Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360467 - Newly added but never started OSDs cause osdmap update issues for Calamari
Summary: Newly added but never started OSDs cause osdmap update issues for Calamari
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Calamari
Version: 1.3.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: 1.3.3
Assignee: Boris Ranto
QA Contact: Ramakrishnan Periyasamy
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1372735
TreeView+ depends on / blocked
 
Reported: 2016-07-26 20:20 UTC by Justin Bautista
Modified: 2018-02-23 12:13 UTC (History)
6 users (show)

Fixed In Version: RHEL: calamari-server-1.3.3-2.el7cp Ubuntu: calamari-server_1.3.3-3redhat1trusty
Doc Type: Bug Fix
Doc Text:
.Calamari now correctly handles manually added OSDs that do not have "ceph-osd" running Previously, when OSD nodes were added manually to the Calamari server but the `ceph-osd` daemon was not started on the nodes, the Calamari server returned error messages and stopped updating statuses for the rest of the OSD nodes. The underlying source code has been modified, and Calamari now handles such OSDs properly.
Clone Of:
Environment:
Last Closed: 2016-09-29 13:00:27 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1972 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage 1.3.3 security, bug fix, and enhancement update 2016-09-29 16:51:21 UTC

Description Justin Bautista 2016-07-26 20:20:34 UTC
Description of problem:

Newly added but never started OSDs cause osdmap update issues for Calamari (OSDs are listed in Web UI, but are grayed out)

Version-Release number of selected component (if applicable):

calamari-server-1.3.3-1.el7cp.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Add OSDs to ceph cluster without starting them
2. Add OSD host them to calamari for monitoring

Actual results:

OSDs are grayed out in Calamari UI

Expected results:

OSDs should be listed/monitored (not grayed out)

Comment 3 Gregory Meno 2016-08-17 18:18:50 UTC
https://github.com/ceph/calamari/pull/483

Comment 4 Gregory Meno 2016-08-17 18:20:00 UTC
Federico,

I'd like to ship this already fixed issue in 1.3.3

Comment 6 Gregory Meno 2016-08-19 13:30:48 UTC
Steps to test:

1. Setup a ceph cluster and calamari.
2. add an osd like here BUT DO NOT start the OSD daemon http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#adding-an-osd-manual
3. observe calamari will no longer update osd status from that node.

Comment 12 Ramakrishnan Periyasamy 2016-09-15 09:06:36 UTC
Based on below observation, it is not clear either OSD should be visible in some sections or not.

1. In Workbench section osd is listed as status down (Red Color)
2. In Manage > OSD section OSD is not visible.

Yet to get reply for Comment10. Marking this bug as Failed verification from QE, since comment6 says OSD should not be visible.

Comment 13 Gregory Meno 2016-09-15 21:35:22 UTC
What I mean in c6 step 3 is that reporting on the state of OSD 10 will no longer happen before applying the fix from this BZ. OSD 9 is in a state that causes the error.

Verification should not be based on the fact that it is visible in the UI. Does this make sense?

so to confirm that you have a fix you should be able to see changes in state to OSD10

Comment 14 Ramakrishnan Periyasamy 2016-09-16 07:23:33 UTC
(In reply to Gregory Meno from comment #13)
> What I mean in c6 step 3 is that reporting on the state of OSD 10 will no
> longer happen before applying the fix from this BZ. OSD 9 is in a state that
> causes the error.

I would like clarify few things below. May be based on that we could take this defect to the closure.

Please note that osd.9 is under test and Not osd.10.

Initially my test setup had 11 osds. I removed osd.9 from it. Added back to the cluster to verify the fix. As per the steps, i did not start osd.9 after adding. As per the fix, osd.9 is expected not to be visible in calamari.

Please correct me, if am wrong.

> 
> Verification should not be based on the fact that it is visible in the UI.
> Does this make sense?
> 
> so to confirm that you have a fix you should be able to see changes in state
> to OSD10

Comment 15 Gregory Meno 2016-09-16 17:24:59 UTC
I can understand why you feel that way. That being said it actually is calamari's ceph.py module on magna108 that is under test. osd.9 is the cause of the failure. The way to verify the fix is to observe updates to the state of any other osd.

That module is responsible for reporting cluster status. If the failure is present you will not get cluster status updates.

The UI will report osd.9 and is expected to do so. The purpose of calamari is to help manage and monitor ceph, if we hide nodes/osds in bad states we might not know there are problems.

Hope this helps,
Gregory

Comment 16 Ramakrishnan Periyasamy 2016-09-19 11:54:30 UTC
Steps followed to verify this bug.

1. Configured ceph cluster and calamari with 9 OSD's and 1 mon
2. OSD's are listed in OSD tree and in Calamari UI with status up 
3. Added one OSD(osd.9) from new OSD host using "ceph-deploy prepare" but not ran "ceph-deploy acitvate" to make sure OSD service is not started.
4. Verified OSD tree where OSD was listed under new host(crush bucket) and OSD status was not up.
5. Verified in Calamari UI, all the existing OSD's status was good and none got greyed out.
6. Activated the service for OSD(osd.9) and added one more OSD(osd.10) to cluster using ceph-deploy.
7. Now all the 11 OSD's were listed in Calamari UI with status up.

Comment 17 Gregory Meno 2016-09-19 17:36:12 UTC
The only concern I have here is that in c16 you list slightly different steps to reproduce, using ceph-deploy vs. doing the OSD setup by hand.
I would be satisfied with this alternative if you would be willing to confirm the negative side of the test e.g. steps in c16 with a previous version result in a reproduction of the error. What do you think?

Comment 23 errata-xmlrpc 2016-09-29 13:00:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html


Note You need to log in before you can comment on or make changes to this bug.