Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1512000 - [DOC][RFE] Documented OSP shutdown procedure incorrect when attached to Ceph [NEEDINFO]
Summary: [DOC][RFE] Documented OSP shutdown procedure incorrect when attached to Ceph
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Dan Macpherson
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-10 15:40 UTC by jliberma@redhat.com
Modified: 2019-01-15 22:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
dmacpher: needinfo? (jliberma)


Attachments (Terms of Use)

Description jliberma@redhat.com 2017-11-10 15:40:42 UTC
Description of problem:


We have an OSP shutdown procedure (https://access.redhat.com/solutions/1977013) that references a Ceph shutdown procedure (https://access.redhat.com/solutions/2139301.) 

If you follow the order of execution outline in this procedure you risk data corruption. 

Ceph health should be checked and services shutdown while the monitors are still up on the controllers.


Specifically, this step:

"Once the cluster is stopped, login to each controller and trigger poweroff"

proceeds this step:

"If there are any Ceph nodes in the overcloud, proceed with powering off the Ceph nodes. This is the procedure for shutting down Ceph cluster
(Note: This procedure for stopping the Ceph cluster applies to Red Hat Ceph Storage version 1.3)"

However, if the Ceph monitors are running on the controller nodes -- which is the default placement for Director-deployed Ceph -- you cannot safely quiesce access to the Ceph OSDs prior to powering them off because these commands much be issued with the Ceph monitors online.

The process should be amended so that:

1. You identify whether the Ceph monitors are running on the controllers
2. If so, quiesce data access to the OSDs from the controllers prior to powering down the controllers
3. then proceed to the Ceph OSD shutdown 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Macpherson 2018-08-08 03:11:02 UTC
Hi Jacob,

Scoping my old BZs. For this one, is the quiesce process a case of:

1. stopping ceph-mon on the controller node you intend to reboot
2. rebooting the controller
3. starting the ceph-mon
4. wait for it to join the cluster

Or was there something else you had in mind for this?

Comment 2 jliberma@redhat.com 2018-08-08 03:47:35 UTC
Hey mate, The issue is that currently we have two shutdown procedures that don't account for a combined deployment. 

The procedure I am referring to is ocmpletely powering down the entire OSP and Ceph deployment, as if you were preparing for a power outage in the data center.

You need to quiesce the IO to the Ceph OSDs while the monitors are still online because you need the monitors online to quiesce the IO.

If you follow this procedure as documented there is no safe way to shutdown Ceph because the mons are requires to quiecse the IO but they are already down.

Basically the procedures need to be interleaved. Please test because I havent looked at this for 9 months.

thanks, jacob

Comment 3 Dan Macpherson 2018-08-08 12:14:40 UTC
As sorry, I overlooked the kbase articles and thought your were referring to the reboot procedures in the docs.

I think I understand the issue now. Basically, the Ceph OSD shutdown procedure needs to occur before the Controller shutdown, right?


Note You need to log in before you can comment on or make changes to this bug.