Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1514042 - Upgrade rhos11 to 12 failed due to "Stop cinder_scheduler service" fail
Summary: Upgrade rhos11 to 12 failed due to "Stop cinder_scheduler service" fail
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: Eric Harney
QA Contact: Avi Avraham
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-16 14:45 UTC by Avi Avraham
Modified: 2017-12-01 14:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-01 14:37:25 UTC


Attachments (Terms of Use)
Logs and shell scripts of stack (deleted)
2017-11-16 15:42 UTC, Avi Avraham
no flags Details

Description Avi Avraham 2017-11-16 14:45:44 UTC
Description of problem:

Openstack upgrade from rhos11 to rhos 12 failed in Cinder scheduler service disable on controller. The test included 4 active instances with volume attached to each instance. I suspect that the timeout for that step was not long enough since the according to systemctl output:
Nov 16 11:54:31 controller-0 systemd[1]: Stopping OpenStack Cinder Scheduler Server...
Nov 16 11:55:11 controller-0 systemd[1]: Stopped OpenStack Cinder Scheduler Server.
  
Version-Release number of selected component (if applicable):
attached sosreports file + upgrade log files

How reproducible:


Steps to Reproduce:
1.Install rhos11 and update to the latest version.
2.Create some instances with attach Cinder volumes 
3.Upgrade to rhos12

Actual results:
The upgrade failed 

Expected results:
Upgrade succeed 

Additional info:
Logs attached to bug

Comment 1 Avi Avraham 2017-11-16 15:42:59 UTC
Created attachment 1353583 [details]
Logs and shell scripts of stack

Comment 2 Alan Bishop 2017-11-17 16:04:15 UTC
I took a look at the failed system, and also analyzed an sosreport I
found there. The failure is occuring in one of the "upgrade_tasks" for
cinder-volume, which is running under pacemaker. The problem isn't getting
cinder-scheduler to stop, it's with pacemaker getting cinder-volume to stop.
Running "openstack stack failures list --long overcloud" shows this:

    TASK [Stop cinder_volume service (pacemaker)] **********************************
    fatal: [localhost]: FAILED! => {"changed": false, "error": "Error: resource 'openstack-cinder-volume' is running on node 'controller-0'\n", "failed": true, "msg": "Failed, to set the resource openstack-cinder-volume to the state disable", "output": "", "rc": 1}

This error message can be seen on controller-0's /var/log/messages, and it
happens at Nov 16 06:55:13. I can't find any other pacemaker messages that
indicate what went wrong.

But, things get strange when looking at the cinder logs. For the non-pacemaker
cinder services (api, scheduler, manage), the log messages all all from Nov 14
(two days earlier) and appear to be truncated. The cinder's volume.log is even
stranger in that it has entries on Nov 14, and suddenly the timestamp jumps
ahead to Nov-16 11:56:11, which is one minute past the pacemaker error. The
large gap in the timestamps is very odd.

Avi, can you try to reproduce this? If it fails again, hopefully the log
messages will be more complete.

Comment 3 Christian Schwede (cschwede) 2017-11-29 09:46:33 UTC
@Avi: were you able to reproduce this issue with the latest puddles?

Comment 4 Avi Avraham 2017-11-30 14:48:42 UTC
Yes I just reproduce it again 
# openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerUpgrade_Step1.1:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 77973c9d-8541-444d-8556-2caad2d5cdfb
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    TASK [Stop cinder_scheduler service] *******************************************
    changed: [localhost]
    
    TASK [Stop cinder_volume service (pacemaker)] **********************************
    fatal: [localhost]: FAILED! => {"changed": false, "error": "Error: resource 'openstack-cinder-volume' is running on node 'controller-0'\n", "failed": true, "msg": "Failed, to set the resource openstack-cinder-volume to the state disable", "output": "", "rc": 1}
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/826d5730-fe54-4564-983d-4e82aa71102e_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=11   changed=10   unreachable=0    failed=1   
    
    (truncated, view all with --long)
  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerUpgrade_Step1.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 304a3f77-da4a-407b-b1a6-85310feb661d
  status: CREATE_FAILED
  status_reason: |
    CREATE aborted
  deploy_stdout: |
None
  deploy_stderr: |
None
overcloud.AllNodesDeploySteps.ControllerUpgrade_Step1.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 7388b822-ca30-499b-b8bd-01e92ddb6472
  status: CREATE_FAILED
  status_reason: |
    CREATE aborted
  deploy_stdout: |
None
  deploy_stderr: |
None

Comment 5 Christian Schwede (cschwede) 2017-11-30 17:19:53 UTC
It might be an option to ensure cinder-volume is disabled before the upgrade step 1 is executed. Maybe using an extra environment file like this?

parameter_defaults:
  UpgradeInitCommand: |
    pcs resource disabled cinder-volume --wait=300

Comment 6 Alan Bishop 2017-11-30 18:44:14 UTC
I logged onto the failed system(s) and found log message sequences like this on all 3 controllers at (roughly) the same time:

Nov 30 08:31:18 localhost systemd: Stopped OpenStack Cinder Scheduler Server.
Nov 30 08:31:18 localhost ansible-pacemaker_resource: Invoked with check_mode=False state=disable resource=openstack-cinder-volume timeout=300 wait_for_resource=True
...
Nov 30 08:31:19 localhost os-collect-config: TASK [Stop cinder_volume service (pacemaker)] **********************************
Nov 30 08:31:19 localhost os-collect-config: fatal: [localhost]: FAILED! => {"changed": false, "error": "Error: resource 'openstack-cinder-volume' is running on node 'controller-0'\n", "failed": true, "msg": "Failed, to set the resource openstack-cinder-volume to the state disable", "output": "", "rc": 1}
Nov 30 08:31:19 localhost os-collect-config: to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/3321b9e9-f079-4353-b2b9-50801dbf9b26_playbook.retry

The timestamps show the ansible-pacemaker_resource task failed almost immediately. It's not a timeout issue waiting for cinder-volume to stop.

cinder-volume is running on controller-0, and its logs are intact (unlike what I reported in comment #2). But, there are no signs of any SIGTERM received around that timeframe.

Here is the current pcs status:

[root@controller-0 log]# pcs status --full
Cluster name: tripleo_cluster
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: controller-0 (1) (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Thu Nov 30 18:29:38 2017
Last change: Thu Nov 30 13:30:45 2017 by root via cibadmin on controller-1

3 nodes configured
19 resources configured (1 DISABLED)

Online: [ controller-0 (1) controller-1 (2) controller-2 (3) ]

Full list of resources:

 Master/Slave Set: galera-master [galera]
     galera     (ocf::heartbeat:galera):        Master controller-2
     galera     (ocf::heartbeat:galera):        Master controller-1
     galera     (ocf::heartbeat:galera):        Master controller-0
     Masters: [ controller-0 controller-1 controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
     rabbitmq   (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     redis      (ocf::heartbeat:redis): Slave controller-2
     redis      (ocf::heartbeat:redis): Master controller-1
     redis      (ocf::heartbeat:redis): Slave controller-0
     Masters: [ controller-1 ]
     Slaves: [ controller-0 controller-2 ]
 ip-192.168.24.7        (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.13 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.3.13 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Started controller-1
 Clone Set: haproxy-clone [haproxy]
     haproxy    (systemd:haproxy):      Started controller-2
     haproxy    (systemd:haproxy):      Started controller-1
     haproxy    (systemd:haproxy):      Started controller-0
     Started: [ controller-0 controller-1 controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started controller-0 (disabled)

Node Attributes:
* Node controller-0 (1):
    + cinder-volume-role                : true      
    + galera-role                       : true      
    + haproxy-role                      : true      
    + master-galera                     : 100       
    + master-redis                      : 1         
    + rabbitmq-role                     : true      
    + redis-role                        : true      
    + rmq-node-attr-last-known-rabbitmq : rabbit@controller-0
    + rmq-node-attr-rabbitmq            : rabbit@controller-0
* Node controller-1 (2):
    + cinder-volume-role                : true      
    + galera-role                       : true      
    + haproxy-role                      : true      
    + master-galera                     : 100       
    + master-redis                      : 1         
    + rabbitmq-role                     : true      
    + redis-role                        : true      
    + rmq-node-attr-last-known-rabbitmq : rabbit@controller-1
    + rmq-node-attr-rabbitmq            : rabbit@controller-1
* Node controller-2 (3):
    + cinder-volume-role                : true      
    + galera-role                       : true      
    + haproxy-role                      : true      
    + master-galera                     : 100       
    + master-redis                      : 1         
    + rabbitmq-role                     : true      
    + redis-role                        : true      
    + rmq-node-attr-last-known-rabbitmq : rabbit@controller-2
    + rmq-node-attr-rabbitmq            : rabbit@controller-2

Migration Summary:
* Node controller-2 (3):
* Node controller-1 (2):
* Node controller-0 (1):

PCSD Status:
  controller-1: Online
  controller-2: Online
  controller-0: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


I'm hoping someone with more HA/pacemaker experience can take a look, and determine why the ansible-pacemaker_resource task failed, and why the cinder-volume service is still running.

Comment 8 Christian Schwede (cschwede) 2017-11-30 19:45:05 UTC
I had a look at this, and I found the reason why "pcs resource disable" didn't work properly.

pcs was manually set to enable fencing with stonith, but stonith was not configured (this was done manually on controller-0 using "pcs property set stonith-enabled=true"). 

Actually pacemaker is warning about this in the logs:

Nov 30 19:11:41 controller-0 pengine[779625]:    error: Resource start-up disabled since no STONITH resources have been defined
Nov 30 19:11:41 controller-0 pengine[779625]:    error: Either configure some or disable STONITH with the stonith-enabled option
Nov 30 19:11:41 controller-0 pengine[779625]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity

Executing "pcs resource disable openstack-cinder-volume --wait=300 ; echo $?" fails therefore.

Now, after disabling stonith with "pcs property set stonith-enabled=false" all the resource commands worked properly, both for disabling and enabling.

I think it was just bad luck that this failed on openstack-cinder-volume, because this is done in step 1 and all other pacemaker_resources that need to be stopped would be executed in step 2 (and very likely failed too, because "pcs resource disable" would fail for them too).

Comment 9 Christian Schwede (cschwede) 2017-11-30 20:59:13 UTC
I re-run ./overcloud-upgrade.sh and now it worked. So worst case would be that we need to document this, but this one does not look like a blocker to me.

Comment 10 Avi Avraham 2017-11-30 21:02:30 UTC
(In reply to Christian Schwede (cschwede) from comment #9)
> I re-run ./overcloud-upgrade.sh and now it worked. So worst case would be
> that we need to document this, but this one does not look like a blocker to
> me.

It is not clear to me what was configure manually,
All installation done according to TripleO upgrade documents without any manual intervention and any change done by TripleO script and if the value is wrong this bug need to be foreword to TripleO team.

Comment 11 Christian Schwede (cschwede) 2017-12-01 08:11:00 UTC
(In reply to Avi Avraham from comment #10)
> (In reply to Christian Schwede (cschwede) from comment #9)
> > I re-run ./overcloud-upgrade.sh and now it worked. So worst case would be
> > that we need to document this, but this one does not look like a blocker to
> > me.
> 
> It is not clear to me what was configure manually,
> All installation done according to TripleO upgrade documents without any
> manual intervention

Someone executed this command on controller-0 on Nov 28th (see bash history as well as pacemaker logs):

pcs property set stonith-enabled=true

This was the reason pacemaker couldn't properly stop cinder-volume (or any other service, cinder-volume was just the first in the list).

To make use of stonith, it needs to be configured manually as documented here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html/advanced_overcloud_customization/sect-fencing_the_controller_nodes

"STONITH is disabled by default and requires manual configuration so that Pacemaker can control the power management of each node in the cluster. "

I simply reverted above command by doing this:

pcs property set stonith-enabled=false

And afterwards the upgrade passed successfully.

Comment 12 Christian Schwede (cschwede) 2017-12-01 08:25:50 UTC
So I just found where the pcs command is documented - it's in the upgrade docs (https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/upgrading_red_hat_openstack_platform/#sect-Updating_the_Overcloud) and it says:

"If you configured fencing for your Controller nodes, the update process might disable it. When the update process completes, reenable fencing with the following command on one of the Controller nodes:

$ sudo pcs property set stonith-enabled=true"

This command should not be used because fencing was not configured on the initial deployment.

You can check if stonith was configured before the upgrade using the following command on one of the OC controller nodes:

[heat-admin@controller-0 ~]$ sudo pcs property
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: tripleo_cluster
 dc-version: 1.1.16-12.el7_4.5-94ff4df
 have-watchdog: false
 maintenance-mode: false
 redis_REPL_INFO: controller-1
 stonith-enabled: false
Node Attributes:
 controller-0: cinder-volume-role=true galera-role=true haproxy-role=true rabbitmq-role=true redis-role=true rmq-node-attr-last-known-rabbitmq=rabbit@controller-0
 controller-1: cinder-volume-role=true galera-role=true haproxy-role=true rabbitmq-role=true redis-role=true rmq-node-attr-last-known-rabbitmq=rabbit@controller-1
 controller-2: cinder-volume-role=true galera-role=true haproxy-role=true rabbitmq-role=true redis-role=true rmq-node-attr-last-known-rabbitmq=rabbit@controller-2

This was on a fresh OSP11 deployment, and as shown above stonith-enabled is set to false by default.

Comment 13 Christian Schwede (cschwede) 2017-12-01 14:37:25 UTC
Run the upgrade again on a fresh deployment, finished successfully.

I'm closing this BZ, please feel free to re-open if there is still a failure.


Note You need to log in before you can comment on or make changes to this bug.