Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1511494 - [Docs][Upgrade] Include specific upgrade instructions for HCI nodes
Summary: [Docs][Upgrade] Include specific upgrade instructions for HCI nodes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 12.0 (Pike)
Assignee: Dan Macpherson
QA Contact: Martin Lopes
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-09 13:12 UTC by Marius Cornea
Modified: 2017-12-19 04:20 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
When upgrading from an HCI deployment, where Ceph OSDs are colocated with Nova Compute, the Ceph upgrade needs to be launched after all the compute nodes have been upgraded. To do so: 1. During major-upgrade-composable-steps-docker.yaml switch $THT/environments/storage-environment.yaml to $THT/environments/puppet-ceph.yaml. Run major-upgrade-composable-steps-docker.yaml 2. Upgrade the HCI nodes via upgrade-non-controller.sh scripts 3. During major-upgrade-converge-docker.yaml switch $THT/environments/puppet-ceph.yaml to $THT/environments/storage-environment.yaml. In addition add the following parameter to an environment file: CephAnsiblePlaybook: /usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml and convert the puppet ceph parameters to CephAnsibleDisksConfig params. This environment file must be used only during upgrade, not in any subsequent command.
Clone Of:
Environment:
Last Closed: 2017-12-19 04:20:50 UTC


Attachments (Terms of Use)
ceph install workflow (deleted)
2017-11-09 13:12 UTC, Marius Cornea
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 522535 None master: NEW tripleo-heat-templates: Set the default CephAnsiblePlaybook to use into the env files 2017-11-27 18:18:45 UTC

Description Marius Cornea 2017-11-09 13:12:07 UTC
Created attachment 1349925 [details]
ceph install workflow

Description of problem:
OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker fails on HCI environment because docker is not running on the HCI nodes(compute+ceph OSD). 

Checking /var/log/mistral/ceph-install-workflow.log we can see that it's failing on the following task:

2017-11-09 05:46:44,206 p=14238 u=mistral |  TASK [ceph-docker-common : pull ceph/rhceph-2-rhel7 image] *********************
2017-11-09 05:46:44,519 p=14238 u=mistral |  fatal: [192.168.0.11]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest"], "delta": "0:00:00.017837", "end": "2017-11-09 10:46:44.475507", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-11-09 10:46:44.457670", "stderr": "Cannot connect to the Docker daemon. Is the docker daemon running on this host?", "stderr_lines": ["Cannot connect to the Docker daemon. Is the docker daemon running on this host?"], "stdout": "", "stdout_lines": []}

The reason is that docker service is not running on the failing compute node:

[root@overcloud-compute-0 heat-admin]# docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?

In regular deployments the compute nodes get the Docker service running when running upgrade-non-controller.sh but since major-upgrade-composable-steps-docker happens before then the Docker service is not running and the ceph-ansible playbook fails.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-0.20171024200823.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 with 3 controllers + 3 HCI nodes(compute+Cep OSD)
2. Upgrade to OSP12:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates/

openstack overcloud deploy --templates $THT \
-r ~/openstack_deployment/roles/roles_data.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e $THT/environments/ceph-ansible/ceph-ansible.yaml \
-e ~/openstack_deployment/environments/nodes.yaml \
-e ~/openstack_deployment/environments/network-environment.yaml \
-e ~/openstack_deployment/environments/disk-layout.yaml \
-e ~/openstack_deployment/environments/neutron-settings.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/ceph-ansible-env.yaml \
-e /home/stack/docker-osp12.yaml \


Actual results:
Upgrade fails while running the ceph-ansible playbook:

[root@undercloud-0 stack]# tail -10 /var/log/mistral/ceph-install-workflow.log 
2017-11-09 05:46:44,206 p=14238 u=mistral |  TASK [ceph-docker-common : pull ceph/rhceph-2-rhel7 image] *********************
2017-11-09 05:46:44,519 p=14238 u=mistral |  fatal: [192.168.0.11]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest"], "delta": "0:00:00.017837", "end": "2017-11-09 10:46:44.475507", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-11-09 10:46:44.457670", "stderr": "Cannot connect to the Docker daemon. Is the docker daemon running on this host?", "stderr_lines": ["Cannot connect to the Docker daemon. Is the docker daemon running on this host?"], "stdout": "", "stdout_lines": []}
2017-11-09 05:46:44,520 p=14238 u=mistral |  PLAY RECAP *********************************************************************
2017-11-09 05:46:44,520 p=14238 u=mistral |  192.168.0.11               : ok=37   changed=3    unreachable=0    failed=1   
2017-11-09 05:46:44,521 p=14238 u=mistral |  192.168.0.12               : ok=4    changed=0    unreachable=0    failed=0   
2017-11-09 05:46:44,521 p=14238 u=mistral |  192.168.0.13               : ok=56   changed=10   unreachable=0    failed=0   
2017-11-09 05:46:44,521 p=14238 u=mistral |  192.168.0.20               : ok=4    changed=0    unreachable=0    failed=0   
2017-11-09 05:46:44,521 p=14238 u=mistral |  192.168.0.21               : ok=49   changed=7    unreachable=0    failed=0   
2017-11-09 05:46:44,521 p=14238 u=mistral |  192.168.0.25               : ok=49   changed=9    unreachable=0    failed=0   
2017-11-09 05:46:44,521 p=14238 u=mistral |  localhost                  : ok=0    changed=0    unreachable=0    failed=0   

Expected results:
Upgrade doesn't fail.

Additional info:
Attaching /var/log/mistral/ceph-install-workflow.log

Comment 1 Marius Cornea 2017-11-09 13:14:05 UTC
This is the roles_data used during upgrade:

- name: Controller
  description: |
    Controller role that has all the controler services loaded and handles
    Database, Messaging and Network functions.
  tags:
    - primary
    - controller
  networks:
    - External
    - InternalApi
    - Storage
    - StorageMgmt
    - Tenant
  uses_deprecated_params: True
  deprecated_param_extraconfig: "controllerExtraConfig"
  deprecated_param_flavor: "OvercloudControlFlavor"
  deprecated_param_image: "controllerImage"
  CountDefault: 1
  ServicesDefault:
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephMon
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephRgw
    - OS::TripleO::Services::CinderApi
    - OS::TripleO::Services::CinderBackup
    - OS::TripleO::Services::CinderScheduler
    - OS::TripleO::Services::CinderVolume
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::CinderBackendDellEMCVMAXISCSI
    - OS::TripleO::Services::CinderBackendDellEMCUnity
    - OS::TripleO::Services::CinderBackendVRTSHyperScale
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::Keystone
    - OS::TripleO::Services::GlanceApi
    - OS::TripleO::Services::HeatApi
    - OS::TripleO::Services::HeatApiCfn
    - OS::TripleO::Services::HeatApiCloudwatch
    - OS::TripleO::Services::HeatEngine
    - OS::TripleO::Services::MySQL
    - OS::TripleO::Services::Clustercheck
    - OS::TripleO::Services::NeutronDhcpAgent
    - OS::TripleO::Services::NeutronL3Agent
    - OS::TripleO::Services::NeutronLbaasv2Agent
    - OS::TripleO::Services::NeutronL2gwAgent
    - OS::TripleO::Services::NeutronMetadataAgent
    - OS::TripleO::Services::NeutronApi
    - OS::TripleO::Services::NeutronL2gwApi
    - OS::TripleO::Services::NeutronBgpVpnApi
    - OS::TripleO::Services::NeutronCorePlugin
    - OS::TripleO::Services::NeutronOvsAgent
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::RabbitMQ
    - OS::TripleO::Services::HAproxy
    - OS::TripleO::Services::Keepalived
    - OS::TripleO::Services::Memcached
    - OS::TripleO::Services::Pacemaker
    - OS::TripleO::Services::Redis
    - OS::TripleO::Services::NovaConductor
    - OS::TripleO::Services::MongoDb
    - OS::TripleO::Services::NovaApi
    - OS::TripleO::Services::NovaMetadata
    - OS::TripleO::Services::NovaScheduler
    - OS::TripleO::Services::NovaConsoleauth
    - OS::TripleO::Services::NovaVncProxy
    - OS::TripleO::Services::Ntp
    - OS::TripleO::Services::SwiftProxy
    - OS::TripleO::Services::ExternalSwiftProxy
    - OS::TripleO::Services::SwiftStorage
    - OS::TripleO::Services::SwiftRingBuilder
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Docker
    - OS::TripleO::Services::CertmongerUser
    - OS::TripleO::Services::CeilometerApi
    - OS::TripleO::Services::CeilometerCollector
    - OS::TripleO::Services::CeilometerExpirer
    - OS::TripleO::Services::CeilometerAgentCentral
    - OS::TripleO::Services::CeilometerAgentNotification
    - OS::TripleO::Services::Horizon
    - OS::TripleO::Services::GnocchiApi
    - OS::TripleO::Services::GnocchiMetricd
    - OS::TripleO::Services::GnocchiStatsd
    - OS::TripleO::Services::ManilaApi
    - OS::TripleO::Services::ManilaScheduler
    - OS::TripleO::Services::ManilaBackendGeneric
    - OS::TripleO::Services::ManilaBackendNetapp
    - OS::TripleO::Services::ManilaBackendCephFs
    - OS::TripleO::Services::ManilaShare
    - OS::TripleO::Services::ManilaBackendVNX
    - OS::TripleO::Services::ManilaBackendVMAX
    - OS::TripleO::Services::ManilaBackendUnity
    - OS::TripleO::Services::ManilaBackendIsilon
    - OS::TripleO::Services::AodhApi
    - OS::TripleO::Services::AodhEvaluator
    - OS::TripleO::Services::AodhNotifier
    - OS::TripleO::Services::AodhListener
    - OS::TripleO::Services::SaharaApi
    - OS::TripleO::Services::SaharaEngine
    - OS::TripleO::Services::IronicApi
    - OS::TripleO::Services::IronicConductor
    - OS::TripleO::Services::NovaIronic
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::OpenDaylightApi
    - OS::TripleO::Services::OpenDaylightOvs
    - OS::TripleO::Services::SensuClient
    - OS::TripleO::Services::FluentdClient
    - OS::TripleO::Services::Sshd
    - OS::TripleO::Services::BarbicanApi
    - OS::TripleO::Services::PankoApi
    - OS::TripleO::Services::Zaqar
    - OS::TripleO::Services::OVNDBs
    - OS::TripleO::Services::OVNController
    - OS::TripleO::Services::NeutronML2FujitsuCfab
    - OS::TripleO::Services::CinderHPELeftHandISCSI
    - OS::TripleO::Services::NovaPlacement

- name: Compute
  description: |
    Basic Compute Node role
  networks:
    - InternalApi
    - Storage
    - Tenant
  uses_deprecated_params: True
  deprecated_param_image: "NovaImage"
  deprecated_param_extraconfig: "NovaComputeExtraConfig"
  deprecated_param_metadata: "NovaComputeServerMetadata"
  deprecated_param_scheduler_hints: "NovaComputeSchedulerHints"
  deprecated_param_ips: "NovaComputeIPs"
  deprecated_server_resource_name: "NovaCompute"
  CountDefault: 1
  HostnameFormatDefault: '%stackname%-compute-%index%'
  disable_upgrade_deployment: True
  ServicesDefault:
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephOSD
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Docker
    - OS::TripleO::Services::CertmongerUser
    - OS::TripleO::Services::Ntp
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::NovaCompute
    - OS::TripleO::Services::NovaMigrationTarget
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::NovaLibvirt
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronOvsAgent
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronL3Agent
    - OS::TripleO::Services::ComputeNeutronMetadataAgent
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::NeutronSriovAgent
    - OS::TripleO::Services::OpenDaylightOvs
    - OS::TripleO::Services::SensuClient
    - OS::TripleO::Services::FluentdClient
    - OS::TripleO::Services::Sshd

Comment 2 Giulio Fidente 2017-11-09 13:28:51 UTC
As per conversation on IRC, the issue here is that when disable_upgrade_deployment is True then step1 doesn't run on all nodes, docker is not started and ceph-ansible fails when trying to bring up the containers

Comment 3 Marius Cornea 2017-11-20 21:56:39 UTC
I tested the following workaround proposed by Giulio and upgrade completed ok:

1. During major-upgrade-composable-steps-docker.yaml switch $THT/environments/storage-environment.yaml to $THT/environments/puppet-ceph.yaml. Run major-upgrade-composable-steps-docker.yaml

2. Upgrade the HCI nodes via upgrade-non-controller.sh scripts

3. During major-upgrade-converge-docker.yaml switch $THT/environments/puppet-ceph.yaml to $THT/environments/storage-environment.yaml. In addition add the following parameter to an environment file:
CephAnsiblePlaybook: /usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml and convert the puppet ceph parameters to CephAnsibleDisksConfig params.

Comment 5 Lucy Bopf 2017-11-24 05:03:33 UTC
Assigning to Dan for review as part of the upgrade documentation for 12.

Comment 8 Dan Macpherson 2017-12-08 02:42:40 UTC
@mlopes, I got Sandra to peer review the whole upgrade guide include the HCI content, so I'll switch this one to VERIFIED (unless you want to do an additional peer review?)


Note You need to log in before you can comment on or make changes to this bug.