Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1691449 - Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker service on master nodes
Summary: Director deployed OCP 3.11 deployment fails with openshift-ansible getting st...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z2
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 15:50 UTC by Marius Cornea
Modified: 2019-04-03 15:41 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
There is currently a known issue where director can hang while deploying OCP. This occurs because the fix described in https://bugzilla.redhat.com/show_bug.cgi?id=1671861 is not a part of the `overcloud-full` image for the Red Hat OpenStack Platform 14 z1 release. As a workaround, prior to deploying the overcloud, follow the steps below to update the docker package in the `overcloud-full` image. For more information on this procedure, see https://access.redhat.com/articles/1556833. After completing these steps, you can expect the director to successfully deploy OCP: $ sudo yum install -y libguestfs-tools $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install docker $ source stackrc $ openstack overcloud image upload --update-existing
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Marius Cornea 2019-03-21 15:50:42 UTC
This bug was initially created as a copy of Bug #1671861

I am copying this bug because: 

The issue is still present when deploying OCP via Director because the images provided by rhosp-director-images-14.0-20190304.2.el7ost.noarch contain the broken docker version:

[root@openshift-master-0 heat-admin]# rpm -q docker
docker-1.13.1-91.git07f3374.el7.x86_64

Deployment gets stuck even if an updated version of docker is available for update in the rhel-7-server-extras-rpms repo:

[root@openshift-master-0 heat-admin]# yum check-updates docker
Loaded plugins: product-id, search-disabled-repos, subscription-manager

docker.x86_64                                                                                       2:1.13.1-94.gitb2f74b2.el7                                                                                       rhel-7-server-extras-rpms


Description of problem:

Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker on master nodes.

Snippet from /var/lib/mistral/openshift/openshift/playbook.log:

TASK [container_runtime : Fix SELinux Permissions on /var/lib/containers] ******
ok: [openshift-infra-2]
ok: [openshift-infra-1]
ok: [openshift-infra-0]
ok: [openshift-master-2]
ok: [openshift-master-0]
ok: [openshift-master-1]
ok: [openshift-worker-2]
ok: [openshift-worker-0]
ok: [openshift-worker-1]

RUNNING HANDLER [container_runtime : restart container runtime] ****************
changed: [openshift-infra-2]
changed: [openshift-infra-1]
changed: [openshift-worker-2]
changed: [openshift-infra-0]
changed: [openshift-worker-0]
changed: [openshift-worker-1]

We can see that the task ran fine on non-master nodes. Checking the docker processes on one of the master node we can see:

[root@openshift-master-0 heat-admin]# ps axu | grep docker
root     17174  0.0  0.0 136600 12472 ?        Ssl  16:24   0:00 /usr/libexec/docker/rhel-push-plugin
root     27861  0.0  0.0 134820  1440 ?        S    16:38   0:00 /bin/systemctl restart docker
root     27899  0.0  0.1 677516 24084 ?        Ssl  16:38   0:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False -s overlay2 --mtu=1450 --add-registry registry.redhat.io --insecure-registry 192.168.24.1:8787 --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org
root     27908  0.0  0.0 394508 13568 ?        Ssl  16:38   0:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true
root     27997  0.0  0.0 115300  1464 ?        Ss   16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28001  0.0  0.0 115300   652 ?        S    16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28002  0.0  0.0 166116  9248 ?        Sl   16:59   0:00 /usr/bin/docker-current ps -aq -f status=dead


The docker service is stuck in activating state:

[root@openshift-master-0 heat-admin]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─99-unset-mountflags.conf, custom.conf
   Active: activating (start) since Fri 2019-02-01 16:38:48 EST; 36min ago
     Docs: http://docs.docker.com
 Main PID: 27899 (dockerd-current)
    Tasks: 27
   Memory: 879.7M
   CGroup: /system.slice/docker.service
           ├─27899 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy...
           └─27908 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-...

Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412...6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f...1ba97f1c1462)."
Hint: Some lines were ellipsized, use -l to show in full.


Version-Release number of selected component (if applicable):
docker-1.13.1-90.git07f3374.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 via OpenStack Director

Actual results:
Deployment cannot complete because docker gets stuck.

Expected results:
Deployment doesn't get stuck.

Additional info:

I can see in the job history that deployment passed with a lower version of docker so this could potentially be a regression introduced by the newer docker package.

Working version:
docker-1.13.1-88.git07f3374.el7.x86_64

Comment 2 Martin André 2019-04-01 17:18:03 UTC
It is possible to workaround the issue bu installing the updated docker package in overcloud-full image. From the undercloud prior to deploying the overcloud:

$ sudo yum install -y libguestfs-tools
$ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install docker
$ source stackrc
$ openstack overcloud image upload --update-existing

Comment 3 Martin André 2019-04-01 17:27:14 UTC
It will also be necessary to register the image. See the whole process at https://access.redhat.com/articles/1556833.


Note You need to log in before you can comment on or make changes to this bug.