Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1486851 - containerized installation with overlay2 docker driver fails on RHEL Atomic Host 7.4 or RHEL 7.4
Summary: containerized installation with overlay2 docker driver fails on RHEL Atomic H...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: ---
Assignee: Vivek Goyal
QA Contact: DeShuai Ma
URL:
Whiteboard: aos-scalability-37
: 1509895 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-30 15:52 UTC by jmencak
Modified: 2017-12-20 13:08 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-20 13:08:01 UTC


Attachments (Terms of Use)
Ansible inventory and log files. (deleted)
2017-08-30 15:52 UTC, jmencak
no flags Details
Full journalctl from 172.16.0.12 (deleted)
2017-08-30 16:34 UTC, jmencak
no flags Details
Ansible inventory and log files. (deleted)
2017-08-30 16:59 UTC, jmencak
no flags Details
Logs with device mapper backend. (deleted)
2017-08-31 08:39 UTC, jmencak
no flags Details

Description jmencak 2017-08-30 15:52:31 UTC
Created attachment 1320180 [details]
Ansible inventory and log files.

Description of problem:
When installing ocp 3.7 with openshift_cloudprovider_kind=openstack using advanced containerized installation on RHEL Atomic Host 7.4, the installation fails.

Version-Release number of selected component (if applicable):
$ oc version
oc v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://rhel-atomic-74-30g-cinder.novalocal:8443
openshift v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f

How reproducible:
Always when trying to configure OpenShift 3.7 for OpenStack on Atomic Host 7.4.

Steps to Reproduce:
1. Advanced installation using the attached inventory on OpenStack.

TASK [openshift_master : Start and enable master controller on first master] ******************
task path: /root/openshift-ansible/roles/openshift_master/tasks/main.yml:332
FAILED - RETRYING: Start and enable master controller on first master (1 retries left).

RUNNING HANDLER [openshift_master : restart master controllers] *******************************
fatal: [172.16.0.12]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"}
META: ran handlers
	to retry, use: --limit @/root/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP ************************************************************************************
172.16.0.12                : ok=248  changed=83   unreachable=0    failed=1   
localhost                  : ok=12   changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    172.16.0.12
     Play:     Configure masters
     Task:     restart master controllers
     Message:  Unable to restart service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master-controllers.service" and "journalctl -xe" for details.
               
$ oc get nodes
No resources found.


Expected results:
Successful instalation with OpenStack Cinder support.

Additional info:
$ rpm -q ansible
ansible-2.3.2.0-1.el7.noarch
$ git describe  
openshift-ansible-3.7.0-0.120.0-12-g028a437

More log file attached.  When omitting the openstack specific variables, the installation finishes fine.

Comment 1 Scott Dodson 2017-08-30 16:26:36 UTC
Can you grab the full journal for atomic-openshift-master-controllers on 172.16.0.12?

journalctl --no-pager -u atomic-openshift-master-controllers

Comment 2 jmencak 2017-08-30 16:34:20 UTC
Created attachment 1320184 [details]
Full journalctl from 172.16.0.12

Comment 3 jmencak 2017-08-30 16:59:06 UTC
Created attachment 1320186 [details]
Ansible inventory and log files.

Comment 4 Scott Dodson 2017-08-30 21:26:32 UTC
The logs show that it's looping on the following for a while and then they finally start up. Were there any problems with docker that you resolved or anything? Does the problem happen if you're not using overlayfs?

Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Starting Atomic OpenShift Master Controllers...
Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26900]: Error response from daemon: No such container: atomic-openshift-master-controllers
Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26906]: /usr/bin/docker-current: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/44369e42b72cf0e17088c5171962621b76ee66b29f4650d75921356d9a74a753-init/merged: mountfrom error on re-exec cmd: fork/exec /proc/self/exe: cannot allocate memory.
Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26906]: See '/usr/bin/docker-current run --help'.
Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=125/n/a
Aug 30 15:35:25 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26914]: Error response from daemon: No such container: atomic-openshift-master-controllers
Aug 30 15:35:25 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1
Aug 30 15:35:25 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Failed to start Atomic OpenShift Master Controllers.
Aug 30 15:35:25 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Aug 30 15:35:25 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service failed.
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Starting Atomic OpenShift Master Controllers...
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26921]: Error response from daemon: No such container: atomic-openshift-master-controllers
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26927]: /usr/bin/docker-current: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/248cc2574ac9ab114ccd6580b7a95ca1969533520f7df743b7ce994c93653552-init/merged: mountfrom error on re-exec cmd: fork/exec /proc/self/exe: cannot allocate memory.
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26927]: See '/usr/bin/docker-current run --help'.
Aug 30 15:35:30 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=125/n/a
Aug 30 15:35:40 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26962]: Error response from daemon: No such container: atomic-openshift-master-controllers
Aug 30 15:35:40 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1
Aug 30 15:35:40 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Failed to start Atomic OpenShift Master Controllers.
Aug 30 15:35:40 rhel-atomic-74-30g-cinder.novalocal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Aug 30 15:35:40 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service failed.
Aug 30 15:35:46 rhel-atomic-74-30g-cinder.novalocal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.

Comment 5 jmencak 2017-08-31 08:39:59 UTC
Created attachment 1320509 [details]
Logs with device mapper backend.

Comment 6 jmencak 2017-08-31 08:42:55 UTC
Haven't touched docker at all during the install.  Please see https://bugzilla.redhat.com/attachment.cgi?id=1320509 for the inventory and logs with the devicemapper backend.

Comment 7 Gaoyun Pei 2017-09-13 05:51:38 UTC
Encounter the same error with overlay2 docker storage driver when installer trying to restart atomic-openshift-master-controllers service during installation. It's also containerized install on AH-74, but on GCE.


Version-Release number of selected component (if applicable):
openshift-ansible-3.7.0-0.125.0.git.0.91043b6.el7.noarch.rpm
ansible-2.3.2.0-2.el7.noarch
openshift v3.7.0-0.125.0

# rpm -q docker
docker-1.12.6-55.gitc4618fb.el7.x86_64


Actual results:

RUNNING HANDLER [openshift_master : restart master api] ************************
Wednesday 13 September 2017  05:20:47 +0000 (0:00:00.013)       0:14:11.080 *** 
skipping: [qe-gpei-overlay-4-master-etcd-1.0913-yd5.qe.rhcloud.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}

RUNNING HANDLER [openshift_master : restart master controllers] ****************
Wednesday 13 September 2017  05:20:47 +0000 (0:00:00.023)       0:14:11.104 *** 
fatal: [qe-gpei-overlay-4-master-etcd-1.0913-yd5.qe.rhcloud.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Unable to restart service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master-controllers.service" and "journalctl -xe" for details.



# journalctl -f -u atomic-openshift-master-controllers.service 
-- Logs begin at Wed 2017-09-13 05:04:55 UTC. --
Sep 13 05:22:49 qe-gpei-overlay-4-master-etcd-1 atomic-openshift-master-controllers[18492]: Error response from daemon: No such container: atomic-openshift-master-controllers
Sep 13 05:22:49 qe-gpei-overlay-4-master-etcd-1 atomic-openshift-master-controllers[18496]: /usr/bin/docker-current: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/6157fd04c7144892f4f76e4be43c12a773bef7344e1a5a1f0eec73ad60c422f5-init/merged: mountfrom error on re-exec cmd: fork/exec /proc/self/exe: cannot allocate memory.
Sep 13 05:22:49 qe-gpei-overlay-4-master-etcd-1 atomic-openshift-master-controllers[18496]: See '/usr/bin/docker-current run --help'.
Sep 13 05:22:49 qe-gpei-overlay-4-master-etcd-1 systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=125/n/a
Sep 13 05:22:59 qe-gpei-overlay-4-master-etcd-1 atomic-openshift-master-controllers[18536]: Error response from daemon: No such container: atomic-openshift-master-controllers
Sep 13 05:22:59 qe-gpei-overlay-4-master-etcd-1 systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1

Comment 9 Gaoyun Pei 2017-09-13 06:34:10 UTC
No such error when trying containerized install with devicemapper storage. 

This issue is blocking the testing on containerized ocp-3.7 env with overlay2 storage now, both on Atomic Host 7.4 and RHEL 7.4

Comment 10 Johnny Liu 2017-09-13 09:27:35 UTC
According to comment 9, seem like this bug is not related to cloudprovider stuff, so update bug title to reflect that.

Comment 11 jmencak 2017-09-13 09:36:50 UTC
I'm not so sure it is the same bug we are hitting as I've seen it fail with devicemapper too.  See comment 5.

Comment 12 Gaoyun Pei 2017-09-13 10:06:50 UTC
(In reply to jmencak from comment #11)
> I'm not so sure it is the same bug we are hitting as I've seen it fail with
> devicemapper too.  See comment 5.

Hi Jiri, I checked the log attached in Comment 5, installation fails due to one node is not available, not master controllers service restart failure.

Today QE team almost tried dozens of containerized install with devicemapper, but unfortunately didn't reproduce such issue, no atomic-openshift-master-api/controllers or atomic-openshift-node start failure encountered. It is really weird.

Comment 14 Tim Bielawa 2017-09-14 18:38:06 UTC
The journalctl has this little nugget sprinkled throughout it: 

> mountfrom error on re-exec cmd: fork/exec /proc/self/exe: cannot allocate memory.

Some searching around shows that this *may* be related to a known [fixed in 13.x] issue with docker [0,1]. 

> This change updates how we handle long lines of output from the
> container. The previous logic used a bufio reader to read entire lines
> of output from the container through an intermediate BytesPipe, and that
> allowed the container to cause dockerd to consume an unconstrained
> amount of memory as it attempted to collect a whole line of output, by
> outputting data without newlines.

I'm cautious to assign blame, however, as this BZ states that this is happening during an install. Under those conditions docker should not have ran many containers at all to reach memory exhaustion.

We're going to need data, much more data, to understand what's happening here. I'll try to set up a reproducer locally based on openshift-ansible g028a437 commit.

In the mean time, here is the testing that anyone else involved with this BZ can do (I'm going to start off by doing the same once I have a reproducer):

While running the reproducer case:

* Keep a running log of used system memory resources. So run something like this on the target installation hosts:

> watch -n15 'date >> /tmp/free-m.log; free -m >> /tmp/free-m.log'

* Keep a running log of memory used by any docker processes during the install:

> watch -n15 'date >> /tmp/free-m.log; ps aux | grep docker | grep current >> /root/ps-grep-docker.log'

And one last thing of interest to me:

> watch -n15 'date >> /tmp/lsof-i-P-docker.log; lsof -i -P | grep docker >> /root/lsof-i-P-docker.log'


It'd be great if we had collectd or something on these hosts, but this (and some sed+awk) will have to do in the meantime.



[0] https://github.com/moby/moby/issues/8539
[1] https://github.com/moby/moby/issues/18295

Comment 15 Tim Bielawa 2017-09-14 18:39:21 UTC
That last line should read


> watch -n15 'date >> /root/lsof-i-P-docker.log; lsof -i -P | grep docker >> /root/lsof-i-P-docker.log'

Comment 16 Tim Bielawa 2017-09-14 20:04:43 UTC
Note for later testers

To get closer I did a few things:

* added an additional virtual disk to my test VMs
* installed docker on the target VMs prior to installing OCP
* Per [0] I ran:

> # echo DEVS=/dev/vdb >> /etc/sysconfig/docker-storage-setup
> # systemctl start docker


I observed this in the `systemctl status docker` output:

> Sep 14 20:01:00 n01.example.com dockerd-current[1535]: time="2017-09-14T20:01:00.890228373Z" level=info msg="devmapper: Successfully created filesystem xfs on device docker-253:0-8409156-base"

This is important because overlay2 is only enabled on XFS file systems. I had to do some openshift-ansible code digging to figure that out. 

But anyway, that's how to ensure that you have the POTENTIAL to reproduce this bug. Because the docker default block device is loopback, and there is no configuration option to request 'overlay2', you must have the XFS filesystems for docker configured ahead of time (or at least I did).




[0] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/managing_containers/managing_storage_with_docker_formatted_containers#setting_up_a_volume_group_and_lvm_thin_pool_on_user_specified_block_device

Comment 17 Tim Bielawa 2017-09-14 21:59:44 UTC
Follow up questions:

* Did your hosts have docker installed pre OCP install?
* How did you get the overlay2 driver to install?
* System stuff:

 - lscpu
 - cat /proc/meminfo
 - vgdisplay
 - lvdisplay
 - pvdisplay

* cat /etc/fstab
* cat /etc/docker/daemon.json
* cat /etc/sysconfig/docker-storage
* cat /etc/docker/daemon.json
* cat /etc/sysconfig/docker-storage-setup

Comment 21 Vivek Goyal 2017-10-12 17:35:46 UTC
Following error message seems to suggests that there is lack of memory on system and hence re-exec failed.

Aug 30 15:35:15 rhel-atomic-74-30g-cinder.novalocal atomic-openshift-master-controllers[26906]: /usr/bin/docker-current: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/44369e42b72cf0e17088c5171962621b76ee66b29f4650d75921356d9a74a753-init/merged: mountfrom error on re-exec cmd: fork/exec /proc/self/exe: cannot allocate memory.

How much memory do you have on the node. Try increasing the amount of RAM. Also monitor the usage of RAM (as suggested by other people) while you are trying to reproduce this issue.

Comment 22 Vivek Goyal 2017-10-12 17:39:23 UTC
Also for overlayfs, there might be too little space on rootfs for atomic. And you might have to setup overlayfs on a separate volume.

To do that, specify following also in /etc/sysconfig/docker-storage-setup. Reset storage and start docker again.

CONTAINER_ROOT_LV_NAME=docker-root-lv
CONTAINER_ROOT_LV_MOUNT_PATH=/var/lib/docker/

Comment 23 jmencak 2017-10-12 17:45:38 UTC
I no longer have access to the original system that caused the issues and I've never seen the issue on ec2 m4.xlarge (16GB) instance.  Looks like Gaoyun's GCE instances were only 4Gs?

Comment 26 Jhon Honce 2017-10-19 16:59:36 UTC
Lowered severity as there is a documented work around

Comment 27 Gan Huang 2017-11-07 04:47:27 UTC
*** Bug 1509895 has been marked as a duplicate of this bug. ***

Comment 28 Vivek Goyal 2017-12-20 13:08:01 UTC
Closing this as NOTABUG. This is configuration issue where node does not have sufficient memory to run this workload.


Note You need to log in before you can comment on or make changes to this bug.