Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1367552 - Node becomes NotReady - Container Install
Summary: Node becomes NotReady - Container Install
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Vivek Goyal
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-16 18:18 UTC by Vikas Laad
Modified: 2016-09-19 14:00 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-19 14:00:53 UTC


Attachments (Terms of Use)

Description Vikas Laad 2016-08-16 18:18:40 UTC
Description of problem:
When running reliability tests for more then 3 days one of the nodes became NotReady because Openswitch service fails to start in Container Install of Openshift. Noticed following errors in log

Aug 16 11:28:11 ip-172-31-32-196 openvswitch: docker: Error response from daemon: devmapper: Thin Pool has 1224 free metadata blocks which is less than minimum required 1228 free metadata blocks. Create more free metadata space in thin pool or use dm.min_free_space option to change behavior.

After this error openvswitch.service service does not start
Aug 16 11:57:39 ip-172-31-32-196 systemd: Failed to start openvswitch.service.
Aug 16 11:57:39 ip-172-31-32-196 systemd: Dependency failed for atomic-openshift-node.service.
Aug 16 11:57:39 ip-172-31-32-196 systemd: Job atomic-openshift-node.service/start failed with result 'dependency'.
Aug 16 11:57:39 ip-172-31-32-196 systemd: Unit openvswitch.service entered failed state.
Aug 16 11:57:39 ip-172-31-32-196 systemd: openvswitch.service failed.
Aug 16 11:57:44 ip-172-31-32-196 systemd: openvswitch.service holdoff time over, scheduling restart.
Aug 16 11:57:44 ip-172-31-32-196 systemd: Cannot add dependency job for unit atomic-openshift-master.service, ignoring: Unit atomic-openshift-master.service failed to load: No such file or directory.
Aug 16 11:57:44 ip-172-31-32-196 systemd: Cannot add dependency job for unit atomic-openshift-master.service, ignoring: Unit atomic-openshift-master.service failed to load: No such file or directory.
Aug 16 11:57:44 ip-172-31-32-196 systemd: Started atomic-openshift-node-dep.service.
Aug 16 11:57:44 ip-172-31-32-196 systemd: Starting atomic-openshift-node-dep.service...
Aug 16 11:57:44 ip-172-31-32-196 systemd: Starting openvswitch.service...
Aug 16 11:57:44 ip-172-31-32-196 openvswitch: Failed to remove container (openvswitch): Error response from daemon: No such container: openvswitch
Aug 16 11:57:44 ip-172-31-32-196 openvswitch: docker: Error response from daemon: Conflict. The name "/openvswitch" is already in use by container 0263d2a59521c92a6130b80ec1f92f5bbf5d1ace5270787ba43736c90a1f0b07. You have to remove (or rename) that cont
ainer to be able to reuse that name..
Aug 16 11:57:44 ip-172-31-32-196 openvswitch: See '/usr/bin/docker-current run --help'.
Aug 16 11:57:44 ip-172-31-32-196 systemd: openvswitch.service: main process exited, code=exited, status=125/n/a
Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Rounding pool metadata size to boundary between physical extents: 12.00 MiB
Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Insufficient free space: 3802 extents needed, but only 1452 available
Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Failed to extend thin docker_vg-docker--pool.
Aug 16 11:57:49 ip-172-31-32-196 openvswitch: Failed to stop container (openvswitch): Error response from daemon: No such container: openvswitch

docker ps with that container id returns nothing on that node.
docker ps -q | grep openvswitch

Version-Release number of selected component (if applicable):
openshift v3.3.0.18
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

root@ip-172-31-36-93: ~ # docker info
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 29
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker_vg-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 5.716 GB
 Data Space Total: 12.87 GB
 Data Space Available: 7.152 GB
 Metadata Space Used: 1.225 MB
 Metadata Space Total: 33.55 MB
 Metadata Space Available: 32.33 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: null host bridge
 Authorization: rhel-push-plugin
Kernel Version: 3.10.0-394.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 15.26 GiB
Name: ip-172-31-36-93.us-west-2.compute.internal
ID: 22OV:QWPO:T3UM:D5VM:2WS4:WAEZ:GPAT:M2PA:PT5M:UAL7:TEHE:INWD
WARNING: bridge-nf-call-iptables is disabled
Registries: registry.qe.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure)

Steps to Reproduce:
1. Create few projects 
2. Keep rebuilding/scaling/redeploying them
3. Node becomes not ready

Additional info:
1. Image/Builds/Deployments pruning was done everyday to cleanup unused data

Comment 1 Jhon Honce 2016-08-16 22:10:48 UTC
Please provide details on which commands you used for pruning.

Comment 2 Vikas Laad 2016-08-17 13:19:11 UTC
oadm prune deployments --orphans --keep-complete=5 --keep-failed=1     --keep-younger-than=60m
oadm prune builds --orphans --keep-complete=5 --keep-failed=1     --keep-younger-than=60m
oadm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm

Comment 3 Andy Goldstein 2016-08-23 01:24:09 UTC
Pruning removes data from etcd related to builds, deployments, and images. It also removes image layers from the registry's storage. It does **not** remove anything from the docker daemon's thin pool, which is what is apparently having an issue. Although, if you look at the output from 'docker info', it appears that everything should be ok. Sorry I can't be more helpful here. Perhaps vgoyal could?

Comment 4 Andy Goldstein 2016-08-23 13:54:39 UTC
1 minor clarification - if 'oadm prune' deletes a build or a deployment, if containers exist for the associated pods, those containers will be deleted, and anything in the containers' COW space would be deleted as well, which would come out of the thin pool.

Comment 5 Vikas Laad 2016-08-23 17:02:10 UTC
Is there any other cleanup recommended to make sure this does not happen?

Comment 7 Andy Goldstein 2016-08-23 17:04:59 UTC
I'm still confused as to why you got this error. Your output from 'docker info' appears to show plenty of space. How soon after you got the initial openvswitch error did you run 'docker info'?

Kube/OpenShift will automatically prune non-running containers as needed, and there are settings you can tweak for when that kicks in. It will also automatically prune images if it is running low on space.

Comment 8 Vikas Laad 2016-08-23 17:16:35 UTC
When I run docker info openvswitch was still not starting.

Comment 9 Vikas Laad 2016-08-23 17:28:55 UTC
after lowering dm.min_free_space things started working.

Comment 10 Vivek Goyal 2016-08-23 17:39:42 UTC
Vikas, 

Can you run docker on this system and while docker is running, can you run "dmsetup status" command and "lvs -a" command and paste output here.

Comment 11 Vikas Laad 2016-08-23 18:11:43 UTC
Hi Vivek,

I do not have this cluster around, I am going to start the app reliability tests today and update this bug when hit this problem again.

Comment 12 Vikas Laad 2016-08-25 14:44:55 UTC
Andy,

I think you are talking about following settings, we have it on all the nodes. I will stop doing pruning I guess, because these setting should take care of pruning automatically.

  image-gc-high-threshold:
  - '80'
  image-gc-low-threshold:
  - '70'
  max-pods:
  - '250'
  maximum-dead-containers:
  - '20'
  maximum-dead-containers-per-container:
  - '1'
  minimum-container-ttl-duration:
  - 10s

Please let me know if there is anything else I should do.
(In reply to Andy Goldstein from comment #7)
> I'm still confused as to why you got this error. Your output from 'docker
> info' appears to show plenty of space. How soon after you got the initial
> openvswitch error did you run 'docker info'?
> 
> Kube/OpenShift will automatically prune non-running containers as needed,
> and there are settings you can tweak for when that kicks in. It will also
> automatically prune images if it is running low on space.

Comment 13 Andy Goldstein 2016-08-25 14:46:11 UTC
Yes, but you should continue to run 'oadm prune' so it can get rid of completed builds and deployments and their associated pods/containers.

Comment 14 Vivek Goyal 2016-09-01 18:41:40 UTC
@vikas, hvae you been able to reproduce the problem? I think your thin pool just filled up and that's why docker refused to start new containers. Lowering min_free_space,  just allows you to go little further till you fill last remaining free space. 

So there should be good mechanism in openshift to keep track of free space and keep on cleaning images/containers to make sure sufficient free space is there in thin pool. After that either stop sending jobs to that node or add more storag to that node.

Comment 15 Vikas Laad 2016-09-02 01:31:01 UTC
@Vivek, I am still running the tests, will update the bug when/if I encounter the issue. If not I guess we will close this bug.

Comment 16 Vikas Laad 2016-09-19 14:00:53 UTC
I am not able to reproduce this issue, had couple of reliability runs on container install. Closing this bug.


Note You need to log in before you can comment on or make changes to this bug.