|Summary:||OCP 3.5 pods reporting read only file system|
|Product:||Red Hat Gluster Storage||Reporter:||mdunn|
|Component:||CNS-deployment||Assignee:||Michael Adam <madam>|
|Status:||CLOSED INSUFFICIENT_DATA||QA Contact:||Prasanth <pprakash>|
|Version:||cns-3.5||CC:||akhakhar, hchiramm, jrivera, madam, pprakash, rhs-bugs, rtalur, sankarshan, sarumuga|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|Last Closed:||2019-02-01 11:53:37 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description mdunn 2017-11-14 16:21:09 UTC
Description of problem: Pods that are deployed on my OCP 3.5 cluster and are using PVCs are in Crash loop states. Looking into the logs on those pods, you find that the pod is seeing a read only file system. Example (from a pod running FIO): fio-3.1 Starting 2 processes fio: pid=7, err=30/file:io_u.c:1770, func=io_u error, error=Read-only file system fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=53940224, buflen=4096 fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=8966144, buflen=4096 The cluster consists of 1 master node, 6 CNS nodes, and 3 non-CNS nodes. Each node has 4 vCPUs and 32GB of memory. Pods started hitting these read only file system errors while I was in the process of deploying new apps on the cluster. Version-Release number of selected component (if applicable): oc v126.96.36.199.36 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp19-231-243.css.lab.eng.bos.redhat.com:8443 openshift v188.8.131.52.36 kubernetes v1.5.2+43a9be4 How reproducible: I am not sure how reproducible this issue would be once I get the file system to no longer be read only, but currently, most of the pods that are deployed are hitting this problem. Steps to Reproduce: 1. Deploy pods that use PVCs 2. Continue deploying pods 3. Actual results: Pods start to fail with read only file system errors Expected results: Deployments should succeed and pods should run successfully
Comment 3 Jose A. Rivera 2017-11-16 17:00:33 UTC
I looked into the setup a bit, but I'm still somewhat stumped. My findings thus far: We have a similar situation we've seen elsewhere, where the # of PVs OCP exist does not make the # of volumes Gluster thinks exist. I haven't been able to compare against the # of volumes heketi thinks exist, because the heketi pod keeps crashing... though it does come up Ready for a short while before doing so. I've seen no error messages to help diagnose why, though. We were seeing the same symptom of glusterd processes consuming 100%+ CPU on all nodes. Bring them all down and then bringing them up one by one seems to have resolves part of that issue, though the glusterfsd process still spikes upwards of 200% occasionally. At this time we've only brought up three nodes, the ones with bricks for the heketidbstorage volume. According to gluster vol status, the volume and related bricks are healthy. Any help would be appreciated.
Comment 5 Yaniv Kaul 2019-01-29 08:23:55 UTC
What's the next step?
Comment 6 Michael Adam 2019-02-01 11:53:37 UTC
Closing this. There was no follow-up. There's no obvious customer case attached. The systems are not available any more. The software has greatly stabilized since the report (OCP 3.5). If the issue persists with latest software, please reopen this or file a new BZ.