Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1513048

Summary: OCP 3.5 pods reporting read only file system
Product: Red Hat Gluster Storage Reporter: mdunn
Component: CNS-deploymentAssignee: Michael Adam <madam>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Prasanth <pprakash>
Severity: urgent Docs Contact:
Priority: urgent    
Version: cns-3.5CC: akhakhar, hchiramm, jrivera, madam, pprakash, rhs-bugs, rtalur, sankarshan, sarumuga
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-01 11:53:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1542093    

Description mdunn 2017-11-14 16:21:09 UTC
Description of problem:
Pods that are deployed on my OCP 3.5 cluster and are using PVCs are in Crash loop states. Looking into the logs on those pods, you find that the pod is seeing a read only file system.

Example (from a pod running FIO):
Starting 2 processes
fio: pid=7, err=30/file:io_u.c:1770, func=io_u error, error=Read-only file system
fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=53940224, buflen=4096
fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=8966144, buflen=4096

The cluster consists of 1 master node, 6 CNS nodes, and 3 non-CNS nodes. Each node has 4 vCPUs and 32GB of memory.

Pods started hitting these read only file system errors while I was in the process of deploying new apps on the cluster.

Version-Release number of selected component (if applicable):
oc v3.
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift v3.
kubernetes v1.5.2+43a9be4

How reproducible:
I am not sure how reproducible this issue would be once I get the file system to no longer be read only, but currently, most of the pods that are deployed are hitting this problem.

Steps to Reproduce:
1. Deploy pods that use PVCs
2. Continue deploying pods

Actual results:
Pods start to fail with read only file system errors

Expected results:
Deployments should succeed and pods should run successfully

Comment 3 Jose A. Rivera 2017-11-16 17:00:33 UTC
I looked into the setup a bit, but I'm still somewhat stumped. My findings thus far:

We have a similar situation we've seen elsewhere, where the # of PVs OCP exist does not make the # of volumes Gluster thinks exist. I haven't been able to compare against the # of volumes heketi thinks exist, because the heketi pod keeps crashing... though it does come up Ready for a short while before doing so. I've seen no error messages to help diagnose why, though.

We were seeing the same symptom of glusterd processes consuming 100%+ CPU on all nodes. Bring them all down and then bringing them up one by one seems to have resolves part of that issue, though the glusterfsd process still spikes upwards of 200% occasionally. At this time we've only brought up three nodes, the ones with bricks for the heketidbstorage volume. According to gluster vol status, the volume and related bricks are healthy.

Any help would be appreciated.

Comment 5 Yaniv Kaul 2019-01-29 08:23:55 UTC
What's the next step?

Comment 6 Michael Adam 2019-02-01 11:53:37 UTC
Closing this.
There was no follow-up.
There's no obvious customer case attached.
The systems are not available any more.
The software has greatly stabilized since the report (OCP 3.5).

If the issue persists with latest software, please reopen this or file a new BZ.