Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1513048 - OCP 3.5 pods reporting read only file system
Summary: OCP 3.5 pods reporting read only file system
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: CNS-deployment
Version: cns-3.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Michael Adam
QA Contact: Prasanth
URL:
Whiteboard:
Depends On:
Blocks: 1542093
TreeView+ depends on / blocked
 
Reported: 2017-11-14 16:21 UTC by mdunn
Modified: 2019-02-01 11:53 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-01 11:53:37 UTC


Attachments (Terms of Use)

Description mdunn 2017-11-14 16:21:09 UTC
Description of problem:
Pods that are deployed on my OCP 3.5 cluster and are using PVCs are in Crash loop states. Looking into the logs on those pods, you find that the pod is seeing a read only file system.

Example (from a pod running FIO):
fio-3.1
Starting 2 processes
fio: pid=7, err=30/file:io_u.c:1770, func=io_u error, error=Read-only file system
fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=53940224, buflen=4096
fio: io_u error on file /usr/share/fio/test: Read-only file system: read offset=8966144, buflen=4096


The cluster consists of 1 master node, 6 CNS nodes, and 3 non-CNS nodes. Each node has 4 vCPUs and 32GB of memory.

Pods started hitting these read only file system errors while I was in the process of deploying new apps on the cluster.

Version-Release number of selected component (if applicable):
oc v3.5.5.31.36
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp19-231-243.css.lab.eng.bos.redhat.com:8443
openshift v3.5.5.31.36
kubernetes v1.5.2+43a9be4


How reproducible:
I am not sure how reproducible this issue would be once I get the file system to no longer be read only, but currently, most of the pods that are deployed are hitting this problem.

Steps to Reproduce:
1. Deploy pods that use PVCs
2. Continue deploying pods
3.

Actual results:
Pods start to fail with read only file system errors

Expected results:
Deployments should succeed and pods should run successfully

Comment 3 Jose A. Rivera 2017-11-16 17:00:33 UTC
I looked into the setup a bit, but I'm still somewhat stumped. My findings thus far:

We have a similar situation we've seen elsewhere, where the # of PVs OCP exist does not make the # of volumes Gluster thinks exist. I haven't been able to compare against the # of volumes heketi thinks exist, because the heketi pod keeps crashing... though it does come up Ready for a short while before doing so. I've seen no error messages to help diagnose why, though.

We were seeing the same symptom of glusterd processes consuming 100%+ CPU on all nodes. Bring them all down and then bringing them up one by one seems to have resolves part of that issue, though the glusterfsd process still spikes upwards of 200% occasionally. At this time we've only brought up three nodes, the ones with bricks for the heketidbstorage volume. According to gluster vol status, the volume and related bricks are healthy.

Any help would be appreciated.

Comment 5 Yaniv Kaul 2019-01-29 08:23:55 UTC
What's the next step?

Comment 6 Michael Adam 2019-02-01 11:53:37 UTC
Closing this.
There was no follow-up.
There's no obvious customer case attached.
The systems are not available any more.
The software has greatly stabilized since the report (OCP 3.5).

If the issue persists with latest software, please reopen this or file a new BZ.


Note You need to log in before you can comment on or make changes to this bug.