Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1517652

Summary: [CRI-O] Cassandra doesn't start on crio
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: ContainersAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: amurdaca, anli, aos-bugs, jhonce, jokerman, miburman, mmccomas, mwringe, trankin, vlaad
Target Milestone: ---Keywords: Regression
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1607984 (view as bug list) Environment:
Last Closed: 2018-09-11 17:36:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
openshift infra logs none

Description Anping Li 2017-11-27 07:59:54 UTC
Description of problem:
When deploy metrics on crio system. the hawkular-metrics couldn't connected to hawkular-cassandra.  it seems Readiness probe checking failed, thus there isn't endpoint for the hawkular-cassandra.

Version-Release number of selected component (if applicable):
openshift-ansible-3.7.7-1.git.0.3e1b62b.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. install OCP-3.7 with crio
openshift_use_crio=true
openshift_crio_systemcontainer_image_override=registry.access.stage.redhat.com/openshift3/cri-o:v3.7

2. deploy metrics

3. Check the metrics pod status
# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-jpz2v   0/1       Running   0          2m
hawkular-metrics-w7sr9       0/1       Running   7          1h
heapster-94vnp               0/1       Running   7          1h

#oc describe pod hawkular-cassandra-1-jpz2v

Actual results:
The container is not ready. oc describe hawkular-cassandra show the Readiness probe failed: Could not get the Cassandra status

[root@host-172-16-120-6 ~]# oc describe pod hawkular-cassandra-1-jpz2v
Name:		hawkular-cassandra-1-jpz2v
Namespace:	openshift-infra
Node:		172.16.120.40/172.16.120.40
Start Time:	Mon, 27 Nov 2017 00:46:09 -0500
Labels:		metrics-infra=hawkular-cassandra
		name=hawkular-cassandra-1
		type=hawkular-cassandra
Annotations:	kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"43502c23-d316-11...
		openshift.io/scc=restricted
Status:		Running
IP:		10.129.0.13
Created By:	ReplicationController/hawkular-cassandra-1
Controlled By:	ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	cri-o://cff2143c32fe916766c24c885471f0096d23c639358be22da0cd74bd8b242f6f
    Image:		registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
    Image ID:		c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:		Running
      Started:		Mon, 27 Nov 2017 00:46:09 -0500
    Ready:		False
    Restart Count:	0
    Limits:
      memory:	2G
    Requests:
      memory:	1G
    Readiness:	exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:			true
      CASSANDRA_DATA_VOLUME:		/cassandra_data
      JVM_OPTS:				-Dcassandra.commitlog.ignorereplayerrors=true
      ENABLE_PROMETHEUS_ENDPOINT:	True
      TRUSTSTORE_NODES_AUTHORITIES:	/hawkular-cassandra-certs/tls.peer.truststore.crt
      TRUSTSTORE_CLIENT_AUTHORITIES:	/hawkular-cassandra-certs/tls.client.truststore.crt
      POD_NAMESPACE:			openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:			2000000000 (limits.memory)
      CPU_LIMIT:			node allocatable (limits.cpu)
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-zvpk7 (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  cassandra-data:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  hawkular-cassandra-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-certs
    Optional:	false
  cassandra-token-zvpk7:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-zvpk7
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason			Message
  ---------	--------	-----	----			-------------				--------	------			-------
  10s		10s		1	default-scheduler						Normal		Scheduled		Successfully assigned hawkular-cassandra-1-jpz2v to 172.16.120.40
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cassandra-data" 
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cassandra-token-zvpk7" 
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs" 
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Pulled			Container image "registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7" already present on machine
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Created			Created container
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Started			Started container
  2s		2s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Warning		Unhealthy		Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again
/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found

Expected results:
The metrics works well in CRIO.

Additional info:

Comment 1 Matt Wringe 2017-11-27 14:22:54 UTC
Hawkular Metrics requires that Cassandra enter the ready state before it can enter the ready state. If Hawkular Metrics can't successfully connect to Cassandra after a certain time period, it will automatically restart the pod.

For any metrics issues you will need to attach:

- the logs for the metric components (Hawkular Metrics, Cassandra, Heapster). [But in this case we only need the Cassandra logs because the other pods can't start yet]

- the output of 'oc get pods -n openshift-infra -o yaml'

- the output of 'oc describe pod ${HAWKULAR_CASSANDRA_POD_NAME}'

Comment 2 Anping Li 2017-11-29 10:25:13 UTC
Created attachment 1360255 [details]
openshift infra logs

Comment 3 Matt Wringe 2017-11-29 13:55:06 UTC
It looks like it failing due to:

/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found


@stefan: can you please take a look at this and make sure its not something wrong with that docker image they are using?

Comment 5 Michael Burman 2017-11-29 20:57:37 UTC
Anping, can you verify your version of CRI-O ?

Comment 8 Antonio Murdaca 2017-11-29 21:09:02 UTC
is the very same image here working with a docker cluster? I suspect something is wrong in the image rather than something different between cri-o/docker clusters

Comment 9 Antonio Murdaca 2017-11-29 21:15:04 UTC
/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found


if the readiness probe is run *inside* the container, failing with the error above, then it's an image issue.

Comment 10 Michael Burman 2017-11-29 21:41:17 UTC
At least the latest image seems fine:

[root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash
bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool
-rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool
bash-4.2$ nodetool
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
..

[miburman@miranda Downloads]$ docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
Trying to pull repository registry.access.stage.redhat.com/openshift3/metrics-cassandra ... 
sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216: Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra
...

docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256:c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c

Seems to reply with 404, so I can't verify that one. Anping, can you verify with current image?

Comment 11 Antonio Murdaca 2017-11-29 21:48:16 UTC
(In reply to Michael Burman from comment #10)
> At least the latest image seems fine:
> 
> [root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash
> bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool
> -rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool
> bash-4.2$ nodetool
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> ..
> 
> [miburman@miranda Downloads]$ docker pull
> registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
> Trying to pull repository
> registry.access.stage.redhat.com/openshift3/metrics-cassandra ... 
> sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216:
> Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra
> ...
> 
> docker pull
> registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256:
> c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c
> 
> Seems to reply with 404, so I can't verify that one. Anping, can you verify
> with current image?

can you check "/opt/apache-cassandra/bin/cassandra-docker-ready.sh"?

Comment 12 Anping Li 2017-11-30 03:35:50 UTC
@Antonio, I am not sure if that is a image issue.

The cri-o images/version
registry.access.stage.redhat.com/openshift3/cri-o:v3.7
crio -v 
crio version 1.0.4

Both rsh and exec works

# oc rsh hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
Cassandra is in the up and normal state. It is now ready.
# oc exec hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
Cassandra is in the up and normal state. It is now ready.

Comment 13 Antonio Murdaca 2017-11-30 08:34:26 UTC
(In reply to Anping Li from comment #12)
> @Antonio, I am not sure if that is a image issue.
> 
> The cri-o images/version
> registry.access.stage.redhat.com/openshift3/cri-o:v3.7
> crio -v 
> crio version 1.0.4
> 
> Both rsh and exec works
> 
> # oc rsh hawkular-cassandra-1-f6bcf
> /opt/apache-cassandra/bin/cassandra-docker-ready.sh
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> Cassandra is in the up and normal state. It is now ready.
> # oc exec hawkular-cassandra-1-f6bcf
> /opt/apache-cassandra/bin/cassandra-docker-ready.sh
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> Cassandra is in the up and normal state. It is now ready.

Could you provide steps to deploy metrics so I can reproduce?

Comment 14 Antonio Murdaca 2017-11-30 09:06:19 UTC
Ok, I can reproduce this with kubernetes as well

Comment 15 Antonio Murdaca 2017-11-30 09:50:26 UTC
Fix is here https://github.com/kubernetes-incubator/cri-o/pull/1187

We'll release a new patch release and a new system container once that's merged and released

Comment 17 Matt Wringe 2018-01-05 14:26:22 UTC
*** Bug 1531495 has been marked as a duplicate of this bug. ***

Comment 18 Junqi Zhao 2018-01-22 06:30:34 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1529233

Comment 19 Junqi Zhao 2018-01-22 07:39:38 UTC
(In reply to Junqi Zhao from comment #18)
> Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1529233

Ignore this comment, it is fixed in 3.9 now

Comment 20 Junqi Zhao 2018-01-22 11:11:00 UTC
Tested with crio version 1.9.0, Cassandra pod can start up on crio

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-nktsc   1/1       Running   0          3m
hawkular-metrics-tl844       1/1       Running   0          3m
heapster-wtg7q               1/1       Running   0          3m
[root@host-172-16-120-155 ~]# runc exec -t cri-o bash
bash: /sbin/consoletype: No such file or directory
[root@host-172-16-120-155 /]# crio -v
crio version 1.9.0

Comment 21 Junqi Zhao 2018-02-27 09:52:13 UTC
Same issue happens in crio 1.9.7,metrics pods could not be started up, "Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1" for cassandra pod.
Re-open it

metrics version: v3.9.0-0.53.0.0
# crio --version
crio version 1.9.7

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-j2nzr   0/1       Running   0          8m
hawkular-metrics-76qxt       0/1       Running   1          8m
heapster-z7g84               0/1       Running   0          7m

# oc describe po hawkular-cassandra-1-j2nzr
Name:           hawkular-cassandra-1-j2nzr
Namespace:      openshift-infra
Node:           ip-172-18-9-9.ec2.internal/172.18.9.9
Start Time:     Tue, 27 Feb 2018 04:40:41 -0500
Labels:         metrics-infra=hawkular-cassandra
                name=hawkular-cassandra-1
                type=hawkular-cassandra
Annotations:    openshift.io/scc=restricted
Status:         Running
IP:             10.129.0.39
Controlled By:  ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:  cri-o://672ecae1a98febb64530d5bc329716d466876f2d118cf10c3c8416f2458e6f32
    Image:         registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0
    Image ID:      registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra@sha256:b8c367205542a4ff725bad029fb89a4142f2c0eb63940c094232690dda12325f
    Ports:         9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:          Running
      Started:      Tue, 27 Feb 2018 04:43:48 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  2G
    Requests:
      memory:   1G
    Readiness:  exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:               true
      CASSANDRA_DATA_VOLUME:          /cassandra_data
      JVM_OPTS:                       -Dcassandra.commitlog.ignorereplayerrors=true
      ENABLE_PROMETHEUS_ENDPOINT:     True
      TRUSTSTORE_NODES_AUTHORITIES:   /hawkular-cassandra-certs/tls.peer.truststore.crt
      TRUSTSTORE_CLIENT_AUTHORITIES:  /hawkular-cassandra-certs/tls.client.truststore.crt
      POD_NAMESPACE:                  openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:                   2000000000 (limits.memory)
      CPU_LIMIT:                      node allocatable (limits.cpu)
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-s6ksh (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  cassandra-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  metrics-cassandra-1
    ReadOnly:   false
  hawkular-cassandra-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hawkular-cassandra-certs
    Optional:    false
  cassandra-token-s6ksh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cassandra-token-s6ksh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age               From                                 Message
  ----     ------                 ----              ----                                 -------
  Normal   Scheduled              8m                default-scheduler                    Successfully assigned hawkular-cassandra-1-j2nzr to ip-172-18-9-9.ec2.internal
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "cassandra-token-s6ksh"
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs"
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "pvc-1229192c-1ba2-11e8-845a-0ef9f426f1dc"
  Normal   Pulling                8m                kubelet, ip-172-18-9-9.ec2.internal  pulling image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
  Normal   Pulled                 5m                kubelet, ip-172-18-9-9.ec2.internal  Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
  Normal   Created                5m                kubelet, ip-172-18-9-9.ec2.internal  Created container
  Normal   Started                5m                kubelet, ip-172-18-9-9.ec2.internal  Started container
  Warning  Unhealthy              1m (x19 over 4m)  kubelet, ip-172-18-9-9.ec2.internal  Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1

Comment 22 Antonio Murdaca 2018-02-28 15:53:19 UTC
Should be fixed by https://github.com/kubernetes-incubator/cri-o/pull/1386

Comment 23 Junqi Zhao 2018-03-02 03:49:08 UTC
Please change to ON_QA, issue is fixed, metrics pods could be started up
# oc get po -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-8cz8s   1/1       Running   0          58m
hawkular-metrics-8gx7z       1/1       Running   0          58m
heapster-lxctn               1/1       Running   0          58m

# openshift version
openshift v3.9.1
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

# crio --version
crio version 1.9.7

cri-o images: v3.9.0-0.53.0.0

Comment 24 Junqi Zhao 2018-03-02 08:47:35 UTC
Remove TestBlocker keyword since issue is fixed

Comment 25 Junqi Zhao 2018-03-06 04:22:01 UTC
Set to VERIFIED as per Comment 23