Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1365670 - Hawkular frequently fails liveness probes, returns 503s in Vagrant environment
Summary: Hawkular frequently fails liveness probes, returns 503s in Vagrant environment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Matt Wringe
QA Contact: chunchen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-09 19:52 UTC by Samuel Padgett
Modified: 2018-07-26 19:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-19 13:52:51 UTC


Attachments (Terms of Use)
Hawkular metrics pod log (deleted)
2016-08-09 19:52 UTC, Samuel Padgett
no flags Details
Cassandra logs (deleted)
2016-08-09 20:05 UTC, Samuel Padgett
no flags Details
My OpenShift Vagrant configuration (deleted)
2016-08-09 20:13 UTC, Samuel Padgett
no flags Details
hawkular-metrics pod metrics (deleted)
2016-08-10 12:39 UTC, Samuel Padgett
no flags Details

Description Samuel Padgett 2016-08-09 19:52:46 UTC
Created attachment 1189390 [details]
Hawkular metrics pod log

I use a Vagrant environment as described in the origin contributing doc here:

https://github.com/openshift/origin/blob/master/CONTRIBUTING.adoc#develop-on-virtual-machine-using-vagrant

I'm frequently getting 503 responses from Hawkular metrics in the web console. It works for a while, then stops. It fails its liveness probe and eventually restarts. I'm attaching the log from the hawkular-metrics pod.

Comment 1 Matt Wringe 2016-08-09 20:03:45 UTC
Can you please attach the Cassandra logs here? And the events log for Hawkular Metrics?

From the Hawkular Metrics logs it times out being able to connect to Cassandra:
"com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.163.20:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.163.20] Timed out waiting for server response))"

And depending on what the Hawkular Metrics events log says, it could be that the liveness probe is timing out when trying to connect to its localhost.

Are you sure your VM is properly configured and that its network is functioning properly?

@chunchen: are you able to reproduce this? I have tried, but have been unable to reproduce this (outside of running on a faulty VM)

Comment 2 Samuel Padgett 2016-08-09 20:05:16 UTC
Created attachment 1189391 [details]
Cassandra logs

Comment 3 Samuel Padgett 2016-08-09 20:07:47 UTC
> Are you sure your VM is properly configured and that its network is functioning properly?

It's the Vagrant configuration for origin using VirtualBox. I haven't had a problem until updating to the newer hawkular-metrics image recently.

Comment 4 Samuel Padgett 2016-08-09 20:13:59 UTC
Created attachment 1189393 [details]
My OpenShift Vagrant configuration

Attaching my OpenShift Vagrant configuration.

https://github.com/openshift/origin/blob/master/CONTRIBUTING.adoc#develop-on-virtual-machine-using-vagrant

Comment 5 Matt Wringe 2016-08-09 20:24:30 UTC
Is there anything in particular which is triggering this? Eg is the console open to a particular page? Does this happen even if the console is not open?

Comment 6 Samuel Padgett 2016-08-10 12:22:19 UTC
> Is there anything in particular which is triggering this? Eg is the console open to a particular page?

Unfortunately I haven't noticed any particular pattern.

> Does this happen even if the console is not open?

Hard to say since I'm always using the console when I'm on my computer. I will try to test.

Comment 7 Samuel Padgett 2016-08-10 12:29:33 UTC
Just hit this again. Here is the error from the Hawkular log. (No errors in the Cassandra log.)

12:26:08,214 ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.163.20:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.163.20] Timed out waiting for server response))
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:78)
	at rx.observable.ListenableFutureObservable$1$1.call(ListenableFutureObservable.java:50)
	at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$1.call(EventLoopsScheduler.java:172)
	at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.163.20:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.163.20] Timed out waiting for server response))
	at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:211)
	at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:43)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:277)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.run(RequestHandler.java:400)
	... 3 more

Comment 8 Samuel Padgett 2016-08-10 12:30:09 UTC
Events:

Events:
  FirstSeen	LastSeen	Count	From				SubobjectPath				Type		Reason		Message
  ---------	--------	-----	----				-------------				--------	------		-------
  16h		1m		28	{kubelet openshiftdev.local}	spec.containers{hawkular-metrics}	Warning		Unhealthy	Liveness probe failed:
  16h		1m		41	{kubelet openshiftdev.local}	spec.containers{hawkular-metrics}	Warning		Unhealthy	Readiness probe failed:
  16h		1m		3	{kubelet openshiftdev.local}	spec.containers{hawkular-metrics}	Normal		Pulling		pulling image "mwringe/origin-metrics-hawkular-metrics:relative"
  1m		1m		1	{kubelet openshiftdev.local}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id 79b17f789215: pod "hawkular-metrics-oggqp_openshift-infra(cea10c48-5e67-11e6-8be2-080027893417)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.

Comment 9 Samuel Padgett 2016-08-10 12:39:21 UTC
Created attachment 1189620 [details]
hawkular-metrics pod metrics

I see CPU usage for the hawkular-metrics pod spike before it begins failing.

Comment 10 Matt Wringe 2016-08-11 13:44:27 UTC
@chunchen are you able to reproduce this?

Comment 11 John Sanda 2016-08-11 21:19:59 UTC
Is heapster reporting any errors when it pushes data to Hawkular Metrics?

And after the error(s) occur, would it be possible to connect to Cassandra with jconsole? The reason I want to do that is because request timeouts as tracked as metrics and exposed via JMX. Since we currently are not collecting metrics on Cassandra, this would give us a better idea of what might be failing.

Comment 12 chunchen 2016-08-12 07:12:52 UTC
I am trying to reproduce this issue, but always met "/sbin/mount.vboxsf: mounting failed with the error: No such device" when executing 'vagrant up/sudo vagrant up.

logs detail: http://pastebin.test.redhat.com/401997


[chunchen@F17-CCY Downloads]$ vagrant plugin list |grep guest
[chunchen@F17-CCY Downloads]$ sudo vagrant plugin list |grep guest
vagrant-vbguest (0.12.0)

Comment 13 Samuel Padgett 2016-08-12 12:24:37 UTC
skuznets@redhat.com -- Do you know who can help with the vagrant error?

Comment 14 Steve Kuznetsov 2016-08-12 13:04:27 UTC
I'm not sure who owns the Vagrantfile to be honest, but I think the issue with the setup from that pastebin is the Guest Additions mismatch:

==> master: Checking for guest additions in VM...
    master: The guest additions on this VM do not match the installed version of
    master: VirtualBox! In most cases this is fine, but in rare cases it can
    master: prevent things such as shared folders from working properly. If you see
    master: shared folder errors, please make sure the guest additions within the
    master: virtual machine match the version of VirtualBox you have installed on
    master: your host and reload your VM.
[openshiftdev] GuestAdditions versions on your host (4.3.12) and guest (5.0.12) do not match.


I would suggest installing vagrant-vbguest[1] and getting the guest additions on the guest to match those on the host.

[1]https://github.com/dotless-de/vagrant-vbguest

Comment 15 Matt Wringe 2016-08-17 23:07:12 UTC
@spadgett Can you try with the the latest openshift/origin-metrics-hawkular-metrics:latest image? Its using a new Wildfly version which has a bug fix which may have been affecting you.

Comment 16 Samuel Padgett 2016-08-18 12:55:53 UTC
Matt, thanks, trying it now. I'll let you know how it goes.

Comment 17 Samuel Padgett 2016-08-18 13:22:09 UTC
I can no longer reproduce the crash with the latest hawkular-metrics image

Comment 18 chunchen 2016-08-19 06:38:23 UTC
According to comment #17, mark it as verified.


Note You need to log in before you can comment on or make changes to this bug.