Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1358657 - Host stopped connecting
Summary: Host stopped connecting
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.0.4
: ---
Assignee: Oved Ourfali
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-21 08:32 UTC by Michal Skrivanek
Modified: 2016-09-04 05:22 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-04 05:22:18 UTC
oVirt Team: Infra


Attachments (Terms of Use)

Description Michal Skrivanek 2016-07-21 08:32:04 UTC
After a random failure - seem like host has been rebooted - engine can no longer connect to the host. Host seems to be in good state, network connectivity is ok(checked with openssl s_client -connect IP:54321), yet there is no connection attempt logged on the host side.
Engien side keeps saying:

2016-07-21 11:26:41,119 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-57) [] Command 'GetCapabilities
VDSCommand(HostName = dell-r420-02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f1c2e1c8-b6f6-4b88-bd40-11e072412844', vds='Host[dell-r420-02,f1
c2e1c8-b6f6-4b88-bd40-11e072412844]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-07-21 11:26:41,120 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-57) [] Failure to refresh Vds runtime info: VDSGener
icException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-07-21 11:26:41,120 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-57) [] Exception: org.ovirt.engine.core.vdsbroker.vd
sbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:652) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) [:1.7.0_101]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_101]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_101]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

Comment 7 Liron Aravot 2016-07-24 17:02:41 UTC
Michal, can you please attach the relevant logs?

Comment 8 Michal Skrivanek 2016-07-27 09:26:26 UTC
(In reply to Liron Aravot from comment #7)
> Michal, can you please attach the relevant logs?

which logs are you interested in? Also, see comment #1, you may work on this issue directly since it's an internal setup
I will attach whatever you request, but generally it's a waste of time if you can access it yourself and look at whatever you want

Comment 13 Liron Aravot 2016-08-09 15:09:16 UTC
Regards the SPM info, the spm is moving between the different hosts because of the failure to get the pool info on timeout. The logs from the relevant time were deleted from the hosts, therefore i can't see what was the exact issue that caused for that timeout (see [1]).

Piotr - I see that many different commands fails on the ceritifcate_expired, for example GetCapabilities(see [2])-  
Moving to Infra for further investigation of the issue (if needed), please let me know if i can somehow help.


[1] - [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [5360bd2] Correlation ID: null, Call 
Stack: null, Custom Event ID: -1, Message: VDSM command failed: Message timeout which can be caused by communication issues
2016-07-21 11:45:28,471 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [5360bd2] ERROR, org.ovirt.engine.core.vdsbroke
r.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues, log id: a2c4e09
2016-07-21 11:45:28,471 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [5360bd2] Exception: org.ovirt.engine.core.vdsb


[2] - 
2016-07-21 00:20:49,601 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-23) [] Command 'GetCapabilitiesVDSCommand
da8ec1dc]'})' execution failed: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_expired

Comment 14 Piotr Kliczewski 2016-08-09 16:24:14 UTC
(In reply to Liron Aravot from comment #13)
> Regards the SPM info, the spm is moving between the different hosts because
> of the failure to get the pool info on timeout. The logs from the relevant
> time were deleted from the hosts, therefore i can't see what was the exact
> issue that caused for that timeout (see [1]).
> 
> Piotr - I see that many different commands fails on the ceritifcate_expired,
> for example GetCapabilities(see [2])-  
> Moving to Infra for further investigation of the issue (if needed), please
> let me know if i can somehow help.
> 
> 

We see similar failures in other bugs. It seems like broader issue that needs to be solved.

Comment 15 Martin Perina 2016-08-15 11:20:02 UTC
Yaniv, we are trying to find out the cause of this issue for more than a month and we still were not able to reproduce it nor find out the cause of this. And it happens only on few installations, it's definitely not a common issue. So I see no point targeting to 4.0.3, because we still don't know if this is not an environment issue

Comment 16 Martin Perina 2016-08-15 12:24:36 UTC
Moving to 4.0.4 for now

Comment 17 Oved Ourfali 2016-09-04 05:22:18 UTC
If reproduces please contact as ASAP.
We weren't able to reproduce that manually.


Note You need to log in before you can comment on or make changes to this bug.