Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1359059 - The agent got stuck if the broker takes more that 30 seconds to reach the smtp server
Summary: The agent got stuck if the broker takes more that 30 seconds to reach the smt...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 2.0.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent vote
Target Milestone: ovirt-4.0.3
: 2.0.3
Assignee: Andrej Krejcir
QA Contact: Nikolai Sednev
URL:
Whiteboard: sla
Depends On:
Blocks: 1364286
TreeView+ depends on / blocked
 
Reported: 2016-07-22 08:20 UTC by Simone Tiraboschi
Modified: 2016-08-31 09:34 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: HA broker waits for a nonresponsive SMTP server without timeout. Consequence: HA agent waits for the broker. Fix: Add a timeout to the connection between the broker and SMTP server. Result: Broker and agent do not wait for SMTP response for a long time.
Clone Of:
: 1364286 (view as bug list)
Environment:
Last Closed: 2016-08-31 09:34:14 UTC
oVirt Team: SLA
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
mgoldboi: planning_ack+
msivak: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 61948 master MERGED Add a timeout to the notification sender 2016-08-23 09:53:30 UTC
oVirt gerrit 62717 v2.0.z MERGED Add a timeout to the notification sender 2016-08-23 11:43:18 UTC

Description Simone Tiraboschi 2016-07-22 08:20:01 UTC
Description of problem:

On the connection between the agent and the broker there is a timeout of 30 seconds, after that the connection it's dropped and the agent will restart.

MainThread::DEBUG::2016-07-21 09:47:34,506::brokerlink::273::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Sending request: notify time=1469108854.51 type=state_transition detail=StartState-ReinitializeFSM hostname='poseidon.netsec'
MainThread::DEBUG::2016-07-21 09:47:34,506::util::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) socket_readline with 30.0 seconds timeout
MainThread::DEBUG::2016-07-21 09:48:04,545::util::88::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) Connection timeout while reading from socket
MainThread::ERROR::2016-07-21 09:48:04,545::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::DEBUG::2016-07-21 09:48:04,546::brokerlink::86::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(disconnect) Closing connection to ha-broker
MainThread::ERROR::2016-07-21 09:48:04,547::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'poseidon.netsec'}: Connection timed out' - trying to restart agent

On some circumstances the broker has to send a notification email and this is done a synchronous way. If connecting the SMTP server requires more than 30 second, this will cause the agent to restart without a clear root cause indication (just timeout...).

Now indeed we have:
server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])

If this happens 10 times in a row (not that uncommon if the smtp server is not well configured/working), the agent will disable itself.

Version-Release number of selected component (if applicable):


How reproducible:
Just if the smtp server takes more than 30 seconds to be connected.

Steps to Reproduce:
1. deploy hosted-engine
2. make the smtp server not reachable with a long timing to detect it
3.

Actual results:
The connection between the agent and the broker got dropped due to timeout, this cause the agent to restart. If this happens more than 10 times in a row, the agent will disable itself.

Expected results:
The agent should work also if the broker takes too long to send a notification email or at least it should fail with a clear error about sending the notification email.

Additional info:
We have to make the smtp connection async or, more simply, just add a timeout value (less than the connection timeout) in
server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])

Comment 2 Sandro Bonazzola 2016-08-25 14:23:16 UTC
Can you please fill doc-text?

Comment 3 Nikolai Sednev 2016-08-30 14:04:47 UTC
ha-agent does not dies, although I do see this error in broker.log:
Thread-3::ERROR::2016-08-30 16:57:32,259::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) timed out
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email
    timeout=float(cfg["smtp-timeout"]))
  File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
    return socket.create_connection((host, port), timeout)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
timeout: timed out

Works for me on these components on host:
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.21.x86_64
sanlock-3.2.4-3.el7_2.x86_64
rhevm-appliance-20160731.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.5-1.el7ev.noarch
mom-0.5.5-1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64
rhev-release-3.6.9-1-001.noarch
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
Linux version 3.10.0-327.36.1.el7.x86_64 (mockbuild@x86-037.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Aug 17 03:02:37 EDT 2016
Linux 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.3-0.1.el7ev.noarch
ovirt-engine-restapi-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
ovirt-engine-cli-3.6.8.1-1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-0.3.0-0.el7ev.noarch
ovirt-engine-tools-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.0.3-0.1.el7ev.noarch
python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-engine-dashboard-1.0.3-1.el7ev.x86_64
ovirt-engine-userportal-4.0.3-0.1.el7ev.noarch
ovirt-engine-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.1-1.el7ev.noarch
ovirt-engine-lib-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-4.0.3-0.1.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.3-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.2-1.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.3-0.1.el7ev.noarch
ovirt-engine-backend-4.0.3-0.1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch
rhevm-doc-4.0.0-3.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch
rhev-guest-tools-iso-4.0-5.el7ev.noarch
rhevm-4.0.3-0.1.el7ev.noarch
rhevm-branding-rhev-4.0.0-5.el7ev.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)


Note You need to log in before you can comment on or make changes to this bug.