Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 821991 - libvirt domain event handler can not receive messages after several times virKeepAliveTimer warning
Summary: libvirt domain event handler can not receive messages after several times vir...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt
Version: 6.3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-16 03:09 UTC by weizhang
Modified: 2013-01-23 10:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-31 15:32:04 UTC
Target Upstream Version:


Attachments (Terms of Use)
libvirtd.log (deleted)
2012-05-16 03:09 UTC, weizhang
no flags Details
libvirtd.log (deleted)
2012-05-25 14:04 UTC, Michal Privoznik
no flags Details
libvirt_client.log (deleted)
2012-05-25 14:05 UTC, Michal Privoznik
no flags Details

Description weizhang 2012-05-16 03:09:24 UTC
Created attachment 584831 [details]
libvirtd.log

Description of problem:
When cpu is busy and domain operation can not reply for more than 30s(default), libvirt will report error like error: End of file while reading data: Input/output error, and libvirt event-python will exit, for several times loop, then libvirt event-python can not receive messages any more.

Version-Release number of selected component (if applicable):
snapshot 3
libvirt-python-0.9.10-18.el6.x86_64
libvirt-client-0.9.10-18.el6.x86_64
libvirt-0.9.10-18.el6.x86_64


How reproducible:
100% for long time execution

Steps to Reproduce:
1. change libvirtd.conf for easily catching error and restart libvirtd
keepalive_interval = 2
keepalive_count = 1

2. on another console do cp to occupy the cpu
# while true; do cp rhel6u3-1 test; rm -rf test; done

3. Do start->save->restore->destroy loop for a guest
# while true; do virsh start kvm-rhel6u3-x86_64; sleep 10; virsh save kvm-rhel6u3-x86_64 /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh restore /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh destroy kvm-rhel6u3-x86_64; done

4. On the third console do
# while true; do python /usr/share/doc/libvirt-python-0.9.10/events-python/event-test.py; done
to get domain operation messages
  
Actual results:
For lots of times loop, libvirt event python can not receive messages any more

Expected results:
libvirt event python can always receive messages

Additional info:

Comment 3 Michal Privoznik 2012-05-25 14:02:08 UTC
Okay, I managed to reproduce. And I think this is NOTABUG.

What's happenning here: the host is under heavy load indeed. Therefore a process, be it libvirtd, may be scheduled by kernel not so often as we want it to be. So it may fail to reply to keepalive requests and thus the connection is closed. In my case it was not something uncommon that libvirtd wasn't scheduled for 35 seconds. Therefore a python client watching for events decideded to close connection. And this is right in fact. The server was unresponsive.

From attached libvirtd.log (notice the jump from 13:20:22 to 13:20:57)
2012-05-25 13:20:21.775+0000: 31242: debug : qemuProcessHandleShutdown:632 : vm=0x7f9a3000d300
2012-05-25 13:20:22.174+0000: 31243: debug : virDomainObjRef:1623 : obj=0x7f9a3000d300 refs=3
2012-05-25 13:20:22.174+0000: 31243: debug : qemuDomainObjBeginJobInternal:753 : Starting job: destroy (async=none)
2012-05-25 13:20:57.106+0000: 31243: debug : qemuProcessStop:3763 : Shutting down VM 'f16' pid=11967 migrated=0
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorClose:759 : QEMU_MONITOR_CLOSE: mon=0x7f9a240243a0 refs=4
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f9a240243a0 refs=3

Moreover, we can see timers which are set to fire up every 2 seconds being fired after 36 seconds (again libvirtd.log):
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0x2126400 client=0x21242b0 countToDeath=1 idle=36
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageNew:48 : msg=0x21e9340 tracked=0
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x21242b0
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x2126400 client=0x21242b0 prog=1801807216 vers=1 proc=1

Even though the client was trying to send KA requests meanwhile (client log):
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=2 idle=10
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=1 idle=15
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:41.221+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=0 idle=20
2012-05-25 13:20:41.221+0000: 11895: warning : virKeepAliveTimer:182 : No response from client 0x7f7faa50d010 after 3 keepalive messages in 20 seconds
2012-05-25 13:20:41.221+0000: 11895: debug : virNetClientClose:521 : client=0x7f7faa50d010


And in fact, server sees these requests, but it's too late to reply (libvirtd.log):
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageNew:48 : msg=0x22a9490 tracked=0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28

Notice that client has closed the connection at 13:20:41 but the server got scheduled at 13:20:57 (16 seconds after) when it was too late.

Comment 4 Michal Privoznik 2012-05-25 14:04:36 UTC
Created attachment 586875 [details]
libvirtd.log

Comment 5 Michal Privoznik 2012-05-25 14:05:02 UTC
Created attachment 586876 [details]
libvirt_client.log

Comment 6 Michal Privoznik 2012-05-31 15:32:04 UTC
Okay, as long as nobody objects I am closing this as NOTABUG per comment 3. If, however, somebody feels otherwise, don't hesitate to reopen.


Note You need to log in before you can comment on or make changes to this bug.