Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1514884

Summary: 354 % CPU on ovirt-engine -server
Product: [oVirt] ovirt-engine Reporter: eberman
Component: GeneralAssignee: bugs <bugs>
Status: CLOSED WORKSFORME QA Contact: guy chen <guchen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: bugs, dagur, gklein, izuckerm, mperina, pkliczew, ratamir, rgolan, rnori
Target Milestone: ---Keywords: Performance
Target Release: ---Flags: ratamir: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-05 13:08:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description eberman 2017-11-19 08:38:38 UTC
Description of problem:

top - 10:29:36 up 2 days, 23:34,  3 users,  load average: 11.85, 10.73, 13.22
Tasks: 2046 total,   1 running, 2041 sleeping,   4 stopped,   0 zombie
%Cpu(s): 37.5 us, 53.3 sy,  0.6 ni,  8.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32754308 total,   566372 free, 13373212 used, 18814724 buff/cache
KiB Swap: 16516092 total, 16392444 free,   123648 used. 18598448 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                  
11549 ovirt     20   0 8905048 4.801g   7360 S 354.3 15.4  13933:33 ovirt-engine -server -XX:+TieredCompilation -Xms4g -Xmx4g -Djava.awt.headless=true -Dsun.rmi.dgc.client+ 
 4946 root      39  19    3712   3372    120 S   3.6  0.0   8:02.79 /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin 1                                                 
 9206 root      20   0  159852   4284   1532 R   1.3  0.0   0:00.47 top                                                                                                
Version-Release number of selected component (if applicable):

followed the proccess:
ps -Lo pcpu,pmem,lwp -p `pidof ovirt-engine` | sort -h 


[root@vega09 ovirt-engine]# sudo su - ovirt -s /bin/sh -c "jcmd  $(pidof ovirt-engine) Thread.print" | grep 2dc4 -A 5
"SSL Stomp Reactor" #142 daemon prio=5 os_prio=0 tid=0x00007f55b890a800 nid=0x2dc4 runnable [0x00007f54ee6f8000]
   java.lang.Thread.State: RUNNABLE
	at sun.misc.Unsafe.unpark(Native Method)
	at java.util.concurrent.locks.LockSupport.unpark(LockSupport.java:141)
	at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:372)
	at java.util.concurrent.FutureTask.set(FutureTask.java:233)


How reproducible:

100%

Steps to Reproduce:
1.6 hosts up and 300 Vms
2.monitor process over night 
3.

Actual results:
Huge amount of CPU used for ovirt-engine -server

Expected results:


Additional info:

Comment 2 Ilan Zuckerman 2017-11-19 09:07:41 UTC
Engine Software version:

[root@vega09 ilan]# rpm -qa |grep ovirt-engine

ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch
ovirt-engine-dashboard-1.2.0-0.5.20171012git78f7a28.el7.centos.noarch
ovirt-engine-dwh-setup-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-dbscripts-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-lib-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64
ovirt-engine-api-explorer-0.0.1-0.0.20171013git1fbdc17.el7.centos.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch
ovirt-engine-setup-base-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.git1660a27.el7.centos.noarch
ovirt-engine-tools-backup-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-webadmin-portal-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-backend-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-restapi-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch
ovirt-engine-websocket-proxy-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-setup-4.2.0-0.0.master.20171105155215.git59d3324.el7.centos.noarch
ovirt-engine-dwh-4.2.1-0.0.master.20171001102132.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.1-1.a3.20171103git1ef0871.el7.centos.x86_64
ovirt-engine-tools-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-metrics-1.1.1-0.2.master.20171031211242.el7.centos.noarch

Comment 3 Yaniv Kaul 2017-11-19 14:45:24 UTC
Roy, I thought you saw something similar on RHEV.TLV?

Comment 4 Martin Perina 2017-11-20 07:31:47 UTC
I was unable to open provided dump file? What's the format? Why not providing normal thread dump of JVM processes? Why attached server.log is empty?

Looking at the engine.log briefly, there's something broken in your environment. You are claiming that you have 6 hosts, but I can see much more (218 if I grepped the log correctly) NonResponsive hosts in your setup. So how it is?

Comment 5 Ilan Zuckerman 2017-11-20 08:17:46 UTC
(In reply to Martin Perina from comment #4)
Hi Martin,

> I was unable to open provided dump file? What's the format?
$file engine.dump
engine.dump: data
Apparently this is binary data file. I attached this because we thought this might be useful for you guys.

> Why not providing normal thread dump of JVM processes?
Please suggest a proper way to do this next time we encounter this issue.

> Why attached server.log is empty?
For some reason it was not written. Neither the day before, nor two days before.

> Looking at the engine.log briefly, there's something broken in your
> environment. You are claiming that you have 6 hosts, but I can see much more
> (218 if I grepped the log correctly) NonResponsive hosts in your setup. So
> how it is?
Those are probably nested hosts (which are actually vms). Physical hosts are only 6.

Comment 6 Yaniv Kaul 2017-11-20 08:35:05 UTC
Re-setting NEEDINFO on QE - we need more data to analyze.
If you don't have it, please close the bug.

Comment 7 Martin Perina 2017-11-20 08:55:24 UTC
(In reply to Ilan Zuckerman from comment #5)
> (In reply to Martin Perina from comment #4)
> Hi Martin,
> 
> > I was unable to open provided dump file? What's the format?
> $file engine.dump
> engine.dump: data
> Apparently this is binary data file. I attached this because we thought this
> might be useful for you guys.

Which tool have you created that dump with?

> 
> > Why not providing normal thread dump of JVM processes?
> Please suggest a proper way to do this next time we encounter this issue.

Why not use 'kill -3 PID'? Using that actual thread dump will be stored in console.log

> 
> > Why attached server.log is empty?
> For some reason it was not written. Neither the day before, nor two days
> before.

OK, that mean no critical errors appeared

> 
> > Looking at the engine.log briefly, there's something broken in your
> > environment. You are claiming that you have 6 hosts, but I can see much more
> > (218 if I grepped the log correctly) NonResponsive hosts in your setup. So
> > how it is?
> Those are probably nested hosts (which are actually vms). Physical hosts are
> only 6.

I think the important information here is the number of hosts (either physical or nested) which are managed by engine. How many hosts are added into engine? How many of them are non-responsive?

Comment 8 eberman 2017-11-27 08:37:15 UTC
Roy since we are currently trying to stabilize the system and get more information requested for https://bugzilla.redhat.com/show_bug.cgi?id=1516256 

Can you please add your inputs since you encountered same issue?
Are you able to reproduce on your current system please?

Comment 9 eberman 2017-11-30 14:41:49 UTC
just reproduced high CPU  on vega09 (system idle) 

[root@vega09 ovirt-engine]# cat /etc/os-release 
VERSION="7.5 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.5"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION="7.5"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.5 Beta"
PRIVACY_POLICY_URL="http://www.redhat.com/en/about/privacy-policy"
NAME="Red Hat Enterprise Linux Server"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.5 Beta (Maipo)"
VARIANT="Server"
VARIANT_ID="server"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.5:beta:server"


[root@vega09 ovirt-engine]# rpm -qa |grep ovirt-engine
ovirt-engine-setup-base-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch
ovirt-engine-dashboard-1.2.0-0.5.20171012git78f7a28.el7.centos.noarch
ovirt-engine-dwh-setup-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-metrics-1.1.1-0.4.master.20171114114759.el7.centos.noarch
ovirt-engine-dbscripts-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64
ovirt-engine-api-explorer-0.0.1-0.0.20171013git1fbdc17.el7.centos.noarch
ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch
ovirt-engine-setup-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.git1660a27.el7.centos.noarch
ovirt-engine-tools-backup-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-webadmin-portal-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-backend-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-restapi-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch
ovirt-engine-websocket-proxy-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.1-1.a3.20171121gitbdf815e.el7.centos.x86_64
ovirt-engine-dwh-4.2.1-0.0.master.20171001102132.el7.centos.noarch
ovirt-engine-lib-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos.noarch
ovirt-engine-tools-4.2.0-0.0.master.20171025204923.git6f4cbc5.el7.centos.noarch

[root@vega09 ovirt-engine]# free
              total        used        free      shared  buff/cache   available
Mem:       32754064    17761048      245016      165204    14748000    14117156
Swap:      16516092      118528    16397564

top - 16:37:08 up 4 days, 16:42,  7 users,  load average: 1.45, 1.54, 1.88
Tasks: 2170 total,   1 running, 2149 sleeping,  20 stopped,   0 zombie
%Cpu(s): 26.7 us,  9.0 sy,  0.0 ni, 63.7 id,  0.3 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32754064 total, 18253232 free,  7210796 used,  7290036 buff/cache
KiB Swap: 16516092 total, 16409852 free,   106240 used. 24669436 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                  
 4514 ovirt     20   0 8458416   2.2g  25088 S 126.4  7.1  65:52.75 java                                                                                                     
 1210 root      20   0   69308  17200   2256 S   4.3  0.1  55:30.69 ovsdb-server                                                                                             
 5134 postgres  20   0  431488 206788 125596 S   3.0  0.6   0:23.04 postgres                                                                                                 
 5153 postgres  20   0  392384 166156 117676 S   2.3  0.5   0:10.72 postgres                                                                                                 
 5430 postgres  20   0  370168  57636  32912 S   2.0  0.2   0:01.12 postgres                                                                                                 
 5381 postgres  20   0  428416 208860 130132 S   1.7  0.6   0:13.19 postgres                                                                                                 
 5432 postgres  20   0  431900 219900 137208 S   1.3  0.7   0:28.65 postgres                                                                                                 
17464 root      20   0  161944   4452   1572 R   1.0  0.0   0:00.42 top                        


finding the busy thread :
 ps -Lo pcpu,pmem,lwp -p 4514 | sort -h
                      

%CPU %MEM   LWP
 0.1  6.8  4526
 0.1  6.8  4527
 0.1  6.8  4530
 0.1  6.8  4531
 0.1  6.8  4677
 0.1  6.8  4856
 0.1  6.8  4877
 0.1  6.8  4881
 0.1  6.8  4887
 0.1  6.8  4898
 0.1  6.8  4899
 0.1  6.8  4908
 0.1  6.8  4929
 0.1  6.8  4931
 0.1  6.8  4932
 0.1  6.8  4951
 0.1  6.8  4961
 0.1  6.8  4970
 0.1  6.8  4978
 0.1  6.8  4981
 0.1  6.8  4982
 0.1  6.8  4989
 0.1  6.8  4993
 0.1  6.8  4995
 0.1  6.8  4997
 0.1  6.8  5002
 0.1  6.8  5003
 0.1  6.8  5004
 0.1  6.8  5005
 0.1  6.8  5006
 0.1  6.8  5009
 0.1  6.8  5011
 0.1  6.8  5013
 0.1  6.8  5014
 0.1  6.8  5015
 0.1  6.8  5016
 0.1  6.8  5017
 0.1  6.8  5018
 0.1  6.8  5019
 0.1  6.8  5021
 0.1  6.8  5024
 0.1  6.8  5025
 0.1  6.8  5026
 0.1  6.8  5027
 0.1  6.8  5028
 0.1  6.8  5029
 0.1  6.8  5031
 0.1  6.8  5032
 0.1  6.8  5035
 0.1  6.8  5036
 0.1  6.8  5038
 0.1  6.8  5049
 0.1  6.8  5050
 0.1  6.8  5053
 0.1  6.8  5056
 0.1  6.8  5057
 0.1  6.8  5058
 0.1  6.8  5062
 0.1  6.8  5063
 0.1  6.8  5064
 0.1  6.8  5065
 0.1  6.8  5068
 0.1  6.8  5069
 0.1  6.8  5070
 0.1  6.8  5071
 0.1  6.8  5072
 0.1  6.8  5073
 0.1  6.8  5074
 0.1  6.8  5077
 0.1  6.8  5078
 0.1  6.8  5079
 0.1  6.8  5080
 0.1  6.8  5081
 0.1  6.8  5084
 0.1  6.8  5086
 0.1  6.8  5087
 0.1  6.8  5088
 0.1  6.8  5089
 0.1  6.8  5090
 0.1  6.8  5092
 0.1  6.8  5097
 0.1  6.8  5098
 0.1  6.8  5099
 0.1  6.8  5100
 0.1  6.8  5102
 0.1  6.8  5105
 0.1  6.8  5106
 0.1  6.8  5107
 0.1  6.8  5110
 0.1  6.8  5111
 0.1  6.8  5113
 0.1  6.8  5114
 0.1  6.8  5115
 0.2  6.8  4879
 0.2  6.8  4912
 0.2  6.8  4985
 0.2  6.8  5008
 0.2  6.8  5022
 0.2  6.8  5023
 0.2  6.8  5054
 0.2  6.8  5061
 0.2  6.8  5094
 0.2  6.8  5096
 0.2  6.8  5101
 0.2  6.8  5109
 0.3  6.8  4791
 0.5  6.8  4907
 0.6  6.8 15038
 0.6  6.8 15040
 0.6  6.8  4595
 2.6  6.8  4593
 2.7  6.8  4594
98.0  6.8  4904

checking for the busy thread :
  function problematicThread() { ps -Lo pcpu,lwp -p $(pidof ovirt-engine) | sort -h  | tail -1 | cut  -d " " -f2 ; }

sending to txt file 
  function problematicThread() { ps -Lo pcpu,lwp -p $(pidof ovirt-engine) | sort -h  | tail -1 | cut  -d " " -f2 ; }
   41  jcmd $(pidof ovirt-engine) Thread.print | egrep -A 10 $(printf "%x" $(problematicThread ) ) >> /tmp/grepOnThread.txt
 
from the txt file you can see the repeating operation : 

SSL Stomp Reactor" #149 daemon prio=5 os_prio=0 tid=0x00007f4b2c202360 nid=0x1328 runnable [0x00007f4ae5da7000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.NativeThread.current(Native Method)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:326)
        - locked <0x00000006cd1e3350> (a java.lang.Object)
        - locked <0x00000006cd1e3330> (a java.lang.Object)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:50)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:90)
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:109)
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:196)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:121)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.lambda$waitForConnect$0(SSLStompClient.java:111)
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient$$Lambda$818/2003211657.call(Unknown Source)
        at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:27)
        at org.ovirt.vdsm.jsonrpc.client.utils.retry.AwaitRetry.retry(AwaitRetry.java:15)
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.waitForConnect(SSLStompClient.java:110)
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.sendMessage(SSLStompClient.java:80)
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:93)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getCapabilities(JsonRpcVdsServer.java:314)
--
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)

Comment 10 eberman 2017-11-30 14:48:07 UTC
[root@vega09 tmp]# rpm -qa |grep vdsm
vdsm-jsonrpc-java-1.4.9-1.20171120122807.git1a5ecf6.el7.centos.noarch


adding logs

Comment 11 eberman 2017-11-30 15:29:38 UTC
(In reply to Martin Perina from comment #7)
> (In reply to Ilan Zuckerman from comment #5)
> > (In reply to Martin Perina from comment #4)
> > Hi Martin,
> > 
> > > I was unable to open provided dump file? What's the format?
> > $file engine.dump
> > engine.dump: data
> > Apparently this is binary data file. I attached this because we thought this
> > might be useful for you guys.
> 
> Which tool have you created that dump with?

regenerated with: 
jcmd $(pidof ovirt-engine) GC.heap_dump /tmp/eberman.ovirt-engine.hdump
adding it now
> 
> > 
> > > Why not providing normal thread dump of JVM processes?
> > Please suggest a proper way to do this next time we encounter this issue.
> 
> Why not use 'kill -3 PID'? Using that actual thread dump will be stored in
> console.log

originally tried process didnt even respond -adding new logs now its included in the new logs 
> 
> > 
> > > Why attached server.log is empty?
> > For some reason it was not written. Neither the day before, nor two days
> > before.
> 
> OK, that mean no critical errors appeared
> 
> > 
> > > Looking at the engine.log briefly, there's something broken in your
> > > environment. You are claiming that you have 6 hosts, but I can see much more
> > > (218 if I grepped the log correctly) NonResponsive hosts in your setup. So
> > > how it is?
> > Those are probably nested hosts (which are actually vms). Physical hosts are
> > only 6.
> 
> I think the important information here is the number of hosts (either
> physical or nested) which are managed by engine. How many hosts are added
> into engine? How many of them are non-responsive?

It was necessary to upgrade the environment, afterwards we needed to adjust our mac address pool and recreate our nested hosts as the IPs had changed.  After doing this our current state is 238 hosts of which 6 are physical and 232 are nested, of which 212 hosts are up.

Comment 12 eberman 2017-11-30 16:20:12 UTC
logs are too large to attach here is a link:


https://drive.google.com/a/redhat.com/file/d/1YiMM_r5LO4GVdZYsOTpC4dDcqgWauhKv/view?usp=sharing

Comment 13 eberman 2017-11-30 16:47:01 UTC
(In reply to Yaniv Kaul from comment #6)
> Re-setting NEEDINFO on QE - we need more data to analyze.
> If you don't have it, please close the bug.

added

Comment 14 Ravi Nori 2017-12-13 18:26:03 UTC
Piotr can you take a look, saw this in the logs

2017-11-30 18:01:50,836+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] Command 'GetCapabilitiesVDS
Command(HostName = nested_host_from_vm_vmvm_updated__0_cluster_real_10.35.89.24, VdsIdAndVdsVDSCommandParametersBase:{hostId='a38bbad8-e2cf-4acf-a2b6-22c42fe10c95', vds='Host[nested_hos
t_from_vm_vmvm_updated__0_cluster_real_10.35.89.24,a38bbad8-e2cf-4acf-a2b6-22c42fe10c95]'})' execution failed: VDSGenericException: VDSErrorException: Failed to GetCapabilitiesVDS, erro
r = Internal JSON-RPC error: {'reason': '[Errno 12] Cannot allocate memory'}, code = -32603
2017-11-30 18:01:50,836+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] Failure to refresh host 'nested_host_
from_vm_vmvm_updated__0_cluster_real_10.35.89.24' runtime info: VDSGenericException: VDSErrorException: Failed to GetCapabilitiesVDS, error = Internal JSON-RPC error: {'reason': '[Errno
 12] Cannot allocate memory'}, code = -32603
2017-11-30 18:01:50,836+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] Failed to refresh VDS , vds = 'nested_host_from_vm_v
mvm_updated__0_cluster_real_10.35.89.24' : 'a38bbad8-e2cf-4acf-a2b6-22c42fe10c95', error = 'VDSGenericException: VDSErrorException: Failed to GetCapabilitiesVDS, error = Internal JSON-R
PC error: {'reason': '[Errno 12] Cannot allocate memory'}, code = -32603', continuing.
2017-11-30 18:01:50,836+02 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker
.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetCapabilitiesVDS, error = Internal JSON-RPC error: {'reason': '[Errno 12] Cannot allocate memory'}, code = -32603
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createDefaultConcreteException(VdsBrokerCommand.java:81) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.createException(BrokerCommandBase.java:223) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:193) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:23) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:112) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:386) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor136.invoke(Unknown Source) [:1.8.0_151]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_151]
	at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_151]
	at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:49) [weld-core-impl-2.4.3.Final.jar:2.4.3.Fi
nal]
	at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:77) [weld-core-impl-2.4.3.Final.jar:2.4.3.Final]
	at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12) [common.jar:]

Comment 15 Piotr Kliczewski 2017-12-13 18:38:54 UTC
Ravi, The stack trace you pasted in comment #14 that there was an issue on vdsm side and it was propagated to the engine:

{'reason': '[Errno 12] Cannot allocate memory'}

I do not see vdsm logs in the provided link in comment #12. Please provide failing vdsm logs.

Comment 16 eberman 2017-12-19 14:04:14 UTC
(In reply to Piotr Kliczewski from comment #15)
> Ravi, The stack trace you pasted in comment #14 that there was an issue on
> vdsm side and it was propagated to the engine:
> 
> {'reason': '[Errno 12] Cannot allocate memory'}
> 
> I do not see vdsm logs in the provided link in comment #12. Please provide
> failing vdsm logs.

i added all the logs while open the Bug , Ravi if you saw something additional please add the logs Piotr is asking

Comment 17 Ravi Nori 2017-12-21 14:42:39 UTC
The error I posted above is the only issue I found

Comment 18 Piotr Kliczewski 2017-12-21 15:45:50 UTC
Ravi, please provide vdsm log so we can analyze what happened.

Comment 19 Ravi Nori 2017-12-21 20:36:16 UTC
The error is from engine log, I don't have the vdsm log

Comment 20 Martin Perina 2017-12-22 07:46:56 UTC
(In reply to eberman from comment #16)
> (In reply to Piotr Kliczewski from comment #15)
> > Ravi, The stack trace you pasted in comment #14 that there was an issue on
> > vdsm side and it was propagated to the engine:
> > 
> > {'reason': '[Errno 12] Cannot allocate memory'}
> > 
> > I do not see vdsm logs in the provided link in comment #12. Please provide
> > failing vdsm logs.
> 
> i added all the logs while open the Bug , Ravi if you saw something
> additional please add the logs Piotr is asking

You have added only engine logs and dumps [1], but not VDSM logs. So Eyal do you still have VDSM logs from affected host 'nested_host_from_vm_vmvm_updated__0_cluster_real_10.35.89.24'? If not, could you please try to reproduce and attach VDSM logs from such host(s)? Or even could reproduce and allow us to access requested system?

Thanks


[1] https://drive.google.com/a/redhat.com/file/d/1YiMM_r5LO4GVdZYsOTpC4dDcqgWauhKv/view?usp=sharing

Comment 21 Daniel Gur 2017-12-26 12:19:31 UTC
Adding More inputs from Martin received on Mail:
"​Ravi and Piotr looked at logs, but haven't found any clue from engine logs​, the only strange thing was the out of memory issue from one of the host (that's why Piotr asked for VDSM logs), but it's probably unrelated to engine CPU usage.
Most probably we will need to either reproduce it or get access to the system where the issue is raised."

Daniel: 
We can not reproduce it in QE correctly  we were requested to release all the servers (Heras)for development team. So please try to reproduce it there.

The VDSM may be can be still on those Heras but probably were already over-ridden as one month passed since this issue was reported till the VDSM logs were requested. But anyhow as Martin say this "need info" is not related to this issue but maybe different bug " out of memory issue from one of the host " - that should be opened. Should we open additional bug for it?
  
We are building new lab in RDU and will try to reproduce it there but this lab will be fully operational only in number of weeks.
Regarding getting access to the system the engine , Please contact Guy Chen he will provide you the access.

Comment 22 Martin Perina 2018-02-07 09:47:24 UTC
Ravi was not able to reproduce this behaviour when testing his non-blocking threads POC, so Daniel could you please try to reproduce during scale testing?

Comment 24 Martin Perina 2018-04-05 13:08:30 UTC
Closing for now, feel free to reopen if you are able to reproduce again

Comment 25 Daniel Gur 2018-06-24 15:00:02 UTC
We don't see it any more