Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1099856 - [SCALE] VDSM is consuming a lot of CPU time even with no active VMs on 100 NFS storage domains
Summary: [SCALE] VDSM is consuming a lot of CPU time even with no active VMs on 100 NF...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.5.0
Assignee: Nir Soffer
QA Contact: Yuri Obshansky
URL:
Whiteboard: storage
Depends On:
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-05-21 11:07 UTC by Aharon Canan
Modified: 2016-02-10 18:18 UTC (History)
11 users (show)

Fixed In Version: vt1.3, 4.16.0-1.el6_5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-16 13:37:34 UTC
oVirt Team: Storage
Target Upstream Version:
scohen: needinfo+
scohen: Triaged+


Attachments (Terms of Use)
script to create 100 SDs (deleted)
2014-05-21 11:07 UTC, Aharon Canan
no flags Details
logs (deleted)
2014-05-21 11:13 UTC, Aharon Canan
no flags Details
create_sd.py (deleted)
2014-05-21 11:19 UTC, Aharon Canan
no flags Details
profile results sorted by time (deleted)
2014-05-21 15:06 UTC, Nir Soffer
no flags Details
profile results sorted by cumultive time (deleted)
2014-05-21 15:07 UTC, Nir Soffer
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 28089 master MERGED nfsSD: Remove unneeded and expensive mount check Never

Description Aharon Canan 2014-05-21 11:07:52 UTC
Created attachment 897917 [details]
script to create 100 SDs

Description of problem:
high cpu usage attributed to 'vdsm' after setting 100 NFS storage domains

Version-Release number of selected component (if applicable):
is36.4 
vdsm-4.13.2-0.17.el6ev


How reproducible:
100%

Steps to Reproduce:
1. set NFS DC with 3 hosts (not sure if we really need 3 hosts)
2. create 100 NFS storage domains
3. run "top" on one of the hosts and check vdsm

Actual results:
==========
 7589 vdsm       0 -20 3537m  55m 6532 S 249.4  0.6   2540:04 vdsm                                                                                                                                                                           


Expected results:


Additional info:

Comment 1 Aharon Canan 2014-05-21 11:13:07 UTC
Created attachment 897918 [details]
logs

Comment 2 Aharon Canan 2014-05-21 11:19:01 UTC
Created attachment 897920 [details]
create_sd.py

Comment 3 Nir Soffer 2014-05-21 11:29:30 UTC
How many cores in the machine that show 249% cpu usage?

Comment 4 Aharon Canan 2014-05-21 11:33:08 UTC
4 cores 1 socket

Comment 5 Nir Soffer 2014-05-21 11:41:35 UTC
Some more info from the machine:

cpu: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
release: Red Hat Enterprise Linux Server release 6.5 Beta (Santiago)
last yum update: 2014-05-04 (missing lot of updates)

Comment 6 Nir Soffer 2014-05-21 14:23:32 UTC
Please repeat this test with sane number of storage domains - we have customers using 30-40 storage domains, and it would be useful to see how the system behave in normal conditions to evaluate the severity of this issue.

Comment 7 Nir Soffer 2014-05-21 15:04:13 UTC
I reproduced this partially using master (2014-05-21) setup with 30 nfs storage domains. I don't get the extreme cpu reported by Aharon, only little high cpu usage of a about 20% out of 800%.

Attached profiles showing where time is spent on this setup on the spm.

Comment 8 Nir Soffer 2014-05-21 15:06:28 UTC
Created attachment 898046 [details]
profile results sorted by time

Comment 9 Nir Soffer 2014-05-21 15:07:07 UTC
Created attachment 898047 [details]
profile results sorted by cumultive time

Comment 10 Nir Soffer 2014-05-21 15:15:40 UTC
The high cpu usage is caused by inefficient implementation of the mount related code, having O(N^2) complexity.

NfsStorageDomain.selftest is responsible to 267 seconds of total 458 seconds of cpu time (58%).

Comment 11 Nir Soffer 2014-05-21 15:20:16 UTC
Set severity to medium and schedule for 3.5.0, since with normal setup (30 storage domains), this is not a major issue. This is also not a regression, the code responsible for this is from 2012.

Comment 12 Nir Soffer 2014-05-21 15:23:07 UTC
Marina, can you tell us what is a common number of storage domains in the field? Do we support systems with more than 30-40 NFS storage domains?

Comment 13 Allon Mureinik 2014-05-21 17:57:02 UTC
Without requirement guidelines, these kind of bugs are pointless.

Sean - We need concrete definition on the size of environment we need to support, and the hardware we require customers to have for it.

Aharon/Gil - we need input on what QA are able to test.

[in any event, 100 SDs sounds like a usecase we'll never see in the field, and if we do, the first action item would be to consolidate them.]

Comment 16 Aharon Canan 2014-05-22 09:18:32 UTC
Nir, 

You asked me to set 100 SDs in comment #7 from https://bugzilla.redhat.com/show_bug.cgi?id=1095907
 
Anyway, in case it is supported we need to fix, 
In case it is not, we need to block the option to add SD above supported numbers.

I think is it up to PM to decide and then we should continue accordingly 

Sean?

Comment 17 Nir Soffer 2014-05-22 09:48:07 UTC
(In reply to Aharon Canan from comment #16)
> You asked me to set 100 SDs in comment #7 from
> https://bugzilla.redhat.com/show_bug.cgi?id=1095907

In https://bugzilla.redhat.com/show_bug.cgi?id=1095907#c6 I asked for "30 ISCIS storage domains"
In https://bugzilla.redhat.com/show_bug.cgi?id=1095907#c7 I suggested to create "lot of (100?) mounts"

Sorry if that was not clear.

Comment 20 Nir Soffer 2014-05-25 15:48:50 UTC
Aharon, can you test the attached patch with your setup?

Comment 21 Aharon Canan 2014-05-26 08:06:39 UTC
do not have resource for integration testing for now.

Comment 23 Yuri Obshansky 2014-12-23 07:53:14 UTC
Bug verified on
RHEV-M 3.5.0-0.22.el6ev
RHEL - 6Server - 6.6.0.2.el6
libvirt-0.10.2-46.el6_6.1
vdsm-4.16.7.6-1.el6ev 
Created 100 NFS Storage Domains and checked top on host:
top - 18:52:14 up 12 days,  9:09,  2 users,  load average: 1.29, 1.21, 1.15
Tasks: 1343 total,   1 running, 1341 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.3%us,  1.0%sy,  0.0%ni, 97.9%id,  0.7%wa,  0.0%hi, 0.0%si,
0.0%st
Mem:  396875340k total,  6112184k used, 390763156k free,   159604k buffers
Swap: 16383996k total,        0k used, 16383996k free,  1931216k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
45620 vdsm       0 -20 33.6g 121m 9700 S 62.4  0.0 171:36.09 vdsm
Bug didn't reproduce.


Note You need to log in before you can comment on or make changes to this bug.