Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1515933 - The engine fails to start VM with 1Gb hugepages and NUMA pinning
Summary: The engine fails to start VM with 1Gb hugepages and NUMA pinning
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.0
Hardware: x86_64
OS: Linux
high
medium vote
Target Milestone: ovirt-4.2.2
: 4.2.2.2
Assignee: Andrej Krejcir
QA Contact: Artyom
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-21 16:18 UTC by Artyom
Modified: 2018-03-29 11:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the VM uses hugepages, the size of every NUMA node has to be divisible by the hugepage size. But when creating NUMA nodes in the UI, the VM's memory is divided equally between all nodes. Consequence: Created NUMA nodes can have size that is not divisible by the hugepage size and so the VM fails to start. Fix: The VM's memory is divided as equally as possible between nodes, while keeping the node size divisible by hugepage size. Result: VM with hugepages and NUMA nodes can be started.
Clone Of:
Environment:
Last Closed: 2018-03-29 11:02:34 UTC
oVirt Team: SLA
rule-engine: ovirt-4.2+
mtessun: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine and vdsm logs (deleted)
2017-11-21 16:18 UTC, Artyom
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 86963 master MERGED core: HugePageUtils returns huge page size as integer. 2018-02-09 10:08:51 UTC
oVirt gerrit 86964 master MERGED core: NumaValidator checks if node size is divisible by hugepage size 2018-02-09 10:08:54 UTC
oVirt gerrit 86965 master MERGED webadmin: Make VM numa node size divisible by hugepage size 2018-02-09 10:08:59 UTC
oVirt gerrit 87379 ovirt-engine-4.2 MERGED core: HugePageUtils returns huge page size as integer. 2018-03-01 10:22:08 UTC
oVirt gerrit 87380 ovirt-engine-4.2 MERGED core: NumaValidator checks if node size is divisible by hugepage size 2018-03-01 10:22:13 UTC
oVirt gerrit 87381 ovirt-engine-4.2 MERGED webadmin: Make VM numa node size divisible by hugepage size 2018-03-01 10:23:04 UTC
oVirt gerrit 87781 master MERGED core: Fix MathUtils.greatestCommonDivisor() 2018-02-21 15:47:12 UTC

Description Artyom 2017-11-21 16:18:09 UTC
Created attachment 1356848 [details]
engine and vdsm logs

Description of problem:
The engine fails to start VM with 1Gb hugepages and NUMA pinning

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch
vdsm-4.20.7-34.gitab15536.el7.centos.x86_64
qemu-kvm-common-ev-2.9.0-16.el7_4.8.1.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.8.1.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure VM:
Memory: 3Gb
CPU's: 2
Hugepages Custom Property: 1048576
Pin it to host with at least two NUMA nodes
A number of NUMA nodes: 2
Pin each VM NUMA node to separate physical NUMA node
2. Start the VM
3.

Actual results:
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument
madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified
2017-11-21T16:00:39.054676Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument
2017-11-21 18:00:39,569+0200 INFO  (vm/fd80374a) [virt.vm] (vmId='fd80374a-cb00-4f78-9121-6e87cee581c0') Changed state to Down: internal error: process exited while connecting to monitor: qemu_madvise: Invalid argument

Expected results:
I think we have two options, or block VM start on the scheduler level, or somehow round NUMA nodes hugepages usage

Additional info:

Comment 1 Michal Skrivanek 2017-11-22 05:31:16 UTC
thoughts?

Comment 3 Martin Tessun 2017-11-22 07:08:29 UTC
(In reply to Yaniv Kaul from comment #2)
> https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?

Exactly that is the issue.
You need to take care that each NUMA node memory fits to the 1GB memory boundary of the Hugepages.

So e.g the following fails:
memory = 21GB
NUMA Nodes = 4
NUMA Memory Node 1 = 5.25GB
NUMA Memory Node 2 = 5.25GB
NUMA Memory Node 3 = 5.25GB
NUMA Memory Node 4 = 5.25GB

It would succeed in case of a asynchronous reservation, e.g.:
memory = 21GB
NUMA Nodes = 4
NUMA Memory Node 1 = 6.00GB
NUMA Memory Node 2 = 5.00GB
NUMA Memory Node 3 = 5.00GB
NUMA Memory Node 4 = 5.00GB

As you don't know in advance how many NUMA Nodes are created, you cannot check the total memory, and as such I believe we need to do the asynchronous reservation and maybe print a warning (NUMA node imbalanced memory due to max memory/NUMA nodes does not fit hugepages boundary).

Thoughts?

Comment 4 Yaniv Kaul 2017-11-22 07:32:34 UTC
(In reply to Martin Tessun from comment #3)
> (In reply to Yaniv Kaul from comment #2)
> > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?
> 
> Exactly that is the issue.
> You need to take care that each NUMA node memory fits to the 1GB memory
> boundary of the Hugepages.
> 
> So e.g the following fails:
> memory = 21GB
> NUMA Nodes = 4
> NUMA Memory Node 1 = 5.25GB
> NUMA Memory Node 2 = 5.25GB
> NUMA Memory Node 3 = 5.25GB
> NUMA Memory Node 4 = 5.25GB
> 
> It would succeed in case of a asynchronous reservation, e.g.:
> memory = 21GB
> NUMA Nodes = 4
> NUMA Memory Node 1 = 6.00GB
> NUMA Memory Node 2 = 5.00GB
> NUMA Memory Node 3 = 5.00GB
> NUMA Memory Node 4 = 5.00GB
> 
> As you don't know in advance how many NUMA Nodes are created, you cannot
> check the total memory, and as such I believe we need to do the asynchronous
> reservation and maybe print a warning (NUMA node imbalanced memory due to
> max memory/NUMA nodes does not fit hugepages boundary).
> 
> Thoughts?

I'd limit our solution to uniform memory distribution across NUMA nodes, and then make sure the total memory is properly dividable between the number of set NUMA nodes - just to ensure we properly fail before running?
When you pin, you should know the theoretical values (of course, some memory when you try to run may already be taken!)

Comment 5 Tomas Jelinek 2017-11-22 07:45:32 UTC
(In reply to Yaniv Kaul from comment #4)
> (In reply to Martin Tessun from comment #3)
> > (In reply to Yaniv Kaul from comment #2)
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1499492#c7 ?
> > 
> > Exactly that is the issue.
> > You need to take care that each NUMA node memory fits to the 1GB memory
> > boundary of the Hugepages.
> > 
> > So e.g the following fails:
> > memory = 21GB
> > NUMA Nodes = 4
> > NUMA Memory Node 1 = 5.25GB
> > NUMA Memory Node 2 = 5.25GB
> > NUMA Memory Node 3 = 5.25GB
> > NUMA Memory Node 4 = 5.25GB
> > 
> > It would succeed in case of a asynchronous reservation, e.g.:
> > memory = 21GB
> > NUMA Nodes = 4
> > NUMA Memory Node 1 = 6.00GB
> > NUMA Memory Node 2 = 5.00GB
> > NUMA Memory Node 3 = 5.00GB
> > NUMA Memory Node 4 = 5.00GB
> > 
> > As you don't know in advance how many NUMA Nodes are created, you cannot
> > check the total memory, and as such I believe we need to do the asynchronous
> > reservation and maybe print a warning (NUMA node imbalanced memory due to
> > max memory/NUMA nodes does not fit hugepages boundary).
> > 
> > Thoughts?
> 
> I'd limit our solution to uniform memory distribution across NUMA nodes, and
> then make sure the total memory is properly dividable between the number of
> set NUMA nodes - just to ensure we properly fail before running?

why not fail on save? We know how many numa nodes will we create on run and how many memory do we need to divide between them. We could provide a good validation message.

> When you pin, you should know the theoretical values (of course, some memory
> when you try to run may already be taken!)

Comment 6 Michal Skrivanek 2017-11-22 09:01:32 UTC
it should be easy enough to split the memory into the right chunks on engine side. In the original case to 2 GB and 1 GB, we do define the amount of memory in each node I believe - Martin?

Comment 7 Martin Sivák 2017-11-22 09:27:42 UTC
We do allow custom size NUMA nodes when configured through REST API (iirc). But the UI based flow distributes the memory uniformly.

Btw, it is hard to show some meaningful validation when hugepage sizes are set using the generic custom parameters approach.

Comment 8 Tomas Jelinek 2017-11-22 10:03:11 UTC
(In reply to Martin Sivák from comment #7)
> We do allow custom size NUMA nodes when configured through REST API (iirc).
> But the UI based flow distributes the memory uniformly.

so the ui based flow could have some logic to distribute them not-uniformly :)

But there will need to be a validation anyway, because there is a chance that it can not be split correctly at all (e.g. 2 numa nodes, 1G pages and 1G memory, so one of the nodes would end up with no memory).

> 
> Btw, it is hard to show some meaningful validation when hugepage sizes are
> set using the generic custom parameters approach.

I don't see the issue here. It is just a property and we use it.

Comment 9 Artyom 2018-03-04 16:19:01 UTC
Checked on rhvm-4.2.2.1-0.1.el7.noarch

VM configuration:
<vm>
<name>golden_env_mixed_virtio_0</name>
</bios>
<cpu>
<architecture>x86_64</architecture>
<topology>
<cores>2</cores>
<sockets>1</sockets>
<threads>1</threads>
</topology>
</cpu>
<custom_properties>
<custom_property>
<name>hugepages</name>
<value>1048576</value>
</custom_property>
</custom_properties>
<memory>3221225472</memory>
<memory_policy>
<guaranteed>1073741824</guaranteed>
<max>4294967296</max>
</memory_policy>
<placement_policy>
<affinity>pinned</affinity>
<hosts>
<hosthref="/ovirt-engine/api/hosts/745204e0-f625-4577-b194-124f82a314fa"id="745204e0-f625-4577-b194-124f82a314fa"/>
</hosts>
</placement_policy>
<numa_tune_mode>interleave</numa_tune_mode>
</vm>

With NUMA nodes:
<vm_numa_nodes>
<vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64"id="cf1d0720-aef8-4bb7-b5be-f3d2e3e30b64">
<cpu>
<cores>
<core>…</core>
</cores>
</cpu>
<index>0</index>
<memory>1536</memory>
<numa_node_pins>
<numa_node_pin>
<index>0</index>
</numa_node_pin>
</numa_node_pins>
<vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/>
</vm_numa_node>
<vm_numa_nodehref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899/numanodes/705ec19c-6c0f-45ee-970b-7f03f5bbc5d0"id="705ec19c-6c0f-45ee-970b-7f03f5bbc5d0">
<cpu>
<cores>
<core>…</core>
</cores>
</cpu>
<index>1</index>
<memory>1536</memory>
<numa_node_pins>
<numa_node_pin>
<index>1</index>
</numa_node_pin>
</numa_node_pins>
<vmhref="/ovirt-engine/api/vms/245b22b0-e711-4f48-9a47-26a9b15aa899"id="245b22b0-e711-4f48-9a47-26a9b15aa899"/>
</vm_numa_node>
</vm_numa_nodes>

VM failed to start with the same error:
2018-03-04 16:09:44.304+0000: 5834: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f3fc8111eb0
qemu_madvise: Invalid argument
madvise doesn't support MADV_DONTDUMP, but dump_guest_core=off specified
2018-03-04T16:09:44.609717Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu/1-golden_env_mixed_vir,size=1610612736,host-nodes=0,policy=interleave: cannot bind memory to host NUMA nodes: Invalid argument
2018-03-04 16:09:44.680+0000: shutting down, reason=failed

Comment 10 Red Hat Bugzilla Rules Engine 2018-03-04 16:19:08 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 11 Andrej Krejcir 2018-03-05 09:47:17 UTC
The fix was released in 4.2.2.2. It is not yet in 4.2.2.1.

Comment 14 Artyom 2018-03-06 14:07:48 UTC
Verified on rhvm-4.2.2.2-0.1.el7.noarch

1) Define 2 NUMA nodes with 1.5Gb each and start VM
	Status: 400
	Reason: Bad Request
	Detail: [Memory size of each numa node must be a multiple of hugepage size.]

2) Define 2 NUMA nodes with 1Gb each and start VM
VM started successfully

Comment 15 Sandro Bonazzola 2018-03-29 11:02:34 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.