Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1512260 - [TestOnly] Java max heap size might be too large
Summary: [TestOnly] Java max heap size might be too large
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Setup.Engine
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium vote
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Yedidyah Bar David
QA Contact: Daniel Gur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-12 08:33 UTC by Yedidyah Bar David
Modified: 2018-06-26 08:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-06 07:56:46 UTC
oVirt Team: Integration
rule-engine: ovirt-4.2+


Attachments (Terms of Use)

Description Yedidyah Bar David 2017-11-12 08:33:06 UTC
Description of problem:

Current behavior is:

engine-setup sets min and max heap size to max(25% of RAM, 1GiB). See bug 1190466. Before it, both were always 1GiB.

ovirt-engine.py, the script that starts the engine, limits above configured values to be not more than 90% of RAM. This is needed in case RAM size was decreased since engine-setup, see bug 1329119.

In some cases since solving bug 1190466, it was mentioned that perhaps 25% of RAM is too much - e.g. with a machine with 128GiB, 25% is 32GiB, which normally causes the heap to get bigger over time, until a GC runs - and with such a large heap, the GC needlessly takes a very long time. It was specifically suggested to limit the heap to 4GiB, regardless of RAM size, with the assumption that if an engine really needs more than 4GiB, presumably because it manages lots of hosts/VMs/etc., it will run into other scalability problems, due to the way it works. I am not aware of any actual methodical tests done about this, thus opening this bug.

Currently setting to TestOnly and moving to QE. QE - please test with a very large system, perhaps somewhat larger than the largest we officially support in terms of hosts/VMs/etc., with an engine machine having 256GiB RAM, and with different values set for ENGINE_HEAP_MAX between, say, 1GiB and 64GiB, and see if you can get into too-large GC times. Also see how heap size affects performance in general.

Juan - what else should be done/tested/etc.? Thanks.

Comment 2 Juan Hernández 2017-11-12 09:13:16 UTC
The effect of too large heaps is double:

1. Makes full garbage collections less frequent, but longer. But checking this in isolation doesn't make much sense, it should be correlated to the performance as perceived by the user. A possible way to check this is to measure the stability of the response times of the engine. Select a heavy API request and repeat it over time, checking the total response times. The distribution should be uniform, with some peaks due to garbage collection. The expectation is that there is a point were increasing the heap size increases the peaks without increasing the performance.

2. More than 32 GiB disable compressed object pointers. The JVM stores object pointers as 32 bit offsets if the heap is smaller than 32 GiB. If it is larger it stores them as 64 bit pointers. This can multiply the use of memory by 1.5, approx, and increase the use of CPU and the cache misses accordingly. So there is no point in using more than 32 GiB of heap if less is enough, even if the machine has much more RAM. In these regards we should verify that increasing the heap beyond 32 GiB does not improve the performance. Then we should make sure that we never use more than 32 GiB of heap.

In version 4.2 of the engine we are introducing in the API a new `follow` mechanism that can be easily used to stress the engine. For example, a request like this:

  GET /ovirt-engine/api/vms?follow=disk_attachments.disk.statistics,nics.statistics

It will ask the engine to retrieve all the virtual machines, with their disks attachments, disks, disk statistics, NICs and NIC statistics. In a large environment should put a lot of pressure on memory (but also on database queries, unfortunately). It is a good candidate for a "heavy" request.

Comment 3 Piotr Kliczewski 2017-11-13 09:03:39 UTC
In addition to the heap size it is possible to reduce full GC pause times by customizing GC algorithm and its configuration. It depends on how engine allocates memory over time so we could reduce stop the world time. Do we have any recent measurements how and how much the engine allocates memory?

Comment 4 Petr Matyáš 2017-11-14 14:29:03 UTC
From my tests on not so large env (8GB RAM for engine) this might be verified, however I think this should also be tested on scale env.

Eyal can you take a look at this?

Comment 5 Lukas Svaty 2018-01-24 10:37:09 UTC
Moving to scale team.


Note You need to log in before you can comment on or make changes to this bug.