Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1491505 - Cannot Scale to more than 116 subnets and VMs [NEEDINFO]
Summary: Cannot Scale to more than 116 subnets and VMs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 12.0 (Pike)
Assignee: Sai Sindhur Malleni
QA Contact: Sai Sindhur Malleni
URL:
Whiteboard: scale_lab
Depends On:
Blocks: 1512489
TreeView+ depends on / blocked
 
Reported: 2017-09-14 03:28 UTC by Sai Sindhur Malleni
Modified: 2018-10-18 07:22 UTC (History)
23 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171023134947.8da5e1f.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1512489 (view as bug list)
Environment:
N/A
Last Closed: 2017-12-13 22:08:57 UTC
Target Upstream Version:
pablo.iranzo: needinfo? (sauchter)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 505381 None None None 2017-09-25 14:25:02 UTC
OpenStack gerrit 509433 None None None 2017-10-30 13:51:58 UTC
Red Hat Knowledge Base (Solution) 3228801 None None None 2017-11-03 13:51:32 UTC
Red Hat Product Errata RHEA-2017:3462 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Sai Sindhur Malleni 2017-09-14 03:28:12 UTC
Description of problem:

We are using an OpenStack setup with 1 controller and 11 compute nodes. Ml2/ODL is the neutron backend. We execute the following Browbeat Rally Plugin:

1. Create a network
2. Create a subnet
3. Boot an instance on this subnet

We do the above sequence of operations 500 times at a concurrency of 8. 

Even after several attempts we are unable to scale past 116 VMs (each VM is on its own subnet). 116 seems to be the hard limit. The port never transitionas into active as even though the VIF Plugging happens, it fails the provioning block (DHCP), Since Ml2/ODL makes use of the neutron DHCP agent for DHCP, on looking in the DHCP agent logs we see

2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self.spawn_process()
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files


Even after increasing max_user_watches from 8192 to 50000 we are seeing the same behaviour

[root@overcloud-controller-0 heat-admin]# sysctl fs.inotify.max_user_watches                                                                                                                  
fs.inotify.max_user_watches = 50000




Version-Release number of selected component (if applicable):
OSP 12
puppet-neutron-11.3.0-0.20170805104936.743dde6.el7ost.noarch
python-neutronclient-6.5.0-0.20170807200849.355983d.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
python-neutron-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lib-1.9.1-0.20170731102145.0ef54c3.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170807223712.el7ost.noarch
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Scale test booting VMs with each Vm on its own subnet
2.
3.

Actual results:

Cannot scale to more than 116 VMs

Expected results:
Should be able to boot 500 VMs on 500 different subnets since we have the capacity from a hypervisor point of view

Additional info:

Comment 1 Sai Sindhur Malleni 2017-09-14 03:59:28 UTC
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq
dnsmasq-2.76-2.el7.x86_64
dnsmasq-utils-2.66-21.el7.x86_64

Comment 2 Brian Haley 2017-09-14 19:26:04 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1474515 has some good info on a similar bug reported to the RHEL team by Joe Talerico.

Comment 3 Sai Sindhur Malleni 2017-09-15 14:54:47 UTC
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created.

Question is should we bump this default in RHEL or have Director bump it at least for overcloud nodes?

Comment 4 Brian Haley 2017-09-15 16:37:18 UTC
I would think bumping this in Director would be best since it might not apply to all RHEL users.  Assuming there is no negative side-effect setting to it's maximum might be best.  Maybe someone on the RHEL team can help with that.

Comment 5 Jakub Libosvar 2017-10-30 13:51:59 UTC
Patch was merged

Comment 8 Sai Sindhur Malleni 2017-10-31 17:33:20 UTC
KCS: https://access.redhat.com/solutions/3228801

Comment 9 Sai Sindhur Malleni 2017-10-31 17:38:06 UTC
Ramon,

BZ for OSP10 backport

Comment 10 Sai Sindhur Malleni 2017-10-31 17:38:34 UTC
Sorry, I failed to link the actual BZ in the previous comment. Here it is,
https://bugzilla.redhat.com/show_bug.cgi?id=1508030

Comment 17 errata-xmlrpc 2017-12-13 22:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.