Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1595230 - [UPGRADES][10] Packet loss during 'controller upgrade' step
Summary: [UPGRADES][10] Packet loss during 'controller upgrade' step
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Brian Haley
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-26 12:15 UTC by Yurii Prokulevych
Modified: 2019-03-24 05:10 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Yurii Prokulevych 2018-06-26 12:15:05 UTC
Description of problem:
-----------------------
There is 4% packet loss during controller nodes upgrade.
This causes automation to fail since threshold is set to 1%


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-neutron-9.4.1-19.el7ost

Also tested with openstack-neutron-9.4.1-25.el7ost

How reproducible:
-----------------
So far 100%

Steps to Reproduce:
-------------------
1. Upgrade uc to RHOS-10
2. Spawn vm and assign floating ip to it
3. Before every upgrade step launch ping to vm's fip and after check packet loss is not higher than 1%


Actual results:
---------------
Packet loss during controller upgrade step

Additional info:
----------------
Virtual setup: 3controllers + 2computes + 3ceph

Comment 2 Brian Haley 2018-06-27 17:44:41 UTC
I tried to reproduce this yesterday by pinging the floating IP while restarting the l3-agent on the controller hosting the master router.  I didn't see any packet loss so will go through the logs to see if I can piece something together.

Comment 3 Brian Haley 2018-06-27 18:33:14 UTC
I couldn't find l3-agent logs in the sosreport for all the controllers, but looking on controller-2, which is not the current master I do see this after it's restarted:

2018-06-26 11:20:57.586 13204 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-4bedd2a6-1c8f-4876-a9d5-411768c75d53', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=0']

2018-06-26 11:20:57.725 13204 INFO neutron.agent.l3.ha [-] Router 4bedd2a6-1c8f-4876-a9d5-411768c75d53 transitioned to master

2018-06-26 11:20:57.886 13204 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-4bedd2a6-1c8f-4876-a9d5-411768c75d53', 'sysctl', '-w', 'net.ipv6.conf.qg-3913e1a3-8f.forwarding=1']

2018-06-26 11:21:09.472 13204 INFO neutron.agent.l3.ha [-] Router 4bedd2a6-1c8f-4876-a9d5-411768c75d53 transitioned to backup

2018-06-26 11:21:09.501 13204 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-4bedd2a6-1c8f-4876-a9d5-411768c75d53', 'sysctl', '-w', 'net.ipv6.conf.qg-3913e1a3-8f.forwarding=0']

So it bounced the forwarding knob on and off, which was the issue that caused the failure originally.

But the router on controller-0 had already declared itself the master less than a minute before:

2018-06-26 11:19:44.463 12652 INFO neutron.agent.l3.ha [-] Router 4bedd2a6-1c8f-4876-a9d5-411768c75d53 transitioned to master

So the packet loss is most likely a result of the router on controller-2 transitioning from master to backup, since even that would cause an MLDv2 to get sent, causing a possible interruption for ~1 minutes.

Comment 4 Brian Haley 2018-06-28 17:24:40 UTC
Yurii recreated this condition again today, but looking on the three controllers they do not have the version of OSP 10 that has the fix:

openstack-neutron.noarch         1:9.4.1-19.el7ost

1:9.4.1-25.el7ost or later is required.  I know yesterday I did see -25 installed so maybe a minor update is required?  And then we can re-check things?

Also, one other thing that would be useful is to test a 10 -> 11 upgrade and see if this failure occurs again.  Thanks.

Comment 6 Brian Haley 2018-07-12 16:25:43 UTC
So I have a thought on what could have happened, although I think this would have affected other upgrades as well.

From the logs I can see that net.ipv6.conf.*.forwarding is getting changes, even on the standby routers:

2018-07-12 06:16:21.292 27186 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-d8abaffc-f1ed-42ff-97c0-a05db06756f8', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:106
2018-07-12 07:26:49.823 17470 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-d8abaffc-f1ed-42ff-97c0-a05db06756f8', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=0'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105
2018-07-12 07:26:50.132 17470 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-d8abaffc-f1ed-42ff-97c0-a05db06756f8', 'sysctl', '-w', 'net.ipv6.conf.qg-1a2c96d7-d5.forwarding=0'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105
2018-07-12 09:09:33.280 73628 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-d8abaffc-f1ed-42ff-97c0-a05db06756f8', 'sysctl', '-w', 'net.ipv6.conf.all.forwarding=0'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105
2018-07-12 09:09:33.913 73628 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-d8abaffc-f1ed-42ff-97c0-a05db06756f8', 'sysctl', '-w', 'net.ipv6.conf.qg-1a2c96d7-d5.forwarding=0'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105

Even that switch from 1 to 0 at 7:26:49.823 could have caused a disruption, although I'm not sure if that's the time where the packets were lost, or if it was at 9:09 ?

Just for testing I did flip the all.forwarding on one of the standby routers and saw packet loss.  The good thing is that this shouldn't happen going forward, since all the backups have these at 0 now, but I would have figured an upgrade from 10 to 13 would have seen this too, unless the 10 setup hadn't applied this patch yet.

This is what I was trying to do with https://review.openstack.org/#/c/569458/ - don't ever change that value from 1 to 0 if possible.  I stopped working on that once the other patch seemed to fix the issue, but can pick it up again since it might fix this edge case.


Note You need to log in before you can comment on or make changes to this bug.