Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1601472 - FFU: after the ffu procedure, unable to receive metadata for a freshly booted dpdk instance [NEEDINFO]
Summary: FFU: after the ffu procedure, unable to receive metadata for a freshly booted...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z2
: 13.0 (Queens)
Assignee: Yolanda Robla
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks: 1560721 1601484 1607376
TreeView+ depends on / blocked
 
Reported: 2018-07-16 13:07 UTC by Ziv
Modified: 2018-12-24 11:40 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
The procedures for upgrading from RHOSP 10 to RHOSP 13 with NFV deployed have been retested and updated for DPDK and SR-IOV environments.
Clone Of:
Environment:
Last Closed: 2018-08-29 16:37:56 UTC
Target Upstream Version:
jslagle: needinfo? (yroblamo)


Attachments (Terms of Use)
rhos 10 initial templates (deleted)
2018-07-16 13:07 UTC, Ziv
no flags Details
neutron errors from controller (deleted)
2018-07-16 13:14 UTC, Ziv
no flags Details
nova errors from controller (deleted)
2018-07-16 13:15 UTC, Ziv
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 585395 None None None 2018-07-27 09:14:49 UTC
Red Hat Bugzilla 1558452 None NEW br-link & br-int bridge datapath type is changed from netdev to system after ffu 2019-04-12 16:20:05 UTC
Red Hat Bugzilla 1601484 None CLOSED glance image save is failing after ffwd upgrade 2019-04-12 16:20:05 UTC
Red Hat Bugzilla 1607376 None CLOSED FFU: After FFU, i cannot restart the controllers 2019-04-12 16:20:06 UTC
Red Hat Bugzilla 1610196 None CLOSED FFU - All VMS are paused after FFU 2019-04-12 16:20:06 UTC
Red Hat Product Errata RHBA-2018:2574 None None None 2018-08-29 16:38:58 UTC

Description Ziv 2018-07-16 13:07:19 UTC
Created attachment 1459162 [details]
rhos 10 initial templates

Description of problem:

After the FFU procedure was finished, I have tried to run a fresh instance for testing.
The instance booted successfully, but the network has not been configured as expected:

openstack console log show dpdk_test_vm

[   35.706679] cloud-init[766]: Stderr: ''
[   35.716646] cloud-init[766]: ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
[   35.719563] cloud-init[766]: ci-info: +--------+------+-----------+-----------+-------------------+
[   35.722448] cloud-init[766]: ci-info: | Device |  Up  |  Address  |    Mask   |     Hw-Address    |
[   35.725579] cloud-init[766]: ci-info: +--------+------+-----------+-----------+-------------------+
[   35.728385] cloud-init[766]: ci-info: |  lo:   | True | 127.0.0.1 | 255.0.0.0 |         .         |
[   35.728615] cloud-init[766]: ci-info: | eth0:  | True |     .     |     .     | fa:16:3e:52:f1:44 |
[   35.728903] cloud-init[766]: ci-info: +--------+------+-----------+-----------+-------------------+
[   35.731155] cloud-init[766]: ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[   35.964688] cloud-init[766]: 2018-07-16 07:10:11,311 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [('Connection aborted.', error(101, 'Network is unreachable'))]
[   36.968456] cloud-init[766]: 2018-07-16 07:10:12,315 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/120s]: request error [('Connection aborted.', error(101, 'Network is unreachable'))]
[   37.976223] cloud-init[766]: 2018-07-16 07:10:13,322 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [2/120s]: request error [('Connection aborted.', error(101, 'Network is unreachable'))]

How reproducible:
Always

Steps to Reproduce:
1. deploy rhos10 z5 with the attached templates
2. minor update to the latest rhos 10z
3. upgrade to the latest rhos13

Additional info:
Error logs for nova/neutron from the controller are attached.

Comment 1 Ziv 2018-07-16 13:14:09 UTC
Created attachment 1459171 [details]
neutron errors from controller

Comment 2 Ziv 2018-07-16 13:15:08 UTC
Created attachment 1459172 [details]
nova errors from controller

Comment 4 Yolanda Robla 2018-07-17 12:49:00 UTC
Also seeing that errors on ovs on the compute:

2018-07-16T13:07:47.368Z|02471|dpif_netlink|WARN|system@ovs-system: cannot create port `dpdk0' because it has unsupported type `dpdk'
2018-07-16T13:07:47.368Z|02472|bridge|WARN|could not add network device dpdk0 to ofproto (Invalid argument)
2018-07-16T13:07:47.368Z|02473|dpif_netlink|WARN|system@ovs-system: cannot create port `dpdk1' because it has unsupported type `dpdk'
2018-07-16T13:07:47.368Z|02474|bridge|WARN|could not add network device dpdk1 to ofproto (Invalid argument)

Comment 5 Yolanda Robla 2018-07-17 12:55:38 UTC
Also, in case that can be relevant, i found that on the messages from the compute:
Jul 16 11:07:06 compute-0 cloud-init: 2018-07-16 11:07:06,469 - stages.py[WARNING]: Failed to rename devices: duplicate mac found! both 'p7p1' and 'bond_api' have mac 'f8:f2:1e:03:bc:40'

Comment 6 Yolanda Robla 2018-07-17 14:55:40 UTC
Have you proceeded with a reboot on the compute after ovs 2.9 minor update?

Comment 7 Yolanda Robla 2018-07-17 15:13:38 UTC
[root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
system
[root@compute-0 log]# ovs-vsctl --may-exist add-br br0 -- set bridge br-link0 datapath_type=netdev
[root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
netdev
[root@compute-0 log]# ifdown br-link0 && ifup br-link0
arping: recvfrom: La red no está activa
Cannot find device "br-link0"
ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Error adding address 10.10.126.104 for br-link0.
arping: Device br-link0 not available.
Cannot find device "br-link0"
[root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
system

Comment 8 Ziv 2018-07-17 16:08:01 UTC
(In reply to Yolanda Robla from comment #6)
> Have you proceeded with a reboot on the compute after ovs 2.9 minor update?

Yes, and the NFV regression tempest tests pass.

Comment 9 zenghui.shi 2018-07-18 00:33:01 UTC
(In reply to Yolanda Robla from comment #7)
> [root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
> system
> [root@compute-0 log]# ovs-vsctl --may-exist add-br br0 -- set bridge
> br-link0 datapath_type=netdev
> [root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
> netdev
> [root@compute-0 log]# ifdown br-link0 && ifup br-link0
> arping: recvfrom: La red no está activa
> Cannot find device "br-link0"
> ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Error adding address
> 10.10.126.104 for br-link0.
> arping: Device br-link0 not available.
> Cannot find device "br-link0"
> [root@compute-0 log]# ovs-vsctl get bridge br-link0 datapath_type
> system

This looks close.

may I ask if there are any changes to file ifcfg-br-link0 during/before overcloud FFU (could be manually change of the file or due to nic-configs template changes)? note that even a minor change like adding a blank space needs to be counted here.

if yes, when os-net-config cmd (os-net-config  -c /etc/os-net-config/config.json  -v) gets executed, br-link0 will be deleted and recreated (due to execution of cmd "ifdown br-link0 and ifup br-link0" triggered by os-net-config), the newly created br-link0 will use 'system' datepath_type instead of 'netdev'.

why os-net-config gets executed during FFU?
restart of httpd service during undercloud FFU upgrade will cause re-run of os-net-config in overcloud if there is any nic-config template changes.

Comment 10 Yolanda Robla 2018-07-19 11:32:34 UTC
Ok so the problem is isolated to be with os-net-config. When we use the 13 version, the type is changed to 13. When we downgrade the package version to the 10 one, the port it set correctly to netdev.
I performed a diff between the nic configs generated between the 10/13 os-net-config versions, and just reported that:

diff -u /etc/sysconfig/network-scripts /etc/sysconfig/network-scripts-13
diff -u /etc/sysconfig/network-scripts/ifcfg-br-link0 /etc/sysconfig/network-scripts-13/ifcfg-br-link0
--- /etc/sysconfig/network-scripts/ifcfg-br-link0	2018-07-19 11:26:08.923619095 +0000
+++ /etc/sysconfig/network-scripts-13/ifcfg-br-link0	2018-07-19 11:25:22.983392104 +0000
@@ -9,4 +9,4 @@
 BOOTPROTO=static
 IPADDR=10.10.126.109
 NETMASK=255.255.255.0
-OVS_EXTRA="set port br-link0 tag=526 -- set bridge br-link0 fail_mode=standalone"
+OVS_EXTRA="set port br-link0 tag=526 -- set bridge br-link0 fail_mode=standalone -- del-controller br-link0 -- set bridge br-link0 fail_mode=standalone -- del-controller br-link0"
diff -u /etc/sysconfig/network-scripts/ifcfg-dpdkbond0 /etc/sysconfig/network-scripts-13/ifcfg-dpdkbond0
--- /etc/sysconfig/network-scripts/ifcfg-dpdkbond0	2018-07-19 11:26:08.923619095 +0000
+++ /etc/sysconfig/network-scripts-13/ifcfg-dpdkbond0	2018-07-19 11:25:22.983392104 +0000
@@ -9,4 +9,4 @@
 OVS_BRIDGE=br-link0
 BOND_IFACES="dpdk0 dpdk1"
 MTU=9000
-OVS_EXTRA="set interface dpdk0 mtu_request=$MTU -- set interface dpdk1 mtu_request=$MTU -- set interface dpdk0 options:n_rxq=2 -- set interface dpdk1 options:n_rxq=2 -- set Interface dpdk0 options:dpdk-devargs=0000:82:00.0 -- set Interface dpdk1 options:dpdk-devargs=0000:82:00.1 -- set interface dpdk0 mtu_request=$MTU -- set interface dpdk1 mtu_request=$MTU -- set interface dpdk0 options:n_rxq=2 -- set interface dpdk1 options:n_rxq=2"
+OVS_EXTRA="set interface dpdk0 mtu_request=$MTU -- set interface dpdk1 mtu_request=$MTU -- set interface dpdk0 options:n_rxq=2 -- set interface dpdk1 options:n_rxq=2 -- set Interface dpdk0 options:dpdk-devargs=0000:82:00.0 -- set Interface dpdk1 options:dpdk-devargs=0000:82:00.1 -- set Interface dpdk0 mtu_request=$MTU -- set Interface dpdk1 mtu_request=$MTU -- set interface dpdk0 mtu_request=$MTU -- set interface dpdk1 mtu_request=$MTU -- set interface dpdk0 options:n_rxq=2 -- set interface dpdk1 options:n_rxq=2"

Comment 11 Yolanda Robla 2018-07-19 12:50:08 UTC
Although that causes a change in config and causes ovs to reboot, that seems to be still a side effect. I downgraded os-collect-config to earlier versions, and ensured that the nic configs were as they were in 10. And on reboots, or ovs restarts, i still can see the type changed to system.

Comment 12 Yolanda Robla 2018-07-20 09:28:13 UTC
New findings, the ovs restart was just a red herring, but the issue seems to be that the neutron-ovs_agent container is not having the right datapath type there. When looking at the config from the container, it shows the following:

()[neutron@compute-0 /]$ cat /var/lib/kolla/config_files/src/etc/neutron/plugins/ml2/openvswitch_agent.ini

[ovs]
bridge_mappings=tenant:br-link0
integration_bridge=br-int
tunnel_bridge=br-tun
local_ip=10.10.126.109

[agent]
l2_population=False
arp_responder=False
enable_distributed_routing=False
drop_flows_on_start=False
extensions=qos
tunnel_types=vxlan
vxlan_udp_port=4789

[securitygroup]
firewall_driver=iptables_hybrid

Comment 13 Yolanda Robla 2018-07-20 12:05:42 UTC
Last findings there show that an ovs.pp puppet module is used for it, and dpdk is not correctly enabled. Looking at how it configures the openvswitch_agent.ini there, it just doesn't consider dpdk values there. I modified the module manually, passing the dpdk parameters and then i can see differences. So it may be that some template changes are needed as a preparartion from 10 to 13, before starting ffu, we are looking at it.

Comment 14 Yolanda Robla 2018-07-20 12:23:57 UTC
The prepare command for FFU was:

(undercloud) [stack@undercloud-0 ~]$ cat overcloud_upgrade_prepare.sh 
#!/bin/env bash
#
# Setup HEAT's output
#
set -euo pipefail

source /home/stack/stackrc

echo "Running  ffwd-upgrade upgrade prepare step"
openstack overcloud ffwd-upgrade prepare --stack overcloud \
    --templates /usr/share/openstack-tripleo-heat-templates \
        --yes \
                                -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dpdk.yaml \
        -e /home/stack/ospd-10-vxlan-dpdk-two-ports-ctlplane-bonding/network-environment.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/ovs-dpdk-permissions.yaml \
                            -e /home/stack/ffu_repos.yaml \
        -e /home/stack/virt/docker-images.yaml \
     2>&1

Comment 15 Saravanan KR 2018-07-20 12:36:12 UTC
I believe this could the issue after discussing with Yolanda. OSP10 has the same service ComputeNeutronOvsAgent mapped to ovs-dpdk service but in OSP10 the same environment file will have the new DPDK service - ComputeNeutronOvsDpdk.

Adding a environment file with below content will ensure that the same mapping is retained for FFU

------------------
resource_registry:
  OS::TripleO::Services::ComputeNeutronOvsAgent: /usr/share/openstack-tripleo-heat-templates/docker/services/neutron-ovs-dpdk-agent.yaml
------------------

Comment 16 Yolanda Robla 2018-07-20 13:23:35 UTC
So comments from Saravanan made sense. I modified your  /usr/share/openstack-tripleo-heat-templates/docker/services/neutron-ovs-dpdk-agent.yaml, and i set like the following:

# A Heat environment that can be used to deploy DPDK with OVS
# Deploying DPDK requires enabling hugepages for the overcloud nodes
resource_registry:
  OS::TripleO::Services::ComputeNeutronOvsDpdk: ../puppet/services/neutron-ovs-dpdk-agent.yaml
  OS::TripleO::Services::ComputeNeutronOvsAgent: /usr/share/openstack-tripleo-heat-templates/docker/services/neutron-ovs-dpdk-agent.yaml

parameter_defaults:
  NeutronDatapathType: "netdev"
  NeutronVhostuserSocketDir: "/var/lib/vhost_sockets"
  NovaSchedulerDefaultFilters: "RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,NUMATopologyFilter"
  OvsDpdkDriverType: "vfio-pci"


  #ComputeOvsDpdkParameters:
    ## Host configuration Parameters
    #TunedProfileName: "cpu-partitioning"
    #IsolCpusList: ""               # Logical CPUs list to be isolated from the host process (applied via cpu-partitioning tuned).
                                    # It is mandatory to provide isolated cpus for tuned to achive optimal performance.
                                    # Example: "3-8,12-15,18"
    #KernelArgs: ""                 # Space separated kernel args to configure hugepage and IOMMU.
                                    # Deploying DPDK requires enabling hugepages for the overcloud compute nodes.
                                    # It also requires enabling IOMMU when using the VFIO (vfio-pci) OvsDpdkDriverType.
                                    # This should be done by configuring parameters via host-config-and-reboot.yaml environment file.

    ## Attempting to deploy DPDK without appropriate values for the below parameters may lead to unstable deployments
    ## due to CPU contention of DPDK PMD threads.
    ## It is highly recommended to to enable isolcpus (via KernelArgs) on compute overcloud nodes and set the following parameters:
    #OvsDpdkSocketMemory: ""       # Sets the amount of hugepage memory to assign per NUMA node.
                                   # It is recommended to use the socket closest to the PCIe slot used for the
                                   # desired DPDK NIC.  Format should be comma separated per socket string such as:
                                   # "<socket 0 mem MB>,<socket 1 mem MB>", for example: "1024,0".
    #OvsPmdCoreList: ""            # List or range of CPU cores for PMD threads to be pinned to.  Note, NIC
                                   # location to cores on socket, number of hyper-threaded logical cores, and
                                   # desired number of PMD threads can all play a role in configuring this setting.
                                   # These cores should be on the same socket where OvsDpdkSocketMemory is assigned.
                                   # If using hyperthreading then specify both logical cores that would equal the
                                   # physical core.  Also, specifying more than one core will trigger multiple PMD
                                   # threads to be spawned, which may improve dataplane performance.
    #NovaVcpuPinSet: ""            # Cores to pin Nova instances to.  For maximum performance, select cores
                                   # on the same NUMA node(s) selected for previous settings.
    #NumDpdkInterfaceRxQueues: 1


See that i added that new line there, that maps ComputeNeutronOvsAgent to the neutron-ovs-dpdk-agent service.
Issue was that the service was just mapped to standard ovs, not to ovs-dpdk one.

Ziv, so can you modify your template, and run FFU with that? This needs to be present there before the prepare steps of FFU.

Comment 17 Ziv 2018-07-22 20:21:44 UTC
As per your request, I have added the definition below to my THT's and rerun ffu end-to-end:

------------------
resource_registry:
  OS::TripleO::Services::ComputeNeutronOvsAgent: /usr/share/openstack-tripleo-heat-templates/docker/services/neutron-ovs-dpdk-agent.yaml
------------------

I checked the dpdk bond configuration, looks like it has no error after the ffu anymore:

        Port "dpdkbond0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:82:00.0", n_rxq="2"}
            Interface "dpdk1"
                type: dpdk
                options: {dpdk-devargs="0000:82:00.1", n_rxq="2"}

Immediately after the ffu I tried to boot up a dpdk instance, but the following instance was in a spanning/build mode for a long period of time.
So after a false attempt to delete it, I decided to reboot the overcloud nodes, just to make sure that everything is in place.

A new problem comes up, virsh controllers are freezes/stuck with a black screen and eventually, instead of rebooting, they're powering off after 10min or so. Checking openstack status shows that they are 'ACTIVE'.
I've powered them on manually and repeated the overcloud reboot again, exactly the same issue has happened.


Please, any idea what could be causing this?

Comment 18 Yolanda Robla 2018-07-23 09:28:55 UTC
I can see that on logs

2018-07-23 09:26:03.812 19840 INFO neutron.agent.securitygroups_rpc [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Refresh firewall rules
2018-07-23 09:26:03.843 19840 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Configuration for devices up [] and devices down [] completed.
2018-07-23 09:26:07.812 19840 WARNING neutron.agent.rpc [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Device Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[],created_at=2018-07-23T09:26:01Z,data_plane_status=<?>,description='',device_id='869e8699-aa23-40f0-bad7-5a94a72ec7f9',device_owner='compute:nova',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=a64b01e7-ff4e-4c62-a00e-ec6917e258c6,mac_address=fa:16:3e:a3:60:ac,name='',network_id=bf95a4f1-58e3-4c84-bb7a-4ec3122fec7e,project_id='6f2cc2c3c52b4137aea7cab531a34336',qos_policy_id=None,revision_number=8,security=PortSecurity(a64b01e7-ff4e-4c62-a00e-ec6917e258c6),security_group_ids=set([f878e7de-6a57-4ae1-919c-3add9bf0504a]),status='DOWN',updated_at=2018-07-23T09:26:04Z) is not bound.
2018-07-23 09:26:07.814 19840 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Device a64b01e7-ff4e-4c62-a00e-ec6917e258c6 not defined on plugin or binding failed
2018-07-23 09:26:07.817 19840 INFO neutron.agent.securitygroups_rpc [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Preparing filters for devices set([u'a64b01e7-ff4e-4c62-a00e-ec6917e258c6'])
2018-07-23 09:26:07.856 19840 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5c81362f-c289-47b9-b2c6-1027548f778f - - - - -] Configuration for devices up [] and devices down [] completed.
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(72a4883df5d042628b0a76008082a55a). Skipping it.
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 368, in _callback
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in __call__
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit     unique_id = self.msg_id_cache.check_duplicate_message(message)
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 121, in check_duplicate_message
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit     raise rpc_common.DuplicateMessageError(msg_id=msg_id)
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(72a4883df5d042628b0a76008082a55a). Skipping it.
2018-07-23 09:26:39.138 19840 ERROR oslo.messaging._drivers.impl_rabbit 
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID f876992db6ac4c3d897f6dc871abf6af
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 320, in _report_state
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 93, in report_state
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID f876992db6ac4c3d897f6dc871abf6af
2018-07-23 09:26:59.271 19840 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
2018-07-23 09:26:59.272 19840 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 30.01 sec

Comment 19 Yolanda Robla 2018-07-23 09:32:26 UTC
I also could find issues when trying to restart controllers from the control plane. Seems they could not be restarted gracefully, they were giving issues with rabbit/galera/haproxy, etc... those are the logs http://pastebin.test.redhat.com/621684

Comment 22 Yolanda Robla 2018-07-31 11:31:13 UTC
The problem with locked vms, was a problem with incorrect mapping of VhostuserSocketGroup . It was mapped as:

parameter_defaults:
  ComputeOvsDpdkParameters:
    VhostuserSocketGroup: "hugetlbfs"

But this role was not used. This needs to be:

parameter_defaults:
  ComputeParameters:
    VhostuserSocketGroup: "hugetlbfs"

As this is the default role being used on this deployment.

Comment 23 Ziv 2018-08-12 03:42:45 UTC
(In reply to Yolanda Robla from comment #22)
> The problem with locked vms, was a problem with incorrect mapping of
> VhostuserSocketGroup . It was mapped as:
> 
> parameter_defaults:
>   ComputeOvsDpdkParameters:
>     VhostuserSocketGroup: "hugetlbfs"
> 
> But this role was not used. This needs to be:
> 
> parameter_defaults:
>   ComputeParameters:
>     VhostuserSocketGroup: "hugetlbfs"
> 
> As this is the default role being used on this deployment.


The following solution above has been verified.

Thanks,
Ziv

Comment 24 Joanne O'Flynn 2018-08-15 07:39:28 UTC
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 26 errata-xmlrpc 2018-08-29 16:37:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2574


Note You need to log in before you can comment on or make changes to this bug.