Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1694024 - pmd usage always shows zero in queue 1
Summary: pmd usage always shows zero in queue 1
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 10.0 (Newton)
Assignee: Eelco Chaudron
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-29 10:25 UTC by Chen
Modified: 2019-04-16 06:57 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Chen 2019-03-29 10:25:29 UTC
Description of problem:

pmd usage always shows zero in queue 1. No dropped packets were observed during the test.

# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 2:
        isolated : false
        port: vhu5083bcae-22    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 3:
        isolated : false
        port: vhu11ecc7b8-19    queue-id:  0    pmd usage:  0 %
        port: vhu42c26989-db    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 4:
        isolated : false
        port: vhu4f454486-6c    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 5:
        isolated : false
        port: vhu97ffbd3f-89    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 22:
        isolated : false
        port: dpdk1             queue-id:  0    pmd usage:  0 %
        port: vhu4fad327d-a6    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 23:
        isolated : false
        port: vhu9142b504-3f    queue-id:  0    pmd usage: 12 %
        port: vhudb82304e-c6    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 24:
        isolated : false
        port: vhubb74b1ff-3b    queue-id:  0    pmd usage:  0 %
        port: vhud3323c04-d5    queue-id:  0    pmd usage: 13 %
pmd thread numa_id 1 core_id 25:
        isolated : false
        port: vhu5e90c3f3-f0    queue-id:  0    pmd usage:  0 %
        port: vhuac48a9ef-cb    queue-id:  0    pmd usage: 12 %
pmd thread numa_id 0 core_id 42:
        isolated : false
        port: vhue900baee-1b    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 43:
        isolated : false
        port: vhuc39cb667-e2    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 44:
        isolated : false
        port: vhu596793e6-ca    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 45:
        isolated : false
        port: vhufdd288de-20    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 62:
        isolated : false
        port: dpdk3             queue-id:  0    pmd usage: 18 %
        port: vhufa8158a6-64    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 63:
        isolated : false
        port: dpdk2             queue-id:  1    pmd usage:  0 %
        port: dpdk3             queue-id:  1    pmd usage:  0 %
pmd thread numa_id 1 core_id 64:
        isolated : false
        port: dpdk2             queue-id:  0    pmd usage: 22 %
        port: vhu17fc1f28-b7    queue-id:  0    pmd usage:  0 %
pmd thread numa_id 1 core_id 65:
        isolated : false
        port: dpdk0             queue-id:  0    pmd usage:  0 %

Version-Release number of selected component (if applicable):

OSP10
ovs-dpdk bonding
trunk bridge

How reproducible:

100% at customer site

Steps to Reproduce:
1. ovs-vsctl set Interface dpdk2 options:n_rxq=2
   ovs-vsctl set Interface dpdk3 options:n_rxq=2
2. Create VNF
3. Send traffic to dpdkbond1
4. The pmd usage for dpdk2/3 queue-id always shows zero  

Actual results:


Expected results:


Additional info:

Comment 4 Eelco Chaudron 2019-03-29 16:07:04 UTC
Can we get the traffic pattern being sent? As this is normally related to the traffic sent in not to OVS-DPDK itself.
You can probably get a trace with ovs-tcpdump, to see what traffic is received.

On top of this, can you also get an SOS report when traffic is running so I have an idea what flows are programmed etc. etc.

Comment 13 Eelco Chaudron 2019-04-02 09:52:36 UTC
(In reply to Chen from comment #10)
> Hi Eelco,
> 
> The customer has uploaded the tcpdump and sosreport. They are available at
> collab-shell.usersys.redhat.com:/case/02339229. You should be able to login
> the collab-shell by using krb5 id.
> 
> tcpdump: dpdkbond1.capture
> sosreport: sosreport-20190402-093246
> 
> I think that they got enough l3 addresses combination but I found that they
> were all using UDP and the port was never changed (udp port 2152). Could
> that be related ?

Thanks for the trace, however, it's a partial trace so I can not replay it on my lab.
Can you take the trace again, but now capture the full packet?

Also looking at the number of sessions (+/- 800) and the total number of keys, this was taken without one of the DPDK ports disabled. So when you capture the packets again, can you make sure one of the DPDK ports is disabled?


Having about 800 packets might be enough to distribute the load, or they are just unlucky. If I have the full trace I'll try it in my lap. The UDP ports being the same should not matter too much as the RSS is calculated over the IPs and L4-Ports.


I also verified the rest of their OVS configuration, and it seems just fine. The OVS log is rather short, if this is a test environment can they restart OVS just before the test and sent me the full OVS log. Just want to see no odd DPDK initialization, or other configuration stuff is mentioned in there.

Finally, they are running a rather old version of OVS 2.9 (-19 vs -101), so it might be good to upgrade as well to avoid debugging any fixed issues: http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.9.0/101.el7fdp/


One additional experiment they could do is remove the bond and just add dpdk2 or dpdk3 directly to see if it's related to bonding, which helps us narrow down the area.

Comment 26 Eelco Chaudron 2019-04-03 13:07:32 UTC
Nothing from the customer for now, just want to confirm that it's working on my XL710 setup.

The only real thing that would stop this from working is the NIC itself, and the XL710 had several bug fixes in its firmware.

For the physical NICs you only need to specify the multi-queue on the interface, which is what has been done, options:n_rxq=2, or else only one queue will be configured.

Comment 27 Eelco Chaudron 2019-04-03 15:04:35 UTC
I've been doing a replication in my lab with the same OVS version (openvswitch-2.9.0-101.el7fdp.x86_64)

$ ovs-vsctl show
7c97bbc8-f541-4538-a2f0-ffe8326954f1
    Bridge "ovs_pvp_br0"
        Port "ovs_pvp_br0"
            Interface "ovs_pvp_br0"
                type: internal
        Port "dpdkbond0"
            Interface "dpdk1"
                type: dpdk
                options: {dpdk-devargs="0000:05:00.1", n_rxq="2"}
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:05:00.0", n_rxq="2"}
    ovs_version: "2.9.0"

I'm using the trace captured on the customer's network:


$ tcpreplay -i p6p1 dpdkbond1.cap 
Actual: 702303 packets (401480442 bytes) sent in 7.85 seconds
Rated: 51102153.6 Bps, 408.81 Mbps, 89392.13 pps
Statistics for network device: p6p1
	Successful packets:        702303
	Failed packets:            0
	Truncated packets:         0
	Retried packets (ENOBUFS): 0
	Retried packets (EAGAIN):  0


[netdev64:~]$ ovs-appctl dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 1:
	packets received: 352219
	packet recirculations: 0
	avg. datapath passes per packet: 1.00
	emc hits: 135493
	smc hits: 0
	megaflow hits: 216672
	avg. subtable lookups per megaflow hit: 1.00
	miss with success upcall: 54
	miss with failed upcall: 0
	avg. packets per output batch: 1.37
	idle cycles: 509064355059 (99.91%)
	processing cycles: 452783762 (0.09%)
	avg cycles per packet: 1446591.86 (509517138821/352219)
	avg processing cycles per packet: 1285.52 (452783762/352219)
pmd thread numa_id 0 core_id 15:
	packets received: 350084
	packet recirculations: 0
	avg. datapath passes per packet: 1.00
	emc hits: 136699
	smc hits: 0
	megaflow hits: 213370
	avg. subtable lookups per megaflow hit: 1.00
	miss with success upcall: 15
	miss with failed upcall: 0
	avg. packets per output batch: 1.46
	idle cycles: 509064953149 (99.91%)
	processing cycles: 452173007 (0.09%)
	avg cycles per packet: 1455413.92 (509517126156/350084)
	avg processing cycles per packet: 1291.61 (452173007/350084)


$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
	isolated : false
	port: dpdk0           	queue-id:  1	pmd usage:  0 %
	port: dpdk1           	queue-id:  0	pmd usage:  0 %
pmd thread numa_id 0 core_id 15:
	isolated : false
	port: dpdk0           	queue-id:  0	pmd usage:  0 %
	port: dpdk1           	queue-id:  1	pmd usage:  0 %


The only thing that I think could be the problem is the firmware of their nic, mine is:

$ ethtool -i p5p1
driver: i40e
version: 2.3.2-k
firmware-version: 4.53 0x800020a2 0.0.0
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


Next steps would be to verify the customer is using the latest version, and secondly, maybe they can try the same simple setup in their lab replaying the trace. In addition, you should also be able to replicate my setup to show them that we do not see it.

Comment 33 Chen 2019-04-15 12:57:17 UTC
Hi Eelco,

The customer has upgraded the firmware but the issue still persists.

[root@overcloud-compute6138-0 Linux_x64]# ethtool -i ens1f1
driver: i40e
version: 2.1.14-k
firmware-version: 6.80 0x80003cf2 1.1313.0
expansion-rom-version:
bus-info: 0000:3b:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Asking the customer whether the issue happens after removing the bond. Do we have any luck with XL710 dpdkbond lab ?

Best Regards,
Chen

Comment 34 Eelco Chaudron 2019-04-15 13:22:50 UTC
(In reply to Chen from comment #33)
> Hi Eelco,
> 
> The customer has upgraded the firmware but the issue still persists.
> 
> [root@overcloud-compute6138-0 Linux_x64]# ethtool -i ens1f1
> driver: i40e
> version: 2.1.14-k
> firmware-version: 6.80 0x80003cf2 1.1313.0#
> expansion-rom-version:
> bus-info: 0000:3b:00.1
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes

That's odd, as all the queue decisions are made by the nic...

> Asking the customer whether the issue happens after removing the bond. Do we
> have any luck with XL710 dpdkbond lab ?

See comment 27 and 28, both Gowrishankar and I were able to replicate the setup with the trace provided and the problem did not occur. This was a bonding setup, just like the customer.

So what I would ask is to replicate our experiment and see if they see or do not see the issue on a similar piece of hardware. See my request in #30.

"Secondly, do they see this issue only on this system or also on others?
Can they replicate this on a similar machine replaying the trace (verify the NIC firmware is the same)?"


Note You need to log in before you can comment on or make changes to this bug.