Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 233608

Summary: Native SCTP - traffic randomly stops on RedHat AS4 Update 4
Product: Red Hat Enterprise Linux 4 Reporter: William Reich <reich>
Component: lksctp-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.9-55.EL5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-18 18:45:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
version information
none
compile script for pitcher & catcher
none
catcher source code
none
pitcher source code
none
script used to launch 6 pitchers
none
sysreport from a machine where the problem appeared.
none
picture of test configuration
none
sys report from redhat 5 box
none
output of catcher on RH 4 update 5 32 bit system none

Description William Reich 2007-03-23 13:08:56 UTC
Description of problem:
While using the LINUX provided SCTP on a RedHat AS4 Update 4 system
( dual CPU,  32bit OS ),
we configure a "catcher" and 6 "pitchers" to run on the same box.

Randomly, one of the pitchers just stops passing data to the catcher.

Debug shows that the pitcher is stuck on a "send" operation, and
the corresponding "catcher" thread is stuck on a "recv" operation.
This implies that the sctp stuck in some way.

Version-Release number of selected component (if applicable):

 

Connected to chaplin.ulticom.com.
Escape character is '^]'.
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
Kernel 2.6.9-42.0.10.ELsmp on an i686
Last login: Fri Mar 23 08:10:12 from blade1.ulticom.com


chaplin 96% uname -a
Linux chaplin 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:17:21 EST 2007 i686 i686
i386 GNU/Linux

chaplin 99% rpm -qa | grep -i sctp
lksctp-tools-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-devel-1.0.2-6.4E.1

NOTE - This machine was specifically built using the
lastest available versions as of 3/14/07.


How reproducible:


Steps to Reproduce:
1. Run attached pitcher/catcher programs
2.
3.
  
Actual results:
at least one of the pitcher will stop passing traffic.
This is random. It may work one time, then fail the next time.

Expected results:
traffic should pass on all pitchers.

Additional info:

1) This problem was recreated on a 2.6.9-34 kernel as well.
2)

               native sctp   Pitcher / Catcher  test


   +---------+
   | pitcher | ----------------------------\
   +---------+                              \
                                             \
                                              \
   +---------+                                 \
   | pitcher | -----------------------\         \
   +---------+                         \         \
                                        \         |
                                         -----+   |
   +---------+                                |   |
   | pitcher | ---------------------\         |   |
   +---------+                       \        v   v
                                      -->  +---------+
                                           | catcher |
                                      -->  +---------+
                                     /        ^   ^
   +---------+                      /         |   |
   | pitcher | --------------------/          |   |
   +---------+                                |   |
                                       -------+   |
                                      /          /
   +---------+                       /          /
   | pitcher | ---------------------/          /
   +---------+                                /
                                             /
                                            /
   +---------+                             /
   | pitcher | ---------------------------/
   +---------+

3) I could not find an appropriate component to file this buzz under,
so I just guessed.

Comment 1 William Reich 2007-03-23 13:08:56 UTC
Created attachment 150748 [details]
version information

Comment 2 William Reich 2007-03-23 13:10:55 UTC
Created attachment 150749 [details]
compile script for pitcher & catcher

Comment 3 William Reich 2007-03-23 13:11:41 UTC
Created attachment 150750 [details]
catcher source code

Comment 4 William Reich 2007-03-23 13:12:23 UTC
Created attachment 150751 [details]
pitcher source code

Comment 5 William Reich 2007-03-23 13:13:16 UTC
Created attachment 150752 [details]
script used to launch 6 pitchers

Comment 6 William Reich 2007-03-23 13:15:59 UTC
To recreate the test,
use 2 xterm windows on the same machine.

In first window, launch the catcher...

./catcher &

In 2nd window, launch the 6 pitcher by using the script 'j'...

./j

Observe in the first window that a report is printed out for
every 1000 messages that are received by the catcher for each pitcher.
During the failure case, at least one of the reports will simply stop
appearing.

Comment 7 William Reich 2007-03-23 13:18:08 UTC
Created attachment 150753 [details]
sysreport from a machine where the problem appeared.

This is the output of the sysreport tool.

Comment 8 William Reich 2007-03-23 13:20:31 UTC
Please note that this problem has only been seen
with the pitchers & catcher are on the same box.


Comment 9 William Reich 2007-03-23 13:22:53 UTC
Comment on attachment 150748 [details]
version information

fixed description of attachement 150748

Comment 10 William Reich 2007-03-23 13:23:57 UTC
Created attachment 150754 [details]
picture of test configuration

Comment 11 William Reich 2007-03-23 14:57:44 UTC
Please note that in the source code of the
test programs that the IP address is hardcoded.

Change this address as appropriate for your own testing.

Comment 12 Neil Horman 2007-06-08 14:59:59 UTC
Please try this against the latest U5 kernel.  I believe this is a problem that
you and I resoved in another bugzilla previously.  Thanks!

Comment 13 William Reich 2007-06-12 13:18:32 UTC
msg received...
I'm going on vacation, so I'll get to this
upon my return.
( estimated time of work is 6/20-6/27/07 )

Comment 14 Neil Horman 2007-06-12 14:36:49 UTC
Thank you!  If U5 doesn't fix it, let me know, I have a patch that is being
reviewed upstream that completely rewrites how our receive buffer management
code works, and should fix any problems that remain in U5 with receive drops/stalls.

Comment 15 William Reich 2007-06-21 13:12:53 UTC
The machine I was assigned for this does not
contain the correct OS.
I gotta get the correct OS installed.

Comment 16 Neil Horman 2007-06-21 16:47:31 UTC
ok, let me know when you do.

Comment 17 William Reich 2007-06-29 15:31:43 UTC
Created attachment 158214 [details]
sys report from redhat 5 box

this is the SYSreport from the machine that I used
to test the fix.
Redhat Enterprise Linux 5
2.6.18-8
sctp packages 1.0.6-1.el5.1

Comment 18 William Reich 2007-06-29 15:33:54 UTC
I ran the test, and it passed.
( Details in comment 17. )

Comment 19 Neil Horman 2007-06-29 16:08:50 UTC
Ok, so it works with a RHEL5 kernel, does it work with a 4.5 kernel as well?
(since this is a RHEL4 bug)

Comment 20 William Reich 2007-06-29 16:58:32 UTC
I'm off to vacation.
I will investigate the week of July 9, 2007.

Comment 21 Neil Horman 2007-06-29 19:50:10 UTC
fine, let me know

Comment 22 William Reich 2007-07-18 17:54:39 UTC
I ran this test on a RH 4 Update 5 32 bit system.
The test passed.
It is curious however that the traffic across the 6 applications is
not even. Some applications pass up to twice as much traffic as others.
-- case closed -- Thanks.

Comment 23 William Reich 2007-07-18 17:56:35 UTC
Created attachment 159542 [details]
output of catcher on RH 4 update 5 32 bit system

output of "catcher" program that shows uneven traffic.
not a failure... just a curious observation...

Comment 24 Neil Horman 2007-07-18 18:45:35 UTC
quite likely has a good deal to do with receive buffer limitations and the need
for retransmits.  Hard to say though without more data.  I'm closing this as
currentrelease. If the problem persists after 4.7 is released (I plan to try
backport my recevie buffer changes to 4.6 or 4.7), feel free to open a new bz
for this issue).