Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 108092

Summary: 2.4.20-20.7 has kernel stack trace deadlocks - gcc-2.96-113
Product: [Retired] Red Hat Linux Reporter: Howard Owen <howen>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: high    
Version: 7.3CC: jason, k.georgiou, linuxcub, pfrields, riel
Target Milestone: ---Keywords: Security
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-01-05 03:22:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On: 87659    
Bug Blocks:    
Attachments:
Description Flags
Script to deadlock kernels built with gcc-2.96-113
none
Sample console messages when kernel deadlocks. none

Description Howard Owen 2003-10-27 16:39:17 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686) Gecko/20030807 Galeon/1.3.5

Description of problem:
Kernels built with gcc-2.96-113 are vulnerable to a deadlock when the current
process has traversed 5 successive symlinks on an NFS file system, and the
kernel takes an IRQ. There being less than 1KiB remainingon the process's kernel
stack, the service routine prints a stack trace rather than servicing the
interrupt. If the systemconsoleis on a serial port, this results in more IRQs
from the UART, which results in a deadlock.

Somehow, gcc-2.96-113 makes this condition far more likely. Kernels built with
gcc-2.96-112 or gcc3 do not exhibit the problem. Thedefault rsize/wsize of 4KiB
makes he problem less likely to occur also. With rsize=wsize=16KiB, the problem
shows up reliably.

Version-Release number of selected component (if applicable):
kernel-2.4.20-20.7 

How reproducible:
Always

Steps to Reproduce:
1. On  an NFS file system with rsize-wsize=16KiB, run the attached perl script
with --jobs=30 --net.
2. Start three to eight large transfers off the box using scp
3. Watch the serial console
    

Actual Results:  Kernel stack trace messages appear on the serial console and
the system deadlocks

Expected Results:  Load should go up to above 30. The large copies should run to
completion. The system should stay up.

Additional info: Sample stack trace

Comment 1 Howard Owen 2003-10-27 16:44:04 UTC
Created attachment 95517 [details]
Script to deadlock kernels built with gcc-2.96-113

With cwd in a NFS mount with rsize=wsize=16KiB, run this script with --jobs=30
--net

Start 3-8 large file transfers off the box. (I use scp with a 500MiB file.)
Watch the serial console. The system will deadlock before the transferrs
complete.

Comment 2 Howard Owen 2003-10-27 16:44:53 UTC
Created attachment 95518 [details]
Sample console messages when kernel deadlocks.

Comment 3 Howard Owen 2003-10-27 16:50:08 UTC
I placed the severity at "security" because this is essentially a
denial-of-service attack on the affected kernel. A local user with normal
privileges can deadlock the system.

Comment 4 Joshua Jensen 2003-12-23 19:21:10 UTC
I noticed that a new kernel errata,
https://rhn.redhat.com/errata/RHBA-2003-394.html, mentions this bug...
but it doesn't say way.  Does this mean that the kernel *wasn't*
compiled with gcc-2.96-113?  If so, what version does Red Hat recommend?

Comment 5 Howard Owen 2003-12-23 19:33:21 UTC
The new kernel was compiled with gcc-2.96-126, which is an unreleased
version. I haven't tested this kernel yet, but I will soon. Since they
call out this bug, I'm assuming the unreleased gcc addresses the issue.

But if you want to build your own kernel, the workaround of
downgrading to gcc-2.96-112 is the only solution I'm aware of. Perhaps
Red Hat will fix bug #87659 by releasing the updated gcc before 7.3
end-of-life next week. The fix for this bug is well over half a loaf,
however, since most installations won't be running custom kernels.

Comment 6 Erling Jacobsen 2003-12-29 22:10:11 UTC
I haven't found a gcc-2.96-124 myself, but I _did_ find a SRPM of
gcc-2.96-124 as an update to the 2.1 enterprise version of RHL.
One of the changes from -113 to -124 is apparently a fix for some
"excessive stack usage caused by the -fno-strict-aliasing patch".
Doesn't that sound relevant ? I'm no expert, but I think it would
be interesting to take the relevant new patches from gcc-2.96-124
and stick them into gcc-2.96-113, rebuild gcc, and use that to rebuild
the kernel.


Comment 7 Howard Owen 2003-12-31 22:52:42 UTC
The patch you are apparantly referring to:
gcc-strict-alias-optimization2.patch, seems to address the problem
when applied to the gcc-2.96-113 SRPM. At least, a variant of my crash
script doesn't crash a kernel built with the resulting gcc. The pach
applied cleanly and the gcc build went smoothly. However I'm not in a
position to judge if this patch, applied in isolation, is a good
general fix for production systems.

One for Progeny, I guess.


Comment 8 Dave Jones 2004-01-05 04:16:49 UTC
*** Bug 91566 has been marked as a duplicate of this bug. ***