|Summary:||Server reboots on initial bringup of e1000 after power up occationally|
|Product:||Red Hat Enterprise Linux 3||Reporter:||David Knierim <new_galoot>|
|Component:||kernel||Assignee:||John W. Linville <linville>|
|Status:||CLOSED WONTFIX||QA Contact:||Brian Brock <bbrock>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2006-05-17 14:12:59 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description David Knierim 2005-05-09 13:45:33 UTC
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050416 Fedora/1.0.3-1.3.1 Firefox/1.0.3 Description of problem: I have a number of servers with multiple e1000 interfaces (4+). The servers are dual processor 3.2Ghz XEON boxes with 2 - 8GB of dram. Occationally, when the box is booting up after being powered off, the server will reboot when it tries to bring up an ethernet interface. When it reboots, the server never has any trouble with any interface that has been brought up or the interface that caused the server to reboot, but it may reboot again on other interfaces that have not been accessed previously. I have seen a server 8 interfaces reboot twice due to this issue. Once all interfaces have been brought up, this problem does not occur again until the box is power cycled (warm reboots do not cause any issues). Yes, this is really strange, but I have observed it many times on many boxes. In an effort to help solve this issue, I installed Intel's driver from their web site (version 5.7.6) and, if anything, the problem is more severe. David Version-Release number of selected component (if applicable): 2.4.21-27.0.2.ELsmp How reproducible: Sometimes Steps to Reproduce: 1. Configure dual 3.2 Ghz xeon box with lots of e1000 interfaces 2. Set all interfaces to get IP addresses (I have been using DHCP) 3. Power box off and on again. 4. Observe console to see if box reboots during boot up. 5. Repeat steps 3 and 4. Actual Results: Sometimes, the server will reboot bringing up ethernet interfaces. Additional info:
Comment 1 John W. Linville 2005-05-10 17:55:59 UTC
I have test kernels w/ a later e1000 driver available here: http://people.redhat.com/linville/kernels/rhel3/ Please give them a try to see if the problem has already been corrected. Please post the results here. Thanks!
Comment 2 John W. Linville 2005-06-27 14:34:09 UTC
Closed due to lack of response...please reopen if/when requested information becomes available...
Comment 3 David Knierim 2005-07-07 14:18:28 UTC
I finally got a chance to retest this. It failed booting up after power cycle running Red Hat test kernel (kernel-smp-2.4.21-32.9.EL.jwltest.35.i686.rpm): Bringing up interface eth8: C<P0>UK 1e:rn Melac phainnei c:Ch Uencka bElex cteo p ticoonnt: i0nu00e0 00In0 0i00d0le0 00t4as -CP nUo t0 :s yMancchinigne Ch e It failed the second time I booted the box up (with a power cycle before each boot). The box in question has 16 e1000 interfaces.
Comment 4 John W. Linville 2005-07-13 15:11:03 UTC
Hmmm, not a lot to go on...just curious, is this the same box as in bug 154680? Please attach the output of running "sysreport" on the box...thanks!
Comment 5 John W. Linville 2005-07-13 16:31:58 UTC
Comment 6 John W. Linville 2005-07-13 20:18:22 UTC
The problem described in the link from comment 5 sounds to me like it _might_ be related. I have hacked-up a patch that I think _might_ help...wanna try it? Test kernels available at the same location as in comment 1. Please give them a try to see if you can reproduce the issue with them, and post the results here...thanks!
Comment 7 John W. Linville 2005-07-13 20:20:19 UTC
Created attachment 116720 [details] jwltest-e1000-alloc_rx_buf.patch
Comment 8 David Knierim 2005-07-15 13:57:46 UTC
The box in question is from the same family as the box I reported bug 154680 against. It's physically a different box, but it has exactly the same hardware configuration except it only has 2GB of DRAM. I just tried to boot is up running the kernel: kernel-smp-2.4.21-32.10.EL.jwltest.37.i686.rpm I have seen several issues when booting this kernel. Several times the box hung when starting up the first interface. Several times it hung running kudzu. One time it spewed forth loads of shell code on the screen and failed to configure the interface. After doing this, the server rebooted a minute or so later. Sorry, but I don't have a capture of the output from this failure. When I went back to the latest errata kernel, the box booted just fine and all interfaces came up.
Comment 9 John W. Linville 2005-07-15 14:02:55 UTC
Hmmm...well, that sucks...don't use those kernels... :-) Thanks for the testing, anyway! I'll have to get back to you...
Comment 10 David Knierim 2005-07-15 14:08:34 UTC
Created attachment 116799 [details] Output of sysreport sysreport output from server in question.
Comment 11 John W. Linville 2005-08-24 14:32:46 UTC
Do the "acpi=off" or "acpi=noirq" kernel command line parameters have any effect on this issue?
Comment 12 David Knierim 2005-09-08 21:58:57 UTC
I just did some testing in this area and here are my results: Initially I configured a box with 16 e1000 ports. One was configured for DHCP and the rest were static. The static ports did not have a network attached. Test pass fail acpi=off 5 0 acpi=noirq 3 1 I then decided to make the test a bit more realistic. I configured all ports for DHCP and attached 15 of them to a hub with a DHCP server and the remaining port went to our main network, which has a different DHCP server. Test pass fail acpi=off 10 0 acpi=noirq 7 3 std acpi 10 1 "std acpi" means there is no entry for acpi on the kernel boot line. I will be doing more testing with acpi=off to see if it actually clears up the problem.
Comment 13 David Knierim 2005-09-09 14:39:08 UTC
I ran an automated test overnight with acpi=off. Out of 111 attempts, 91 passed. I will run additional testing today.
Comment 14 David Knierim 2005-09-09 19:11:02 UTC
Running with no settings for acpi on the kernel line gave the following results: 33 attempts, 27 passes. I will be out of the office next week. I will not be doing further testing until I return.
Comment 15 John W. Linville 2005-09-23 14:57:25 UTC
Is there an oops or panic before the box reboots? Can you do anything to capture that oops (perhaps using netconsole)?
Comment 16 David Knierim 2005-10-07 13:05:09 UTC
I set up netconsole and attempted to capture anything further. Netconsole is working, but it does not capture anything with this failure.
Comment 17 David Knierim 2005-11-02 19:20:23 UTC
I just tried this using the latest driver from Intel (6.2.15). The problem exists there, too. Determining IP information for eth1...C<P0U> 0K:e rnMaelch ipanne iCch:e Uckn aEbxlec eptoti coon:nt i0n00ue0 000I0n 00i0dl00e 04t 0 sk C-PU no1:t Msaynchciinnge Ch ec
Comment 18 David Knierim 2005-11-17 16:41:42 UTC
I have observed something that might be useful. I have been testing using a box with 16 interfaces. 1 goes to our production network and the rest all go to a single 10/100 hub with a DHCP server attached. No matter which order the interfaces are brought up, the failure always happens when bringing up the interface attached to the production net. Up until now, all interfaces have been configured to use DHCP. I will set the production interface to be static and see what happens.
Comment 19 John W. Linville 2005-12-12 14:57:55 UTC
Did that configuration change make any difference? Are you still confident that it is the "production" interface that always causes the problem?
Comment 20 David Knierim 2005-12-16 21:53:31 UTC
The problem only happens on the interface attached to the "production" network. I have tried both DHCP and static configured IP address. Both fail. In a few cases, the interface came up with a static address, but I then got the failure when I tried to ssh out the interface to a second server. I have also moved the network cable from the production network to different ports, but that did not make any difference. Since I noticed this behavior, I have never seen any of the 15 interfaces attached to the test network fail (The production interface is running 100Mbps, full duplex, while the test network is 100Mbps, half duplex). When I get a chance to test this more, I can try to put some traffic on the test network to see if that makes any difference. Let me know if you have any other ideas.
Comment 21 John W. Linville 2005-12-16 22:07:30 UTC
It certainly would not be the first time I've seen drivers have problems with incoming traffic during initialization...I'll have to get back to you...
Comment 23 John W. Linville 2006-02-08 20:58:26 UTC
Big e1000 update available in the test kernels here: http://people.redhat.com/linville/kernels/rhel4/ Please give those a try and post the results here...thanks!
Comment 24 John W. Linville 2006-02-08 20:59:10 UTC
Belay that...RHEL3 kernels not ready yet...
Comment 25 John W. Linville 2006-02-09 14:58:17 UTC
Ok, now they are here: http://people.redhat.com/linville/kernels/rhel3/ Please give those a try and post the results...thanks!
Comment 26 David Knierim 2006-03-07 20:52:41 UTC
Sorry it has taken so long to retest this. Getting the test setup back together was a PITA. The problem is still happening. The failure does not seem to be happening as often as before. I will retest to see if this observation is accurate. I am currently running with one interface attached to a production network. I am testing with a different server and different production network, now that I think about it, so the failure rate does not necessarily correlate. In any case, I have a script that does the following: - boot up the server - run and log online test on 1 interface - run and log offline test on 1 interface - log before bringing up interface - bring up interface - log after bringing up interface - log and run ssh <remote host> /bin/true - log and shut down interface - log and bring up interface again (a leftover from when the script did more interfaces) - log and ping remote host - ssh to remote host and initiate script to cause power cycle (with logging) - shut down server (the previous script will fire it back up) After running this for about 24 hours, the logs have 275 entries for each of the steps before bringing up the interface. There were 233 entries for the remaining steps. This indicates that out of 275 attempts, 233 succeeded and 42 failed. I am retesting with an older kernel to see what rate the error is happening with this server and network.
Comment 27 David Knierim 2006-03-08 13:18:10 UTC
I retested the same configuration, but only changed the kernel to the kernel we normally use (2.4.21-32.0.1.ELsmp). The scripts attempted to bring up the ethernet interface 29 times. The server hung twice and rebooted a third time without bringing up the interface. The test kernel never hung when attempting to bring up the interface. However, the failure rate does not appear to be improved by the new test kernel.
Comment 28 John W. Linville 2006-05-17 14:12:59 UTC
RHEL3 is now late enough in its life cycle that I don't think this will ever be fixed. I'm sorry.