Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 548198 - Will not boot on a HP DL380G6 server; complains of "NMI received for unknown reason"
Summary: Will not boot on a HP DL380G6 server; complains of "NMI received for unknown ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 12
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 593003 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-12-16 21:21 UTC by Ben Webb
Modified: 2010-08-08 19:15 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-08-08 19:15:26 UTC


Attachments (Terms of Use)

Description Ben Webb 2009-12-16 21:21:27 UTC
Description of problem:
After an upgrade to F12, booting fails on an HP DL380G6 server. The kernel complains "NMI received for unknown reason" but continues; however after a short period the boot will hang.

Although the message suggests that this is a hardware problem, the same system works with no problems using the latest F11 kernel.

Version-Release number of selected component (if applicable):
kernel-2.6.31.6-166.fc12.x86_64


How reproducible:
Always.


Steps to Reproduce:
1. Install F12 on an x86_64 HP DL380G6 server.
2. Try to boot.
  
Actual results:
Boot messages look similar to:
uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23
uhci_hcd 0000:00:1d.3: UHCI Host Controller
uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5
uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001060
usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb5: Product: UHCI Host Controller
usb usb5: Manufacturer: Linux 2.6.31.6-166.fc12.x86_64 uhci_hcd
usb usb5: SerialNumber: 0000:00:1d.3
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
uhci_hcd 0000:01:04.4: PCI INT B -> GSI 22 (level, low) -> IRQ 22
uhci_hcd 0000:01:04.4: UHCI Host Controller
uhci_hcd 0000:01:04.4: new USB bus registered, assigned bus number 6
uhci_hcd 0000:01:04.4: port count misdetected? forcing to 2 ports
uhci_hcd 0000:01:04.4: irq 22, io base 0x00003800
usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
Uhhuh. NMI received for unknown reason a1 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb6: Product: UHCI Host Controller
usb usb6: Manufacturer: Linux 2.6.31.6-166.fc12.x86_64 uhci_hcd
usb usb6: SerialNumber: 0000:01:04.4
usb usb6: configuration #1 chosen from 1 choice
hub 6-0:1.0: USB hub found
hub 6-0:1.0: 2 ports detected

Booting continues and then usually hangs after:
ioc0: LSISAS1068E B3: Capabilities={Initiator}
uhci_hcd 0000:01:04.4: Unlink after no-IRQ?  Controller is probably using the wrong IRQ.
scsi0 : ioc0: LSISAS1068E B3, FwRev=01172a00h, Ports=1, MaxQ=163, IRQ=30
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 5, phy 4, sas_addr 0x500110a0008db4ba
scsi 0:0:0:0: Sequential-Access HP       Ultrium 4-SCSI   U28W PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 1
scsi 0:0:0:1: Medium Changer    HP       MSL G3 Series    6.80 PQ: 0 ANSI: 5
scsi 0:0:0:1: Attached scsi generic sg1 type 8
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 6, phy 5, sas_addr 0x500110a0008db4bd
scsi 0:0:1:0: Sequential-Access HP       Ultrium 4-SCSI   U28W PQ: 0 ANSI: 5
scsi 0:0:1:0: Attached scsi generic sg2 type 1



Expected results:
Boot succeeds. Using the F11 kernel (kernel-2.6.30.9-102.fc11.x86_64) the boot messages look similar to:

Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: UHCI Host Controller
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001060
Dec 16 11:47:06 angklung kernel: usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
Dec 16 11:47:06 angklung kernel: usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Dec 16 11:47:06 angklung kernel: usb usb5: Product: UHCI Host Controller
Dec 16 11:47:06 angklung kernel: usb usb5: Manufacturer: Linux 2.6.30.9-102.fc11.x86_64 uhci_hcd
Dec 16 11:47:06 angklung kernel: usb usb5: SerialNumber: 0000:00:1d.3
Dec 16 11:47:06 angklung kernel: usb usb5: configuration #1 chosen from 1 choice
Dec 16 11:47:06 angklung kernel: hub 5-0:1.0: USB hub found
Dec 16 11:47:06 angklung kernel: hub 5-0:1.0: 2 ports detected
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: PCI INT B -> GSI 22 (level, low) -> IRQ 22
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: UHCI Host Controller
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: new USB bus registered, assigned bus number 6
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: port count misdetected? forcing to 2 ports
Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: irq 22, io base 0x00003800
Dec 16 11:47:06 angklung kernel: usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
Dec 16 11:47:06 angklung kernel: usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Dec 16 11:47:06 angklung kernel: usb usb6: Product: UHCI Host Controller
Dec 16 11:47:06 angklung kernel: usb usb6: Manufacturer: Linux 2.6.30.9-102.fc11.x86_64 uhci_hcd
Dec 16 11:47:06 angklung kernel: usb usb6: SerialNumber: 0000:01:04.4
Dec 16 11:47:06 angklung kernel: usb usb6: configuration #1 chosen from 1 choice
Dec 16 11:47:06 angklung kernel: hub 6-0:1.0: USB hub found
Dec 16 11:47:06 angklung kernel: hub 6-0:1.0: 2 ports detected

Booting continues normally, and the "Unlink after no-IRQ?" message is not seen.

Additional info:

Comment 1 Ben Webb 2010-01-04 19:15:20 UTC
Also fails with the latest F12 update, kernel-2.6.31.9-174.fc12.x86_64. Normal boot with the latest F11 update, kernel-2.6.30.10-105.fc11.x86_64.

Comment 2 Jon E 2010-01-06 19:27:23 UTC
as a workaround - try turning off C-States in BIOS

Comment 3 Ben Webb 2010-01-07 17:33:23 UTC
I see a bunch of settings related to power management (e.g. tuning HP's dynamic power savings mode) in the BIOS, but nothing that obviously says "turn C-States on/off". Do you happen to know what HP might call this in their BIOS?

Comment 4 Jon E 2010-01-07 21:05:33 UTC
Try "Minimum Processor Idle Power State" .. if you're navigating the BIOS screen it's under "Power Management Options" => "Advanced Power Management Options"

I've had mixed results with this one .. I had a 2.4GHz model that came up without this setting and is still running fine, and a 2.93GHz that worked initially with this but then developed problems later

could be an issue with certain issue G6's and power state weirdness - might want to talk to your vendor about it

Comment 5 Jan ONDREJ 2010-03-04 14:43:01 UTC
I have same problem on same server.
After changing:

rbsu> SET CONFIG MINIMUM PROCESSOR IDLE POWER STATE 4
Minimum Processor Idle Power State
1|C6 State
2|C3 State
3|C1E State
4|No C-states <=

message and problems still persist. Some services fail to start randomly and complete system freezes in random time (aprox. 5 minutes from boot).

Similar problems with current testing kernel-2.6.32:
Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.
(just there is nothing in IML).

Fedora 11 kernel boots fine, just my server has very high load some times and my virtual machines freezes for 1-60 minutes.

Any other solutions?

Comment 6 Jan ONDREJ 2010-03-04 20:20:50 UTC
Kernel 2.6.34-0.4.rc0.git2.fc14.x86_64:
- no uhhuh message
- Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.

Still doesn't work. Looks like everything >=2.6.31 fails. :-(

Comment 7 Jan ONDREJ 2010-03-08 15:54:15 UTC
As a workaround, intel_iommu=off works for me. May be there are some disadvantages, but at least I can run my server.

Comment 8 Jan ONDREJ 2010-03-09 11:22:58 UTC
With iommu=pt server works well, but after reboot fails in BIOS with message:
  NMI - Undetermined Source
(this message is displayed before "Press F9 to setup ...")

ILO IML log says:
  Severity   Class      Description
  Critical   Host Bus   Unknown Event (Class 6, Code 4)

Another workaround is to disable VT-d in BIOS. Then there is no need to add any *iommu* kernel parameter and server works well also after reboot.

I don't need to assign PCI devices to guests, I use only virtio drivers in guests, so I don't need VT-d, but it will be good to fix this in future.

More info about VT-d:
  http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM

Comment 9 Don Zickus 2010-06-03 20:06:29 UTC
(In reply to comment #6)
> Kernel 2.6.34-0.4.rc0.git2.fc14.x86_64:
> - no uhhuh message
> - Kernel panic - not syncing: An NMI occurred, please see the Integrated
> Management Log for details.
> 
> Still doesn't work. Looks like everything >=2.6.31 fails. :-(    

The messages went away because you are using the hpwdt kernel module which intercepts the unknown nmis and uses the BIOS to determine the source and save them in the ILO.  Upon reboot you see what the ILO figured out (which was nothing interesting).

I have a kernel module that can walk the pci tree at the time of the NMI but the problem needs to be reproducable after booting to load the module (unless you want a kernel patch to apply and build).  Of course you would need to disable the hpwdt module to use this one instead.

Let me know if anyone is interested.

Cheers,
Don

Comment 10 Jan ONDREJ 2010-06-04 05:55:32 UTC
I don't think, that hpwdt is responsible for this problem. It only reports an bug, but does not make it.

I have similar problem with DELL PowerEdge R710 server. I can't find VT-d switch in BIOS, so I can't disable VT-d. Only workaround is iommu=pt.

Any chance to fix this in kernel to ignore or fix VT-d support?

Comment 11 Alex Williamson 2010-06-04 12:45:40 UTC
Please try updating the firmware on the smart array controller to version 3.00 or newer.  This should fix the NMI received during boot.  The NMI on reboot seems to be a BIOS issue, unless there's anything more we can do to make sure the VT-d hardware is shutdown and will not produce faults during the BIOS reboot.

Comment 12 Don Zickus 2010-06-04 13:24:53 UTC
(In reply to comment #10)
> I don't think, that hpwdt is responsible for this problem. It only reports an
> bug, but does not make it.

Sorry for the confusion.  I was referring to the change of behaviour of the messages.  I understand the hpwdt is not causing the problem.  That module just reports the info differently than the normal unknown nmi path.

Hopefully a firmware update can fix this problem :-)

Cheers,
Don

Comment 13 Jan ONDREJ 2010-06-07 12:59:11 UTC
I can't experiment on this server now. It's a production server.

But I have similar problems on DELL server with all firmware updates installed, same problems with F12 and F13. DELL does not display message after reboot, but also unable to boot without iommu=pt parameter. I can't find VT-d parameter in bios, so I can't disable it. Can I test something on this server? I don't use it in production yet.

Comment 14 David Woodhouse 2010-06-07 13:16:31 UTC
For Dell BIOS brokenness; let's ask Dell... Jan, can you confirm that you're seeing the _same_ problem on the Dell box? Sorry to ask, but some people have a very strange idea of what 'similar bug' means :)

Comment 15 David Woodhouse 2010-06-07 13:28:26 UTC
Offline we have established that the Dell issue is different -- it manifests itself as a lockup soon after enabling VT-d. Probably due to a missing RMRR for the iDRAC which is being used?

Comment 16 Joshua Roys 2010-06-15 20:24:16 UTC
*** Bug 593003 has been marked as a duplicate of this bug. ***

Comment 17 Jan ONDREJ 2010-08-08 19:15:26 UTC
I can confirm, that after these updates this problem disappeared:

  - bios update on HP to latest version
  - compaq smart array update to latest version
  - fedora kernel update

and also on Dell, all bioses and kernel has been updated and now works well.
Also after VT-d has been reenabled in BIOS, problem is gone.
Thank you for help, closing bug.


Note You need to log in before you can comment on or make changes to this bug.