Bug 1694779

Summary: list_del corruption in exit_sem
Product: [Fedora] Fedora Reporter: Gary Duzan <Gary.Duzan>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
kernel log none

Description Gary Duzan 2019-04-01 16:28:07 UTC
Created attachment 1550637 [details]
kernel log

1. Please describe the problem:

Periodically we get kernel lockups with this particular kernel report at the root of it. Typically under heavy load testing GT.M, which makes significant use of semaphores.
Mar 30 04:25:33 kernel: list_del corruption, ffff953f1fe70e08->next is LIST_POISON1 (dead000000000100)
Mar 30 04:25:33 kernel: ------------[ cut here ]------------
Mar 30 04:25:33 kernel: kernel BUG at lib/list_debug.c:47!
Mar 30 04:25:33 kernel: invalid opcode: 0000 [#1] SMP PTI
Mar 30 04:25:33 kernel: CPU: 1 PID: 933549 Comm: mumps Not tainted 5.0.3-200.fc29.x86_64 #1
Mar 30 04:25:33 kernel: Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.7.1 001/22/2018
Mar 30 04:25:33 kernel: RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Mar 30 04:25:33 kernel: Code: c9 ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 18 16 12 b1 e8 bc 15 c9 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 a8 16 12 b1 e8 a8 15 c9 ff <0f> 0b 48 c7 c7 58 17 12 b1 e8 9a 15 c9 ff 0f 0b 48 89 f
2 48 89 fe
Mar 30 04:25:33 kernel: RSP: 0018:ffffb02826157e00 EFLAGS: 00010246
Mar 30 04:25:33 kernel: RAX: 000000000000004e RBX: ffff953eed251600 RCX: 0000000000000000
Mar 30 04:25:33 kernel: RDX: 0000000000000000 RSI: ffff95407f4168c8 RDI: ffff95407f4168c8
Mar 30 04:25:33 kernel: RBP: ffff953f1fe70de0 R08: 000000000000065b R09: 0000000000000003
Mar 30 04:25:33 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 000000001af48021
Mar 30 04:25:33 kernel: R13: ffff953f11710d40 R14: ffff953f11710d48 R15: ffff953eed251688
Mar 30 04:25:33 kernel: FS:  00007f9e93f5a440(0000) GS:ffff95407f400000(0000) knlGS:0000000000000000
Mar 30 04:25:33 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 30 04:25:33 kernel: CR2: 00007fa2cc113000 CR3: 000000155e40e001 CR4: 00000000003606e0
Mar 30 04:25:33 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 30 04:25:33 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 30 04:25:33 kernel: Call Trace:
Mar 30 04:25:33 kernel:  exit_sem+0x12d/0x577
Mar 30 04:25:33 kernel:  do_exit+0x2a4/0xbb0
Mar 30 04:25:33 kernel:  ? __do_page_fault+0x26f/0x500
Mar 30 04:25:33 kernel:  do_group_exit+0x3a/0xa0
Mar 30 04:25:33 kernel:  __x64_sys_exit_group+0x14/0x20
Mar 30 04:25:33 kernel:  do_syscall_64+0x5b/0x160
Mar 30 04:25:33 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 30 04:25:33 kernel: RIP: 0033:0x7f9e9406aad6
Mar 30 04:25:33 kernel: Code: Bad RIP value.
Mar 30 04:25:33 kernel: RSP: 002b:00007ffc79a3ac28 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
Mar 30 04:25:33 kernel: RAX: ffffffffffffffda RBX: 00007f9e9415d740 RCX: 00007f9e9406aad6
Mar 30 04:25:33 kernel: RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
Mar 30 04:25:33 kernel: RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
Mar 30 04:25:33 kernel: R10: 00007ffc79a3aa8e R11: 0000000000000246 R12: 00007f9e9415d740
Mar 30 04:25:33 kernel: R13: 0000000000000002 R14: 00007f9e94166448 R15: 0000000000000000
Mar 30 04:25:33 kernel: Modules linked in: btrfs xor zstd_compress raid6_pq zstd_decompress fuse vfat fat rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_i
scsi f2fs intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate mgag200 intel_uncore i2c_algo_bit intel_rapl
_perf ttm ipmi_ssif drm_kms_helper drm iTCO_wdt iTCO_vendor_support mei_me dcdbas lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mxm_wmi pcc_cpufreq auth_rpcgss sunrpc binfmt_misc xfs libcrc32c
nvme crc32c_intel nvme_core megaraid_sas tg3 wmi loop
Mar 30 04:25:33 kernel: ---[ end trace d208a32963f4ac0e ]---

2. What is the Version-Release number of the kernel:

Linux 5.0.3-200.fc29.x86_64 #1 SMP Tue Mar 19 15:07:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at :

We have observed the issue for some time, maybe a year or so, though it isn't clear when the problem started.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Unfortunately, while we do see the issue on a weekly to monthly basis, we don't have a specific trigger for it, as I'm assuming it is timing sensitive.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

It would not be practical to test with a rawhide kernel on the affected system.

6. Are you running any modules that not shipped with directly Fedora's kernel?:


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

The full kernel log from that boot is just under 51MB, and quite repetitive, so I'll trim most of it.