Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1694779

Summary: list_del corruption in exit_sem
Product: [Fedora] Fedora Reporter: Gary Duzan <Gary.Duzan>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
kernel log none

Description Gary Duzan 2019-04-01 16:28:07 UTC
Created attachment 1550637 [details]
kernel log

1. Please describe the problem:

Periodically we get kernel lockups with this particular kernel report at the root of it. Typically under heavy load testing GT.M, which makes significant use of semaphores.
Mar 30 04:25:33 kernel: list_del corruption, ffff953f1fe70e08->next is LIST_POISON1 (dead000000000100)
Mar 30 04:25:33 kernel: ------------[ cut here ]------------
Mar 30 04:25:33 kernel: kernel BUG at lib/list_debug.c:47!
Mar 30 04:25:33 kernel: invalid opcode: 0000 [#1] SMP PTI
Mar 30 04:25:33 kernel: CPU: 1 PID: 933549 Comm: mumps Not tainted 5.0.3-200.fc29.x86_64 #1
Mar 30 04:25:33 kernel: Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.7.1 001/22/2018
Mar 30 04:25:33 kernel: RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Mar 30 04:25:33 kernel: Code: c9 ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 18 16 12 b1 e8 bc 15 c9 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 a8 16 12 b1 e8 a8 15 c9 ff <0f> 0b 48 c7 c7 58 17 12 b1 e8 9a 15 c9 ff 0f 0b 48 89 f
2 48 89 fe
Mar 30 04:25:33 kernel: RSP: 0018:ffffb02826157e00 EFLAGS: 00010246
Mar 30 04:25:33 kernel: RAX: 000000000000004e RBX: ffff953eed251600 RCX: 0000000000000000
Mar 30 04:25:33 kernel: RDX: 0000000000000000 RSI: ffff95407f4168c8 RDI: ffff95407f4168c8
Mar 30 04:25:33 kernel: RBP: ffff953f1fe70de0 R08: 000000000000065b R09: 0000000000000003
Mar 30 04:25:33 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 000000001af48021
Mar 30 04:25:33 kernel: R13: ffff953f11710d40 R14: ffff953f11710d48 R15: ffff953eed251688
Mar 30 04:25:33 kernel: FS:  00007f9e93f5a440(0000) GS:ffff95407f400000(0000) knlGS:0000000000000000
Mar 30 04:25:33 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 30 04:25:33 kernel: CR2: 00007fa2cc113000 CR3: 000000155e40e001 CR4: 00000000003606e0
Mar 30 04:25:33 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 30 04:25:33 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 30 04:25:33 kernel: Call Trace:
Mar 30 04:25:33 kernel:  exit_sem+0x12d/0x577
Mar 30 04:25:33 kernel:  do_exit+0x2a4/0xbb0
Mar 30 04:25:33 kernel:  ? __do_page_fault+0x26f/0x500
Mar 30 04:25:33 kernel:  do_group_exit+0x3a/0xa0
Mar 30 04:25:33 kernel:  __x64_sys_exit_group+0x14/0x20
Mar 30 04:25:33 kernel:  do_syscall_64+0x5b/0x160
Mar 30 04:25:33 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 30 04:25:33 kernel: RIP: 0033:0x7f9e9406aad6
Mar 30 04:25:33 kernel: Code: Bad RIP value.
Mar 30 04:25:33 kernel: RSP: 002b:00007ffc79a3ac28 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
Mar 30 04:25:33 kernel: RAX: ffffffffffffffda RBX: 00007f9e9415d740 RCX: 00007f9e9406aad6
Mar 30 04:25:33 kernel: RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
Mar 30 04:25:33 kernel: RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
Mar 30 04:25:33 kernel: R10: 00007ffc79a3aa8e R11: 0000000000000246 R12: 00007f9e9415d740
Mar 30 04:25:33 kernel: R13: 0000000000000002 R14: 00007f9e94166448 R15: 0000000000000000
Mar 30 04:25:33 kernel: Modules linked in: btrfs xor zstd_compress raid6_pq zstd_decompress fuse vfat fat rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_i
scsi f2fs intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate mgag200 intel_uncore i2c_algo_bit intel_rapl
_perf ttm ipmi_ssif drm_kms_helper drm iTCO_wdt iTCO_vendor_support mei_me dcdbas lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mxm_wmi pcc_cpufreq auth_rpcgss sunrpc binfmt_misc xfs libcrc32c
nvme crc32c_intel nvme_core megaraid_sas tg3 wmi loop
Mar 30 04:25:33 kernel: ---[ end trace d208a32963f4ac0e ]---

2. What is the Version-Release number of the kernel:

Linux 5.0.3-200.fc29.x86_64 #1 SMP Tue Mar 19 15:07:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at :

We have observed the issue for some time, maybe a year or so, though it isn't clear when the problem started.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Unfortunately, while we do see the issue on a weekly to monthly basis, we don't have a specific trigger for it, as I'm assuming it is timing sensitive.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

It would not be practical to test with a rawhide kernel on the affected system.

6. Are you running any modules that not shipped with directly Fedora's kernel?:


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

The full kernel log from that boot is just under 51MB, and quite repetitive, so I'll trim most of it.