Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1692738

Summary: BUG: unable to handle kernel paging request at 00004047c748091e
Product: Red Hat Enterprise Linux 8 Reporter: Jianlin Shi <jishi>
Component: kernel-rtAssignee: Juri Lelli <jlelli>
kernel-rt sub component: sctp QA Contact: ying xu <yinxu>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: high CC: bhu, daolivei, dzickus, mleitner, network-qe, qzhao, rasibley, vkabatov, williams
Version: 8.0   
Target Milestone: rc   
Target Release: 8.1   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jianlin Shi 2019-03-26 10:34:13 UTC
Description of problem:
 BUG: unable to handle kernel paging request at 00004047c748091e

Version-Release number of selected component (if applicable):
git://git.host.prod.eng.bos.redhat.com/kernel-rt.git commit: cc5442d8c774

How reproducible:
2/2

Steps to Reproduce:
1. run case pmtu as follows: https://beaker.engineering.redhat.com/jobs/3430543
2.
3.

Actual results:
[ 3043.692672] BUG: unable to handle kernel paging request at 00004047c748091e
[ 3043.692674] PGD 0 P4D 0 
[ 3043.692678] Oops: 0000 [#1] PREEMPT SMP PTI
[ 3043.692681] CPU: 23 PID: 24574 Comm: netserver Kdump: loaded Not tainted 4.18.0-rt9.cki+ #1
[ 3043.692681] Hardware name: Dell Inc. PowerEdge R820/04K5X5, BIOS 2.3.4 01/22/2016
[ 3043.692699] RIP: 0010:sctp_backlog_rcv+0x25/0x1b0 [sctp]
[ 3043.692701] Code: 00 00 00 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 53 48 83 ec 08 48 8b 6e 40 48 8b 9d 98 00 00 00 4c 8b ad e8 00 00 00 <80> 7b 1c 00 75 41 4c 8b 63 20 4c 8d 7b 28 49 39 fc 75 4e 48 89 ee
[ 3043.692702] RSP: 0018:ffff9bff4e9d7c20 EFLAGS: 00010282
[ 3043.692704] RAX: ffffffffc0a01a30 RBX: 00004047c7480902 RCX: 0000000000000000
[ 3043.692705] RDX: 0000000000000000 RSI: ffff8894fdb3ca00 RDI: ffff8894e3767400
[ 3043.692706] RBP: ffffffffa9b7ca60 R08: 0000000000000000 R09: ffff8884f40220d0
[ 3043.692707] R10: ffff8884f4021cd0 R11: ffff8894fbc74108 R12: ffff8894e3767488
[ 3043.692708] R13: 4489480000002825 R14: 00000000000004cc R15: ffff888541fe2200
[ 3043.692710] FS:  00007fb295dd1100(0000) GS:ffff8894ff2c0000(0000) knlGS:0000000000000000               
[ 3043.692711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3043.692712] CR2: 00004047c748091e CR3: 0000000ff1082002 CR4: 00000000000606e0
[ 3043.692713] Call Trace:
[ 3043.692726]  __release_sock+0x81/0xd0
[ 3043.692728]  release_sock+0x30/0xa0
[ 3043.692737]  sctp_recvmsg+0x196/0x330 [sctp]                                                           
[ 3043.692742]  inet_recvmsg+0x5b/0x100
[ 3043.692746]  ___sys_recvmsg+0xdd/0x1e0
[ 3043.692751]  ? __switch_to_asm+0x40/0x70
[ 3043.692752]  ? __switch_to_asm+0x40/0x70
[ 3043.692754]  ? __switch_to_asm+0x34/0x70
[ 3043.692755]  ? __switch_to_asm+0x40/0x70
[ 3043.692756]  ? __switch_to_asm+0x34/0x70
[ 3043.692757]  ? __switch_to_asm+0x40/0x70
[ 3043.692759]  ? __switch_to_asm+0x34/0x70
[ 3043.692760]  ? _raw_spin_unlock_irq+0x1d/0x50
[ 3043.692764]  ? finish_task_switch+0x85/0x2c0
[ 3043.692765]  ? __switch_to_asm+0x40/0x70
[ 3043.692768]  ? __schedule+0x262/0x680
[ 3043.692773]  ? __audit_syscall_entry+0xd7/0x160
[ 3043.692775]  __sys_recvmsg+0x54/0xa0
[ 3043.692781]  do_syscall_64+0x5b/0x1b0
[ 3043.692783]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 3043.692786] RIP: 0033:0x7fb29515fda8
[ 3043.692788] Code: ff eb b1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 59 2c 00 8b 00 85 c0 75 17 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
[ 3043.692789] RSP: 002b:00007ffc3591ee48 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
[ 3043.692790] RAX: ffffffffffffffda RBX: 00007ffc3591ef10 RCX: 00007fb29515fda8
[ 3043.692791] RDX: 0000000000000000 RSI: 00007ffc3591ee60 RDI: 0000000000000009
[ 3043.692792] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3043.692793] R10: 00007fb295bd2070 R11: 0000000000000246 R12: 0000000000000000
[ 3043.692794] R13: 00007ffc35921400 R14: 0000000000000000 R15: 0000000000000000
[ 3043.692797] Modules linked in: veth ah6 xfrm6_mode_transport ah4 xfrm4_mode_transport sctp mlx4_en mlx4_core devlink sunrpc intel_rapl sb_edac intel_powerclamp coretemp kvm_intel kvm joydev irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt intel_cstate iTCO_vendor_support dcdbas ipmi_ssif intel_uncore intel_rapl_perf ipmi_si ipmi_devintf pcspkr sg ext4 ipmi_msghandler mbcache wmi mei_me acpi_power_meter mei lpc_ich jbd2 ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel ahci ixgbe mdio libahci igb dca libata megaraid_sas i2c_algo_bit [last unloaded: veth]                                                    
[ 3043.692831] CR2: 00004047c748091e

Expected results:
no panic

Additional info:


cloned the job, and reserve the system after pmtu finished, run pmtu manually for several times, the panic would happen: https://beaker.engineering.redhat.com/recipes/6666947#task90187377,task90187378

Comment 3 Juri Lelli 2019-03-26 15:37:41 UTC
Hi, is the box currently free and can I ssh into it?

Comment 4 Juri Lelli 2019-03-26 16:52:12 UTC
Hi again,

Another couple of questions.

1 - I noticed this happened on a "4.18.0-rt9.cki+" kernel.
    What is the origin/purpose of this kernel? (and why not
    using official kernel-rt builds?)

2 - Couldn't find a way to get access to a vmlinux with
    debug symbols (for the aforementioned kernel). Could
    you please point me at that?

Thanks!

Comment 5 Jianlin Shi 2019-03-27 10:28:52 UTC
(In reply to Juri Lelli from comment #4)
> Hi again,
> 
> Another couple of questions.
> 
> 1 - I noticed this happened on a "4.18.0-rt9.cki+" kernel.
>     What is the origin/purpose of this kernel? (and why not
>     using official kernel-rt builds?)

the issue was found through CI testing per patch: https://xci32.lab.eng.rdu2.redhat.com/cki-project/cki-pipeline/pipelines/5682

> 
> 2 - Couldn't find a way to get access to a vmlinux with
>     debug symbols (for the aforementioned kernel). Could
>     you please point me at that?
> 
> Thanks!


you can log in the system, and you can find vmlinux there.
netqe2.knqe.lab.eng.bos.redhat.com  root/redhat

Comment 6 Juri Lelli 2019-03-27 12:24:32 UTC
(In reply to Jianlin Shi from comment #5)
> (In reply to Juri Lelli from comment #4)
> > Hi again,
> > 
> > Another couple of questions.
> > 
> > 1 - I noticed this happened on a "4.18.0-rt9.cki+" kernel.
> >     What is the origin/purpose of this kernel? (and why not
> >     using official kernel-rt builds?)
> 
> the issue was found through CI testing per patch:
> https://xci32.lab.eng.rdu2.redhat.com/cki-project/cki-pipeline/pipelines/5682
> 
> > 
> > 2 - Couldn't find a way to get access to a vmlinux with
> >     debug symbols (for the aforementioned kernel). Could
> >     you please point me at that?
> > 
> > Thanks!
> 
> 
> you can log in the system, and you can find vmlinux there.
> netqe2.knqe.lab.eng.bos.redhat.com  root/redhat

However, what I could find (/boot/vmlinux-4.18.0-rt9.cki+) doesn't
seem to have debug symbols?

[root@netqe2 ~]# crash /boot/vmlinux-4.18.0-rt9.cki+ /var/crash/127.0.0.1-2019-03-26-05\:07\:55/vmcore

crash 7.2.3-16.el8
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

crash: /boot/vmlinux-4.18.0-rt9.cki+: no debugging data available

[root@netqe2 ~]# grep DEBUG_INFO /boot/config-4.18.0-rt9.cki+ 
# CONFIG_DEBUG_INFO is not set

Comment 7 Jianlin Shi 2019-03-27 13:07:09 UTC
I don't know how the kernel is compiled, rachel, could you help to tell Juri? thanks

Comment 8 Veronika Kabatova 2019-03-27 14:06:06 UTC
Rachel asked me to answer for her.

> 1 - I noticed this happened on a "4.18.0-rt9.cki+" kernel. What is the origin/purpose of this kernel? (and why not using official kernel-rt builds?)

CKI is collaborating with RT team to get CI for kernel-rt. That means we are testing changes directly from git *before* an official build is created and released. The repo and branch we used for RHEL8.0 build were linked by Clark so they should be the right ones to use. We can't use an official build if we are supposed to find bugs before it's created.

> 2 - Couldn't find a way to get access to a vmlinux with debug symbols (for the aforementioned kernel). Could you please point me at that?

The kernels are compiled without debuginfo. The full build log can be downloaded from https://xci32.lab.eng.rdu2.redhat.com/cki-project/cki-pipeline/-/jobs/35848/artifacts/file/workdir/build.log if it's any help. Here are the commands we ran (from pipeline logs):

(workdir is the checked out git repo)

make -C /cki-project/cki-pipeline/workdir rh-configs
/cki-project/cki-pipeline/workdir/scripts/config --file /cki-project/cki-pipeline/workdir/.config --disable debug_info
/cki-project/cki-pipeline/workdir/scripts/config --file /cki-project/cki-pipeline/workdir/.config --set-str LOCALVERSION .cki
make -C /cki-project/cki-pipeline/workdir INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64

You should get the same build with debuginfo enabled if you skip the command that disables it (obviously).

Comment 9 Clark Williams 2019-03-27 15:58:51 UTC
(In reply to Veronika Kabatova from comment #8)
> Rachel asked me to answer for her.
> 
> > 1 - I noticed this happened on a "4.18.0-rt9.cki+" kernel. What is the origin/purpose of this kernel? (and why not using official kernel-rt builds?)
> 
> CKI is collaborating with RT team to get CI for kernel-rt. That means we are
> testing changes directly from git *before* an official build is created and
> released. The repo and branch we used for RHEL8.0 build were linked by Clark
> so they should be the right ones to use. We can't use an official build if
> we are supposed to find bugs before it's created.

I'm honestly struggling to figure out why this build is being done. The git tree you guys are pulling from is only pushed after we've
merged RHEL's latest changes into it *and* have successfully done a brew build.

It's a much simpler workflow than what RHEL has to do,  since we're just merging the RHEL changes into an existing RT tree, potentially fixing up merge conflicts and then kicking off a build.

What am I missing?

Comment 10 Don Zickus 2019-03-27 17:52:16 UTC
Hi Clark,

The build is being done because we do not know what tests the RT team ran before releasing.  As such, we run our standard tests on your baseline to make sure future patches applied on top are not falsely determined to be the reason why the test failed.

If/when we switch to running CKI _before_ the RT team releases a kernel (ie gating), then running this post release test becomes moot.

This bug was filed because even though you already released this kernel, it was determined that your baseline kernel would cause problems with our standard tests.  This means we have to be aware and disable this test otherwise when CKI tests with patches on top, it may be falsely accuse a patch of failing a test when in fact the baseline already had a bug in it.  Make sense?

(not that we plan to put many patches on top)

Cheers,
Don