这几年排查的各种类型的crash也比较多了,各种类型的也算见过,但是排查这个crash,走了不该走的弯路,事后显得很low,为了防止自己犯类似错误,也同时提醒后人,记录之。

内核是suse11,sp1,

uname -a
Linux Ftp1 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64 x86_64 x86_64 GNU/Linux

crash目录下有三个文件:

README.txt  vmcore  vmlinux-2.6.32.59-0.7-default

常规动作,编译vmlinux,然后看crash:

A10111916:~ # crash /home/caq/vmlinux /home/zxin11/vmcorecrash 4.0-7.6--------------------------------------------------------------------低版本
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...crash: invalid structure size: x8664_pdaFILE: x86_64.c  LINE: 561  FUNCTION: x86_64_cpu_pda_init()[/usr/bin/crash] error trace: 535689 => 4569bd => 4cb321 => 4e53004e5300: SIZE_verify+2244cb321: x86_64_init+16814569bd: main_loop+93535689: (undetermined)

我还以为是vmcore拷贝的有问题,检查了线上的vmcore和拷贝回来的vmcore,大小一样,md5值都是一样。然后检查编译的vmlinux,主要是检查.config文件 以及编译内核的

环境的gcc版本是否和线上出问题的gcc版本一致,也没有问题。过了好一会才开始怀疑,

是不是crash的版本有问题,为了验证这个想法,将vmlinux拷贝到线上去检查,线上环境的crash是5.0.1版本,就没有报错,看来真的跟crash版本有关系。这个也给自己上了一课,总共就

三个文件,crash,vmlinux,vmcore,解析出错,在保证vmlinux编译没问题和vmcore是完整的情况下,要仔细确认下crash的版本。

crash 5.0.1------------------------------------------------os自带版本
Copyright (C) 2002-2010  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09
LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 866NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64  (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 0COMMAND: "swapper"TASK: ffffffff8180c020  (1 of 48)  [THREAD_INFO: ffffffff81800000]CPU: 0STATE: TASK_RUNNING (ACTIVE)WARNING: panic task not found

居然显示"panic task not found",常见的crash都如下所示,而这个crash解析多了一个warning,

      PANIC: "[335750.721156] Oops: 0002 [#1] SMP " (check log for details)PID: 6879COMMAND: "bash"TASK: ffff88031b886380  [THREAD_INFO: ffff880319958000]CPU: 1STATE: TASK_RUNNING (PANIC)

比 crash 4.0-7.6 有进步,也算是个好兆头,下载 crash 5.0.1 的源码检查,发现这个warning关系也不大,但我犯了一个致命错误,就是对:

PANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)

这一行没有仔细看,高版本一些的内核,都是打印dmesg.txt在单独的一个文件,通过这个文件至少能快速地确认出panic的堆栈。而PANIC这行,

要求我去看log命令,我又没有去看,因为任务不多,直接去看各个进程的堆栈。导致又走了弯路。发现了两个堆栈比较可疑:

PID: 44451  TASK: ffff88067bbc6080  CPU: 8   COMMAND: "SMSvr"#0 [ffff88067dbf5dc8] schedule at ffffffff813923c4#1 [ffff88067dbf5de0] sys_reboot at ffffffff8105e00d#2 [ffff88067dbf5e60] do_notify_resume at ffffffff810028c5#3 [ffff88067dbf5f30] sys_rt_sigreturn at ffffffff81002aa8#4 [ffff88067dbf5f50] ptregscall_common at ffffffff81003216RIP: 00007f017b6e8efd  RSP: 00007f0178b71dc0  RFLAGS: 00000293RAX: fffffffffffffdfc  RBX: 0000000000000000  RCX: ffffffffffffffffRDX: 0000000000000000  RSI: 00007f0178b71df0  RDI: 00007f0178b71df0RBP: 00007f0178b71e00   R8: fefefefefefeffff   R9: 0000000000000001R10: 0000000000000800  R11: 0000000000000293  R12: 00007fff853b82f0R13: 00007f0178b72000  R14: 0000000000000003  R15: 0000000000001000ORIG_RAX: 0000000000000023  CS: 0033  SS: 002b

这个函数里面居然有一个sys_reboot调用,reboot导致panic我确实还没经历过,不死心,反汇编一下sys_reboot,打印如下:

rash> dis -l sys_reboot
/home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362
0xffffffff8105df50 <sys_reboot>:        test   %edi,0x1(%rbx)
0xffffffff8105df53 <sys_reboot+3>:      add    %al,(%rax)
0xffffffff8105df55 <sys_reboot+5>:      cmp    $0x1f,%ebx
0xffffffff8105df58 <sys_reboot+8>:      jg     0xffffffff8105df69 <sys_reboot+25>
0xffffffff8105df5a <sys_reboot+10>:     lea    -0x1(%rbx),%ecx
0xffffffff8105df5d <sys_reboot+13>:     mov    $0x8430000,%eax
/home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 367
0xffffffff8105df62 <sys_reboot+18>:     shr    %cl,%rax
0xffffffff8105df65 <sys_reboot+21>:     test   $0x1,%al
/home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362

明显反汇编得不对啊,reboot的代码里面有很多case对应的魔术字,而这个却没有cmp指令,而且代码一开始进来也没有建立栈的过程,立马再次对这个crash的解析结果产生怀疑,因为按道理

crash从vmlinux取出响应的符号对应的地址,然后到vmcore中找到对应的地址展示出来,说明vmcore和vmlinux还是存在不对应。但这个crash工具居然没提示(我见过不一致的提示,类似于WARNING: kernel version inconsistency between vmlinux and dumpfile)

为了验证自己的想法,我到编译的vmlinux中找一下sys_reboot,

linux-h9c2:/home/caq # objdump -d vmlinux >caq.txtlinux-h9c2:/home/caq # grep sys_reboot caq.txt
ffffffff8105df50 <sys_reboot>:linux-h9c2:/home/caq # nm vmlinux |grep -i sys_reboot
ffffffff8105df50 T sys_reboot

地址是:ffffffff8105df50,crash工具将这个地址去找sys_reboot,结果打印的却不是sys_reboot的反汇编,不可能crash工具出这么低级的问题啊,说明vmlinux和vmcore还是存在不对应。

想着reboot调用跟panic按道理风牛马不相及啊,放弃这条路,因为既然sys_reboot是错的,那么可能堆栈回溯都是错的了,

就剩下pid 38021了。

crash> bt -f 38021
PID: 38021  TASK: ffff88003531c340  CPU: 2   COMMAND: "sh"#0 [ffff880476051de8] schedule at ffffffff813923c4ffff880476051df0: 0000000000000000 0000000000000000ffff880476051e00: 0000000000000000 0000000000000000ffff880476051e10: 0000000000000000 0000000000000000ffff880476051e20: 0000000000000000 0000000000000000ffff880476051e30: 0000000000000000 0000000000000000ffff880476051e40: 0000000000000000 0000000000000000ffff880476051e50: 0000000000000000 0000000000000000ffff880476051e60: 0000000000000000 0000000000000000ffff880476051e70: 0000000000000000 0000000000000000ffff880476051e80: 0000000000000000 0000000000000000ffff880476051e90: 0000000000000000 0000000000000000ffff880476051ea0: 0000000000000000 0000000000000000ffff880476051eb0: 0000000000000000 0000000000000000ffff880476051ec0: 0000000000000000 0000000000000000ffff880476051ed0: 0000000000000000 0000000000000000ffff880476051ee0: 0000000000000000 0000000000000000ffff880476051ef0: 0000000000000000 0000000000000000ffff880476051f00: 0000000000000000 0000000000000000ffff880476051f10: 0000000000000000 0000000000000000ffff880476051f20: 0000000000000000 0000000000000000ffff880476051f30: 00000000006c9870 ffff88027dd62480ffff880476051f40: ffff88084c3a8d40 0000000000000000ffff880476051f50: 00000000006a0dd0 00007fffbc69e690ffff880476051f60: 0000000000000441 00000000006d3040ffff880476051f70: 0000000000000003 00000000006d3ba0ffff880476051f80: ffffffff81002f7b#1 [ffff880476051f80] auditsys at ffffffff81002f7bRIP: 00007fb09a95b4f0  RSP: 00007fffbc69e6c0  RFLAGS: 00010202RAX: 0000000000000002  RBX: ffffffff81002f7b  RCX: 0000000000000000RDX: 00000000000001b6  RSI: 0000000000000441  RDI: 00000000006d3040RBP: 00000000006d3ba0   R8: 0000000000000020   R9: 6c6568732f6d732fR10: 0000000000000020  R11: 0000000000000246  R12: 0000000000000003R13: 00000000006d3040  R14: 0000000000000441  R15: 00007fffbc69e690ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

看着堆栈不太对啊,auditsys 不是一个系统调用的入口,按道理第一个压栈的函数应该是常见的system_call_fastpath ,直接查看一下这个地址:

Ftp1:/home # grep ffffffff81002f /proc/kallsyms
ffffffff81002f00 T system_call_after_swapgs
ffffffff81002f65 t system_call_fastpath
ffffffff81002f80 t ret_from_sys_call
ffffffff81002f85 t sysret_check
ffffffff81002fd8 t sysret_careful
ffffffff81002fe8 t sysret_signal

发现 ffffffff81002f7b 应该属于 system_call_fastpath 的地址范围。

看来这crash的工具用不了,映射是错的,于是找了个更新一点的crash工具,版本为7.0.9

crash 7.0.9---------------------------------------------------更高版本
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.crash: vmlinux: no .gnu_debuglink section
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...WARNING: kernel version inconsistency between vmlinux and dumpfile------------------------有告警KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09
LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 867NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64  (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 38021COMMAND: "sh"-------------------------------------------------------找到对应的panic任务,比上一个版本靠谱TASK: ffff88003531c340  [THREAD_INFO: ffff880476050000]CPU: 2STATE: TASK_RUNNING (PANIC)

升级到7.0.9,然后敲入log命令:

对应的log中显示:

[6186227.149460] BUG: unable to handle kernel NULL pointer dereference at (null)
[6186227.149479] IP: [<ffffffff811e7752>] strlen+0x2/0x30
[6186227.149492] PGD 47b9be067 PUD 42e601067 PMD 0
[6186227.149497] Oops: 0000 [#1] SMP
[6186227.149502] last sysfs file: /sys/devices/pci0000:40/0000:40:07.0/0000:45:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/state
[6186227.149510] CPU 2
[6186227.149513] Modules linked in: secureProof(N) iptable_filter ip_tables x_tables dm_round_robin dm_multipath scsi_dh ipv6 bonding microcode f
use loop dm_mod tpm_tis dcdbas(X) tpm qla2xxx usbhid tpm_bios hid iTCO_wdt scsi_transport_fc iTCO_vendor_support serio_raw sr_mod scsi_tgt ses cd
rom pcspkr enclosure bnx2 sg rtc_cmos rtc_core rtc_lib wmi power_meter button uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fa
n processor ide_pci_generic ide_core ata_generic ata_piix libata megaraid_sas thermal thermal_sys hwmon mpdh(N) mpdt(N) scsi_mod [last unloaded:
secureProof]
[6186227.149571] Supported: Yes
[6186227.149577] Pid: 38021, comm: sh Tainted: G          NX 2.6.32.59-0.7-default #1 PowerEdge R910
[6186227.149582] RIP: 0010:[<ffffffff811e7752>]  [<ffffffff811e7752>] strlen+0x2/0x30
[6186227.149588] RSP: 0018:ffff880476051280  EFLAGS: 00010246
[6186227.149592] RAX: 0000000000000000 RBX: ffff8805f94ec000 RCX: 0000000000000000
[6186227.149596] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[6186227.149600] RBP: 0000000000000000 R08: ffff8804760511f8 R09: ffffffff81539570
[6186227.149604] R10: 0000000000000020 R11: 0000000000000fff R12: ffff880476051d38
[6186227.149608] R13: ffff88067bbc6080 R14: ffffffffa03d8f79 R15: 0000000000000000
[6186227.149612] FS:  00007fb09b21e700(0000) GS:ffff880487400000(0000) knlGS:0000000000000000
[6186227.149617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6186227.149621] CR2: 0000000000000000 CR3: 0000000473f4d000 CR4: 00000000000006e0
[6186227.149625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[6186227.149629] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[6186227.149634] Process sh (pid: 38021, threadinfo ffff880476050000, task ffff88003531c340)
[6186227.149638] Stack:
[6186227.149640]  ffffffffa03d5768 ffffffffa03d8e4c ffff88067bbc6080 ffff880476051d38
[6186227.149645] <0> ffff8804760514c8 0000000000000019 ffffffffa03d5b53 ffff880476051d58
[6186227.149650] <0> ffffffffa03d8e4c 787a2f656d6f682f 7374642f30316e69 76534d532f6d732f
[6186227.149657] Call Trace:
[6186227.149678]  [<ffffffffa03d5768>] getprocpath+0xa8/0x150 [secureProof]
[6186227.149701]  [<ffffffffa03d5b53>] checkTrustProc+0x83/0x270 [secureProof]
[6186227.149710]  [<ffffffffa03d66ca>] checkProcAndFile+0x3da/0x890 [secureProof]
[6186227.149720]  [<ffffffffa03d7aba>] our_sys_open+0xfa/0x1d0 [secureProof]-----------我们模块接管的open
[6186227.149736]  [<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b
[6186227.149745]  [<00007fb09a95b4f0>] 0x7fb09a95b4f0
[6186227.149749] Code: 00 48 83 c7 01 0f b6 07 84 c0 74 0c 0f b6 c0 f6 80 a0 08 85 81 20 75 e9 48 89 f8 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 0031 c0 <80> 3f 00 48 89 fa 74 15 66 0f 1f 44 00 00 48 83 c2 01 80 3a 00
[6186227.149779] RIP  [<ffffffff811e7752>] strlen+0x2/0x30
[6186227.149784]  RSP <ffff880476051280>
[6186227.149787] CR2: 0000000000000000

这个打印和crash找的任务是一致的,都是sh进程,pid为38021。

然后查看strlen的代码:

/home/caq/usr/src/linux-2.6.32.59-0.7/lib/string.c: 379
0xffffffff811e1750 <strlen>:    ljmpq  *(%rcx)
0xffffffff811e1752 <strlen+2>:  icebp

确定是由于rcx为NULL导致的,业务代码流程有问题,直接引用空指针,导致crash。

总结一下:

1.crash分析的时候,crash的版本尽量新一些,特别当某些crash工具解析有问题的时候,要果断换,出现的crash工具提醒的warning,要重视。

2.老司机也会翻车,编译vmlinx的gcc版本,最好和运行的内核的gcc版本一致。

转载于:https://www.cnblogs.com/10087622blog/p/10609159.html

一个suse11 sp1的crash工具版本问题相关推荐

  1. crash工具分析linux内核,如何使用crash工具分析Linux内核崩溃转储文件

    满意答案 使用 crash 的先决条件 1. kernel 映像文件 vmlinux 在编译的时候必须指定了 -g 参数,即带有调试信息. 2. 需要有一个内存崩溃转储文件(例如 vmcore),或者 ...

  2. crash工具解析_crash工具和x86-64汇编基础

    在前面的文章中,已经出现了作为Linux重要调试手段之一的crash工具的身影.在后续的文章里,我们还会继续用到它.因此在这里,准备对Linux中的crash工具的原理和使用方法,做一个相对全面的介绍 ...

  3. crash工具分析sysdump使用

    一.准备环境 1)获取crash工具.注意区分版本(arm/arm64/x86_64). 2)获取对应软件版本的符号表文件(如vmlinux),可以将该文件放置 crash工具同一目录下. 3)获取s ...

  4. crash工具解析_Android Crash 工具

    crash工具解析_Android Crash 工具_weixin_39543655的博客-CSDN博客本篇文章主要介绍Android开发中的部分知识点,通过阅读本篇文章,您将收获以下内容:一.Cra ...

  5. linux crash,系统崩溃 - crash工具介绍

    工欲善其事,必先利其器.本文主要介绍linux下crash工具常用命令的功能和使用. 背景知识 crash是redhat的工程师开发的,主要用来离线分析linux内核转存文件,它整合了gdb工具,功能 ...

  6. github snap android,GitHub - albuer/heapsnap: HeapSnap 是一个定位内存泄露的工具,适用于Android平台。...

    HeapSnap 1.HeapSnap 是什么 HeapSnap 是一个定位内存泄露的工具,适用于Android平台. 主要特性如下: 对系统负载低 不需要修改目标进程的源代码 支持Andoroid上 ...

  7. Crash工具实战-变量解析【转】

    转自:http://blog.chinaunix.net/uid-14528823-id-4358785.html Crash工具实战-变量解析 Crash工具用于解析Vmcore文件,Vmcore文 ...

  8. you-get 一个下载视频的好工具

    不知道你有没有这种情况,你用的是网页版的app看视频,但是你想下载视频,你又不想下载APP. 那么如何解决呢? 其实是用一个叫you-get的工具就可以解决你的困扰. 在Python的第三方库的第三方 ...

  9. 卸载后以前拍的视频会删除吗_可立拍!苹果自己的视频编辑App是一个被忽视的好工具...

    手机预装应用总是不如三方产品? 看到这个问题,你是不是会下意识反驳:iPhone自带 app 就很好用啊!的确如此,iPhone 的<Pages><备忘录>,这些 app 的优 ...

最新文章

  1. C语言:随笔5--指针1
  2. 自定义View字段表头
  3. FPGA之道(44)HDL中的隐患写法
  4. 马斯克成功把人从太空送回地球!历时64天,SpaceX首次载人任务圆满收工
  5. java提高篇(七)-----关键字static
  6. 绝对精华,大牛教你在Android系统上安装linux发行版
  7. Ueditor百度编辑器中的 setContent()方法的使用
  8. 物联网在医疗保健中的应用
  9. chrome浏览器解决ajax跨域问题
  10. Linux 两台服务器之间传输文件和文件夹
  11. 传感器工作原理_光电式速度传感器的工作原理
  12. MFC 消息映射表和虚函数实现消息映射到底谁的效率高
  13. 【转载】.NET系统学习----Assembly
  14. 完美镜像ISO制作工具WinIso-----使用说明
  15. OSI网络七层协议详解
  16. 目标实现,时间、知识管理体系
  17. 51单片机LCD1602电子时钟
  18. python 自动解4399数独游戏
  19. Arduino检测不到串口的问题(改)
  20. python+selenium环境配置及浏览器调用

热门文章

  1. 信息系统项目管理师-计算题专题(三)上午计算小题
  2. DataGridView中实现点击单元格Cell动态添加自定义控件
  3. CS中常用转义符与@符号的作用
  4. Jquery中使用ajax请求SSM后台时提示:org.springframework.http.converter.HttpMessageNotReadableException: Could no
  5. Atom 实用侧边栏插件
  6. oracle 文件写 n r,文本模式读写文件中\r和\n的问题
  7. 20051020:该办宽带了
  8. 解决安装DEB包时报错
  9. vue学习笔记(二)- 数据绑定、列表渲染、条件判断
  10. 删除数据库所有表 序列号