一个suse11 sp1的crash工具版本问题
这几年排查的各种类型的crash也比较多了,各种类型的也算见过,但是排查这个crash,走了不该走的弯路,事后显得很low,为了防止自己犯类似错误,也同时提醒后人,记录之。
内核是suse11,sp1,
uname -a Linux Ftp1 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64 x86_64 x86_64 GNU/Linux
crash目录下有三个文件:
README.txt vmcore vmlinux-2.6.32.59-0.7-default
常规动作,编译vmlinux,然后看crash:
A10111916:~ # crash /home/caq/vmlinux /home/zxin11/vmcorecrash 4.0-7.6--------------------------------------------------------------------低版本 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...crash: invalid structure size: x8664_pdaFILE: x86_64.c LINE: 561 FUNCTION: x86_64_cpu_pda_init()[/usr/bin/crash] error trace: 535689 => 4569bd => 4cb321 => 4e53004e5300: SIZE_verify+2244cb321: x86_64_init+16814569bd: main_loop+93535689: (undetermined)
我还以为是vmcore拷贝的有问题,检查了线上的vmcore和拷贝回来的vmcore,大小一样,md5值都是一样。然后检查编译的vmlinux,主要是检查.config文件 以及编译内核的
环境的gcc版本是否和线上出问题的gcc版本一致,也没有问题。过了好一会才开始怀疑,
是不是crash的版本有问题,为了验证这个想法,将vmlinux拷贝到线上去检查,线上环境的crash是5.0.1版本,就没有报错,看来真的跟crash版本有关系。这个也给自己上了一课,总共就
三个文件,crash,vmlinux,vmcore,解析出错,在保证vmlinux编译没问题和vmcore是完整的情况下,要仔细确认下crash的版本。
crash 5.0.1------------------------------------------------os自带版本 Copyright (C) 2002-2010 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09 LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 866NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64 (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 0COMMAND: "swapper"TASK: ffffffff8180c020 (1 of 48) [THREAD_INFO: ffffffff81800000]CPU: 0STATE: TASK_RUNNING (ACTIVE)WARNING: panic task not found
居然显示"panic task not found",常见的crash都如下所示,而这个crash解析多了一个warning,
PANIC: "[335750.721156] Oops: 0002 [#1] SMP " (check log for details)PID: 6879COMMAND: "bash"TASK: ffff88031b886380 [THREAD_INFO: ffff880319958000]CPU: 1STATE: TASK_RUNNING (PANIC)
比 crash 4.0-7.6 有进步,也算是个好兆头,下载 crash 5.0.1 的源码检查,发现这个warning关系也不大,但我犯了一个致命错误,就是对:
PANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)
这一行没有仔细看,高版本一些的内核,都是打印dmesg.txt在单独的一个文件,通过这个文件至少能快速地确认出panic的堆栈。而PANIC这行,
要求我去看log命令,我又没有去看,因为任务不多,直接去看各个进程的堆栈。导致又走了弯路。发现了两个堆栈比较可疑:
PID: 44451 TASK: ffff88067bbc6080 CPU: 8 COMMAND: "SMSvr"#0 [ffff88067dbf5dc8] schedule at ffffffff813923c4#1 [ffff88067dbf5de0] sys_reboot at ffffffff8105e00d#2 [ffff88067dbf5e60] do_notify_resume at ffffffff810028c5#3 [ffff88067dbf5f30] sys_rt_sigreturn at ffffffff81002aa8#4 [ffff88067dbf5f50] ptregscall_common at ffffffff81003216RIP: 00007f017b6e8efd RSP: 00007f0178b71dc0 RFLAGS: 00000293RAX: fffffffffffffdfc RBX: 0000000000000000 RCX: ffffffffffffffffRDX: 0000000000000000 RSI: 00007f0178b71df0 RDI: 00007f0178b71df0RBP: 00007f0178b71e00 R8: fefefefefefeffff R9: 0000000000000001R10: 0000000000000800 R11: 0000000000000293 R12: 00007fff853b82f0R13: 00007f0178b72000 R14: 0000000000000003 R15: 0000000000001000ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b
这个函数里面居然有一个sys_reboot调用,reboot导致panic我确实还没经历过,不死心,反汇编一下sys_reboot,打印如下:
rash> dis -l sys_reboot /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362 0xffffffff8105df50 <sys_reboot>: test %edi,0x1(%rbx) 0xffffffff8105df53 <sys_reboot+3>: add %al,(%rax) 0xffffffff8105df55 <sys_reboot+5>: cmp $0x1f,%ebx 0xffffffff8105df58 <sys_reboot+8>: jg 0xffffffff8105df69 <sys_reboot+25> 0xffffffff8105df5a <sys_reboot+10>: lea -0x1(%rbx),%ecx 0xffffffff8105df5d <sys_reboot+13>: mov $0x8430000,%eax /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 367 0xffffffff8105df62 <sys_reboot+18>: shr %cl,%rax 0xffffffff8105df65 <sys_reboot+21>: test $0x1,%al /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362
明显反汇编得不对啊,reboot的代码里面有很多case对应的魔术字,而这个却没有cmp指令,而且代码一开始进来也没有建立栈的过程,立马再次对这个crash的解析结果产生怀疑,因为按道理
crash从vmlinux取出响应的符号对应的地址,然后到vmcore中找到对应的地址展示出来,说明vmcore和vmlinux还是存在不对应。但这个crash工具居然没提示(我见过不一致的提示,类似于WARNING: kernel version inconsistency between vmlinux and dumpfile)
为了验证自己的想法,我到编译的vmlinux中找一下sys_reboot,
linux-h9c2:/home/caq # objdump -d vmlinux >caq.txtlinux-h9c2:/home/caq # grep sys_reboot caq.txt ffffffff8105df50 <sys_reboot>:linux-h9c2:/home/caq # nm vmlinux |grep -i sys_reboot ffffffff8105df50 T sys_reboot
地址是:ffffffff8105df50,crash工具将这个地址去找sys_reboot,结果打印的却不是sys_reboot的反汇编,不可能crash工具出这么低级的问题啊,说明vmlinux和vmcore还是存在不对应。
想着reboot调用跟panic按道理风牛马不相及啊,放弃这条路,因为既然sys_reboot是错的,那么可能堆栈回溯都是错的了,
就剩下pid 38021了。
crash> bt -f 38021 PID: 38021 TASK: ffff88003531c340 CPU: 2 COMMAND: "sh"#0 [ffff880476051de8] schedule at ffffffff813923c4ffff880476051df0: 0000000000000000 0000000000000000ffff880476051e00: 0000000000000000 0000000000000000ffff880476051e10: 0000000000000000 0000000000000000ffff880476051e20: 0000000000000000 0000000000000000ffff880476051e30: 0000000000000000 0000000000000000ffff880476051e40: 0000000000000000 0000000000000000ffff880476051e50: 0000000000000000 0000000000000000ffff880476051e60: 0000000000000000 0000000000000000ffff880476051e70: 0000000000000000 0000000000000000ffff880476051e80: 0000000000000000 0000000000000000ffff880476051e90: 0000000000000000 0000000000000000ffff880476051ea0: 0000000000000000 0000000000000000ffff880476051eb0: 0000000000000000 0000000000000000ffff880476051ec0: 0000000000000000 0000000000000000ffff880476051ed0: 0000000000000000 0000000000000000ffff880476051ee0: 0000000000000000 0000000000000000ffff880476051ef0: 0000000000000000 0000000000000000ffff880476051f00: 0000000000000000 0000000000000000ffff880476051f10: 0000000000000000 0000000000000000ffff880476051f20: 0000000000000000 0000000000000000ffff880476051f30: 00000000006c9870 ffff88027dd62480ffff880476051f40: ffff88084c3a8d40 0000000000000000ffff880476051f50: 00000000006a0dd0 00007fffbc69e690ffff880476051f60: 0000000000000441 00000000006d3040ffff880476051f70: 0000000000000003 00000000006d3ba0ffff880476051f80: ffffffff81002f7b#1 [ffff880476051f80] auditsys at ffffffff81002f7bRIP: 00007fb09a95b4f0 RSP: 00007fffbc69e6c0 RFLAGS: 00010202RAX: 0000000000000002 RBX: ffffffff81002f7b RCX: 0000000000000000RDX: 00000000000001b6 RSI: 0000000000000441 RDI: 00000000006d3040RBP: 00000000006d3ba0 R8: 0000000000000020 R9: 6c6568732f6d732fR10: 0000000000000020 R11: 0000000000000246 R12: 0000000000000003R13: 00000000006d3040 R14: 0000000000000441 R15: 00007fffbc69e690ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
看着堆栈不太对啊,auditsys 不是一个系统调用的入口,按道理第一个压栈的函数应该是常见的system_call_fastpath ,直接查看一下这个地址:
Ftp1:/home # grep ffffffff81002f /proc/kallsyms ffffffff81002f00 T system_call_after_swapgs ffffffff81002f65 t system_call_fastpath ffffffff81002f80 t ret_from_sys_call ffffffff81002f85 t sysret_check ffffffff81002fd8 t sysret_careful ffffffff81002fe8 t sysret_signal
发现 ffffffff81002f7b 应该属于 system_call_fastpath 的地址范围。
看来这crash的工具用不了,映射是错的,于是找了个更新一点的crash工具,版本为7.0.9
crash 7.0.9---------------------------------------------------更高版本 Copyright (C) 2002-2014 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.crash: vmlinux: no .gnu_debuglink section GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...WARNING: kernel version inconsistency between vmlinux and dumpfile------------------------有告警KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09 LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 867NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64 (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 38021COMMAND: "sh"-------------------------------------------------------找到对应的panic任务,比上一个版本靠谱TASK: ffff88003531c340 [THREAD_INFO: ffff880476050000]CPU: 2STATE: TASK_RUNNING (PANIC)
升级到7.0.9,然后敲入log命令:
对应的log中显示:
[6186227.149460] BUG: unable to handle kernel NULL pointer dereference at (null) [6186227.149479] IP: [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149492] PGD 47b9be067 PUD 42e601067 PMD 0 [6186227.149497] Oops: 0000 [#1] SMP [6186227.149502] last sysfs file: /sys/devices/pci0000:40/0000:40:07.0/0000:45:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/state [6186227.149510] CPU 2 [6186227.149513] Modules linked in: secureProof(N) iptable_filter ip_tables x_tables dm_round_robin dm_multipath scsi_dh ipv6 bonding microcode f use loop dm_mod tpm_tis dcdbas(X) tpm qla2xxx usbhid tpm_bios hid iTCO_wdt scsi_transport_fc iTCO_vendor_support serio_raw sr_mod scsi_tgt ses cd rom pcspkr enclosure bnx2 sg rtc_cmos rtc_core rtc_lib wmi power_meter button uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fa n processor ide_pci_generic ide_core ata_generic ata_piix libata megaraid_sas thermal thermal_sys hwmon mpdh(N) mpdt(N) scsi_mod [last unloaded: secureProof] [6186227.149571] Supported: Yes [6186227.149577] Pid: 38021, comm: sh Tainted: G NX 2.6.32.59-0.7-default #1 PowerEdge R910 [6186227.149582] RIP: 0010:[<ffffffff811e7752>] [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149588] RSP: 0018:ffff880476051280 EFLAGS: 00010246 [6186227.149592] RAX: 0000000000000000 RBX: ffff8805f94ec000 RCX: 0000000000000000 [6186227.149596] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 [6186227.149600] RBP: 0000000000000000 R08: ffff8804760511f8 R09: ffffffff81539570 [6186227.149604] R10: 0000000000000020 R11: 0000000000000fff R12: ffff880476051d38 [6186227.149608] R13: ffff88067bbc6080 R14: ffffffffa03d8f79 R15: 0000000000000000 [6186227.149612] FS: 00007fb09b21e700(0000) GS:ffff880487400000(0000) knlGS:0000000000000000 [6186227.149617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [6186227.149621] CR2: 0000000000000000 CR3: 0000000473f4d000 CR4: 00000000000006e0 [6186227.149625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [6186227.149629] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [6186227.149634] Process sh (pid: 38021, threadinfo ffff880476050000, task ffff88003531c340) [6186227.149638] Stack: [6186227.149640] ffffffffa03d5768 ffffffffa03d8e4c ffff88067bbc6080 ffff880476051d38 [6186227.149645] <0> ffff8804760514c8 0000000000000019 ffffffffa03d5b53 ffff880476051d58 [6186227.149650] <0> ffffffffa03d8e4c 787a2f656d6f682f 7374642f30316e69 76534d532f6d732f [6186227.149657] Call Trace: [6186227.149678] [<ffffffffa03d5768>] getprocpath+0xa8/0x150 [secureProof] [6186227.149701] [<ffffffffa03d5b53>] checkTrustProc+0x83/0x270 [secureProof] [6186227.149710] [<ffffffffa03d66ca>] checkProcAndFile+0x3da/0x890 [secureProof] [6186227.149720] [<ffffffffa03d7aba>] our_sys_open+0xfa/0x1d0 [secureProof]-----------我们模块接管的open [6186227.149736] [<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b [6186227.149745] [<00007fb09a95b4f0>] 0x7fb09a95b4f0 [6186227.149749] Code: 00 48 83 c7 01 0f b6 07 84 c0 74 0c 0f b6 c0 f6 80 a0 08 85 81 20 75 e9 48 89 f8 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 0031 c0 <80> 3f 00 48 89 fa 74 15 66 0f 1f 44 00 00 48 83 c2 01 80 3a 00 [6186227.149779] RIP [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149784] RSP <ffff880476051280> [6186227.149787] CR2: 0000000000000000
这个打印和crash找的任务是一致的,都是sh进程,pid为38021。
然后查看strlen的代码:
/home/caq/usr/src/linux-2.6.32.59-0.7/lib/string.c: 379 0xffffffff811e1750 <strlen>: ljmpq *(%rcx) 0xffffffff811e1752 <strlen+2>: icebp
确定是由于rcx为NULL导致的,业务代码流程有问题,直接引用空指针,导致crash。
总结一下:
1.crash分析的时候,crash的版本尽量新一些,特别当某些crash工具解析有问题的时候,要果断换,出现的crash工具提醒的warning,要重视。
2.老司机也会翻车,编译vmlinx的gcc版本,最好和运行的内核的gcc版本一致。
转载于:https://www.cnblogs.com/10087622blog/p/10609159.html
一个suse11 sp1的crash工具版本问题相关推荐
- crash工具分析linux内核,如何使用crash工具分析Linux内核崩溃转储文件
满意答案 使用 crash 的先决条件 1. kernel 映像文件 vmlinux 在编译的时候必须指定了 -g 参数,即带有调试信息. 2. 需要有一个内存崩溃转储文件(例如 vmcore),或者 ...
- crash工具解析_crash工具和x86-64汇编基础
在前面的文章中,已经出现了作为Linux重要调试手段之一的crash工具的身影.在后续的文章里,我们还会继续用到它.因此在这里,准备对Linux中的crash工具的原理和使用方法,做一个相对全面的介绍 ...
- crash工具分析sysdump使用
一.准备环境 1)获取crash工具.注意区分版本(arm/arm64/x86_64). 2)获取对应软件版本的符号表文件(如vmlinux),可以将该文件放置 crash工具同一目录下. 3)获取s ...
- crash工具解析_Android Crash 工具
crash工具解析_Android Crash 工具_weixin_39543655的博客-CSDN博客本篇文章主要介绍Android开发中的部分知识点,通过阅读本篇文章,您将收获以下内容:一.Cra ...
- linux crash,系统崩溃 - crash工具介绍
工欲善其事,必先利其器.本文主要介绍linux下crash工具常用命令的功能和使用. 背景知识 crash是redhat的工程师开发的,主要用来离线分析linux内核转存文件,它整合了gdb工具,功能 ...
- github snap android,GitHub - albuer/heapsnap: HeapSnap 是一个定位内存泄露的工具,适用于Android平台。...
HeapSnap 1.HeapSnap 是什么 HeapSnap 是一个定位内存泄露的工具,适用于Android平台. 主要特性如下: 对系统负载低 不需要修改目标进程的源代码 支持Andoroid上 ...
- Crash工具实战-变量解析【转】
转自:http://blog.chinaunix.net/uid-14528823-id-4358785.html Crash工具实战-变量解析 Crash工具用于解析Vmcore文件,Vmcore文 ...
- you-get 一个下载视频的好工具
不知道你有没有这种情况,你用的是网页版的app看视频,但是你想下载视频,你又不想下载APP. 那么如何解决呢? 其实是用一个叫you-get的工具就可以解决你的困扰. 在Python的第三方库的第三方 ...
- 卸载后以前拍的视频会删除吗_可立拍!苹果自己的视频编辑App是一个被忽视的好工具...
手机预装应用总是不如三方产品? 看到这个问题,你是不是会下意识反驳:iPhone自带 app 就很好用啊!的确如此,iPhone 的<Pages><备忘录>,这些 app 的优 ...
最新文章
- C语言:随笔5--指针1
- 自定义View字段表头
- FPGA之道(44)HDL中的隐患写法
- 马斯克成功把人从太空送回地球!历时64天,SpaceX首次载人任务圆满收工
- java提高篇(七)-----关键字static
- 绝对精华,大牛教你在Android系统上安装linux发行版
- Ueditor百度编辑器中的 setContent()方法的使用
- 物联网在医疗保健中的应用
- chrome浏览器解决ajax跨域问题
- Linux 两台服务器之间传输文件和文件夹
- 传感器工作原理_光电式速度传感器的工作原理
- MFC 消息映射表和虚函数实现消息映射到底谁的效率高
- 【转载】.NET系统学习----Assembly
- 完美镜像ISO制作工具WinIso-----使用说明
- OSI网络七层协议详解
- 目标实现,时间、知识管理体系
- 51单片机LCD1602电子时钟
- python 自动解4399数独游戏
- Arduino检测不到串口的问题(改)
- python+selenium环境配置及浏览器调用
热门文章
- 信息系统项目管理师-计算题专题(三)上午计算小题
- DataGridView中实现点击单元格Cell动态添加自定义控件
- CS中常用转义符与@符号的作用
- Jquery中使用ajax请求SSM后台时提示:org.springframework.http.converter.HttpMessageNotReadableException: Could no
- Atom 实用侧边栏插件
- oracle 文件写 n r,文本模式读写文件中\r和\n的问题
- 20051020:该办宽带了
- 解决安装DEB包时报错
- vue学习笔记(二)- 数据绑定、列表渲染、条件判断
- 删除数据库所有表 序列号