跟coredump文件一样,ramdump文件也是内存转储文件,但是ramdump文件更大,因为它几乎是对整个物理内存的镜像,除了一些security类型的memory抓不出来之外,几乎所有的dram都被抓下来,有些问题的复现概率低,而且有些问题是由于踩内存导致的,这种问题靠log往往是无法分析出来的,所以如果可以在问题发生时候把内存镜像保存下来,对于分析问题是很有帮助的。

MTK平台抓ramdump的方法

因为每个平台抓取ramdump的方式都不一样,所以这里以MTK平台作为示例,在MTK平台上面,user/user debug build抓取ramdump的方法如下所示,

另外由于MTK平台和Android、Kernel的升级,下面的这个方式也不一定会有用,如果抓取不了可以咨询芯片厂家。

第1步:修改LK的make file "bootable/bootloader/lk/app/mt_boot/rules.mk"

build_mt_ramdump := yes ##原先是no,这里修改为yes

ifneq ($(TARGET_BUILD_VARIANT),user)

ifeq ($(ARCH_HAVE_MT_RAMDUMP),yes)

build_mt_ramdump := yes

endif

endif

第2步:修改user-build kernel config(kernel/arch/arm/configs/xxx_defconfig或

kernel/arch/arm64/configs/xxx_defconfig)把(没有对应的config则不需要修改)

# CONFIG_MTK_AEE_MRDUMP is not set

# CONFIG_HAVE_DDR_RESERVE_MODE is not set

修改为enable

CONFIG_MTK_AEE_MRDUMP=y

CONFIG_HAVE_DDR_RESERVE_MODE=y

第3步:编译下载后复现时请打开mtklogger,如果没有打开也不会抓ramdump。

crash工具介绍

同GDB工具一样,crash也是一个很有用的分析工具,它的官网crash,当前最新的版本为7.2.3,最新的版本支持ARM64平台,将代码下载下来之后解压,编译支持arm64的版本

taotaomami@taotaomami$ ls

alpha.c dev.c gdb_interface.c lkcd_dump_v7.h lkcd_x86_trace.c netdump.h s390dbf.c unwind.c vmware_vmss.c xen_hyper_dump_tables.c

arm64.c diskdump.c global_data.c lkcd_dump_v8.h lkcd_x86_trace.h ppc64.c s390_dump.c unwind_decoder.c vmware_vmss.h xen_hyper_global_data.c

arm.c diskdump.h help.c lkcd_fix_mem.c main.c ppc.c s390x.c unwind.h x86_64.c

binqQtgK4pQ66.bin extensions ia64.c lkcd_fix_mem.h makedumpfile.c qemu.c sadump.c unwind_i.h x86.c

build_data.c extensions.c ibm_common.h lkcd_v1.c makedumpfile.h qemu-load.c sadump.h unwind_x86_32_64.c xen_dom0.c

cmdline.c filesys.c ipcs.c lkcd_v2_v3.c Makefile qemu-load.h sparc64.c unwind_x86_64.h xen_dom0.h

configure gdb-7.6 kernel.c lkcd_v5.c memory.c ramdump.c symbols.c unwind_x86.h xendump.c

configure.c gdb-7.6.patch kvmdump.c lkcd_v7.c memory_driver README task.c vas_crash.h xendump.h

COPYING3 gdb-7.6-ppc64le-support.patch kvmdump.h lkcd_v8.c mips.c remote.c test.c va_server.c xen_hyper.c

crash.8 gdb-7.6.tar.gz lkcd_common.c lkcd_vmdump_v1.h net.c rse.h tools.c va_server.h xen_hyper_command.c

defs.h gdb.files lkcd_dump_v5.h lkcd_vmdump_v2_v3.h netdump.c s390.c unwind_arm.c va_server_v1.c xen_hyper_defs.h

taotaomami@taotaomami$ make target=arm64

TARGET: ARM64

CRASH: 7.1.7++

GDB: 7.6

gcc -g -O2 -I. -I. -I./common -I./config -DLOCALEDIR="\"/usr/local/share/locale\"" -DCRASH_MERGE -DHAVE_CONFIG_H -I./../include/opcode -I./../opcodes/.. -I./../readline/.. -I../bfd -I./../bfd -I./../include -I../libdecnumber -I./../libdecnumber -I./gnulib/import -Ibuild-gnulib/import -DTUI=1 -Wall -Wdeclaration-after-statement -Wpointer-arith -Wformat-nonliteral -Wno-pointer-sign -Wno-unused -Wunused-value -Wunused-function -Wno-switch -Wno-char-subscripts -Wmissing-prototypes -Wdeclaration-after-statement -Wempty-body -c -o symtab.o -MT symtab.o -MMD -MP -MF .deps/symtab.Tpo symtab.c

Making init.c

gcc -g -O2 -I. -I. -I./common -I./config -DLOCALEDIR="\"/usr/local/share/locale\"" -DCRASH_MERGE -DHAVE_CONFIG_H -I./../include/opcode -I./../opcodes/.. -I./../readline/.. -I../bfd -I./../bfd -I./../include -I../libdecnumber -I./../libdecnumber -I./gnulib/import -Ibuild-gnulib/import -DTUI=1 -Wall -Wdeclaration-after-statement -Wpointer-arith -Wformat-nonliteral -Wno-pointer-sign -Wno-unused -Wunused-value -Wunused-function -Wno-switch -Wno-char-subscripts -Wmissing-prototypes -Wdeclaration-after-statement -Wempty-body -c -o init.o -MT init.o -MMD -MP -MF .deps/init.Tpo init.c

cc -c -g -DARM64 -DGDB_7_6 build_data.c

cc -c -g -DARM64 -DGDB_7_6 main.c

cc -c -g -DARM64 -DGDB_7_6 tools.c

cc -c -g -DARM64 -DGDB_7_6 global_data.c

cc -c -g -DARM64 -DGDB_7_6 memory.c

cc -c -g -DARM64 -DGDB_7_6 filesys.c

cc -c -g -DARM64 -DGDB_7_6 help.c

cc -c -g -DARM64 -DGDB_7_6 task.c

cc -c -g -DARM64 -DGDB_7_6 kernel.c

cc -c -g -DARM64 -DGDB_7_6 test.c

cc -c -g -DARM64 -DGDB_7_6 gdb_interface.c

cc -c -g -DARM64 -DGDB_7_6 net.c

cc -c -g -DARM64 -DGDB_7_6 dev.c

cc -c -g -DARM64 -DGDB_7_6 alpha.c

cc -c -g -DARM64 -DGDB_7_6 x86.c -DMCLX

编译完成之后会在当前目录下生成crash可执行文件,执行这个文件就可以分析ramdump了,分析ramdump的时候除了ramdump文件之外,还需要一个vmlinux文件,这个文件就是内核的symbole文件。

taotaomami@taotaomami$ crash vmlinux SYS_COREDUMP

crash 7.1.7++

Copyright (C) 2002-2016 Red Hat, Inc.

Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation

Copyright (C) 1999-2006 Hewlett-Packard Co

Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited

Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.

Copyright (C) 2005, 2011 NEC Corporation

Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.

Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.

This program is free software, covered by the GNU General Public License,

and you are welcome to change it and/or distribute copies of it under

certain conditions. Enter "help copying" to see the conditions.

This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6

Copyright (C) 2013 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=aarch64-elf-linux"...

跟GDB工具一样,crash也有自己的命令帮助,这里就不去赘述了,感兴趣的童鞋可以一个一个的去看,这里还是跟前面一样通过对一个问题的分析来介绍crash、ramdump是如何帮助我们分析问题的。

crash> help

* files mach repeat timer

alias foreach mod runq tree

ascii fuser mount search union

bt gdb net set vm

btop help p sig vtop

dev ipcs ps struct waitq

dis irq pte swap whatis

eval kmem ptob sym wr

exit list ptov sys q

extend log rd task

crash version: 7.1.7++ gdb version: 7.6

For help on any command above, enter "help ".

For help on input options, enter "help input".

For help on output options, enter "help output".

问题背景

在某个机型上面,跑monkey,跑一段时间之后差不多所有机器都会卡住,卡住的时候没法连接adb和串口,更坑人的是重启之后发现SD卡下面保存log的目录只保留了很少的一部分,后面的log都没有了,我们只能通过/data/anr下面的trace文件看到一点信息

"Binder_1" prio=5 tid=8 Blocked

| group="main" sCount=1 dsCount=0 obj=0x12c5c0a0 self=0x7fabe09400

| sysTid=1042 nice=0 cgrp=default sched=0/0 handle=0x7fabdff440

| state=S schedstat=( 155829476357 56801688642 366831 ) utm=11187 stm=4395 core=1 HZ=100

| stack=0x7fabd04000-0x7fabd06000 stackSize=1009KB

| held mutexes=

at com.android.server.MountService.getVolumeList(MountService.java:3139)

- waiting to lock <0x07a7279e> (a java.lang.Object) held by thread 86

at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1751)

at android.os.Binder.execTransact(Binder.java:462)

"Binder_2" prio=5 tid=9 Blocked

| group="main" sCount=1 dsCount=0 obj=0x12c5f0a0 self=0x7fa2fadc00

| sysTid=1043 nice=0 cgrp=default sched=0/0 handle=0x7fabd01440

| state=S schedstat=( 152468392027 56015549452 365405 ) utm=11086 stm=4160 core=0 HZ=100

| stack=0x7fabc06000-0x7fabc08000 stackSize=1009KB

| held mutexes=

at com.android.server.MountService.getVolumeList(MountService.java:3139)

- waiting to lock <0x07a7279e> (a java.lang.Object) held by thread 86

at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1751)

at android.os.Binder.execTransact(Binder.java:462)

"Binder_6" prio=5 tid=86 Native

| group="main" sCount=1 dsCount=0 obj=0x1404b0a0 self=0x7f92fac600

| sysTid=1691 nice=0 cgrp=default sched=0/0 handle=0x7f929bf440

| state=S schedstat=( 149288745355 56180291820 364873 ) utm=10576 stm=4352 core=1 HZ=100

| stack=0x7f928c4000-0x7f928c6000 stackSize=1009KB

| held mutexes=

at libcore.io.Posix.statvfs(Native method)

at libcore.io.BlockGuardOs.statvfs(BlockGuardOs.java:298)

at java.io.File.getTotalSpace(File.java:1126)

at android.os.storage.StorageManager.getStorageLowBytes(StorageManager.java:978)

at android.os.storage.VolumeInfo.buildStorageVolume(VolumeInfo.java:371)

at com.android.server.MountService.getVolumeList(MountService.java:3146)

- locked <0x07a7279e> (a java.lang.Object)

at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1751)

at android.os.Binder.execTransact(Binder.java:462)

从这里的信息来看,system_server所有的binder线程都卡住了,并且大多数都是卡在获取SD容量大小的接口上面,综合前面SD卡没法保留log以及这里的trace文件信息,怀疑sd卡工作出了问题,当时sd卡用的是fuse文件系统,fuse文件系统在user空间有一个守护进程名字就叫sdcard,但是由于连不了adb和串口,我们无法得知当时sdcard进程的状态和其他信息,由于这个问题复现概率还算高,并且卡住是由于发生了system_server SWT,所以决定在system_server进程发生SWT的时候,主动让kernel crash,抓取ramdump文件来分析。

echo c > /proc/sysrq-trigger 可以让kernel发生异常重启

因为fuse文件系统分为user空间和kernel空间两层,而如果kernel层出问题的话,那kernel层其他模块很大概率也会出问题,但是从几次问题现场来看,kernel应该还是工作正常的,所以将目光放到了user空间层的sdcard守护进程上面:

过滤所有D状态的进程, UN = UNINTERRUPTIBLE

crash> ps | grep UN

116 2 1 ffffffc003274000 UN 0.0 0 0 [display_esd_che]

121 2 0 ffffffc0b6202000 UN 0.0 0 0 [display_idle_de]

176 2 0 ffffffc0b5334000 UN 0.0 0 0 [hang_detect]

206 2 0 ffffffc0b4b52000 UN 0.0 0 0 [C2K_TX_ASC]

207 2 0 ffffffc0b45a2000 UN 0.0 0 0 [C2K_RX_ASC]

229 2 0 ffffffc0b46a8000 UN 0.0 0 0 [bat_routine_thr]

230 2 0 ffffffc0b45d0000 UN 0.0 0 0 [bat_update_thre]

3046 315 5 ffffffc095fdd000 UN 0.0 20136 2056 sdcard

3048 315 4 ffffffc0adade000 UN 0.0 20136 2056 sdcard

21528 2 0 ffffffc01e2df000 UN 0.0 0 0 [kworker/u16:6]

其中3046和3048这两个sdcard线程的状态都为D,对比一下正常的机器,这两个线程的状态是不正常的,获取这两个线程的堆栈信息:

crash> bt 3046

PID: 3046 TASK: ffffffc095fdd000 CPU: 5 COMMAND: "sdcard"

#0 [ffffffc095e43880] __switch_to at ffffffc000087204

#1 [ffffffc095e438b0] __schedule at ffffffc000ae4164

#2 [ffffffc095e43a60] schedule at ffffffc000ae4590

#3 [ffffffc095e43a70] schedule_preempt_disabled at ffffffc000ae45f0

#4 [ffffffc095e43a80] __mutex_lock_slowpath at ffffffc000ae5da4

#5 [ffffffc095e43ae0] mutex_lock at ffffffc000ae5eb8

#6 [ffffffc095e43b00] fuse_reverse_inval_entry at ffffffc0002ab150

#7 [ffffffc095e43b50] fuse_notify_delete at ffffffc0002a6078

#8 [ffffffc095e43bc0] fuse_dev_do_write at ffffffc0002a7a24

#9 [ffffffc095e43c70] fuse_dev_write at ffffffc0002a81a4

#10 [ffffffc095e43cf0] do_sync_readv_writev at ffffffc00019fb28

#11 [ffffffc095e43d70] do_readv_writev at ffffffc0001a1444

#12 [ffffffc095e43e70] vfs_writev at ffffffc0001a1524

#13 [ffffffc095e43e80] sys_writev at ffffffc0001a1620

#14 [ffffffc095e43ed0] el0_svc_naked at ffffffc00008542c

PC: 0000007f9deef9c0 LR: 0000007f9e105a50 SP: 0000007f9dbbd260

X29: 0000007f9dbbd260 X28: 0000007f9dbbf360 X27: 000000000000005d

X26: 00000000009eaa0a X25: 0000007f9dbbf3d0 X24: 0000007f9e108360

X23: 0000007fe5ccd56c X22: 0000007fe5ccb448 X21: 0000007f9d546e00

X20: 0000007f9d542c00 X19: 0000000000000006 X18: 0000007f9dd43c00

X17: 0000007f9deef9b8 X16: 0000007f9e119d70 X15: 000000000000005a

X14: 0000007fe5ccb440 X13: 0000007fe5cca440 X12: 0000000000000000

X11: 0101010101010101 X10: 0000007fe5ccd56c X9: 0000000000000018

X8: 0000000000000042 X7: 0000000000000010 X6: 000000000000003d

X5: 0000000000000015 X4: 0000000000000014 X3: 0000007f9dbbd2a8

X2: 0000000000000003 X1: 0000007f9dbbd2d0 X0: 000000000000000a

ORIG_X0: 000000000000000a SYSCALLNO: 42 PSTATE: 00000000

crash> bt 3048

PID: 3048 TASK: ffffffc0adade000 CPU: 4 COMMAND: "sdcard"

#0 [ffffffc0ac517880] __switch_to at ffffffc000087204

#1 [ffffffc0ac5178b0] __schedule at ffffffc000ae4164

#2 [ffffffc0ac517a60] schedule at ffffffc000ae4590

#3 [ffffffc0ac517a70] schedule_preempt_disabled at ffffffc000ae45f0

#4 [ffffffc0ac517a80] __mutex_lock_slowpath at ffffffc000ae5da4

#5 [ffffffc0ac517ae0] mutex_lock at ffffffc000ae5eb8

#6 [ffffffc0ac517b00] fuse_reverse_inval_entry at ffffffc0002ab150

#7 [ffffffc0ac517b50] fuse_notify_delete at ffffffc0002a6078

#8 [ffffffc0ac517bc0] fuse_dev_do_write at ffffffc0002a7a24

#9 [ffffffc0ac517c70] fuse_dev_write at ffffffc0002a81a4

#10 [ffffffc0ac517cf0] do_sync_readv_writev at ffffffc00019fb28

#11 [ffffffc0ac517d70] do_readv_writev at ffffffc0001a1444

#12 [ffffffc0ac517e70] vfs_writev at ffffffc0001a1524

#13 [ffffffc0ac517e80] sys_writev at ffffffc0001a1620

#14 [ffffffc0ac517ed0] el0_svc_naked at ffffffc00008542c

PC: 0000007f9deef9c0 LR: 0000007f9e105a50 SP: 0000007f9d9c1260

X29: 0000007f9d9c1260 X28: 0000007f9d9c3360 X27: 0000007f9d6ce3e0

X26: 0000000000120edc X25: 0000007f9d9c33d0 X24: 0000007f9e108360

X23: 0000007fe5d4d62c X22: 0000007fe5cc9418 X21: 0000007f9d546e00

X20: 0000007f9d542b80 X19: 0000000000000006 X18: 00000000000001fd

X17: 0000007f9deef9b8 X16: 0000007f9e119d70 X15: 0000000000000000

X14: 00000000000081b4 X13: 0000000000000000 X12: 0000000000000000

X11: 0101010101010101 X10: 0000007fe5d4d62c X9: 0000000000000018

X8: 0000000000000042 X7: 0000000000000010 X6: 000000000000003d

X5: 0000000000000015 X4: 0000000000000014 X3: 0000007f9d9c12a8

X2: 0000000000000003 X1: 0000007f9d9c12d0 X0: 0000000000000004

ORIG_X0: 0000000000000004 SYSCALLNO: 42 PSTATE: 00000000

从堆栈来看,这两个线程都是在删除文件,并且最终都在内核态等锁,其中mutex_lock和mutex的定义如下:

crash> whatis mutex_lock

void mutex_lock(struct mutex *);

crash> whatis struct mutex

struct mutex {

atomic_t count;

spinlock_t wait_lock;

struct list_head wait_list;

struct task_struct *owner;

struct optimistic_spin_queue osq;

}

SIZE: 40

mutex里面记录了这个锁被谁持有,所以如果可以找到mutex的内容,那么就可以找到这个锁的持有者了,接下来介绍一种简单的寻找锁持有者的方法,反汇编mutex_lock的方法

crash> dis mutex_lock

0xffffffc000ae5e74 : stp x29, x30, [sp,#-32]!

0xffffffc000ae5e78 : mov x29, sp

0xffffffc000ae5e7c : str x19, [sp,#16]

0xffffffc000ae5e80 : mov x19, x0 //将参数mutex传给了X19,

0xffffffc000ae5e84 : ldxr w1, [x0]

0xffffffc000ae5e88 : sub w1, w1, #0x1

0xffffffc000ae5e8c : stlxr w2, w1, [x0]

0xffffffc000ae5e90 : cbnz w2, 0xffffffc000ae5e84

0xffffffc000ae5e94 : dmb ish

0xffffffc000ae5e98 : tbnz w1, #31, 0xffffffc000ae5eb8

0xffffffc000ae5e9c : mov x0, sp

0xffffffc000ae5ea0 : and x0, x0, #0xffffffffffffc000

0xffffffc000ae5ea4 : ldr x0, [x0,#16]

0xffffffc000ae5ea8 : str x0, [x19,#24]

0xffffffc000ae5eac : ldr x19, [sp,#16]

0xffffffc000ae5eb0 : ldp x29, x30, [sp],#32

0xffffffc000ae5eb4 : ret

0xffffffc000ae5eb8 : bl 0xffffffc000ae5d10 <__mutex_lock_slowpath> //mutex_lock从这里调用的__mutex_lock_slowpath

0xffffffc000ae5ebc : mov x0, sp

0xffffffc000ae5ec0 : and x0, x0, #0xffffffffffffc000

0xffffffc000ae5ec4 : ldr x0, [x0,#16]

0xffffffc000ae5ec8 : str x0, [x19,#24]

0xffffffc000ae5ecc : ldr x19, [sp,#16]

0xffffffc000ae5ed0 : ldp x29, x30, [sp],#32

0xffffffc000ae5ed4 : ret

从0xffffffc000ae5e80到0xffffffc000ae5eb8这一段的指令里面X19寄存器都没有再被修改,所以找出X19寄存器的内容就是struct mutex的地址,再看__mutex_lock_slowpath的反汇编代码:

crash> dis __mutex_lock_slowpath

0xffffffc000ae5d10 <__mutex_lock_slowpath>: stp x29, x30, [sp,#-96]!

0xffffffc000ae5d14 <__mutex_lock_slowpath>: mov x1, sp

0xffffffc000ae5d18 <__mutex_lock_slowpath>: mov x29, sp

0xffffffc000ae5d1c <__mutex_lock_slowpath>: and x1, x1, #0xffffffffffffc000

0xffffffc000ae5d20 <__mutex_lock_slowpath>: stp x19, x20, [sp,#16] //把x19寄存器存放到了[sp,#16]

0xffffffc000ae5d24 <__mutex_lock_slowpath>: stp x21, x22, [sp,#32]

0xffffffc000ae5d28 <__mutex_lock_slowpath>: stp x23, x24, [sp,#48]

0xffffffc000ae5d2c <__mutex_lock_slowpath>: mov x19, x0

0xffffffc000ae5d30 <__mutex_lock_slowpath>: mov w0, #0x1 // #1

0xffffffc000ae5d34 <__mutex_lock_slowpath>: ldr x22, [x1,#16]

0xffffffc000ae5d38 <__mutex_lock_slowpath>: bl 0xffffffc0000c7bac

0xffffffc000ae5d3c <__mutex_lock_slowpath>: mov x1, #0x0 // #0

0xffffffc000ae5d40 <__mutex_lock_slowpath>: mov x0, x19

0xffffffc000ae5d44 <__mutex_lock_slowpath>: mov w2, w1

0xffffffc000ae5d48 <__mutex_lock_slowpath>: bl 0xffffffc0000e6554

0xffffffc000ae5d4c <__mutex_lock_slowpath>: uxtb w0, w0

0xffffffc000ae5d50 <__mutex_lock_slowpath>: cbnz w0, 0xffffffc000ae5e08 <__mutex_lock_slowpath>

......

0xffffffc000ae5e60 <__mutex_lock_slowpath>: ldp x19, x20, [sp,#16]

0xffffffc000ae5e64 <__mutex_lock_slowpath>: ldp x21, x22, [sp,#32]

0xffffffc000ae5e68 <__mutex_lock_slowpath>: ldp x23, x24, [sp,#48]

0xffffffc000ae5e6c <__mutex_lock_slowpath>: ldp x29, x30, [sp],#96

0xffffffc000ae5e70 <__mutex_lock_slowpath>: ret

从上面的反汇编代码来看,struct mutex的地址被放到了与__mutex_lock_slowpath栈指针偏移16字节的地方,而__mutex_lock_slowpath的栈指针为ffffffc095e43a80.

4 [ffffffc095e43a80] __mutex_lock_slowpath at ffffffc000ae5da4

5 [ffffffc095e43ae0] mutex_lock at ffffffc000ae5eb8

导出ffffffc095e43a80到ffffffc095e43ae0这段内存的内容

crash> rd ffffffc095e43a80 -e ffffffc095e43ae0

ffffffc095e43a80: ffffffc095e43ae0 ffffffc000ae5ebc .:.......^......

ffffffc095e43a90: ffffffc00f387b98 00000000ffffffec .{8.............

ffffffc095e43aa0: ffffffc00f387b98 ffffffc095e43b98 .{8......;......

ffffffc095e43ab0: 0000007f9d542c00 000000000000003d .,T.....=.......

ffffffc095e43ac0: ffffffc095e43ae0 ffffffc00f387ba0 .:.......{8.....

ffffffc095e43ad0: ffffffc00f387ba0 ffffffc095fdd000 .{8.............

可以得到struct mutex的地址为ffffffc00f387b98,利用这种方式可以得到你想要的变量的内容,打印这个锁可以获取这个锁的持有者:

crash> struct mutex ffffffc00f387b98

struct mutex {

count = {

counter = -1

},

wait_lock = {

{

rlock = {

raw_lock = {

owner = 1,

next = 1

}

}

}

},

wait_list = {

next = 0xffffffc095e43ac8,

prev = 0xffffffc095e43ac8

},

owner = 0xffffffc00bf4e000,

osq = {

tail = {

counter = 0

}

}

}

crash> struct task_struct.pid 0xffffffc00bf4e000

pid = 5446

得到3046这个sdcard线程正在等的锁的持有者是5446这个进程,接下来获取5446的堆栈:

crash> bt 5446

PID: 5446 TASK: ffffffc00bf4e000 CPU: 4 COMMAND: "io"

#0 [ffffffc0079b3a70] __switch_to at ffffffc000087204

#1 [ffffffc0079b3aa0] __schedule at ffffffc000ae4164

#2 [ffffffc0079b3c50] schedule at ffffffc000ae4590

#3 [ffffffc0079b3c60] wait_answer_interruptible at ffffffc0002a4f34

#4 [ffffffc0079b3cc0] __fuse_request_send at ffffffc0002a5424

#5 [ffffffc0079b3d30] fuse_request_send at ffffffc0002a83d4

#6 [ffffffc0079b3d40] fuse_unlink at ffffffc0002a8fb0

#7 [ffffffc0079b3d80] vfs_unlink at ffffffc0001af044

#8 [ffffffc0079b3dc0] do_unlinkat at ffffffc0001af2f4

#9 [ffffffc0079b3ec0] sys_unlinkat at ffffffc0001af350

#10 [ffffffc0079b3ed0] el0_svc_naked at ffffffc00008542c

PC: f751c298 LR: f7513639 SP: e9570150 PSTATE: 60070010

X12: 33116380 X11: 32e89670 X10: 348c61f0 X9: eeed1200

X8: 348c61f0 X7: 00000148 X6: eeed63c0 X5: 713b8ea0

X4: df154348 X3: 000081b0 X2: 00000000 X1: df154348

X0: ffffff9c

从这个堆栈来看,这个线程也是在删除文件....... ,但是这里是发送文件删除请求给fuse文件系统,fuse文集系统kernel层代码接收到请求的时候需要重新把请求路由给上层的sdcard进程,所以上面的3046和3048其中有一个线程就是在处理5446的删除文件请求,用同样的方法得到3048进程的锁是被12051获取了,而从它的堆栈来看,它也是在删除文件

crash> struct task_struct.pid 0xffffffc0082ac000

pid = 12051

crash> bt 12051

PID: 12051 TASK: ffffffc0082ac000 CPU: 5 COMMAND: "RxComputationSc"

#0 [ffffffc00ab7ba70] __switch_to at ffffffc000087204

#1 [ffffffc00ab7baa0] __schedule at ffffffc000ae4164

#2 [ffffffc00ab7bc50] schedule at ffffffc000ae4590

#3 [ffffffc00ab7bc60] wait_answer_interruptible at ffffffc0002a4f34

#4 [ffffffc00ab7bcc0] __fuse_request_send at ffffffc0002a5374

#5 [ffffffc00ab7bd30] fuse_request_send at ffffffc0002a83d4

#6 [ffffffc00ab7bd40] fuse_unlink at ffffffc0002a8fb0

#7 [ffffffc00ab7bd80] vfs_unlink at ffffffc0001af044

#8 [ffffffc00ab7bdc0] do_unlinkat at ffffffc0001af2f4

#9 [ffffffc00ab7bec0] sys_unlinkat at ffffffc0001af350

#10 [ffffffc00ab7bed0] el0_svc_naked at ffffffc00008542c

PC: 0000007f89fc7930 LR: 0000007f89fb72d8 SP: 0000007f6d1be4e0

X29: 0000007f6d1be4e0 X28: 000000001341e240 X27: 0000000013424180

X26: 000000001341c4a0 X25: 000000001341e240 X24: 0000000013210ea0

X23: 000000001341ba90 X22: 0000000013424180 X21: 0000007f6d1be5c8

X20: 0000007f80254c00 X19: 0000007f69a44980 X18: 0000007f86444b78

X17: 0000007f89fc7928 X16: 0000007f8a035510 X15: 0000000000000000

X14: 0000000000000001 X13: 0000000000000017 X12: 0000000000000000

X11: 0000000000000036 X10: 0000007f8643c9d0 X9: 0000007f8643a000

X8: 0000000000000023 X7: 0000000000430000 X6: 0000000000000000

X5: 000000000000002f X4: 0000000000430000 X3: 0000000000000100

X2: 0000000000000000 X1: 0000007f69a44980 X0: 00000000ffffff9c

ORIG_X0: 00000000ffffff9c SYSCALLNO: 23 PSTATE: 20000000

这个问题分析到这,大致可以知道是有两个线程同时向fuse文件系统删除文件,然后导致系统卡住,隐约可以感觉到这是一个死锁问题,只不过这个不是普通之间的线程死锁,而是进程之间,user空间和kernel空间的死锁问题,既然都是在删除文件,那么到底是在删除什么文件呢?类似前面获取mutex地址的方式一样,我们从5446和12051这两个线程的上下文里面获取它们正在删除的文件

crash> whatis fuse_unlink

int fuse_unlink(struct inode *, struct dentry *); //dentry里面就包含了路径信息

5446这个线程正在删除的文件信息

crash> bt 5446

PID: 5446 TASK: ffffffc00bf4e000 CPU: 4 COMMAND: "io"

#0 [ffffffc0079b3a70] __switch_to at ffffffc000087204

#1 [ffffffc0079b3aa0] __schedule at ffffffc000ae4164

#2 [ffffffc0079b3c50] schedule at ffffffc000ae4590

#3 [ffffffc0079b3c60] wait_answer_interruptible at ffffffc0002a4f34

#4 [ffffffc0079b3cc0] __fuse_request_send at ffffffc0002a5424

#5 [ffffffc0079b3d30] fuse_request_send at ffffffc0002a83d4

#6 [ffffffc0079b3d40] fuse_unlink at ffffffc0002a8fb0

#7 [ffffffc0079b3d80] vfs_unlink at ffffffc0001af044

#8 [ffffffc0079b3dc0] do_unlinkat at ffffffc0001af2f4

#9 [ffffffc0079b3ec0] sys_unlinkat at ffffffc0001af350

#10 [ffffffc0079b3ed0] el0_svc_naked at ffffffc00008542c

PC: f751c298 LR: f7513639 SP: e9570150 PSTATE: 60070010

X12: 33116380 X11: 32e89670 X10: 348c61f0 X9: eeed1200

X8: 348c61f0 X7: 00000148 X6: eeed63c0 X5: 713b8ea0

X4: df154348 X3: 000081b0 X2: 00000000 X1: df154348

X0: ffffff9c

crash> rd ffffffc0079b3cc0 -e ffffffc0079b3d30

ffffffc0079b3cc0: ffffffc0079b3d30 ffffffc0002a83d8 0=........*.....

ffffffc0079b3cd0: ffffffc0235981a0 ffffffc06a521e40 ..Y#....@.Rj....

ffffffc0079b3ce0: ffffffc098daa800 ffffffc06a521e40 ........@.Rj....

ffffffc0079b3cf0: ffffffc00f387b00 ffffffc0079b3e20 .{8..... >......

ffffffc0079b3d00: ffffffc0079b3d10 ffffffc0001aa028 .=......(.......

ffffffc0079b3d10: ffffffc0079b3d60 ffffffc0002c9bc8 `=........,.....

ffffffc0079b3d20: ffffffc06a521e40 ffffffc06a521e40 @.Rj....@.Rj....

crash> struct dentry ffffffc06a521e40

struct dentry {

d_flags = 4718788,

d_seq = {

sequence = 2

},

d_hash = {

next = 0x0,

pprev = 0xffffffc0b7c77f28

},

d_parent = 0xffffffc0052f96c0,

d_name = {

{

{

hash = 114945031,

len = 20

},

hash_len = 86014290951

},

name = 0xffffffc06a521e78 "10048316241362964103"

},

d_inode = 0xffffffc033afcb00,

d_iname = "10048316241362964103\000\000s\000l\000bytes",

d_lockref = {

{

lock_count = 4295294981,

{

lock = {

{

rlock = {

raw_lock = {

owner = 5,

next = 5

}

}

}

},

count = 1

}

}

},

d_op = 0xffffffc000b3fec0 ,

d_sb = 0xffffffc098daa000,

d_time = 4299504276,

d_fsdata = 0x0,

d_lru = {

next = 0xffffffc095eff088,

prev = 0xffffffc024a50140

},

d_child = {

next = 0xffffffc03bec35d0,

prev = 0xffffffc0052f9760

},

d_subdirs = {

next = 0xffffffc06a521ee0,

prev = 0xffffffc06a521ee0

},

d_u = {

d_alias = {

next = 0x0,

pprev = 0xffffffc033afcc08

},

d_rcu = {

next = 0x0,

func = 0xffffffc033afcc08

}

}

}

12051正在删除的文件信息:

crash> struct dentry ffffffc007383180

struct dentry {

d_flags = 4718788,

d_seq = {

sequence = 2

},

d_hash = {

next = 0x0,

pprev = 0xffffffc0b7d70318

},

d_parent = 0xffffffc03ad70780,

d_name = {

{

{

hash = 3934655135,

len = 20

},

hash_len = 89834001055

},

name = 0xffffffc0073831b8 "15919563341606802260"

},

d_inode = 0xffffffc00398b2c0,

d_iname = "15919563341606802260\000m\000re.bytes",

d_lockref = {

{

lock_count = 4295491592,

{

lock = {

{

rlock = {

raw_lock = {

owner = 8,

next = 8

}

}

}

},

count = 1

}

}

},

d_op = 0xffffffc000b3fec0 ,

d_sb = 0xffffffc003290800,

d_time = 4299504227,

d_fsdata = 0x0,

d_lru = {

next = 0xffffffc00acf1d40,

prev = 0xffffffc059711980

},

d_child = {

next = 0xffffffc059711990,

prev = 0xffffffc00acf1d50

},

d_subdirs = {

next = 0xffffffc007383220,

prev = 0xffffffc007383220

},

d_u = {

d_alias = {

next = 0x0,

pprev = 0xffffffc00398b3c8

},

d_rcu = {

next = 0x0,

func = 0xffffffc00398b3c8

}

}

}

上面这么看还是不够直观,在crash工具下面还发现一个很好用的命令files,用这个命令可以很直观的把文件路径打出来,

crash> files -d ffffffc06a521e40

DENTRY INODE SUPERBLK TYPE PATH

ffffffc06a521e40 ffffffc033afcb00 ffffffc098daa000 REG 10048316241362964103

//保存的这个ramdump里面不知道为什么这里打印不出来完整的路径名,

//但是在当时分析的其他ramdump里面是能打印出来,也是在删除com.xxx目录下的文件,

//只是路径前缀不一样,这个与Android M的运行时权限机制有关,像surfaceflinger这种系统服务

//进程和普通的应用进程都去访问SD卡下的同一个文件,但是它们看到的路径却是不一样的,比如

//surfaceflinger 看到的是/storage/runtime/default开头,而普通应用看到的是/mnt/runtime/write

crash> files -d ffffffc007383180

DENTRY INODE SUPERBLK TYPE PATH

ffffffc007383180 ffffffc00398b2c0 ffffffc003290800 REG /storage/runtime/default/emulated/0/com.xxx/15919563341606802260

从上面可以得知3048进程的锁是被12051获取了,而12051正在删除/storage/runtime/default/emulated/0/com.xxx/15919563341606802260这个文件,那我们是否可以从3048这个进程里面获取到它正在删除哪个文件呢?如果从它的上下文里面得到它正在删除10048316241362964103这个文件,那么就可以证明这是一个死锁的问题了,同样还是利用上面那个方法,从3048的堆栈上下文里面去分析它正在删除哪个文件.

crash> bt 3048

PID: 3048 TASK: ffffffc0adade000 CPU: 4 COMMAND: "sdcard"

#0 [ffffffc0ac517880] __switch_to at ffffffc000087204

#1 [ffffffc0ac5178b0] __schedule at ffffffc000ae4164

#2 [ffffffc0ac517a60] schedule at ffffffc000ae4590

#3 [ffffffc0ac517a70] schedule_preempt_disabled at ffffffc000ae45f0

#4 [ffffffc0ac517a80] __mutex_lock_slowpath at ffffffc000ae5da4

#5 [ffffffc0ac517ae0] mutex_lock at ffffffc000ae5eb8

#6 [ffffffc0ac517b00] fuse_reverse_inval_entry at ffffffc0002ab150

#7 [ffffffc0ac517b50] fuse_notify_delete at ffffffc0002a6078

#8 [ffffffc0ac517bc0] fuse_dev_do_write at ffffffc0002a7a24

#9 [ffffffc0ac517c70] fuse_dev_write at ffffffc0002a81a4

#10 [ffffffc0ac517cf0] do_sync_readv_writev at ffffffc00019fb28

#11 [ffffffc0ac517d70] do_readv_writev at ffffffc0001a1444

#12 [ffffffc0ac517e70] vfs_writev at ffffffc0001a1524

#13 [ffffffc0ac517e80] sys_writev at ffffffc0001a1620

#14 [ffffffc0ac517ed0] el0_svc_naked at ffffffc00008542c

crash> whatis fuse_notify_delete

int fuse_notify_delete(struct fuse_conn *, unsigned int, struct fuse_copy_state *);

crash> whatis fuse_reverse_inval_entry

int fuse_reverse_inval_entry(struct super_block *, u64, u64, struct qstr *);

crash> dis fuse_reverse_inval_entry

0xffffffc0002ab108 : stp x29, x30, [sp,#-80]!

0xffffffc0002ab10c : mov x29, sp

0xffffffc0002ab110 : add x4, x29, #0x50

0xffffffc0002ab114 : stp x19, x20, [sp,#16]

0xffffffc0002ab118 : stp x21, x22, [sp,#32]

0xffffffc0002ab11c : stp x23, x24, [sp,#48]

0xffffffc0002ab120 : str x1, [x4,#-8]!

0xffffffc0002ab124 : mov x23, x2

0xffffffc0002ab128 : adrp x2, 0xffffffc0002b0000

0xffffffc0002ab12c : mov x22, x3 //struct qstr参数放到了X22寄存器

0xffffffc0002ab130 : add x2, x2, #0xfa8

0xffffffc0002ab134 : mov x3, x4

0xffffffc0002ab138 : bl 0xffffffc0001ba738

0xffffffc0002ab13c : mov x19, x0

0xffffffc0002ab140 : cbz x0, 0xffffffc0002ab1e4

0xffffffc0002ab144 : add x21, x0, #0x98

0xffffffc0002ab148 : mov w20, #0xffffffec // #-20

0xffffffc0002ab14c : mov x0, x21

0xffffffc0002ab150 : bl 0xffffffc000ae5e74 //程序从这里跳到mutex_lock继续执行。

.......

继续分析mutex_lock和__mutex_lock_slowpath的汇编代码,发现它们都没有修改X22的值,并且在__mutex_lock_slowpath的汇编里面把X22保存在函数栈里面

crash> dis __mutex_lock_slowpath

0xffffffc000ae5d10 <__mutex_lock_slowpath>: stp x29, x30, [sp,#-96]!

0xffffffc000ae5d14 <__mutex_lock_slowpath>: mov x1, sp

0xffffffc000ae5d18 <__mutex_lock_slowpath>: mov x29, sp

0xffffffc000ae5d1c <__mutex_lock_slowpath>: and x1, x1, #0xffffffffffffc000

0xffffffc000ae5d20 <__mutex_lock_slowpath>: stp x19, x20, [sp,#16]

0xffffffc000ae5d24 <__mutex_lock_slowpath>: stp x21, x22, [sp,#32] //X22保存在与SP偏移32字节的地方

0xffffffc000ae5d28 <__mutex_lock_slowpath>: stp x23, x24, [sp,#48]

......

crash> rd ffffffc0ac517a80 -e ffffffc0ac517ae0

ffffffc0ac517a80: ffffffc0ac517ae0 ffffffc000ae5ebc .zQ......^......

ffffffc0ac517a90: ffffffc04f747098 00000000ffffffec .ptO............

ffffffc0ac517aa0: ffffffc04f747098 ffffffc0ac517b98 .ptO.....{Q.....

ffffffc0ac517ab0: 0000007f9d542b80 000000000000003d .+T.....=.......

ffffffc0ac517ac0: ffffffc0ac517ae0 ffffffc04f7470a0 .zQ......ptO....

ffffffc0ac517ad0: ffffffc04f7470a0 ffffffc0adade000 .ptO............

crash> struct qstr ffffffc0ac517b98

struct qstr {

{

{

hash = 114945031,

len = 20

},

hash_len = 86014290951

},

name = 0xffffffc010143000 "10048316241362964103" //果然3048正在删除10048316241362964103这个文件

}

分析到这里,可以知道这个问题确实是进程间死锁问题,然后根据目前获得的消息,由两个不同权限的进程同时去删除SD卡同一个目录下的文件(例如surfaceflinger和普通app进程),果然复现了这个问题,找到了必现的路径,反馈给BSP有关同事之后, 他们认为是fuse文件系统本身的一个缺陷,但是由于项目紧急,所以当时只是做了workground方案.

ramdump提取coredump文件的方法

既然ramdump文件是整个内存的镜像,那么理论上来讲,从它里面是可以提取单个进程的coredump的,网上搜罗了一堆资料,最后在crash extension modules这个网页里面找到了方法,crash是提供可扩展的,可以为它写一些类似插件的功能,gcore就是其中一款,可以利用这个插件从ramdump里面提取出单个进程的coredump文件.

Create a core dump file of a user-space task that was running in a kernel dumpfile. The module supports the X86, X86_64, ARM, ARM64, MIPS and PPC64 architectures.

If /usr/include/crash/defs.h exists, the module may be built like so:

$ tar xzf crash-gcore-command-1.4.0.tar.gz

$ cd crash-gcore-command-1.4.0

$ make -f gcore.mk

Alternatively, defs.h may be copied to the current directory:

$ tar xzf crash-gcore-command-1.4.0.tar.gz

$ cd crash-gcore-command-1.4.0

$ cp /defs.h .

$ make -f gcore.mk

To build the module from the top-level crash- directory, enter:

$ tar xzf crash-gcore-command-1.4.0.tar.gz

$ mv crash-gcore-command-1.4.0/* extensions

$ make extensions

This module's source files are self-contained in crash-gcore-command-1.4.0.tar.gz.

crashdumpandroid_Android Stability - crash 和 ramdump相关推荐

  1. crashdumpandroid_Android 中Crash时如何获取异常信息详解及实例

    Android 中Crash时如何获取异常信息详解 前言: 大家都知道,Android应用不可避免的会发生crash,无论你的程序写的多完美,总是无法完全避免crash的发生,可能是由于Android ...

  2. crashdumpandroid_Android 常见Crash Log汇总

    一.BinderProxy@4479b390 is not valid; is your activity running? 原因分析: 因为使用了AsyncTask 异步线程在线程完成以后的onPo ...

  3. 高通平台Android 蓝牙调试和配置手册--如何抓取log

    1.Android蓝牙调试概述 使用本文档可以使用以下方法调试一些常见的蓝牙问题: ■ Android主机日志(ADB logcat) ■ 蓝牙日志 ■ 音频配置文件日志 ■ 数据配置文件日志 ■ G ...

  4. (好文重发)朴英敏:用crash工具分析Linux内核死锁的一次实战

    本文简介: 内核死锁问题一般是读写锁(rw_semaphore)和互斥锁(mutex)引起的,本文主要讲如何通过ramdump+crash工具来分析这类死锁问题. 作者简介: 朴英敏,现就职于国内一家 ...

  5. 可伸缩性, 可用性和稳定性模式 Scalability, Availability Stability Patterns

    Scalability, Availability & Stability Patterns 一 自我有要求的读者应该提出问题:(研习:掌握层次:)能力级别:不会(了解)--领会(理解)--熟 ...

  6. crash工具解析_Android Crash 工具

    crash工具解析_Android Crash 工具_weixin_39543655的博客-CSDN博客本篇文章主要介绍Android开发中的部分知识点,通过阅读本篇文章,您将收获以下内容:一.Cra ...

  7. 【ARM Linux 系统稳定性分析入门及渐进 1 -- Crash 工具简介】

    文章目录 1.1 Kexec 和 Kdump 1.1.1 Kexec 加载优点 1.1.2 kdump 功能 1.1.3 kdump原理 1.2 Crash tool 1.2.1 ramdump 机制 ...

  8. System Reliability - One Crash per Thousand Server Years

    Servers crash, except when they don't. Our servers don't crash, at least not often. Our current rate ...

  9. Crash Dump Analysis 崩溃转储分析-摘抄翻译自深入解析windows操作系统

    Almost every Windows user has heard of, if not experienced, the infamous "blue screen of death. ...

  10. ADB 查看 crash log

    有时候项目Crash 了, 我们可以使用adb 查看crash 信息 在终端中输入 adb shell 然后就是查看crash 命令 logcat -b crash 这样就能查看crash 的地方了.

最新文章

  1. Linux下sh/bash/source/.命令的区别(转)
  2. swift_013(Swift 的运算符)
  3. deepin--安装Anaconda
  4. (一)、http原理
  5. mongo的php查询,使用PHP进行简单查询的mongo查询速度慢
  6. 一种改进的高光谱图像CEM目标检测算法
  7. 收藏 | 在yolov5上验证一些不成熟的想法
  8. 带父节点的平衡二叉树_平衡二叉树的左右旋以及双旋转的图文详解
  9. 多媒体计算机的信息类型,多媒体信息主要有几种类型
  10. 3D数学基础----欧拉角
  11. 计算机组成原理实验二八位寄存器,计算机组成原理课后参考答案
  12. 数据库设计(5)-理解用户需求
  13. Theano简单入门(三)
  14. storm windows 机制
  15. C语言基础 判断周几
  16. Maven使用(一)
  17. 沪市和深市股票托管方式的区别
  18. 我们成为什么样的人,是由自己的选择决定的---Leo读 不是孙振耀写的职场感言 大结局 7
  19. php是什么化学物,dmap是什么化学物质以及它的性质用途
  20. 后缀是dat的是什么文件 dat文件删除了能恢复吗

热门文章

  1. 安卓搞机教程--修改设置里 添加选项 添加文字 修改图标 修改版本号等等 实例解析
  2. 黑马python24期课件和代码_黑马Python 24期全套教程
  3. SRCNN中的PSNR计算问题
  4. MAC打开eclipse一直卡在core.ui的解决办法
  5. 微信公众号开发框架 For Java —— wechatapi
  6. WordPress 跨站请求伪造漏洞
  7. DP动态规划思想讲解
  8. Mac系统下编译并使用ijkplyer播放器
  9. 如何减小电压跟随器输出电阻_电压跟随器秘笈:LM358电压跟随器+运放问题
  10. 在web项目中使用MarkDown组件