背景知识
【hang detect 机制】
MTK在内核中新增了一个用于监视system_server的SW watchdog字符设备驱动:/dev/RT_Monitor,该设备会启动一个内核线程循环去check system_server线程的状态,若system_server不存在则表明Andriod在执行restart,若system_server存在,hang_detect_counter 就开始超时流程,每过30s,hang_detect_counter--,hang_detect_counter 初始值为10,由SW watchdog通过/dev/RT_Monitor设备节点操作ioctl设置(喂狗)。也就是说,如果操作10*30s,还没有收到来自SW watchdog的喂狗操作,则表明system_server 的SW watchdog已经被hang住,无法转变异常为SWT类型,此时RT_Monitor进程将打印出所有D态(深睡)进程的backtrace到kernel log供调试分析,最后执行BUG(),结束自己。
主要实现:
434 static int hang_detect_thread(void *arg)
435 {
..
445     while (1) {
446         if ((1 == hd_detect_enabled) && (FindTaskByName("system_server") != -1)) {
447             LOGE("[Hang_Detect] hang_detect thread counts down %d:%d.\n", hang_detect_counter, hd_timeout);
448
449             if (hang_detect_counter <= 0)
450                 ShowStatus();
451
452             if (hang_detect_counter == 0) {
..
461                 } else {    /* only Customer user load  trigger KE */
462                     LOGE("[Hang_Detect] we should triger KE...\n");
463                     aee_kernel_exception_api(__FILE__, __LINE__,
464                          DB_OPT_NE_JBT_TRACES | DB_OPT_DISPLAY_HANG_DUMP,
465                          "\nCRDISPATCH_KEY:SS Hang\n",
466                          "we triger Kernel API DB ");
467                     msleep(30 * 1000);
468                     BUG();
469                 }
现象
mt6580平台,反复执行【camera连拍后-返回】操作5-30min出现手机卡住,后面自动重启, user版本.
初步分析
解开db,kernel log提示发生了KE,而且是由BUG()触发.
[ 1832.311664] -(0)[117:hang_detect]------------[ cut here ]------------
[ 1832.311696] -(0)[117:hang_detect]Kernel BUG at c0741d80 [verbose debug info unavailable]
[ 1832.311709] -(0)[117:hang_detect]Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 1832.311723] -(0)[117:hang_detect]disable aee kernel api
[ 1833.311741] -(0)[117:hang_detect]Non-crashing CPUs did not react to IPI
[ 1833.311768] -(0)[117:hang_detect]CPU: 0 PID: 117 Comm: hang_detect Tainted: G W 3.18.35+ #2
[ 1833.311783] -(0)[117:hang_detect]task: df117200 ti: de608000 task.ti: de608000
[ 1833.311804] -(0)[117:hang_detect]PC is at hang_detect_thread+0x240/0x2e8
[ 1833.311821] -(0)[117:hang_detect]LR is at _raw_spin_unlock_irqrestore+0x30/0x5c
..
[ 1839.517818] -(0)[117:hang_detect]Backtrace:
[ 1839.518142] -(0)[117:hang_detect][<c0741b40>] (hang_detect_thread) from [<c013d634>] (kthread+0xe8/0xfc)
[ 1839.518286] -(0)[117:hang_detect] r9:00000000 r8:00000000 r7:c0741b40 r6:00000000 r5:00000000 r4:de5ebf80
[ 1839.518953] -(0)[117:hang_detect][<c013d54c>] (kthread) from [<c01070a0>] (ret_from_fork+0x14/0x34)
[ 1839.519095] -(0)[117:hang_detect] r7:00000000 r6:00000000 r5:c013d54c r4:de5ebf80
[ 1839.519619] -(0)[117:hang_detect]Code: e34c00de ebffe2ce e3070530 ebe908fe (e7f001f2)
[ 1839.519843] -(0)[117:hang_detect]---[ end trace 0d39c03f35ba1634 ]---
[ 1840.396724] -(0)[117:hang_detect]Kernel panic - not syncing: Fatal exception
取出mini_ramdump+vmlinux,启动gdb调试, 可以清晰的此题属于hang detect类问题分析方向.
#0  0xc0741d80 in hang_detect_thread (arg=<optimized out>)
    at /home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/drivers/misc/mediatek/aee/aed/monitor_hang.c:467
467                     msleep(30 * 1000);
(gdb) list
462                     LOGE("[Hang_Detect] we should triger KE...\n");
463                     aee_kernel_exception_api(__FILE__, __LINE__,
464                          DB_OPT_NE_JBT_TRACES | DB_OPT_DISPLAY_HANG_DUMP,
465                          "\nCRDISPATCH_KEY:SS Hang\n",
466                          "we triger Kernel API DB ");
467                     msleep(30 * 1000);
468                     BUG();
469                 }
分析kernel log,查看最开始打印和最后hang detect 计数器的时间点:
<3>[ 1441.021890]  (1)[117:hang_detect][Hang_Detect] hang_detect thread counts down 10:10.
...
<3>[ 1801.141892]  (0)[117:hang_detect][Hang_Detect] hang_detect thread counts down 0:10.
这里一共执行了10*30s时间,
然后会依次打印出D状态进程,由于需要厘清各个D进程之间卡住的关系,所以需要排查所有的D状态进程(深睡)依赖关系,首先从system_server进程被卡住的watchdog线程停下的地方开始:
..
<3>[ 1802.041672]  (1)[117:hang_detect]watchdog        D (1)[117:hang_detect] c0aad124
..
<3>[ 1802.284342]  (0)[117:hang_detect]Camera@Display  D
..
<3>[ 1802.294514]  (0)[117:hang_detect]com.myos.camera D
..
<3>[ 1802.295677]  (0)[117:hang_detect]Binder:9840_1   D
<3>[ 1802.296097]  (0)[117:hang_detect]Binder:9840_2   D
<3>[ 1802.296392]  (0)[117:hang_detect]AsyncTask #1    D
<3>[ 1802.296711]  (0)[117:hang_detect]AsyncTask #2    D
<3>[ 1802.297264]  (0)[117:hang_detect]AsyncTask #3    D
..
<3>[ 1802.301450]  (0)[117:hang_detect]service         D
分析watchdog的调用栈
<3>[ 1802.041672]  (1)[117:hang_detect]watchdog        D (1)[117:hang_detect] c0aad124
<3>[ 1802.041709]  (1)[117:hang_detect]    0  1139    263 0x00000000
<6>[ 1802.041722]  (1)[117:hang_detect]Backtrace:
<6>[ 1802.041750]  (1)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.041759]  (1)[117:hang_detect] r10:c1040eb4 r9:ca4a8028 r8:00000002 r7:ce632a00 r6:ffffffff r5:c1040eb0
<6>[ 1802.041786]  (1)[117:hang_detect] r4:c1040eac
<6>[ 1802.041805]  (1)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aad5c8>] (schedule_preempt_disabled+0x18/0x24)
<6>[ 1802.041823]  (1)[117:hang_detect][<c0aad5b0>] (schedule_preempt_disabled) from [<c0aaee60>] (__mutex_lock_slowpath+0xb8/0x194)
<6>[ 1802.041840]  (1)[117:hang_detect][<c0aaeda8>] (__mutex_lock_slowpath) from [<c0aaef90>] (mutex_lock+0x54/0x58)
<6>[ 1802.041848]  (1)[117:hang_detect] r10:00000000 r9:de1cc700 r8:de30e958 r7:ca785480 r6:c1040eac r5:00a0003a
<6>[ 1802.041875]  (1)[117:hang_detect] r4:c1040eac
<6>[ 1802.041896]  (1)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
<6>[ 1802.041904]  (1)[117:hang_detect] r4:c1040eac r3:c03bdd30
<6>[ 1802.041928]  (1)[117:hang_detect][<c03bdd30>] (misc_open) from [<c02241e0>] (chrdev_open+0x88/0x140)
<6>[ 1802.041936]  (1)[117:hang_detect] r8:00000000 r7:ca785480 r6:de30e958 r5:df2f6540 r4:00000000 r3:c03bdd30
<6>[ 1802.041972]  (1)[117:hang_detect][<c0224158>] (chrdev_open) from [<c021e200>] (do_dentry_open+0x180/0x2d0)
<6>[ 1802.041980]  (1)[117:hang_detect] r8:00000000 r7:c0224158 r6:ca785488 r5:de30e958 r4:ca785480
<6>[ 1802.042014]  (1)[117:hang_detect][<c021e080>] (do_dentry_open) from [<c021f2c4>] (vfs_open+0x68/0x70)
<6>[ 1802.042022]  (1)[117:hang_detect] r10:00000000 r9:00000000 r8:00000000 r7:ca4a9e88 r6:de1cc700 r5:de30e958
上面就是watchdog线程的调用栈情况,发现此刻已经被调度走了,看上去是在等待misc_open的mutex lock,具体是不是呢?解开地址验证下:
(gdb) list *0xc03bdd5c
0xc03bdd5c is in misc_open (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/drivers/char/misc.c:121).
116     int err = -ENODEV;
117     const struct file_operations *new_fops = NULL;
118
119     mutex_lock(&misc_mtx);
120
121     list_for_each_entry(c, &misc_list, list) {
122         if (c->minor == minor) {
123             new_fops = fops_get(c->fops);
124             break;
125         }
(gdb)
果然,当前线程是在尝试获取misc_mtx这个lock,当前lock被其他线程拿走了,mutex lock是可以睡眠的,所以watchdog线程进入睡眠D态,那这个lock被谁拿走了呢?
打印出来看看:
(gdb) p misc_mtx
$1 = {count = {counter = 1}, wait_lock = {{rlock = {raw_lock = {{slock = 0, tickets = {owner = 0, next = 0}}}}}}, wait_list = {
    next = 0xc1040eb4 <misc_mtx+8>, prev = 0xc1040eb4 <misc_mtx+8>}, owner = 0x0 <__vectors_start>, osq = {tail = {counter = 0}}}
(gdb)
counter = 1表示unlock状态,owner显示0,这是怎么回事呢?虽然dump取出来的时候跟上面的bt不是同一个时间,但是依然是出问题的情况呀?这个问题留后面看,推测是mini dump的问题,因为mini dump只保留异常现场32kb范围内容,需要抓full dump才更可靠。
继续排除D状态进程,搜索看哪个进程还在操作misc_open的misc_mtx lock,
  C:\Documents and Settings\user.com\桌面\stability training\k300死机\12.09\APLog_2017_1219_140241\kernel_log (9 hits)
    Line 49131: <6>[ 1801.473767]  (1)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 50630: <6>[ 1801.953114]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 50851: <6>[ 1802.041896]  (1)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 51694: <6>[ 1802.283537]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 52024: <6>[ 1802.289427]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 52057: <6>[ 1802.290022]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 52279: <6>[ 1802.294083]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 52617: <6>[ 1802.299955]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c03bdd5c>] (misc_open+0x2c/0x104)
    Line 52715: <6>[ 1802.301778]  (0)[117:hang_detect][<c0879300>] (binder_open) from [<c03bddc0>] (misc_open+0x90/0x104)
果然,c03bdd5c的所有进程都是在等拿misc_open的mutex lock,只有c03bddc0不同,而这是就是需要要找的进程.
分析service进程
<3>[ 1802.301450]  (0)[117:hang_detect]service         D
<3>[ 1802.301461]  (0)[117:hang_detect] c0aad124 <3>[ 1802.301476]  (0)[117:hang_detect]    0 10387  10164 0x00000000
<6>[ 1802.301485]  (0)[117:hang_detect]Backtrace:
<6>[ 1802.301504]  (0)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.301512]  (0)[117:hang_detect] r10:c108277c r9:c5296028 r8:00000002 r7:cb56e600 r6:ffffffff r5:c1082778
<6>[ 1802.301538]  (0)[117:hang_detect] r4:c1082774
<6>[ 1802.301639]  (0)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aad5c8>] (schedule_preempt_disabled+0x18/0x24)
<6>[ 1802.301657]  (0)[117:hang_detect][<c0aad5b0>] (schedule_preempt_disabled) from [<c0aaee60>] (__mutex_lock_slowpath+0xb8/0x194)
<6>[ 1802.301674]  (0)[117:hang_detect][<c0aaeda8>] (__mutex_lock_slowpath) from [<c0aaef90>] (mutex_lock+0x54/0x58)
<6>[ 1802.301682]  (0)[117:hang_detect] r10:00000000 r9:cc64a580 r8:c50c2180 r7:c109ec54 r6:c5296000 r5:c5296000
<6>[ 1802.301709]  (0)[117:hang_detect] r4:c1082774
<6>[ 1802.301740]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c08793f4>] (binder_open+0xf4/0x330)
<6>[ 1802.301748]  (0)[117:hang_detect] r4:ca18ac00 r3:00000000
<6>[ 1802.301778]  (0)[117:hang_detect][<c0879300>] (binder_open) from [<c03bddc0>] (misc_open+0x90/0x104)
<6>[ 1802.301790]  (0)[117:hang_detect] r9:cc64a580 r8:de30fd40 r7:c50c2180 r6:c1040eac r5:00000034 r4:c0879300
<6>[ 1802.301834]  (0)[117:hang_detect][<c03bdd30>] (misc_open) from [<c02241e0>] (chrdev_open+0x88/0x140)
解开调用栈,发现该进程拿到了misc_mtx,但是一直没释放,阻塞在下面的open函数,那么service进程在等谁呢?
(gdb) list *0xc03bddc0
0xc03bddc0 is in misc_open (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/drivers/char/misc.c:147).
143     err = 0;
144     replace_fops(file, new_fops);
145     if (file->f_op->open) {
146         file->private_data = c;
147         err = file->f_op->open(inode,file);
148     }
149 fail:
150     mutex_unlock(&misc_mtx);
151     return err;
(gdb)
..
这个是misc_open的调用是从binder_open过来的,分析它的调用栈:
(gdb) list *0xc0879300
0xc0879300 is in binder_open (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/drivers/staging/android/binder.c:4361).
4360    static int binder_open(struct inode *nodp, struct file *filp)
4361    {
4362        struct binder_proc *proc;
..
4380        binder_lock(__func__);
..
(gdb) list *0xc08793f4
(gdb) list *0xc08793f4
0xc08793f4 is in binder_open (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/include/linux/jump_label.h:88).
84  #include <linux/atomic.h>
85
86  static inline int static_key_count(struct static_key *key)
87  {
88      return atomic_read(&key->enabled);
89  }
90
91  #ifdef HAVE_JUMP_LABEL
92
(gdb)
上面这个atomic_read是什么?跟踪内核代码发现,其实这个就是mutex lock的实现,所以最终这个service进程这里在等binder lock,那binder lock又是被谁拿去了呢??
既然是等待binder lock,那么推测定有其它进程有binder thread read/write操作,排查既是D态进程,又在操作binder_thread_read/write的进程,发现下面两个进程同事满足条件:
<3>[ 1802.284342]  (0)[117:hang_detect]Camera@Display  D
<3>[ 1802.295677]  (0)[117:hang_detect]Binder:9840_1   D
分析Binder:9840_1线程
<3>[ 1802.295677]  (0)[117:hang_detect]Binder:9840_1   D
<3>[ 1802.295689]  (0)[117:hang_detect] c0aad124 <3>[ 1802.295704]  (0)[117:hang_detect]    0  9851    263 0x00000000
<6>[ 1802.295712]  (0)[117:hang_detect]Backtrace:
<6>[ 1802.295731]  (0)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.295739]  (0)[117:hang_detect] r10:c108277c r9:dbf5c018 r8:00000002 r7:c2db3c00 r6:ffffffff r5:c1082778
<6>[ 1802.295766]  (0)[117:hang_detect] r4:c1082774
<6>[ 1802.295784]  (0)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aad5c8>] (schedule_preempt_disabled+0x18/0x24)
<6>[ 1802.295801]  (0)[117:hang_detect][<c0aad5b0>] (schedule_preempt_disabled) from [<c0aaee60>] (__mutex_lock_slowpath+0xb8/0x194)
<6>[ 1802.295817]  (0)[117:hang_detect][<c0aaeda8>] (__mutex_lock_slowpath) from [<c0aaef90>] (mutex_lock+0x54/0x58)
<6>[ 1802.295825]  (0)[117:hang_detect] r10:c109ec54 r9:ca18a600 r8:ca18a668 r7:ca18a660 r6:dbf5de40 r5:dae2fd00
<6>[ 1802.295852]  (0)[117:hang_detect] r4:c1082774
<6>[ 1802.295872]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c087dc1c>] (binder_thread_read+0x168/0x12f0)
<6>[ 1802.295880]  (0)[117:hang_detect] r4:00000000 r3:00000000
<6>[ 1802.295903]  (0)[117:hang_detect][<c087dab4>] (binder_thread_read) from [<c088269c>] (binder_ioctl+0x520/0xbe0)
<6>[ 1802.295911]  (0)[117:hang_detect] r10:cb6ff300 r9:a6981828 r8:c109ec54 r7:c0186201 r6:dbf5ded0 r5:c13374dc
<6>[ 1802.295938]  (0)[117:hang_detect] r4:00000000
仔细看这个进程的调用栈,发现也是陷入了获取不到binder lock的等待,然后被调度走了,说明该进程不是当前binder lock的持有者!
那就只剩下binder write的进程了。
分析Camera@Display线程
<3>[ 1802.284342]  (0)[117:hang_detect]Camera@Display  D
<3>[ 1802.284354]  (0)[117:hang_detect] c0aad124 <3>[ 1802.284368]  (0)[117:hang_detect]    0 10194      1 0x00000000
<6>[ 1802.284377]  (0)[117:hang_detect]Backtrace:
<6>[ 1802.284396]  (0)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.284404]  (0)[117:hang_detect] r10:de110ff8 r9:fffe0001 r8:de111004 r7:ffff0000 r6:ffff0001 r5:c2706000
<6>[ 1802.284431]  (0)[117:hang_detect] r4:00000002
<6>[ 1802.284450]  (0)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aafc0c>] (rwsem_down_write_failed+0x154/0x388)
<6>[ 1802.284467]  (0)[117:hang_detect][<c0aafab8>] (rwsem_down_write_failed) from [<c0aaf53c>] (down_write+0x64/0x68)
<6>[ 1802.284475]  (0)[117:hang_detect] r10:d045bd00 r9:c13374dc r8:00000001 r7:00000000 r6:e7a03000 r5:ca18a600
<6>[ 1802.284501]  (0)[117:hang_detect] r4:de110ff8
<6>[ 1802.284521]  (0)[117:hang_detect][<c0aaf4d8>] (down_write) from [<c087bf28>] (binder_update_page_range+0x190/0x384)
<6>[ 1802.284529]  (0)[117:hang_detect] r4:c109ec54 r3:c109ec54
<6>[ 1802.284552]  (0)[117:hang_detect][<c087bd98>] (binder_update_page_range) from [<c0880744>] (binder_thread_write+0x16d4/0x310c)
<6>[ 1802.284560]  (0)[117:hang_detect] r10:d045bd00 r9:c13374dc r8:e7a01f44 r7:aa27f134 r6:00000000 r5:c1082750
<6>[ 1802.284587]  (0)[117:hang_detect] r4:c109ec54
<6>[ 1802.284606]  (0)[117:hang_detect][<c087f070>] (binder_thread_write) from [<c0882614>] (binder_ioctl+0x498/0xbe0)
<6>[ 1802.284614]  (0)[117:hang_detect] r10:dd5dbb40 r9:a9d024a8 r8:c109ec54 r7:c0186201 r6:c2707ed0 r5:c13374dc
<6>[ 1802.284640]  (0)[117:hang_detect] r4:dd5dbb41
<6>[ 1802.284658]  (0)[117:hang_detect][<c088217c>] (binder_ioctl) from [<c02316f8>] (do_vfs_ioctl+0x3e4/0x5a8)
从上面分析调用栈发现,刚好是改进程拿到了binder lock,但是在执行down_write这个信号量的时候被schedule出去了,
Camera@Display进程也陷入D状态,所以导致后面需要拿binder lock的线程都会被block住!
那么down_write是哪个情况呢??继续撸:
(gdb) list *0xc087bf28
0xc087bf28 is in binder_update_page_range (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/drivers/staging/android/binder.c:1459).
1454        else
1455            mm = get_task_mm(proc->tsk);
1456
1457        if (mm) {
1458            down_write(&mm->mmap_sem);
1459            vma = proc->vma;
1460            if (vma && mm != proc->vma_vm_mm) {
1461                pr_err("%d: vma mm and task mm mismatch\n", proc->pid);
1462                vma = NULL;
1463            }
(gdb)
原来Camera@Display进程在等待mm->mmap_sem这个信号量锁没等到被调度走了,从上面代码得知,mm是由get_task_mm函数所得,就是task_struct->mmap_sem信号量锁。
那这个task_struct->mmap_sem(mm lock)又被谁拿走了呢?(是不是撸得有点晕了..?)
分析AsyncTask #2 进程
继续分析排查D态调进程,发现AsyncTask #2 进程中的sdcardfs_setattr函数可疑:
<3>[ 1802.296711]  (0)[117:hang_detect]AsyncTask #2    D
<3>[ 1802.296723]  (0)[117:hang_detect] c0aad124 <3>[ 1802.296738]  (0)[117:hang_detect]    0  9897    263 0x00000000
<6>[ 1802.296747]  (0)[117:hang_detect]Backtrace:
<6>[ 1802.296765]  (0)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.296773]  (0)[117:hang_detect] r10:deec5418 r9:cd044008 r8:00000002 r7:ce93da00 r6:ffffffff r5:deec5414
<6>[ 1802.296800]  (0)[117:hang_detect] r4:deec5410
<6>[ 1802.296818]  (0)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aad5c8>] (schedule_preempt_disabled+0x18/0x24)
<6>[ 1802.296835]  (0)[117:hang_detect][<c0aad5b0>] (schedule_preempt_disabled) from [<c0aaee60>] (__mutex_lock_slowpath+0xb8/0x194)
<6>[ 1802.296851]  (0)[117:hang_detect][<c0aaeda8>] (__mutex_lock_slowpath) from [<c0aaef90>] (mutex_lock+0x54/0x58)
<6>[ 1802.296859]  (0)[117:hang_detect] r10:cd044000 r9:da3a4110 r8:cd045d98 r7:cd045ce0 r6:deec53a0 r5:00000000
<6>[ 1802.296885]  (0)[117:hang_detect] r4:deec5410
<6>[ 1802.296906]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c02f1750>] (sdcardfs_setattr+0x140/0x33c)
<6>[ 1802.296914]  (0)[117:hang_detect] r4:c24414a8 r3:0000a068
<6>[ 1802.296936]  (0)[117:hang_detect][<c02f1610>] (sdcardfs_setattr) from [<c023a8bc>] (notify_change+0x1d0/0x33c)
..
gdb+撸代码:
(gdb) list *0xc02f1610
0xc02f1610 is in sdcardfs_setattr (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/fs/sdcardfs/inode.c:462).
461 static int sdcardfs_setattr(struct dentry *dentry, struct iattr *ia)
462 {
463     int err;
..
516     if (current->mm)
(gdb)
517         down_write(&current->mm->mmap_sem);
..
(gdb) list *0xc02f1750
0xc02f1750 is in sdcardfs_setattr (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/fs/sdcardfs/inode.c:543).
538      * Note: we use lower_dentry->d_inode, because lower_inode may be
539      * unlinked (no inode->i_sb and i_ino==0.  This happens if someone
540      * tries to open(), unlink(), then ftruncate() a file.
541      */
542     mutex_lock(&lower_dentry->d_inode->i_mutex);
543     err = notify_change(lower_dentry, &lower_ia, /* note: lower_ia */
544             NULL);
545     mutex_unlock(&lower_dentry->d_inode->i_mutex);
546     if (current->mm)
547         up_write(&current->mm->mmap_sem);
从上面调用栈结合撸代码可以明显看出,AsyncTask #2先拿到了mm lock(current->mm->mmap_sem,LINE517),但是还没来得急释放(LINE547),就有陷入了i_mutex的等待!
所以后面所有要拿mm lock的进程都会被阻塞住陷入等待.
搜索sdcardfs关键字:
C:\Documents and Settings\user.com\桌面\stability training\k300死机\12.09\APLog_2017_1219_140241\kernel_log (4 hits)
    Line 52441: <6>[ 1802.296906]  (0)[117:hang_detect][<c0aaef3c>] (mutex_lock) from [<c02f1750>] (sdcardfs_setattr+0x140/0x33c)
    Line 52443: <6>[ 1802.296936]  (0)[117:hang_detect][<c02f1610>] (sdcardfs_setattr) from [<c023a8bc>] (notify_change+0x1d0/0x33c)
    Line 52493: <6>[ 1802.297793]  (0)[117:hang_detect][<c0220aa8>] (vfs_write) from [<c02f1420>] (sdcardfs_write+0x44/0x190)
    Line 52495: <6>[ 1802.297834]  (0)[117:hang_detect][<c02f13dc>] (sdcardfs_write) from [<c0220b54>] (vfs_write+0xac/0x1b8)
发现是 AsyncTask #3 进程在操作sdcardfs_write:
分析AsyncTask #3进程
<3>[ 1802.297264]  (0)[117:hang_detect]AsyncTask #3    D
<3>[ 1802.297275]  (0)[117:hang_detect] c0aad124 <3>[ 1802.297290]  (0)[117:hang_detect]    0  9951    263 0x00000000
<6>[ 1802.297299]  (0)[117:hang_detect]Backtrace:
<6>[ 1802.297318]  (0)[117:hang_detect][<c0aacde0>] (__schedule) from [<c0aad564>] (schedule+0x38/0x84)
<6>[ 1802.297326]  (0)[117:hang_detect] r10:c1028fe4 r9:8e47900b r8:00000017 r7:de110fc0 r6:00000002 r5:de110ff8
<6>[ 1802.297352]  (0)[117:hang_detect] r4:c212f200
<6>[ 1802.297371]  (0)[117:hang_detect][<c0aad52c>] (schedule) from [<c0aafa78>] (rwsem_down_read_failed+0xe0/0x120)
<6>[ 1802.297387]  (0)[117:hang_detect][<c0aaf998>] (rwsem_down_read_failed) from [<c0aaf4d4>] (down_read+0x38/0x3c)
<6>[ 1802.297395]  (0)[117:hang_detect] r6:c212f200 r5:c6712018 r4:c6713cb0
<6>[ 1802.297421]  (0)[117:hang_detect][<c0aaf49c>] (down_read) from [<c0114688>] (do_page_fault+0xb4/0x3ec)
<6>[ 1802.297435]  (0)[117:hang_detect][<c01145d4>] (do_page_fault) from [<c0100348>] (do_DataAbort+0x7c/0x110)
<6>[ 1802.297443]  (0)[117:hang_detect] r10:c1028fe4 r9:8e47900b r8:00000007 r7:00000017 r6:c6713cb0 r5:c1028f74
<6>[ 1802.297470]  (0)[117:hang_detect] r4:c6712000
<6>[ 1802.297487]  (0)[117:hang_detect][<c01002cc>] (do_DataAbort) from [<c010c7d8>] (__dabt_svc+0x38/0x60)
<6>[ 1802.297497]  (0)[117:hang_detect]Exception stack(0xc6713cb0 to 0xc6713cf8)
<6>[ 1802.297510]  (0)[117:hang_detect]3ca0:                                     00000000 8e47900b 00001000 00000034
<6>[ 1802.297525]  (0)[117:hang_detect]3cc0: 00001000 c6713e04 deec5474 00001000 00000000 c0b1a040 00001000 c6713d0c
<6>[ 1802.297538]  (0)[117:hang_detect]3ce0: 000d3394 c6713cf8 c6713dfc c01fd150 200f0013 ffffffff
<6>[ 1802.297545]  (0)[117:hang_detect] r10:00001000 r9:c0b1a040 r8:00000000 r7:c6713ce4 r6:ffffffff r5:200f0013
<6>[ 1802.297572]  (0)[117:hang_detect] r4:c01fd150
<6>[ 1802.297594]  (0)[117:hang_detect][<c01fd0d8>] (iov_iter_fault_in_readable) from [<c01dc1c0>] (generic_perform_write+0x74/0x1c4)
<6>[ 1802.297610]  (0)[117:hang_detect][<c01dc14c>] (generic_perform_write) from [<c01dc4dc>] (__generic_file_write_iter+0x1cc/0x3ec)
<6>[ 1802.297618]  (0)[117:hang_detect] r10:c6713e04 r9:00000000 r8:deec5474 r7:c6713e18 r6:c6712030 r5:db4a0300
<6>[ 1802.297645]  (0)[117:hang_detect] r4:00133394
<6>[ 1802.297664]  (0)[117:hang_detect][<c01dc310>] (__generic_file_write_iter) from [<c01dc738>] (generic_file_write_iter+0x3c/0xcc)
<6>[ 1802.297672]  (0)[117:hang_detect] r10:8e41800c r9:ccf380c0 r8:c212f200 r7:db4a0300 r6:c6713e04 r5:deec5410
<6>[ 1802.297698]  (0)[117:hang_detect] r4:c6713e18
<6>[ 1802.297717]  (0)[117:hang_detect][<c01dc6fc>] (generic_file_write_iter) from [<c0220974>] (new_sync_write+0x9c/0xc4)
<6>[ 1802.297725]  (0)[117:hang_detect] r7:db4a0300 r6:c6713f78 r5:00000000 r4:00000000
<6>[ 1802.297753]  (0)[117:hang_detect][<c02208d8>] (new_sync_write) from [<c0220b54>] (vfs_write+0xac/0x1b8)
<6>[ 1802.297761]  (0)[117:hang_detect] r8:da3a4e58 r7:00133394 r6:c6713f78 r5:8e41800c r4:db4a0300
<6>[ 1802.297793]  (0)[117:hang_detect][<c0220aa8>] (vfs_write) from [<c02f1420>] (sdcardfs_write+0x44/0x190)
<6>[ 1802.297801]  (0)[117:hang_detect] r9:ccf380c0 r8:da3a4e58 r7:00133394 r6:c6713f78 r5:8e41800c r4:db4a0300
<6>[ 1802.297834]  (0)[117:hang_detect][<c02f13dc>] (sdcardfs_write) from [<c0220b54>] (vfs_write+0xac/0x1b8)
解析调用栈撸代码:
(gdb) list *0xc02f13dc
0xc02f13dc is in sdcardfs_write (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/fs/sdcardfs/file.c:61).
59  static ssize_t sdcardfs_write(struct file *file, const char __user *buf,
60                  size_t count, loff_t *ppos)
61  {
62      int err;
..
73      err = vfs_write(lower_file, buf, count, ppos);
74      /* update our inode times+sizes upon a successful lower write */
75      if (err >= 0) {
(gdb) 
(gdb) list *0xc02208d8
0xc02208d8 is in new_sync_write (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/fs/read_write.c:467).
466 ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
467 {
468     struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
469     struct kiocb kiocb;
470     struct iov_iter iter;
471     ssize_t ret;
(gdb)
..
(gdb) list *0xc0220b54
0xc0220b54 is in vfs_write (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/fs/read_write.c:534).
529     ret = rw_verify_area(WRITE, file, pos, count);
530     if (ret >= 0) {
531         count = ret;
532         file_start_write(file);
533         if (file->f_op->write)
534             ret = file->f_op->write(file, buf, count, pos);
..
(gdb) list *0xc01dc738
0xc01dc738 is in generic_file_write_iter (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/mm/filemap.c:2673).
2668        struct file *file = iocb->ki_filp;
2669        struct inode *inode = file->f_mapping->host;
2670        ssize_t ret;
2671
2672        mutex_lock(&inode->i_mutex);
2673        ret = __generic_file_write_iter(iocb, from);
2674        mutex_unlock(&inode->i_mutex);
..
这里可以明显得出,i_mutex lock被AsyncTask #3进程拿住了,但是AsyncTask #3也最终被调度走了,那么AsyncTask #3在等什么呢?继续分析调用栈:
(gdb) list *0xc0114688
0xc0114688 is in __do_page_fault (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/arch/arm/mm/fault.c:236).
231         unsigned int flags, struct task_struct *tsk)
232 {
233     struct vm_area_struct *vma;
234     int fault;
235
236     vma = find_vma(mm, addr); ???
237     fault = VM_FAULT_BADMAP;
238     if (unlikely(!vma))
239         goto out;
240     if (unlikely(vma->vm_start > addr))
..
这里是发生了缺页异常的处理流程,继续看:
(gdb) list *0xc01145d4
0xc01145d4 is in do_page_fault (/home/android/.jenkins/jobs/K300AG_7.0_TSDS/workspace/kernel-3.18/arch/arm/mm/fault.c:266).
264 static int __kprobes
265 do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
266 {
267     struct task_struct *tsk;
268     struct mm_struct *mm;
..
299     if (!down_read_trylock(&mm->mmap_sem)) {
300         if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
(gdb)
301             goto no_context;
302 retry:
303         down_read(&mm->mmap_sem);
304     } else {
305         /*
306          * The above down_read_trylock() might have succeeded in
307          * which case, we'll have missed the might_sleep() from
308          * down_read()
309          */
310         might_sleep();
(gdb)
311 #ifdef CONFIG_DEBUG_VM
312         if (!user_mode(regs) &&
313             !search_exception_tables(regs->ARM_pc))
314             goto no_context;
315 #endif
316     }
317
318     fault = __do_page_fault(mm, addr, fsr, flags, tsk);
这里的gdb解析显示代码虽然有点问题,symbols提示的代码位置跟调用栈有点出入(可能是工具的原因,所以需要结合调用栈一起分析), 但可以根据当前的调用栈推出当前进程是在等待 mm lock (mm->mmap_sem信号量锁)(LINE303),到现在这步,线程卡住的关系就很明显了:
AsyncTask #2进程持有mm lock,但它在等i_mutex,
AsyncTask #3进程持有i_mutex, 但它在等待mm lock
==》dead lock
用一张简图表示

解决方案
拿i_mutex前,释放mm lock.
fs/sdcardfs/inode.c
461 static int sdcardfs_setattr(struct dentry *dentry, struct iattr *ia)
462 {
..
516 if (current->mm)
517 down_write(&current->mm->mmap_sem);
518 if (ia->ia_valid & ATTR_SIZE) {
519 err = inode_newsize_ok(inode, ia->ia_size);
520 if (err) {
521 if (current->mm)
522 up_write(&current->mm->mmap_sem);
523 lockdep_on();
524 goto out;
525 }
526 truncate_setsize(inode, ia->ia_size);
527 }
+ if (current->mm)
+ up_write(&current->mm->mmap_sem);
529 /*
530 * mode change is for clearing setuid/setgid bits. Allow lower fs
531 * to interpret this in its own way.
532 */
533 if (lower_ia.ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))
534 lower_ia.ia_valid &= ~ATTR_MODE;
535
536 /* notify the (possibly copied-up) lower inode */
537 /*
538 * Note: we use lower_dentry->d_inode, because lower_inode may be
539 * unlinked (no inode->i_sb and i_ino==0. This happens if someone
540 * tries to open(), unlink(), then ftruncate() a file.
541 */
542 mutex_lock(&lower_dentry->d_inode->i_mutex);
543 err = notify_change(lower_dentry, &lower_ia, /* note: lower_ia */
544 NULL);
545 mutex_unlock(&lower_dentry->d_inode->i_mutex);
546 - if (current->mm)
547 - up_write(&current->mm->mmap_sem);
548 lockdep_on();

hang_detect类死锁解题案例分析相关推荐

  1. 死锁问题案例分析解决

    C++11 - thread多线程编程,线程互斥和同步通信,死锁问题分析解决_大秦坑王的专栏-CSDN博客 top命令的用法_zhuoya_的博客-CSDN博客_top命令

  2. 案例2:丙类仓库建筑防火案例分析

    题目: 主要需要看图中的内容 仓库火灾危险性  甲类 乙类: 上面分类总用到的 注意: 民用建筑: 特别提醒: 消防车道: 仓库内办公室,休息室布置: 构造防火: 安全疏散: 地下半地下安全疏散: 仓 ...

  3. JAVA23种设计模式学习,源码+案例分析+类图讲解

    本文对JAVA中23种设计模式进行了简单的讲解,幷加以实际案例进行辅助理解,每种模式都会举例说明,幷将源码开源至gitee和githbu上.JAVA目录下为源码,resources目录下的UML文件夹 ...

  4. 图解通信原理与案例分析-26: 5G NR是如何支持海量机器类通信mMTC的?移动通信对物联网的支持

    前言: 移动通信最初是解决人与人之间的语音通信,后来发展成了人与人之间的文本通信, 到了4G LTE,已经很好的解决了人与人之间的视频通信. 到了5G,人与人之间的高速率的数据通信在LTE的基础之上得 ...

  5. 系统集成项目管理工程师 计算专题(下午计算题-进度类-案例分析)

    有兴趣了解或者考软考的小伙伴可以到B站看看这个老师讲的课程. 学计算机的小伙伴有时间可以考一个软考证书,这是中国计算机技术职业资格网官网. 这是广东人事考试网,这里面会公布广东的报考信息,没记错的话是 ...

  6. 系统集成项目管理工程师 计算专题(下午计算题-成本类-案例分析)

    有兴趣了解或者考软考的小伙伴可以到B站看看这个老师讲的课程. 学计算机的小伙伴有时间可以考一个软考证书,这是中国计算机技术职业资格网官网. 这是广东人事考试网,这里面会公布广东的报考信息,没记错的话是 ...

  7. 线程死锁问题以及简单案例分析

    线程死锁问题以及简单案例分析 1. 什么是死锁? 1.1 概述 1.2. 业务场景--举例说明 2. 死锁例子 2.1 代码 2.2 效果 2.3 代码分析 3. 解决案例中死锁问题 4. 怎么避免线 ...

  8. 擎创技术流 | java多类加载器类冲突案例分析

    众所周知,jvm类加载机制采用双亲委派机制.但在有些框架中,常常为了提供某种形式的"隔离和沙盒",自定义一种称为ChildFirst的类加载器,简单的说就是破坏了双亲委派,由自定义 ...

  9. 系统集成项目管理工程师 计算专题(下午计算题-进度类-时标网络图-案例分析)

    有兴趣了解或者考软考的小伙伴可以到B站看看这个老师讲的课程. 学计算机的小伙伴有时间可以考一个软考证书,这是中国计算机技术职业资格网官网. 这是广东人事考试网,这里面会公布广东的报考信息,没记错的话是 ...

最新文章

  1. php的环境怎么配置文件,php环境下所有的配置文件以及作用
  2. SQLSERVER2014中的新功能
  3. 梅森增益matlab求解,梅森公式互不接触回路及其增益
  4. 框架详解_BeetleX框架详解SAEA对象
  5. POJ - 2187 Beauty Contest(最远点对)
  6. 【TPC协议头解析】
  7. 学习理发去哪里_学习美发去哪里学
  8. 【Servlet】Servlet显示时间和IP等信息
  9. 快抢!猪年之前最后一波送书福利,错过只能等“明年”
  10. warning C4996: 'fopen': This function or variable may be unsafe.(_CRT_SECURE_NO_WARNINGS)
  11. LDA模型原理学习及应用
  12. 100个超好用的Excel快捷键
  13. 用HTML5为你的网页添加音效(兼容Firefox 3.5+, IE 6-9, Safari 3.0+, Chrome 3.0+, Opera 10.5+)...
  14. 在Linux平台中调试C/C++内存泄漏方法 (腾讯和MTK面试的时候问到的)
  15. LV4500二维码扫描器对接到安卓访客一体机系统应用-拓展扫描二维码登记访客功能
  16. python话费充值_手机话费充值接口开发指南(含API文档,充值移动、联通、电信话费)...
  17. android课程设计健身,健身软件课程设计本科毕业设计论文.doc
  18. 功能测试--如何对时间(年月日)进行测试
  19. 机器学习算法实践——K-Means算法与图像分割
  20. 应届大学毕业生面试应答

热门文章

  1. 程序员的另一条出路:在赚够钱后,35 岁考公上岸。
  2. PTA: 旅游规划 [狄杰斯特拉+堆优化+链式前向星]
  3. 33岁开始学习java开发晚吗?
  4. php+获取上午还是下午,pm是上午还是下午?
  5. linux 内核移植4412,移植Linux-5.4+内核到4412开发板
  6. CDMA CDMA2000 WCDMA TD-SCDMA的区别
  7. 指针p++与p+1的区别
  8. 通过RK3568实现的NVR方案
  9. 背景建模(一) Evaluation of Background Subtraction Techniques
  10. Mongrel2服务器配置及控制