前面已经这个系列已经更新了4篇,死机重启问题分析中,Watchdog问题最为常见,今天接着写一写Watchdog问题的分析套路以及工作原理。
应用与系统稳定性第一篇---ANR问题分析的一般套路
应用与系统稳定性第二篇---ANR的监测与信息采集
应用与系统稳定性第三篇---FD泄露问题漫谈
应用与系统稳定性第四篇---单线程导致的空指针问题分析

一、Watchdog基本认识

1、什么是watchdog?

Watchdog又名看门狗,如果不按时给“喂狗”,超过一分钟,就会咬人。Android系统中,服务有上百种,为了防止SystemServer的一些核心服务hang住而发生冻屏,引入了Watchdog机制,当出现故障时,Watchdog就会调用Process.killProcess(Process.myPid())杀死SystemServer进程system_server进程是zygote的大弟子,是zygote进程fork的第一个进程,zygote和system_server这两个进程可以说是Java世界的半边天,任何一个进程的死亡,都会导致Java世界的崩溃。所以如果子进程SystemServer挂了,Zygote就会自杀,这样Zygote孵化的所有子进程都会重启一遍,相当于手机被软重启了,用户不会因为手机冻屏而不能使用。

上面说的是防止Watchdog问题,系统的处理策略,而我们程序员关注的是,具体是哪里发生了Watchdog,和ANR类似,Watchdog发生过程中,需要dump trace,最终定位并解决问题。所以得研究一套机制能确定超时问题。

watchdog代码位于 /frameworks/base/services/core/java/com/android/server/Watchdog.java

常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor,区别在下文分析。

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
2、初始化
watchdog初始化.png

Watchdog本身继承Thread,初始化是在SystemServer启动过程中

public final class SystemServer {... .../*** Starts a miscellaneous grab bag of stuff that has yet to be refactored* and organized.*/private void startOtherServices() {......try {......traceBeginAndSlog("InitWatchdog");final Watchdog watchdog = Watchdog.getInstance(); // 获取Watchdog对象初始化watchdog.init(context, mActivityManagerService); // 注册receiver以接收系统重启广播Trace.traceEnd(Trace.TRACE_TAG_SYSTEM_SERVER);......}......mActivityManagerService.systemReady(new Runnable() {@Overridepublic void run() {......Watchdog.getInstance().start();......}});}
241    public static Watchdog getInstance() {
242        if (sWatchdog == null) {
243            sWatchdog = new Watchdog();
244        }
245
246        return sWatchdog;
247    }

为了搞一套超时判断的方案,在Watchdog在构造函数中,会构建很多HandlerChecker,可以分为两类:

  • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。
  /* This handler will be used to post message back onto the main thread */
107    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();249    private Watchdog() {//实质调用的是父类Thread的构造方法,设置线程名称
250        super("watchdog");
251        // Initialize handler checkers for each common thread we want to check.  Note
252        // that we are not currently checking the background thread, since it can
253        // potentially hold longer running operations with no guarantees about the timeliness
254        // of operations there.
255
256        // The shared foreground thread is the main checker.  It is where we
257        // will also dispatch monitor checks and do other work.
258        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
259                "foreground thread", DEFAULT_TIMEOUT);
260        mHandlerCheckers.add(mMonitorChecker);
261        // Add checker for main thread.  We only do a quick check since there
262        // can be UI running on the thread.
263        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
264                "main thread", DEFAULT_TIMEOUT));
265        // Add checker for shared UI thread.
266        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
267                "ui thread", DEFAULT_TIMEOUT));
268        // And also check IO thread.
269        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
270                "i/o thread", DEFAULT_TIMEOUT));
271        // And the display thread.
272        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
273                "display thread", DEFAULT_TIMEOUT));
274
275        // Initialize monitor for Binder threads.
276        addMonitor(new BinderThreadMonitor());
277        //O上新增对FD泄露的监控
278        mOpenFdMonitor = OpenFdMonitor.create();
......
283    }

其中DEFAULT_TIMEOUT一般是一分钟,对于installd是10分钟。
两类HandlerChecker的侧重点不同,

Monitor Checker预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行;
Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。

所以Watchdog就靠这两个Checker来搞搞事情了。

3、基本原理
3.1如何添加Checker对象

拿AMS举例,是既添加了Monitor Checker对象,也添加了Looper Checker对象,也实现了Watchdog.Monitor接口,重写了monitor方法。

public class ActivityManagerService extends IActivityManager.Stubimplements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {......public ActivityManagerService(Context systemContext) {......Watchdog.getInstance().addMonitor(this);Watchdog.getInstance().addThread(mHandler);......}....../** In this method we try to acquire our lock to make sure that we have not deadlocked */public void monitor() {synchronized (this) { }}......
}

在AMS构造的时候,会调用Watchdog的addMonitor和addThread把自己和MainHandler的对象mHander加进去

323    public void addThread(Handler thread) {
324        addThread(thread, DEFAULT_TIMEOUT);
325    }
326
327    public void addThread(Handler thread, long timeoutMillis) {
328        synchronized (this) {
329            if (isAlive()) {
330                throw new RuntimeException("Threads can't be added once the Watchdog is running");
331            }
332            final String name = thread.getLooper().getThread().getName();
333            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
334        }
335    }
336314    public void addMonitor(Monitor monitor) {
315        synchronized (this) {
316            if (isAlive()) {
317                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
318            }
319            mMonitorChecker.addMonitor(monitor);
320        }
321    }

mMonitorChecker是HandlerChecker 对象,实质上是HandlerChecker的addMonitor方法,而mHandlerCheckers是ArrayList对象,就可以直接add。

120    public final class HandlerChecker implements Runnable {
121        private final Handler mHandler;
122        private final String mName;
123        private final long mWaitMax;
124        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
125        private boolean mCompleted;
126        private Monitor mCurrentMonitor;
127        private long mStartTime;
128
129        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
130            mHandler = handler;
131            mName = name;
132            mWaitMax = waitMaxMillis;
133            mCompleted = true;
134        }
135
136        public void addMonitor(Monitor monitor) {
137            mMonitors.add(monitor);
138        }......}
3.2、核心原理

在添加Checker之后,该如何使用这些Checker呢?因为Watchdog继承Thread,直接看run方法。

398    @Override
399    public void run() {
400        boolean waitedHalf = false;
401        while (true) {
402            final ArrayList<HandlerChecker> blockedCheckers;
403            final String subject;
404            final boolean allowRestart;//是否是在调试状态
405            int debuggerWasConnected = 0;
406            synchronized (this) {//CHECK_INTERVAL时长是DEFAULT_TIMEOUT的一半,一般是30s
407                long timeout = CHECK_INTERVAL;
408                //1、处理所有的HandlerChecker
410                for (int i=0; i<mHandlerCheckers.size(); i++) {
411                    HandlerChecker hc = mHandlerCheckers.get(i);
412                    hc.scheduleCheckLocked();
413                }.....// 2. 开始定期检查
423                long start = SystemClock.uptimeMillis();
424                while (timeout > 0) {
425                    if (Debug.isDebuggerConnected()) {
426                        debuggerWasConnected = 2;
427                    }
428                    try {
429                        wait(timeout);
430                    } catch (InterruptedException e) {
431                        Log.wtf(TAG, e);
432                    }
433                    if (Debug.isDebuggerConnected()) {
434                        debuggerWasConnected = 2;
435                    }
436                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
437                } 
438        // 3. 获取状态,状态有如下三种,
439                final int waitState = evaluateCheckerCompletionLocked();
440                if (waitState == COMPLETED) {
441                    // The monitors have returned; reset
442                    waitedHalf = false;
443                    continue;
444                } else if (waitState == WAITING) {
445                    // still waiting but within their configured intervals; back off and recheck
446                    continue;
447                } else if (waitState == WAITED_HALF) {
448                    if (!waitedHalf) {
449                        //超时一半的时候,开始dumpStackTraces
451                        ArrayList<Integer> pids = new ArrayList<Integer>();
452                        pids.add(Process.myPid());
453                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
454                            getInterestingNativePids());
455                        waitedHalf = true;
456                    }
457                    continue;
458                }
459
460                // 走到这里,说明存在超时的HandlerChecker
461                blockedCheckers = getBlockedCheckersLocked();
462                subject = describeCheckersLocked(blockedCheckers);
463                allowRestart = mAllowRestart;
464            }
465
466            // If we got here, that means that the system is most likely hung.
467            // First collect stack traces from all threads of the system process.
468            // Then kill this process so that the system will restart.//eventlog打印发生了watchdog
469            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
470   //
471            ArrayList<Integer> pids = new ArrayList<>();
472            pids.add(Process.myPid());473            if (mPhonePid > 0) pids.add(mPhonePid);
474            // Pass !waitedHalf so that just in case we somehow wind up here without having
475            //开始dumpStackTraces,包含pids中的进程和getInterestingNativePids中的进程
476            final File stack = ActivityManagerService.dumpStackTraces(
477                    !waitedHalf, pids, null, null, getInterestingNativePids());
478
479            // Give some extra time to make sure the stack traces get written.
480            // The system's been hanging for a minute, another second or two won't hurt much.
481            SystemClock.sleep(2000);
482
483            // Pull our own kernel thread stacks as well if we're configured for that//开始dumpKernelStackTraces
484            if (RECORD_KERNEL_THREADS) {
485                dumpKernelStackTraces();
486            }
487
488            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
489            doSysRq('w');
490            doSysRq('l');
491
492            // Try to add the error to the dropbox, but assuming that the ActivityManager
493            // itself may be deadlocked.  (which has happened, causing this statement to
494            // deadlock and the watchdog as a whole to be ineffective)
495            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
496                    public void run() {//将Error加入到DropBox文件中
497                        mActivity.addErrorToDropBox(
498                                "watchdog", null, "system_server", null, null,
499                                subject, null, stack, null);
500                    }
501                };
502            dropboxThread.start();......
525
526            // Only kill the process if the debugger is not attached.
527            if (Debug.isDebuggerConnected()) {
528                debuggerWasConnected = 2;
529            }
530            if (debuggerWasConnected >= 2) {
531                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
532            } else if (debuggerWasConnected > 0) {
533                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
534            } else if (!allowRestart) {
535                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
536            } else {
537                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
538                for (int i=0; i<blockedCheckers.size(); i++) {
539                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
540                    StackTraceElement[] stackTrace
541                            = blockedCheckers.get(i).getThread().getStackTrace();
542                    for (StackTraceElement element: stackTrace) {
543                        Slog.w(TAG, "    at " + element);
544                    }
545                }
546                Slog.w(TAG, "*** GOODBYE!");//最终杀死System进程
547                Process.killProcess(Process.myPid());
548                System.exit(10);
549            }
550
551            waitedHalf = false;
552        }
553    }

原理总结:

  • 1、系统中所有需要监控的服务都调用Watchdog的addMonitor添加Monitor Checker到mMonitors这个List中或者addThread方法添加Looper Checker到mHandlerCheckers这个List中。
  • 2、当Watchdog线程启动后,便开始无限循环,它的run方法就开始执行
    • 第一步调用HandlerChecker#scheduleCheckLocked处理所有的mHandlerCheckers
    • 第二步定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒,每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
      COMPLETED表示已经完成
      WAITING和WAITED_HALF表示还在等待,但未超时,WAITED_HALF时候会dump一次trace.
      OVERDUE表示已经超时。默认情况下,timeout是1分钟。
  • 3、如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息,保存日志,包括一些运行时的堆栈信息。
  • 4、最后杀死SystemServer进程

    Watchdog原理.png

上面就是大概的原理总结,还需要看几个细节问题

3.2.1、HandlerChecker#scheduleCheckLocked的处理?
127        public void scheduleCheckLocked() {//mMonitors.size为0或者,消息队列处于空闲,说明没有阻塞,设置   mCompleted = true后直接返回
128            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
129                // If the target looper has recently been polling, then
130                // there is no reason to enqueue our checker on it since that
131                // is as good as it not being deadlocked.  This avoid having
132                // to do a context switch to check the thread.  Note that we
133                // only do this if mCheckReboot is false and we have no
134                // monitors, since those would need to be executed at this point.
135                mCompleted = true;
136                return;
137            }
......
144            mCompleted = false;
145            mCurrentMonitor = null;
146            mStartTime = SystemClock.uptimeMillis();//post一个消息到当前mHandler所在消息队列的最前面
147            mHandler.postAtFrontOfQueue(this);
148        }

如果上面消息能够执行,下面的run方法就会走进去,尝试调用monitor申请锁。

    public final class HandlerChecker implements Runnable {.......@Overridepublic void run() {final int size = mMonitors.size();for (int i = 0 ; i < size ; i++) {synchronized (Watchdog.this) {mCurrentMonitor = mMonitors.get(i);}mCurrentMonitor.monitor();}synchronized (Watchdog.this) {mCompleted = true;mCurrentMonitor = null;}}}

对于Looper Checker而言,会判断线程的消息队列是否处于空闲状态。 如果被监测的消息队列一直闲不下来,则说明可能已经阻塞等待了很长时间

如果scheduleCheckLocked中post的消息能够被执行到,对于Monitor Checker而言,会调用实现类的monitor方法,上文中提到的AMS.monitor()方法, 方法实现一般很简单,就是获取当前类的对象锁,如果当前对象锁已经被持有,则monitor()会一直处于wait状态,直到超时。

如果scheduleCheckLocked中post的消息不能够被执行到,那么说明消息队列中前一个消息一直在执行,没有执行完成,也会超时。不得不佩服这种巧妙的设计啊,postAtFrontOfQueue可谓是一箭双雕,既检测了是否锁有耗时,也检查了消息队列中某个Message是否耗时。

二、案例分析

对于Watchdog问题分析,首先需要确定trace是否有效,通过前面的分析,Watchdog在30s和1分钟的时候都会dump一次trace,比如看到下面的trace。

09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on ActivityManager (ActivityManager)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: ActivityManager stack trace:
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:325)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.Looper.loop(Looper.java:148)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** GOODBYE!

然后我们看ActivityManager的trace.

"ActivityManager" prio=5 tid=12 Blocked
group="main" sCount=1 dsCount=0 flags=1 obj=0x13180c38 self=0x73bb923600
sysTid=1579 nice=-2 cgrp=default sched=0/0 handle=0x73adbcf4f0
state=S schedstat=( 3039883125048 14149853235996 6778200 ) utm=112965 stm=191023 core=6 HZ=100
stack=0x73adacd000-0x73adacf000 stackSize=1037KB
held mutexes=
at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3486)
waiting to lock <0x0748826a> (a com.android.server.am.ActivityManagerService) held by thread 10
at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2032)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:173)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)

因为ActivityManager被10号线程blocked,继续看10号线程的trace.

"Binder:1540_1C" prio=5 tid=10 Nativegroup="main" sCount=1 dsCount=0 flags=1 obj=0x1318deb8 self=0x73c0817600
sysTid=8946 nice=-4 cgrp=default sched=0/0 handle=0x739db674f0
state=S schedstat=( 2025031009459 6852098325718 5020435 ) utm=136019 stm=66484 core=1 HZ=100
stack=0x739da6d000-0x739da6f000 stackSize=1005KB
held mutexes=
kernel: __switch_to+0x9c/0xd0
kernel: futex_wait_queue_me+0xc4/0x13c
kernel: futex_wait+0xe4/0x204
kernel: do_futex+0x170/0x500
kernel: SyS_futex+0x90/0x1b0
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 000000000001db2c /system/lib64/libc.so (syscall+28)
native: #01 pc 00000000000e74c8 /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152)
native: #02 pc 00000000005227a8 /system/lib64/libart.so (art::GoToRunnable(art::Thread*)+440)
native: #03 pc 00000000005225a8 /system/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
native: #04 pc 0000000000cb8fc0 /system/framework/arm64/boot-framework.oat (Java_android_os_Process_setThreadPriority__II+176)
at android.os.Process.setThreadPriority(Native method)
at com.android.server.ThreadPriorityBooster.boost(ThreadPriorityBooster.java:49)
at com.android.server.wm.WindowManagerThreadPriorityBooster.boost(WindowManagerThreadPriorityBooster.java:58)
at com.android.server.wm.WindowManagerService.boostPriorityForLockedSection(WindowManagerService.java:930)
at com.android.server.wm.WindowManagerService.containsDismissKeyguardWindow(WindowManagerService.java:3116)
locked <0x0b54880e> (a com.android.server.wm.WindowHashMap)
at com.android.server.am.ActivityRecord.hasDismissKeyguardWindows(ActivityRecord.java:1364)
at com.android.server.am.ActivityStack.checkKeyguardVisibility(ActivityStack.java:2070)
at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1924)
at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3626)
at com.android.server.am.ActivityStackSupervisor.attachApplicationLocked(ActivityStackSupervisor.java:1043)
at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:7471)
at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:7538)
locked <0x0748826a> (a com.android.server.am.ActivityManagerService)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:292)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3026)
at android.os.Binder.execTransact(Binder.java:704)

难道是setThreadPriority超时??但是缺乏1分钟的trace,我们不能断定是这个地方卡住。在 dumptraces 的时候对于处于 Suspended 状态的线程,会修改线程的 suspend_count_,使其+1,然后将其添加到suspended_count_modified_threads 的列表中,然后对于 suspended_count_modified_threads 中的线程一起 dumptraces ,对于 dump 完成的 thread 会进行 suspend_count_ - 1 的操作。Suspended 线程想要由 jni 回到 java 代码(Runnable 状态)在 GoToRunnable 时会检查 suspend_count_,如果不为0就在这里等待,直到其变为0。所以这里只能说明 dumptraces 的时候 tid=10 在执行 setThreadPriority 的 native method,如果要确定是否卡在了这里还需要对比两次 traces才能确定。

2.1、案例一

有的手机在Monkey测试过程中发生Watchdog不会重启,现象可能是冻屏,查看traces_SystemServer_WDT05_1月_23_50_59.974.txt,发现所有线程都被73号线程blocked,而且两次trace完全一致

WDT.png

来看看73号线程在干嘛

"Binder:1300_3" prio=5 tid=73 Native| group="main" sCount=1 dsCount=0 flags=1 obj=0x14f89110 self=0x7ee794d600| sysTid=1774 nice=-10 cgrp=default sched=0/0 handle=0x7ecbc474f0| state=S schedstat=( 59882636556 104471794509 273786 ) utm=3455 stm=2533 core=6 HZ=100| stack=0x7ecbb4d000-0x7ecbb4f000 stackSize=1005KB| held mutexes=kernel: __switch_to+0x94/0xa8kernel: binder_thread_read+0x460/0x10a0kernel: binder_ioctl_write_read+0x21c/0x360kernel: binder_ioctl+0x50c/0x798kernel: do_vfs_ioctl+0xb8/0x800kernel: SyS_ioctl+0x84/0x98kernel: el0_svc_naked+0x24/0x28native: #00 pc 00000000000690a4  /system/lib64/libc.so (__ioctl+4)native: #01 pc 0000000000024638  /system/lib64/libc.so (ioctl+132)native: #02 pc 0000000000061a10  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+256)native: #03 pc 00000000000627a8  /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+340)native: #04 pc 00000000000624c8  /system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+216)native: #05 pc 0000000000056d98  /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+72)native: #06 pc 000000000008a86c  /system/lib64/libgui.so (???)native: #07 pc 000000000009ec88  /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)native: #08 pc 00000000000fb058  /system/lib64/libandroid_runtime.so (???)native: #09 pc 0000000001326998  /system/framework/arm64/boot-framework.oat (Java_android_view_SurfaceControl_nativeScreenshot__Landroid_os_IBinder_2Landroid_graphics_Rect_2IIIIZZI+264)at android.view.SurfaceControl.nativeScreenshot(Native method)at android.view.SurfaceControl.screenshot(SurfaceControl.java:877)at com.android.server.wm.DisplayContent.-com_android_server_wm_DisplayContent-mthref-0(DisplayContent.java:2863)at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.$m$0(unavailable:-1)at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.screenshot(unavailable:-1)at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:3125)- locked <0x036ec628> (a com.android.server.wm.WindowHashMap)at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:2862)at com.android.server.wm.AppWindowContainerController.screenshotApplications(AppWindowContainerController.java:749)at com.android.server.am.ActivityRecord.screenshotActivityLocked(ActivityRecord.java:1650)at com.android.server.am.ActivityRecord.setVisible(ActivityRecord.java:1675)at com.android.server.am.ActivityStack.makeInvisible(ActivityStack.java:2078)at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1896)at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3575)at com.android.server.am.ActivityManagerService.ensureConfigAndVisibilityAfterUpdate(ActivityManagerService.java:20965)at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20897)at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20867)at com.android.server.am.ActivityStack.resumeTopActivityInnerLocked(ActivityStack.java:2608)at com.android.server.am.ActivityStack.resumeTopActivityUncheckedLocked(ActivityStack.java:2246)at com.android.server.am.ActivityStackSupervisor.resumeFocusedStackTopActivityLocked(ActivityStackSupervisor.java:2148)at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1480)at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1406)at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:7542)- locked <0x08abeada> (a com.android.server.am.ActivityManagerService)at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:317)at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3018)at android.os.Binder.execTransact(Binder.java:677)

最后是停在下面两行

 native: #06 pc 000000000008a86c  /system/lib64/libgui.so (???)native: #07 pc 000000000009ec88  /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)

使用addr2line -Cfe ./system/lib64/libgui.so 000000000009ec88

_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj
frameworks/native/libs/gui/SurfaceComposerClient.cpp:1018 (discriminator 1)
1003 status_t ScreenshotClient::captureToBuffer(const sp<IBinder>& display,
1004        Rect sourceCrop, uint32_t reqWidth, uint32_t reqHeight,
1005        int32_t minLayerZ, int32_t maxLayerZ, bool useIdentityTransform,
1006        uint32_t rotation,
1007        sp<GraphicBuffer>* outBuffer) {
1008    sp<ISurfaceComposer> s(ComposerService::getComposerService());
1009    if (s == NULL) return NO_INIT;
1010
1011    sp<IGraphicBufferConsumer> gbpConsumer;
1012    sp<IGraphicBufferProducer> producer;
1013    BufferQueue::createBufferQueue(&producer, &gbpConsumer);
1014    sp<BufferItemConsumer> consumer(new BufferItemConsumer(gbpConsumer,
1015           GRALLOC_USAGE_HW_TEXTURE | GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER,
1016           1, true));
1017
1018    status_t ret = s->captureScreen(display, producer, sourceCrop, reqWidth, reqHeight,
1019            minLayerZ, maxLayerZ, useIdentityTransform,
1020            static_cast<ISurfaceComposer::Rotation>(rotation));
1021    if (ret != NO_ERROR) {
1022        return ret;
1023    }
1024    BufferItem b;
1025    consumer->acquireBuffer(&b, 0, true);
1026    *outBuffer = b.mGraphicBuffer;
1027    return ret;
1028}

1018行captureScreen函数是在做截屏,看来是截屏时候发生了Watchdog,根据captureScreen,那么对应的surfaceflinger的trace如下:

"Binder:820_3" sysTid=1331#00 pc 0000000000068fb8  /system/lib64/libc.so (__epoll_pwait+8)#01 pc 000000000001fc68  /system/lib64/libc.so (epoll_pwait+48)#02 pc 0000000000015c84  /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144)#03 pc 0000000000015b6c  /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)#04 pc 00000000000b921c  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger13captureScreenERKNS_2spINS_7IBinderEEERKNS1_INS_22IGraphicBufferProducerEEENS_4RectEjjiibNS_16ISurfaceComposer8RotationE+672)#05 pc 0000000000088660  /system/lib64/libgui.so (_ZN7android17BnSurfaceComposer10onTransactEjRKNS_6ParcelEPS1_j+1788)#06 pc 00000000000b8828  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger10onTransactEjRKNS_6ParcelEPS1_j+144)#07 pc 00000000000559ac  /system/lib64/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+136)#08 pc 0000000000061ecc  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+536)#09 pc 0000000000061c04  /system/lib64/libbinder.so (_ZN7android14IPCThreadState20getAndExecuteCommandEv+156)#10 pc 0000000000062250  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14joinThreadPoolEb+60)#11 pc 0000000000082bcc  /system/lib64/libbinder.so (_ZN7android10PoolThread10threadLoopEv+24)#12 pc 0000000000011674  /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)#13 pc 0000000000066970  /system/lib64/libc.so (_ZL15__pthread_startPv+36)#14 pc 000000000001f474  /system/lib64/libc.so (__start_thread+68)

surfaceflinger的主线程trace如下

----- pid 820 at 2018-01-05 23:49:16 -----
Cmd line: /system/bin/surfaceflinger
ABI: 'arm64'"surfaceflinger" sysTid=820#00 pc 00000000000690a4  /system/lib64/libc.so (__ioctl+4)#01 pc 0000000000024638  /system/lib64/libc.so (ioctl+132)#02 pc 0000000000015210  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState14talkWithDriverEb+256)#03 pc 0000000000015f58  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState15waitForResponseEPNS0_6ParcelEPi+60)#04 pc 0000000000015d84  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState8transactEijRKNS0_6ParcelEPS2_j+216)#05 pc 00000000000128d4  /system/lib64/libhwbinder.so (_ZN7android8hardware10BpHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+72)#06 pc 0000000000038bc4  /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_118BpHwComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+240)#07 pc 0000000000091cc0  /system/lib64/libsurfaceflinger.so (_ZN7android4Hwc28Composer11createLayerEmPm+100)#08 pc 000000000009a930  /system/lib64/libsurfaceflinger.so (_ZN4HWC27Display11createLayerEPNSt3__110shared_ptrINS_5LayerEEE+72)#09 pc 00000000000c5304  /system/lib64/libsurfaceflinger.so (_ZN7android10HWComposer11createLayerEi+152)#10 pc 00000000000b1524  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger15setUpHWComposerEv+1560)#11 pc 00000000000b096c  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger20handleMessageRefreshEv+108)#12 pc 00000000000aa660  /system/lib64/libsurfaceflinger.so (_ZN7android16ExSurfaceFlinger20handleMessageRefreshEv+16)#13 pc 00000000000b03e4  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger17onMessageReceivedEi+260)#14 pc 0000000000015d40  /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+332)#15 pc 0000000000015b6c  /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)#16 pc 000000000008b944  /system/lib64/libsurfaceflinger.so (_ZN7android12MessageQueue11waitMessageEv+84)#17 pc 00000000000af338  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger3runEv+20)#18 pc 0000000000002cfc  /system/bin/surfaceflinger (main+948)#19 pc 000000000001b8b0  /system/lib64/libc.so (__libc_init+88)#20 pc 00000000000028a8  /system/bin/surfaceflinger (do_arm64_start+80)

看到主线程正在createLayer,又通过binder从surfaceflinger进程call到了/vendor/bin/hw/android.hardware.graphics.composer@2.1-service,我们在去看看graphics.composer的对应线程的trace.

"HwBinder:738_1" sysTid=1273#00 pc 000000000001dc2c  /system/lib64/libc.so (syscall+28)#01 pc 0000000000066014  /system/lib64/libc.so (pthread_cond_wait+96)#02 pc 000000000001fc3c  /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)#03 pc 00000000000140e0  /vendor/lib64/hw/android.hardware.graphics.composer@2.1-impl.so (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)#04 pc 0000000000044840  /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_116BsComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+160)#05 pc 000000000003fd10  /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_118BnHwComposerClient10onTransactEjRKNS0_6ParcelEPS5_jNSt3__18functionIFvRS5_EEE+2224)#06 pc 0000000000011be0  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware9BHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+132)#07 pc 00000000000156fc  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14executeCommandEi+584)#08 pc 0000000000015404  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState20getAndExecuteCommandEv+156)#09 pc 0000000000015b0c  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14joinThreadPoolEb+60)#10 pc 000000000001f5c8  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware10PoolThread10threadLoopEv+24)#11 pc 0000000000011674  /system/lib64/vndk-sp/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)#12 pc 0000000000066970  /system/lib64/libc.so (_ZL15__pthread_startPv+36)#13 pc 000000000001f474  /system/lib64/libc.so (__start_thread+68)

终于找到了最终blocked的地方

 #02 pc 000000000001fc3c  /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)
#03 pc 00000000000140e0  /vendor/lib64/hw/android.hardware.graphics.composer@2.1-impl.so (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)

再次使用addr2line
addr2line -f -e hwcomposer.sdm845.so 1fc3c
_ZN3sdm6Locker4WaitEv
hardware/qcom/display/include/../sdm/include/utils/locker.h:141

addr2line -f -e android.hardware.graphics.composer@2.1-impl.so 140e0
ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3_18functionIFvNS3_5ErrorEmEEE
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp:299

hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp#299
295Return<void> ComposerClient::createLayer(Display display,
296        uint32_t bufferSlotCount, createLayer_cb hidl_cb)
297{
298    Layer layer = 0;
299    Error err = mHal.createLayer(display, &layer);
300    if (err == Error::NONE) {
301        std::lock_guard<std::mutex> lock(mDisplayDataMutex);
302
303        auto dpy = mDisplayData.find(display);
304        if (dpy != mDisplayData.end()) {
305            auto ly = dpy->second.Layers.emplace(layer, LayerBuffers()).first;
306            ly->second.Buffers.resize(bufferSlotCount);
307        } else {
308            layer = 0;
309            err = Error::BAD_DISPLAY;
310        }
311    }
312
313    hidl_cb(err, layer);
314    return Void();
315}

看样子是createLayer出了问题,最后将问题转给底层显示模块的同学继续分析。最后关于Watchdog还是有一些问题可以思考的,比如Watchdog各个版本有哪些变化,Watchdog线程被blocked了怎么办?而且Watchdog问题纷繁复杂,各个模块的业务都不一样,由于篇幅原因,读者自己调查。

参考连接https://duanqz.github.io/2015-10-12-Watchdog-Analysis

应用与系统稳定性第五篇---Watchdog原理和问题分析相关推荐

  1. Noah Mt4跟单系统制作第五篇 Mt4TradeApi挂单篇

    Noah Mt4跟单系统制作第五篇 Mt4TradeApi挂单篇 using Mt4TradeApi; using System; using System.Collections.Generic; ...

  2. Agv、Rgv 车辆控制调度系统开发第五篇-避碰

    Agv.Rgv 车辆控制调度系统开发第五篇-避碰 前言 上期结束的时候说讲避碰,这期就主要谈一下避碰的原理,避碰是之前给其他人讲调度时,别人提了一个场景里面有三种车,10种货架问我怎么调度,当时确实被 ...

  3. Java分布式跟踪系统Zipkin(五):Brave源码分析-Brave和SpringMVC整合

    所有博文均在个人独立博客http://blog.mozhu.org首发,欢迎访问! 上一篇博文中,我们分析了Brave是如何在普通Web项目中使用的,这一篇博文我们继续分析Brave和SpringMV ...

  4. 应用与系统稳定性第三篇---FD泄露问题漫谈

    cat /proc/pid/limits 查看最大打开文件Max open files cat /proc/pid/fd 查看打开文件 cat /proc/sys/kernel/threads-max ...

  5. java dofinalize_应用与系统稳定性第六篇---JVM垃圾回收之finalize执行时引起timed out 闪退分析...

    一.背景 java.util.concurrent.TimeoutException: android.content.res.AssetManager$AssetInputStream.finali ...

  6. Agv、Rgv 车辆控制调度系统开发第八篇-错误纠正

    Agv.Rgv 车辆控制调度系统开发第八篇-错误纠正 前言 开始写博客到现在也有一年多了,这一年多分析了调度的很多东西,我也全网搜索过,网上真正分享调度知识的基本没有,虽然我也没有把核心的代码展示出来 ...

  7. Agv、Rgv 车辆控制调度系统开发第四篇

    Agv.Rgv 车辆控制调度系统开发第四篇 车辆调度模拟器 前言 一.车辆模拟器是什么? 二.如何做模拟器 1.动作仿真模拟器 2.完全仿真模拟器 总结 下期预告 系列文章链接 其他文章 新篇章 前言 ...

  8. Linux系统详解 第五篇:Linux的安装-4:Fedora 16的安装

    Linux系统详解 第五篇:Linux的安装-4:Fedora 16的安装 前言: 本系列文章取材广泛,有来自于互联网的,有来自教科书的,有来自自己的笔记的,也有来自自己对Linux的经验积累的.此系 ...

  9. 应用与系统稳定性第一篇---ANR问题分析的一般套路

    image.png ANR(App Not Responding)基本上99%的App都有,即使是系统,也有system_anr,我相信虽然ANR问题这样的普遍,还是有很多人对ANR问题即熟悉又陌生的 ...

最新文章

  1. Eclipse如何导入maven项目,以及配置maven
  2. 解决HttpServletResponse输出中文乱码问题
  3. 【知乎】怎么成为一个优秀的程序员,而不是一个优秀的码农?
  4. Timus 1018 树形DP
  5. 软件项目组织管理(六)项目时间管理
  6. Effective Java~2.Builder代替多参数Constructor
  7. 春节期间的学习小目标
  8. [C#]简单的理解委托和事件
  9. Springboot中使用Junit5(Jupiter)和Mockito
  10. win10系统中如何查看wifi密码
  11. Java动态代理机制原理详解(JDK 和CGLIB,Javassist,ASM)
  12. 使用netstat命令统计established状态的连接数
  13. 164work 综合练习1
  14. uc7.5java下载,uc浏览器7.5版手机下载-uc浏览器7.5官方版v7.5 安卓版 - 极光下载站...
  15. 熟悉又陌生的 k8s 字段:finalizers
  16. FC按键修改教程之一键开关
  17. 开机时User服务器未能登录,Win10开机提示user profile service服务登录失败的原因及解决方法...
  18. 蓝牙模块的5大应用场景
  19. 2021-2027全球与中国汽车双回路冷却系统市场现状及未来发展趋势
  20. 使用百度地图获取手机GPS定位

热门文章

  1. Linux 只可以找到 lo 网卡,没有eth0 or eth1网卡的解决方法
  2. 将Excel表格中的文本格式存储的数字批量转换为数字
  3. 小心肝队-冲刺日志(第十天)
  4. iOS 如何查看SDK版本
  5. 泰迪云课堂数据分析案例:广电大数据营销推荐项目
  6. iPhoneXS Max 获取UDID
  7. 射频:TD-LTE与FDD-LTE区别
  8. Windows Project2016如何增加加班工时。
  9. 企查查访问超频怎么办_Springboot与Selenium合体变蜘蛛爬企查查
  10. sqli-lab教程Less-5