这个bug源于项目中一个诡异的现象:代码层面没有明显的锁的问题,但是执行时发生了死锁一样的表现。我把业务逻辑简化为:父进程一直维持一个子进程。(转载请指明出于breaksoftware的csdn博客)

首先我们定义一个结构体ProcessGuard,它持有子进程的ID以及保护它的的锁。这样我们在多线程中,可以安全的操作这个结构体。

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <pthread.h>struct ProcessGuard {pthread_mutex_t pids_mutex;pid_t pid;
};

主进程的主线程启动一个线程,用于不停监视ProcessGuard的pid是否为0(即子进程不存在)。如果不存在就创建子进程,并把进程ID记录到pid中;

void chile_process() {while (1) {printf("This is the child process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());sleep(1);}
}void create_process_routine() {printf("This is the child thread of parent process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());while (1) {int child = 0;if (child == 0) {pthread_mutex_lock(&g_guard->pids_mutex);}if (g_guard->pid != 0) {continue;    }pid_t pid = fork();sleep(1);printf("Create child process %d.\n", pid);if (pid < 0) {perror("fork failed");}else if (pid == 0) {chile_process();child = 1;break;}else {// parent processg_guard->pid = pid;printf("dispatch task to process. pid is %d.\n", pid);}if (child == 0) {pthread_mutex_unlock(&g_guard->pids_mutex);  }else {break;}}
}

我们在父进程的主线程中注册一个signal监听。如果子进程被杀掉,则将ProcessGuard中pid设置为0,这样父进程的监控线程将重新启动一个进程。

void sighandler(int signum) {printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());pthread_mutex_lock(&g_guard->pids_mutex);g_guard->pid = 0;pthread_mutex_unlock(&g_guard->pids_mutex);
}

最后看下父进程,它初始化一些结构后,注册了signal处理事件并启动了创建子进程的线程。

int main(void) {pthread_t creat_process_tid;g_guard = malloc(sizeof(struct ProcessGuard));pthread_mutex_t pids_mutex;if (pthread_mutex_init(&g_guard->pids_mutex, NULL) != 0) {perror("init pids_mutex error.");exit(1);}g_guard->pid = 0;printf("This is the Main thread of parent process.PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());signal(SIGCHLD, sighandler);pthread_create(&creat_process_tid, NULL, (void*)create_process_routine, NULL);while(1)  {printf("Get task from network.\n");sleep(1);}pthread_mutex_destroy(&g_guard->pids_mutex);return 0;
}

上述代码,我们看到锁只在线程函数create_process_routine和signal处理函数sighandler中被使用了。它们两个在代码层面没有任何调用关系,所以不应该出现死锁!但是实际并非如此。

我们运行程序,并且杀死子进程,会发现主进程并没有重新启动一个新的子进程。

$ ./test
This is the Main thread of parent process.PID is 17641.My thread_id is 140014057678656.
Get task from network.
This is the child thread of parent process. My PID is 17641.My thread_id is 140014049122048.
Create child process 17643.
dispatch task to process. pid is 17643.
Create child process 0.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the parent process.Catch signal 17.My PID is 17641.My thread_id is 140014049122048.
Get task from network.
Get task from network.
Get task from network.
Get task from network.
Get task from network.

这个和我们代码设计不符合,而且不太符合逻辑。于是我们使用gdb attach主进程。

Attaching to process 17641
[New LWP 17642]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) info threadsId   Target Id         Frame
* 1    Thread 0x7f57902be740 (LWP 17641) "test" 0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190)at ../sysdeps/unix/sysv/linux/nanosleep.c:282    Thread 0x7f578fa95700 (LWP 17642) "test" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb) t 2
[Switching to thread 2 (Thread 0x7f578fa95700 (LWP 17642))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#2  0x000055c512c29a9d in sighandler ()
#3  <signal handler called>
#4  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#5  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#6  0x000055c512c29b42 in create_process_routine ()
#7  0x00007f578fe8e6db in start_thread (arg=0x7f578fa95700) at pthread_create.c:463
#8  0x00007f578fbb788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

我们查看线程2的调用栈,发现栈帧5和栈帧1锁住了相同的mutex(0x55c51383e260)。而我们线程代码中锁是加/解成对,那么第二个锁是哪儿来的呢?

我们看到栈帧1的锁是源于栈帧2对应的函数sighandler,即下面代码

void sighandler(int signum) {printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());pthread_mutex_lock(&g_guard->pids_mutex);g_guard->pid = 0;pthread_mutex_unlock(&g_guard->pids_mutex);
}

于是,问题来了。我们在线程函数create_process_routine中从来没有调用sighandler,那这个调用是哪儿来的?

在linux文档http://man7.org/linux/man-pages/man7/signal.7.html中,我们发现了有关signal的这段话

A process-directed signal may be delivered to any
one of the threads that does not currently have the signal blocked.
If more than one of the threads has the signal unblocked, then the
kernel chooses an arbitrary thread to which to deliver the signal.

这句话是说process-directed signal会被投递到当前没有被标记不接受该signal的任意一个线程中。 具体是哪个,是由系统内核决定的。这就意味着我们的sighandler可能在主线程中执行,也可能在子线程中执行。于是发生了我们上面的死锁现象。

那么如何解决?官方的方法是使用sigprocmask让一些存在潜在死锁关系的线程不接收这些信号。但是这个方案在复杂的系统中是存在缺陷的。因为我们的工程往往使用各种开源库或者第三方库,我们无法控制它们启动线程的问题。所以,我的建议是:在signal处理函数中,尽量使用无锁结构。通过中间数据的设计,将复杂的业务代码和signal处理函数隔离。

bug诞生记——信号(signal)处理导致死锁相关推荐

  1. bug诞生记——临时变量、栈变量导致的双杀

    这是<bug诞生记>的第一篇文章.本来想起个文艺点的名字,比如<Satan(撒旦)来了>,但是最后还是想让这系列的重心放在"bug的产生过程"和" ...

  2. bug诞生记——隐蔽的指针偏移计算导致的数据错乱

    C++语言为了兼容C语言,做了很多设计方面的考量.但是有些兼容设计产生了不清晰的认识.本文就将讨论一个因为认知不清晰而导致的bug.(转载请指明出于breaksoftware的csdn博客) clas ...

  3. bug诞生记——不定长参数隐藏的类型问题

    这个bug的诞生源于项目中使用了一个开源C库.由于对该C库API不熟悉,一个不起眼的错误调用,导致一系列诡异的问题.最终经过调试,我们发现发生了内存覆盖问题.为了直达问题根节,我将问题代码简化如下(转 ...

  4. bug诞生记——const_cast引发只读数据区域写违例

    对于C++这种强类型的语言,明确的类型既带来了执行的高效,又让错误的发生提前到编译期.所以像const这类体现设计者意图的关键字,可以隐性的透露给我们它描述的对象的使用边界.它是我们的朋友,我们要学会 ...

  5. 多次执行sql 后卡住_解Bug之路记一次中间件导致的慢SQL排查过程

    解Bug之路-记一次中间件导致的慢SQL排查过程 前言 最近发现线上出现一个奇葩的问题,这问题让笔者定位了好长时间,期间排查问题的过程还是挺有意思的,就以此为素材写出了本篇文章. Bug现场 我们的分 ...

  6. sql 在某段时间_解Bug之路记一次中间件导致的慢SQL排查过程

    解Bug之路-记一次中间件导致的慢SQL排查过程 前言 最近发现线上出现一个奇葩的问题,这问题让笔者定位了好长时间,期间排查问题的过程还是挺有意思的,就以此为素材写出了本篇文章. Bug现场 我们的分 ...

  7. 记一次bug,由于前端参数不对导致的bug,no int/Int-argument constructor/factory method to deserialize from Number valu

    bug现象: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of com ...

  8. Linux异步之信号(signal)机制分析

    From:http://www.cnblogs.com/hoys/archive/2012/08/19/2646377.html From:http://kenby.iteye.com/blog/11 ...

  9. 张小龙的微信帝国诞生记

    2010年11月20日这一天,在广州,一个六七人的产品小组正式组建.2011年10月1日,这个产品小组的一款产品登上了中国移动互联网即时通讯工具软件第一的位置. 这款产品叫"微信" ...

最新文章

  1. 深入理解 Java 虚拟机(第一弹) - Java 内存区域透彻分析
  2. matlab球坐标曲线,matlab绘制曲线subplotsphere球面坐标绘制饼图
  3. App 上传遇到问题
  4. 【Matlab】绘制3D 3维图
  5. linux vbox 添加串口,如何在VirtualBox中直接使用本机物理串口
  6. HDU - 1005 Number Sequence(循环群)
  7. python中dict函数_dict()函数以及Python中的示例
  8. g100显卡 linux驱动,nvidia geforce g100驱动
  9. wpf开源ui引用步骤_完善开源产品策略的6个步骤
  10. vscode 支持ansi_vscode terminal美化
  11. boid模型的Matlab程序,基于Boid模型以及吸引—排斥模型的沙丁鱼集群运动行为模拟...
  12. Mysql数据库InnoDB存储引擎的隔离级别
  13. iperf 的下载和使用
  14. AiChallenger比赛记录之样本不均衡
  15. 地图可视化绘制 | R-tanaka/metR包 绘制3D阴影效果地图
  16. Mac安装软件提示 已损坏【已解决】
  17. 一文搞定 JVM 面试,教你吊打面试官~
  18. [谨记]女人面前莫谈年龄
  19. FCKEDITOR编辑器的使用
  20. 百度地图获取省市边界、设置图片背景

热门文章

  1. 使用Python制作酷炫的二维码
  2. 深蓝学院第二章:基于全连接神经网络(FCNN)的手写数字识别
  3. Slam十四讲(第二版):1、习题
  4. c++中static_cast用法与uchar/char的区别
  5. CentOS 6.3 64bit上升级系统默认Python 2.6.6到2.7.10版本
  6. python 绘图脚本系列简单记录
  7. h-hdparm打开关闭磁盘cache
  8. windows下安装程序制作
  9. 腾讯微视:向前一步是悲壮,向后一步是绝望zz
  10. Milking Cows 挤牛奶