Iowait high

I/O问题一直是一个比较难定位的问题,今天线上环境遇到了I/O 引起的CPU负载问题,

Linux has many tools available for troubleshooting some are easy to use, some are more advanced.

Linux 有许多可用来查找问题的简单工具,也有许多是更高级的

I/O Wait is an issue that requires use of some of the more advanced tools as well as an advanced usage of some of the basic tools. The reason I/O Wait is difficult to troubleshoot is due to the fact that by default there are plenty of tools to tell you that your system is I/O bound, but not as many that can narrow the problem to a specific process or processes.

I/O Wait 就是一个需要使用高级的工具来debug的问题,当然也有许多基本工具的高级用法。I/O wait的问题难以定位的原因是因为我们有很多工具可以告诉你说I/O 受限了,但是并没有告诉你具体是那个进程引起的(哪些进程们)

Answering whether or not I/O is causing system slowness

确认是否是I/O问题导致系统缓慢

确认是否是I/O问题导致系统缓慢
Troubleshooting Disk Issues In Linux

To identify whether I/O is causing system slowness you can use several commands but the easiest is the unix command top.

确认是否是I/O导致的系统缓慢我们可以使用多个命令,但是,最简单的是unix的命令 top

top

top - 14:31:20 up 35 min, 4 users, load average: 2.25, 1.74, 1.68
Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.3%us, 1.7%sy, 0.0%ni, 0.0%id, 96.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 245440k total, 241004k used, 4436k free, 496k buffers
Swap: 409596k total, 5436k used, 404160k free, 182812k cached

From the CPU(s) line you can see the current percentage of CPU in I/O Wait; The higher the number the more cpu resources are waiting for I/O access.

从Cpu一行我们可以看到浪费在I/O Wait上的CPU百分比;这个数字越高说明越多的CPU资源在等待I/O权限

wa – iowait
Amount of time the CPU has been waiting for I/O to complete.

Finding which disk is being written to

查找那块磁盘正在被写入

查找那块磁盘正在被写入

The above top command shows I/O Wait from the system as a whole but it does not tell you what disk is being affected; for this we will use the iostatcommand.

上边的top命令从一个整体上说明了I/O wait,但是并没有说明是哪块磁盘影响的,想知道是哪块磁盘引发的问题,我们用到了另外一个命令 iostat 命令

$ iostat -x 2 5
avg-cpu: %user %nice %system %iowait %steal %idle
3.66 0.00 47.64 48.69 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 44.50 39.27 117.28 29.32 11220.94 13126.70 332.17 65.77 462.79 9.80 2274.71 7.60 111.41
dm-0 0.00 0.00 83.25 9.95 10515.18 4295.29 317.84 57.01 648.54 16.73 5935.79 11.48 107.02
dm-1 0.00 0.00 57.07 40.84 228.27 163.35 8.00 93.84 979.61 13.94 2329.08 10.93 107.02

The iostat command in the example will print a report every 2 seconds for 5 intervals; the -x tells iostat to print out an extended report.

上边的例子中,iostat 会每2秒更新一次,一共打印5次信息, -x 的选项是打印出扩展信息

The 1st report from iostat will print statistics based on the last time the system was booted; for this reason in most circumstances the first report from iostat should be ignored. Every sub-sequential report printed will be based on the time since the previous interval. For example in our command we will print a report 5 times, the 2nd report are disk statistics gathered since the 1st run of the report, the 3rd is based from the 2nd and so on.

第一个iostat 报告会打印出系统最后一次启动后的统计信息,这也就是说,在多数情况下,第一个打印出来的信息应该被忽略,剩下的报告,都是基于上一次间隔的时间。举例子来说,这个命令会打印5次,第二次的报告是从第一次报告出来一个后的统计信息,第三次是基于第二次 ,依次类推

In the above example the %utilized for sda is 111.41% this is a good indicator that our problem lies with processes writing to sda. While the test system in my example only has 1 disk this type of information is extremely helpful when the server has multiple disks as this can narrow down the search for which process is utilizing I/O.

在上面的例子中,sda的%utilized 是111.41%,这个很好的说明了有进程正在写入到sda磁盘中。因为例子中的测试系统只有一块磁盘,当一个服务器中有多快磁盘的时候,这个命令可以很好的缩小我们需要查找的进程的范围

Aside from %utilized there is a wealth of information in the output of iostat; items such as read and write requests per millisecond(rrqm/s & wrqm/s), reads and writes per second (r/s & w/s) and plenty more. In our example our program seems to be read and write heavy this information will be helpful when trying to identify the offending process.

除了%utilized 外,我们可以得到更改丰富的资源从iostat,例如每毫秒读写请求(rrqm/s & wrqm/s)),每秒读写的((r/s & w/s),当然还有更多。在上边的例子中,我们的项目看起来正在读写非常多的信息。这个对我们查找相应的进程非常有用。

Finding the processes that are causing high I/O

查找引起高I/O wait 对应的进程

查找引起高I/O wait 对应的进程
Troubleshoot high iowait issue in Linux

iotop

iotop

Total DISK READ: 8.00 M/s | Total DISK WRITE: 20.36 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
15758 be/4 root 7.99 M/s 8.01 M/s 0.00 % 61.97 % bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp

The simplest method of finding which process is utilizing storage the most is to use the command iotop. After looking at the statistics it is easy to identify bonnie++ as the process causing the most I/O utilization on this machine.

最简单的方式来发现罪魁祸首是使用命令iotop,通过查看iotop的统计信息,我们可以很容易的指导bonnie++就是罪魁祸首

While iotop is a great command and easy to use, it is not installed on all (or the main) Linux distributions by default; and I personally prefer not to rely on commands that are not installed by default. A systems administrator may find themselves on a system where they simply cannot install the non-defualt packages until a scheduled time which may be far too late depending on the issue.

虽然iotop是一个非常强大的工具,并且使用简单,但是它并不是默认安装在所有的linux操作系统中。并且我个人倾向不要太依赖那些默认没有安装的命令。一个系统管理员可能会发现他无法立即安装额外的除默认程序之外的软件,除非等到后边的维护的时间。

If iotop is not available the below ste target="_blank">ps will also allow you to narrow down the offending process/processes.

如果iotop并没有安装,下面的步骤会教会你如何缩小目标进程的范围

Process list “state”

进程的状态

Using Linux Iotop to check disk IO usage Per process

进程的状态

The ps command has statistics for memory and cpu but it does not have a statistic for disk I/O. While it may not have a statistic for I/O it does show the processes state which can be used to indicate whether or not a process is waiting for I/O.

ps 命令对内存和CPU有一个统计,但是他没有对磁盘I/O的统计,虽然他没有显示磁盘I/O,但是它显示进行的状态,我们可以用来知道一个进程是否正在等待I/O

The ps state field provides the processes current state; below is a list of states from the man page.

ps state状态来表示了process现在的状态,下面是各个状态的帮助文档

PROCESS STATE CODES
D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct (“zombie”) process, terminated but not reaped by its parent.

Processes that are waiting for I/O are commonly in an “uninterruptible sleep” state or “D”; given this information we can simply find the processes that are constantly in a wait state.

那些等待I/O的进程的状态一般是“uninterruptible sleep”,或者“D”,我们可以很容易的查找到正在等待I/O的进程

Example:

 for x in `seq 1 1 10`; do ps -eo state,pid,cmd | grep "^D"; echo "----"; sleep 5; doneD 248 [jbd2/dm-0-8]D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp----D 22 [kswapd0]D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp----D 22 [kswapd0]D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp----D 22 [kswapd0]D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp----D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp

The above for loop will print the processes in a “D” state every 5 seconds for 10 intervals.

上边的例子会循环的输出状态是D的进程,每5秒一次,一共10次

From the output above the bonnie++ process with a pid of 16528 is waiting for I/O more often than any other process. At this point the bonnie++ seems likely to be causing the I/O Wait, but just because the process is in an uninterruptible sleep state does not necessarily prove that it is the cause of I/O wait.

从输出我们可以知道 bonnie++ 的pid是16528 ,它比其他的进程更可疑,它老师在waiting,这个时候,bonnie++貌似就是我们想找到的进程,但是,单纯从它的状态,我们没有办法证明是bonnie++引起的I/O问题

To help confirm our suspicions we can use the /proc file system. Within each processes directory there is a file called “io” which holds the same I/O statistics that iotop is utilizing.

为了确认我们的怀疑,我们可以使用 /proc文件系统,每个进程目录下都有一个叫io的文件,里边保存这和iotop类似的信息

 # cat /proc/16528/iorchar: 48752567wchar: 549961789syscr: 5967syscw: 67138read_bytes: 49020928write_bytes: 549961728cancelled_write_bytes: 0

The read_bytes and write_bytes are the number of bytes that this specific process has written and read from the storage layer. In this case the bonnie++ process has read 46 MB and written 524 MB to disk. While for some processes this may not be a lot, in our example this is enough write and reads to cause the high i/o wait that this system is seeing.

read_bytes和write_bytes是这个进程从磁盘读写的字节,在这个例子中,bonnie++进程读取了46M的数据并且写入了524的数据到磁盘上。这样的数据对于其他的进程可能并不是很多,但是在我们例子中,这足够引发系统的问题。

Finding what files are being written too heavily

查找那个文件引起的I/Owait

查找那个文件引起的I/Owait

The lsof command will show you all of the files open by a specific process or all processes depending on the options provided. From this list one can make an educated guess as to what files are likely being written to often based on the size of the file and the amounts present in the “io” file within /proc.

lsof 命令可以展示一个进程打开的所有文件,或者打开一个文件的所有进程。从这个列表中,我们可以找到具体是什么文件被写入,根据文件的大小和/proc中io文件的具体数据

(这段翻译的有点绕,翻译好的直接留言好了http://www.503error.com)

To narrow down the output we will use the -p options to print only files open by the specific process id.

我们可以使用-p 的方式来减少输出,pid是具体的进程

lsof -p 16528

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
bonnie++ 16528 root cwd DIR 252,0 4096 130597 /tmp
<truncated>
bonnie++ 16528 root 8u REG 252,0 501219328 131869 /tmp/Bonnie.16528
bonnie++ 16528 root 9u REG 252,0 501219328 131869 /tmp/Bonnie.16528
bonnie++ 16528 root 10u REG 252,0 501219328 131869 /tmp/Bonnie.16528
bonnie++ 16528 root 11u REG 252,0 501219328 131869 /tmp/Bonnie.16528
bonnie++ 16528 root 12u REG 252,0 501219328 131869 <strong>/tmp/Bonnie.16528</strong>

To even further confirm that these files are being written to the heavily we can see if the /tmp filesystem is part of sda.

为了更深入的确认这些文件被频繁的读写,我们可以通过如下命令来查看
How to Check if a Disk is Busy in Linux

df /tmp

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/workstation-root 7667140 2628608 4653920 37% /

From the output of df we can determine that /tmp is part of the root logical volume in the workstation volume group.

从上面的命令结果来看,我们可以确定/tmp 是我们环境的逻辑磁盘的根目录

pvdisplay

— Physical volume —
PV Name /dev/sda5
VG Name workstation
PV Size 7.76 GiB / not usable 2.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 1986
Free PE 8
Allocated PE 1978
PV UUID CLbABb-GcLB-l5z3-TCj3-IOK3-SQ2p-RDPW5S

Using pvdisplay we can see that the /dev/sda5 partition part of the sda disk is the partition that the workstation volume group is using and in turn is where/tmp exists. Given this information it is safe to say that the large files listed in the lsof above are likely the files being read & written to frequently.

通过pvdisplay我们能看到/dev/sda5其实就是我们用来创建逻辑磁盘的具体磁盘。通过以上的信息我们可以放心的说lsof的结果就是我们要查找的文件

  • Troubleshooting Disk Issues In Linux
  • How to Check Disk usage Disk Linux
  • How to Check if a Disk is Busy in Linux
  • Troubleshoot high iowait issue in Linux
  • Using Linux Iotop to check disk IO usage Per process

在linux上面如何解决高iowait问题相关推荐

  1. linux的cache过高的原因定位与解决echo 3 > /proc/sys/vm/drop_caches

    先说结论解决办法: 一.在crontab定时执行echo 3> /proc/sys/vm/drop_caches清理缓存.治标不治本,过段时间缓存又会增加上来. 二.hcache -top 10 ...

  2. linux stress 命令 模拟系统高负载

    stress 命令主要用来模拟系统负载较高时的场景,本文介绍其基本用法.文中 demo 的演示环境为 ubuntu 18.04. 基本语法 语法格式: stress <options> 常 ...

  3. linux系统报警怎么办,常见Linux系统故障和解决方法

    常见Linux系统故障和解决方法 发布时间:2020-06-06 14:48:19 来源:亿速云 阅读:212 作者:Leah 栏目:云计算 这篇文章给大家分享的是常见的Linux系统故障和解决方法. ...

  4. linux CPU使用率过高或负载过高的处理思路

    linux CPU使用率过高或负载过高的处理思路 1.查看系统CPU负载及使用率的命令为:top    vmstat top 命令:查看进程级别的cpu使用情况. vmstat 命令:查看系统级别的c ...

  5. Day783.网络通信优化之I/O模型:如何解决高并发下I/O瓶颈 -Java 性能调优实战

    网络通信优化之I/O模型:如何解决高并发下I/O瓶颈 Hi,我是阿昌,今天学习记录的是关于网络通信优化之I/O模型:如何解决高并发下I/O瓶颈. 提到 Java I/O,相信你一定不陌生. 可能使用 ...

  6. 【Linux学习笔记】3.Linux 忘记密码解决方法及远程登录

    前言 本章介绍Linux的忘记密码解决方法及远程登录. Linux 忘记密码解决方法 很多朋友经常会忘记Linux系统的root密码,linux系统忘记root密码的情况该怎么办呢?重新安装系统吗?当 ...

  7. 乐观锁 -业务判断 解决高并发问题

    在解决高并发问题时,如果是分布式系统显然我们只能够使用数据库端加锁机制来解决这个问题,但是这种同步机制或者数据库物理锁机制会牺牲一部分的性能,所以常常以另外一种方式来解决这个问题 就是乐观锁模式 银行 ...

  8. python访问数据库如何解决高并发_怎样解决数据库高并发的问题

    怎样解决数据库高并发的问题?解决数据库高并发使用缓存式的Web应用程序架构.增加Redis缓存数据库.增加数据库索引.页面静态化.使用存储过程.MySQL主从读写分离.分表分库.负载均衡集群. 解决数 ...

  9. Linux入门!Linux无法联网解决办法!CentOS7、VMPlayer、VMWareWorkstation16资源!VMWareWorkstation16序列号!VMware安装Centos7!

    Linux入门!CentOS7.VMPlayer.VMWareWorkstation16资源!VMWareWorkstation16序列号!Linux无法联网解决办法!VMware安装Centos7! ...

最新文章

  1. Unity3D NGUI学习(一)血条
  2. python+appium+PyCharm==自动化测试APP环境
  3. JQuery 实现遮罩层
  4. 在观念上进行大的转变
  5. webstorm目录定位(自动定位)当前编辑的文件 - 设置篇
  6. 使用 ADO.NET连接SQL Azure
  7. .NET C#研发的授权工具
  8. html的搜索框代码怎么写_网站新闻怎么写才能被搜索引擎收录?
  9. 【算法学习笔记】18:树与图的DFS与BFS
  10. 忘记steam账号了,如何查找本地steam账号?
  11. 报童问题求解最大利润_OM | 选址问题模型研究——以悠桦林仓库布局实践为例...
  12. 五大主流浏览器的介绍
  13. Oracle VS SAP
  14. 车羊问题c语言编程,再谈“羊车门”问题
  15. 各代iphone ipad iPod各种信息 获取设备型号等等整理
  16. 软件测试管理--第二章 2.2节
  17. 华为“扫地僧”纯手打《趣谈—网络协议.pdf》,看完只剩一个字:香
  18. 中国移动规范学习——4A技术要求(综述)
  19. es6 去掉空格_微信小程序自动去除input空格的方法
  20. Linux下嵌入式开发环境配置

热门文章

  1. 力扣(LeetCode)799. 香槟塔(C++)
  2. 制作简易的个人主页(代码笔记)
  3. c++ primer 5th 笔记
  4. 党旗飘飘平台正确打开方法 | JS定时点击按钮
  5. 电商搜索全链路(PART I)Overview
  6. 股票常识|股票基础知识
  7. oracle存储过程和触发器结合database link的实例
  8. android opencv 银行卡识别,【opencv小应用】银行卡号识别(一)
  9. HTML5+CSS大作业——响应式个人简历介绍(5页)-模板下载
  10. MySQL自连接查询的深入分析