是否优化更新主题浏览量:_主题306：能力规划

是否优化更新主题浏览量:

在你开始前

了解这些教程可以教给您什么以及如何从中获得最大收益。

关于本系列

Linux Professional Institute （LPI）在三个级别上对Linux系统管理员进行认证：初级（也称为“认证级别1”），高级（也称为“认证级别2”）和高级（也称为“认证级别3”））。要获得认证1级，您必须通过考试101和102。要达到认证2级，您必须通过考试201和202。要达到认证3级，您必须具有有效的高级认证并通过考试301（“核心”）。您可能还需要通过其他高级专业考试。

developerWorks提供了教程，以帮助您准备五个初级，高级和高级认证考试。每个考试涵盖几个主题，并且每个主题在developerWorks上都有一个相应的自学教程。表1列出了LPI 301考试的六个主题和相应的developerWorks教程。

表1. LPI 301考试：教程和主题

LPI 301考试主题	developerWorks教程	教程总结
主题301	LPI 301考试准备：概念，体系结构和设计	了解有关LDAP概念和体系结构，如何设计和实现LDAP目录以及架构的信息。
主题302	LPI 301考试准备：安装与开发	了解如何安装，配置和使用OpenLDAP软件。
主题303	LPI 301考试准备：组态	详细了解如何配置OpenLDAP软件。
主题304	LPI 301考试准备：用法	了解如何搜索目录并使用OpenLDAP工具。
主题305	LPI 301考试准备：整合与迁移	了解如何使用LDAP作为系统和应用程序的数据源。
主题306	LPI 301考试准备：容量规划	（本教程。）测量资源，解决资源问题并计划未来的增长。请参阅详细目标。

要通过301考试（并达到3级认证），必须满足以下条件：

您应具有在多种计算机上安装和维护Linux的多种用途的多年经验。
您应该具有与各种技术和操作系统的集成经验。
您应该具有作为企业级Linux专业人员的专业经验或接受过培训（包括作为其他角色的一部分的经验）。
您应该了解Linux管理的高级和企业级别，包括安装，管理，安全性，故障排除和维护。
您应该能够使用开源工具来衡量容量计划并解决资源问题。
您应该具有使用LDAP与UNIX®服务和Microsoft®Windows®服务（包括Samba，可插拔身份验证模块（PAM），电子邮件和Active Directory）集成的专业经验。
您应该能够使用Samba和LDAP进行规划，架构，设计，构建和实施完整的环境，并能够评估服务的容量规划和安全性。
您应该能够使用Bash或Perl创建脚本，或者至少了解一种系统编程语言（例如C）。

要继续为3级认证做准备，请参阅LPI 301考试系列的developerWorks教程，以及整套developerWorks LPI教程。

Linux Professional Institute不特别认可任何第三方考试准备材料或技术。

关于本教程

欢迎使用“容量规划”，这是为您准备LPI 301考试而设计的六本教程的最后一部分。在本教程中，您将学习有关测量UNIX资源，分析需求以及预测未来资源需求的全部知识。

本教程根据该主题的LPI目标进行组织。粗略地说，对于权重较高的目标，应在考试中遇到更多问题。

目标

表2显示了本教程的详细目标。

表2.能力计划：本教程涵盖的考试目标

LPI考试目标	客观体重	客观总结
306.1 衡量资源使用情况	4	测量硬件和网络使用率。
306.2 解决资源问题	4	识别资源问题并进行故障排除。
306.3 分析需求	2	确定软件的容量需求。
306.4 预测未来的资源需求	1个	通过趋势化使用情况并预测应用程序何时需要更多资源来规划未来。

先决条件

为了从本教程中获得最大收益，您应该具有Linux的高级知识以及可以在其上实践所涵盖命令的Linux系统。

如果您的基本Linux技能有点生锈，那么您可能需要先阅读LPIC-1和LPIC-2考试的教程。

程序的不同版本可能会不同地格式化输出，因此您的结果可能看起来与本教程中的清单和图不完全相同。

系统要求

要按照这些教程中的示例进行操作，您将需要具有OpenLDAP软件包并支持PAM的Linux工作站。大多数现代发行版都满足这些要求。

衡量资源使用情况

本部分介绍了高级Linux专业人员（LPIC-3）考试301的主题306.1的材料。此主题的权重为4。

在本节中，学习如何：

测量CPU使用率
测量内存使用量
测量磁盘I / O
测量网络I / O
测量防火墙和路由吞吐量
地图客户端带宽使用情况

计算机依靠硬件资源：中央处理器（CPU），内存，磁盘和网络。您可以对这些资源进行评估，以了解计算机当前的运行状况以及任何潜在故障的隐患。在一段时间（例如几个月）内查看这些测量值，可以得出一些有趣的历史记录。通常可以将这些读数推算到将来，这可以帮助您预测其中一种资源何时用完。或者，您可以使用历史信息来验证系统模型，从而开发系统的数学模型，然后使用该模型更准确地预测将来的使用情况。

服务器始终需要多个硬件资源来完成一项任务。一个任务可能需要访问磁盘以检索数据，并需要内存来存储数据，而CPU却要对其进行处理。如果其中一种资源受到限制，则会影响性能。从磁盘读取信息后，CPU才能处理信息。如果内存已满，也无法存储信息。这些概念是相关的。当内存已满时，操作系统开始将其他内存交换到磁盘。内存也从缓冲区中删除，缓冲区用于加速磁盘活动。

了解资源

在进行测量之前，您必须了解要测量的内容。然后，您可以开始绘制有关系统的有用信息：当前信息，历史记录或将来的预测。

中央处理器

计算机的CPU执行应用程序所需的所有计算，使命令发布到磁盘和其他外围设备，并负责运行操作系统内核。一次只在CPU上运行一个任务，无论该任务是运行内核还是单个应用程序。当前任务可以被称为中断的硬件信号中断。中断是由外部事件触发的，例如接收到网络数据包。或内部事件，例如系统时钟（在Linux中称为“ 滴答” ）。当发生中断时，当前正在运行的进程将被挂起，并运行一个例程以确定系统下一步应该做什么。

当当前运行的进程超过其分配的时间时，内核可以使用称为上下文切换的过程将其换出另一个进程。如果进程发出任何I / O命令（例如，读取磁盘），则可以在分配的时间之前将其关闭。计算机比磁盘快得多，以至于CPU在等待挂起的进程的磁盘请求返回时可以运行其他任务。

在谈论Linux系统的CPU时，您应该考虑几个因素。第一个是CPU空闲时间占工作时间的百分比（实际上，CPU总是在做某件事 -如果没有等待执行的任务被认为是空闲的）。当空闲百分比为零时，CPU处于最大运行状态。 CPU的非空闲部分分为系统时间和用户时间，其中系统时间是指在内核中花费的时间，而用户时间是完成用户请求的工作所花费的时间。空闲时间被分为内核空闲时间（因为它无关紧要）和空闲时间（因为它正在等待I / O）。

测量这些计数器非常棘手，因为要获得一个准确的数字将需要CPU花所有的时间来确定其作用！内核每秒大约检查100次当前状态（系统，用户，iowait，空闲），并使用这些度量来计算百分比。

Linux用于传达CPU使用率的另一个指标是平均负载。该指标不直接与CPU使用率相关；它代表了过去1分钟，5分钟和15分钟内内核运行队列中任务数量的指数加权。稍后将更仔细地研究此指标。

有关内核的其他注意事项是中断负载和上下文切换。这些数字没有上限，但是执行的中断和上下文切换次数越多，CPU完成用户工作所需的时间就越短。

记忆

系统具有两种类型的内存：实内存和交换空间。实际内存是指主板上的RAM棒。交换空间是系统在尝试分配比实际存在更多的RAM时使用的临时保留点。在这种情况下，RAM页面将交换到磁盘上，以释放当前分配的空间。当再次需要数据时，数据将交换回RAM。

RAM可以由应用程序使用，也可以由系统使用，或者根本不使用。系统通过两种方式使用RAM：作为原始磁盘块（传入或传出）的缓冲区以及文件缓存。缓冲区和缓存的大小是动态的，因此可以根据需要将内存分配给应用程序。这就是为什么大多数人认为Linux系统没有可用内存的原因：系统已将未使用的内存分配给缓冲区和缓存。

交换内存位于磁盘上。大量交换会减慢速度，这表明系统内存不足。

磁碟

磁盘是存储长期数据的位置，例如存储在硬盘驱动器，闪存盘或磁带（统称为块设备）上。一个例外是RAM磁盘，其行为类似于块设备，但驻留在RAM中。系统关闭时，该数据将消失。磁盘最流行的形式是硬盘驱动器，因此本教程中对磁盘的讨论集中在这种介质上。

有两种测量方法可用来描述磁盘：空间和速度。磁盘上的可用空间是指磁盘上可使用的字节数。磁盘上的开销包括文件系统已使用或否则无法使用的任何空间。请记住，大多数制造商报告的磁盘数量为1,000,000,000字节的千兆字节，而您的操作系统使用的基数为2的值为1,073,741,824。这给消费者带来93％的“损失”。这不是开销，但如果不考虑这一点，您的计算将是错误的。

磁盘的第二个指标是speed ，它衡量从磁盘返回数据的速度。当CPU发出请求时，必须完成几件事才能将数据返回到CPU：

内核将请求放入队列中，在该队列中等待将请求发送到磁盘（等待时间）。
该命令将发送到磁盘控制器。
磁盘将磁盘头搜索到所需的块（查找时间）。
磁盘头从磁盘读取数据。
数据返回到CPU。

这些步骤中的每个步骤均以不同的方式衡量，有时甚至根本没有。服务时间包括最后三个步骤，代表一旦发出请求，请求服务需要花费多长时间。等待时间代表整个过程，包括排队时间和服务时间。

内核执行的一项优化操作是对步骤1中的队列中的请求进行重新排序和合并，以最大程度地减少磁盘查找的次数。这被称为电梯，并且多年来使用了几种不同的算法。

网络

Linux在网络方面扮演着两个主要角色：客户端，服务器上的应用程序发送和接收数据包；客户端。和路由器（或防火墙或网桥）。数据包在一个接口上接收，并在另一个接口上发送（也许在进行某些过滤或检查之后）。

网络通常以每秒比特数（或千比特，兆比特或千兆比特）和每秒数据包的形式进行度量。每秒测量数据包通常不太有用，因为计算机每个数据包具有固定的开销，因此在较小的数据包大小下吞吐量会降低。不要将网卡的速度（100Mbit /秒或千兆位）与从机器或通过机器的数据传输的预期速度混淆。外部因素在起作用，包括延迟和连接的远程端，更不用说在服务器上进行调整了。

Queue列

队列与其他资源不太适合，但是它们在性能监视中经常出现，因此必须提及。队列是一行，请求在其中等待处理。内核以多种方式使用队列，从保存要运行的进程列表的运行队列到磁盘队列，网络队列和硬件队列。通常，队列是指内存中的某个点，内核使用该点来跟踪特定的一组任务，但是它也可以指硬件所管理的硬件组件上的一块内存。

队列在性能调整中以两种方式出现。首先，当太多工作进入队列时，所有新工作都会丢失。例如，如果太多的数据包进入网络接口，则某些数据包将被丢弃（在网络圈子中，这会导致尾部丢弃）。其次，如果队列使用过多（有时甚至不够用），则另一个组件的性能将不尽必要。频繁运行队列中的大量进程可能意味着CPU过载。

衡量绩效

有几种工具可用于衡量Linux机器上的性能。其中一些工具直接测量CPU，磁盘，内存和网络，而其他工具则显示诸如队列使用率，进程创建和错误之类的指示。有些工具显示瞬时值，有些显示一段时间内的平均值。了解如何进行测量以及正在测量什么同样重要。

虚拟机

vmstat是一个有用的工具，用于实时显示最常用的性能指标。关于vmstat的最重要的了解是，它产生的第一个显示代表自系统启动以来的平均值，通常应该忽略该平均值。在命令行上指定重复时间（以秒为单位），以使vmstat使用当前数据重复报告信息。清单1显示了vmstat 5的输出。

清单1. `vmstat 5`的输出

# vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 3 17780 10304 18108 586076 0 0 2779 332 1 1 3 4 76 17 0 1 2 17780 10088 19796 556172 0 0 7803 3940 2257 4093 25 28 14 34 0 0 2 17780 9568 19848 577496 0 0 18060 1217 1610 910 0 3 48 49 0 0 0 17780 51696 20804 582396 0 0 9237 3490 1736 839 0 3 55 41 0

清单1显示了vmstat命令的输出，每5秒进行一次测量。值的第一行代表自系统启动以来的平均值，因此应将其忽略。前两列涉及流程。 r标题下的数字是测量时运行队列中的进程数。运行队列中的进程正在CPU上等待。下一列是在I / O上阻塞的进程数，这意味着它们在进入某些I / O之前一直处于睡眠状态，并且不能被中断。

内存标题下的列是有关系统内存的瞬时度量，以千字节（1024字节）为单位。 swpd是已交换到磁盘的内存量。 free是应用程序，缓冲区或缓存未使用的可用内存量。如果这个数字很低，请不要感到惊讶（有关可用内存的真正更多信息，请参见关于空闲的讨论）。 buff和cache表示分配给缓冲区和缓存的内存量。缓冲区存储原始磁盘块，而缓存存储文件。

前两个类别是瞬时测量。可能在很短的时间内，所有可用内存都已耗尽，但在下一个间隔之前返回了。其余值在整个采样期间取平均值。

swap是每秒从磁盘（ si ）换出到磁盘（ so ）的平均内存量；它以千字节为单位报告。 io是每秒从所有块设备读取并发送到块设备的磁盘块数。

系统类别描述了每秒的中断数（ in ）和每秒的上下文切换（ cs ）。中断来自设备（例如向网卡发送信号，通知内核正在等待数据包）和系统计时器。在某些内核中，系统计时器每秒触发1000次，因此这个数字可能会很高。

最终的测量类别显示CPU发生了什么，以占CPU总时间的百分比表示。这五个值应为100。我们是平均时间在CPU上的用户任务在采样周期中度过，和sy是平均时间在CPU上的系统任务度过。 id是CPU空闲的时间，而wa是CPU等待I / O的时间（清单1是从受I / O约束的系统中提取的，您可以看到CPU花费了34-49％的时间等待数据从磁盘返回）。最终值st （窃取时间）用于运行虚拟机监控程序和虚拟机的服务器。它是指管理程序可以运行虚拟机但还有其他事情要做的时间百分比。

从清单1中可以看到， vmstat提供了涵盖广泛指标的大量信息。如果登录时发生了某些情况， vmstat是缩小源范围的绝佳方法。

vmstat还可以显示有关每个设备的磁盘使用情况的一些有趣信息，这可以使您更加了解清单1中的swap和io类别。- -d参数报告一些磁盘统计信息，包括读取和写入的总数，基于每个磁盘。清单2显示了vmstat -d 5部分输出（ vmstat -d 5滤掉未使用的设备）。

清单2.使用`vmstat`显示磁盘使用情况

disk- ------------reads------------ ------------writes----------- -----IO------ total merged sectors ms total merged sectors ms cur sec hda 186212 28646 3721794 737428 246503 4549745 38981340 8456728 0 2583 hdd 181471 27062 3582080 789856 246450 4549829 38981624 8855516 0 2652

每个磁盘显示在单独的行上，并且输出分为读取和写入。读和写进一步分为发出的请求总数，在磁盘升降机中合并了多少个请求，从中读取或写入的扇区数以及总服务时间。所有这些数字都是计数器，因此与没有-d选项的情况下看到的平均值相反，它们将一直增加直到下一次重新启动。

清单2中，在IO标题下的最后一组测量结果显示了磁盘正在进行的I / O操作的当前数量以及自启动以来I / O花费的总秒数。

清单2显示了两个磁盘上的读取量相似，而写入量几乎相同。这两个磁盘恰好形成一个软件镜像，因此可以预期会出现这种情况。清单2中的信息可用于指示速度较慢的磁盘或使用率更高的磁盘。

iostat

与清单2中的vmstat -d示例紧密相关的是iostat 。此命令提供有关每个设备的磁盘使用情况的详细信息。 iostat通过提供更多详细信息对vmstat -d进行了改进。与vmstat ，您可以向iostat传递一个表示刷新间隔的数字。同样，与vmstat ，第一个输出表示自系统启动以来的值，因此通常被忽略。清单3显示了以5秒为间隔的iostat输出。

清单3. `iostat`命令的输出

$ iostat 5 Linux 26.20-1.3002.fc6xen (bob.ertw.com) 02/08/2008 avg-cpu: %user %nice %system %iowait %steal %idle 0.85 0.13 0.35 0.75 0.01 97.90 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 1.86 15.24 13351 4740568 41539964 hdd 1.85 14.69 133.51 4570088 41540256

每个测量间隔的第一部分显示CPU使用率，这也由vmstat 。但是，此处显示两位小数。输出的第二部分显示了系统上的所有块设备（为限制显示的设备数量，请在命令行上传递设备的名称，例如iostat 5 hda sda ）。第一列tps代表电梯将请求合并后每秒向设备的传输次数。没有指定传输的大小。最后四列处理512字节块，分别显示每秒读取的块，每秒写入的块，读取的总块和写入的总块。如果您希望以千字节或兆字节为单位报告值，请分别使用-k或-m 。如果需要该数据， -p选项将显示详细信息直至分区级别。

通过使用-x参数，您可以获得很多更多信息，如清单4所示。清单4还将输出限制为一个驱动器。调整了格式以适合页面宽度限制。

清单4. `iostat`扩展信息

# iostat -x 5 hda ..... CPU information removed ... Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz hda 16669.31 1.49 756.93 1.49 139287.13 27.72 18369 avgqu-sz await svctm %util 1.58 208 1.28 96.83

前六个值与每秒读取和写入有关。 rrqm / s和wrqm / s表示合并的读取和写入请求的数量。相比之下， r / s和w / s表示发送到磁盘的读取和写入次数。因此，合并的磁盘请求百分比为16669 /（16669 + 757）= 95％。 rsec / s和wsec / s以每秒扇区为单位显示读写速率。

接下来的四列显示有关磁盘队列和时间的信息。 avgrq-sz是发出给设备的请求的平均大小（以扇区为单位）。 avgqu-sz是整个测量间隔内磁盘队列的平均长度。等待是平均等待时间（以毫秒为单位），它表示从将请求发送到内核到返回请求所花费的平均时间。 svctm是平均服务时间（以毫秒为单位），这是从磁盘请求离开队列并发送到磁盘到返回磁盘所花费的时间。

最终值％util是系统在该设备上执行I / O的时间百分比，也称为饱和度。清单4中报告的96.83％表明，在此期间磁盘几乎处于容量极限。

mpstat

mpstat报告有关CPU（或多处理器计算机中的所有CPU）的详细信息。 iostat和vmstat以某种形式报告了许多此类信息，但是mpstat分别为所有处理器提供数据。清单5显示了具有5秒测量间隔的mpstat 。与iostat和vmstat不同，您不应忽略第一行。

清单5.使用`mpstat`显示CPU信息

# mpstat -P 0 5 Linux 2.620-1.3002.fc6xen (bob.ertw.com) 02/09/2008 09:45:23 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 09:45:25 PM 0 77.61 21.89 0.00 0.00 0.50 0.00 0.00 0.00 155.22 09:45:27 PM 0 68.16 30.85 1.00 0.00 0.00 0.00 0.00 0.00 154.73

-P 0指定应该显示第一个CPU（从0开始）。您也可以分别为所有CPU指定-P ALL 。 mpstat返回的字段如下：

％user ：用户任务花费的时间百分比，不包括出色的任务
％nice ：好的（较低优先级）用户任务中花费的时间百分比
％sys ：在内核任务中花费的时间百分比
％iowait ：空闲时等待I / O花费的时间百分比
％irq ：服务硬件中断的时间百分比
％soft ：在软件中断中花费的时间百分比
％steal ：虚拟机管理程序从虚拟机窃取的时间百分比
intr / s ：每秒平均中断数

pstree

在跟踪资源使用情况时，了解哪个进程产生了另一个进程会很有帮助。找到此方法的一种方法是使用ps -ef的输出，并使用父processid将您的方法返回PID 1（ init ）。您还可以使用ps -efjH ，它将输出分类到父子树中，并包括CPU时间使用情况。

名为pstree的实用程序以更加图形化的格式打印进程树，并且还将同一进程的多个实例滚动到一行中。清单6显示了pstree传递Postfix守护程序的PID后的输出。

清单6. `pstree`输出

[root@sergeant ~]# pstree 7988 master─┬─anvil ├─cleanup ├─local ├─pickup ├─proxymap ├─qmgr ├─2*[smtpd] └─2*[trivial-rewrite]

主过程，被称为巧妙主，催生了其它几个处理，诸如砧座，净化和局部的 。输出的最后两行格式为N *[ something ] ，其中something是进程的名称， N是该名称下的子代数。如果事情被括在大括号（{}），除了方括号（[]），这将表明N个线程运行（ ps不能正常显示线程，除非你使用-L ）。

w，正常运行时间和顶部

这些实用程序归为一类，因为它们是人们在调查问题时倾向于使用的第一个实用程序。清单7显示了w命令的输出。

清单7. `w`命令的输出

# w 12:14:15 up 33 days, 15:09, 2 users, load average: 0.06, 0.12, 0.09 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root tty2 - 17Jan08 18days 0.29s 0.04s login -- root root pts/0 bob Sat22 0.00s 0.57s 0.56s -bash

w的第一行提供了大量信息。第一部分“ 33：12：14：15，15：09”给出了当前时间，然后是33天，15小时和9分钟的正常运行时间。第二部分“ 2个用户”给出了已登录用户的数量。最后一部分是平均负载，平均为1分钟，5分钟和15分钟。

负载平均值是给定时间段内运行队列中的进程数的加权平均值。平均负载越高，试图争夺CPU的进程就越多。负载平均值未针对CPU数量进行归一化，这意味着负载平均值与CPU数量无关。

要使用平均负载，您还必须了解加权。平均负载每5秒更新一次，较旧的信息在计算中的作用较小。如果您的系统要立即从运行队列中的0个进程转到1个进程，则下一分钟的1分钟平均负载图不是一条直线，而是一条曲线，该曲线首先Swift上升，然后逐渐减小直到60秒标记。有关如何平均计算负载的更详细信息，请参阅“ 相关主题”部分。

加权平均负载的实际应用是将测量时实际负载的变化平滑掉。但是当前状态更多地反映在数字上，尤其是1分钟的平均值。

第一行之后是已登录用户的列表，包括其登录时间，位置以及有关CPU使用率的信息。第一个用户root是从tty2 （本地控制台）登录的，并且闲置了18天。第二个用户再次是root，但已通过网络登录，并且当前已在Shell上登录。 JCPU和PCPU列使您了解用户已使用了多少CPU时间。第一列包括过去的作业，而PCPU用于用户当前正在使用的进程。

uptime的输出显示与w第一行完全相同，但没有有关用户的信息。实际上，由于有关用户的更多信息以及键入时间更短，因此w在这两者中更有用。

另一个流行的命令是top ，除了一些其他指标外，它还显示持续更新的顶级进程列表（按内存或CPU使用率排序）。图1显示了top操作的屏幕截图。

图1 `top`的动作

第一行显示的是uptime ，例如正常运行时间和平均负载。第二行是进程数。正在运行的进程在运行队列中；睡眠过程正在等待某种唤醒。停止的进程已被暂停，可能是因为正在跟踪或调试该进程。僵尸进程已退出，但父进程尚未确认死亡。

东西没加起来

VIRT = RES + SWAP，这意味着SWAP = VIRT-RES。查看PID 13435，您可以看到VIRT为164m，RES为76m，这意味着SWAP必须为88m。但是，屏幕顶部的交换状态指示仅使用144K交换！可以通过在top使用f键并启用更多字段（例如swap）来验证这一点。

事实证明，交换不仅仅意味着页面已交换到磁盘。应用程序的二进制文件和库不必始终保持在内存中。内核当时可以将某些内存页标记为不必要；但是由于二进制文件位于磁盘上的已知位置，因此无需使用交换文件。由于部分代码不是常驻代码，因此仍被视为交换。此外，应用程序可以将内存映射到磁盘文件。由于应用程序的整个大小（VIRT）包括映射的内存，但它不是驻留的（RES），因此被视为交换。

第三行给出了CPU使用率：依次显示用户，系统，正常，空闲，I / O等待，硬件中断，软件中断和窃取时间。这些数字是上次测量间隔中的时间百分比（默认为3秒）。

顶部的最后两行显示内存统计信息。第一个提供有关真实内存的信息；在图1中，您可以看到系统具有961,780K的RAM（扣除内核开销之后）。除6,728K以外的所有内存均已使用，并具有大约30MB的缓冲区和456M的缓存（缓存显示在第二行的末尾）。第二行显示交换使用情况：系统几乎有3G交换，仅使用了144K。

屏幕的其余部分显示有关当前正在运行的进程的信息。 top会显示尽可能多的进程来填充窗口的大小。每个进程都有其自己的行，并且列表在每个测量时间间隔内更新，其中使用顶部CPU最多的任务。列如下：

PID ：进程的processid
USER ：进程的有效用户名（如果程序使用setuid(2)更改用户，则显示新用户）
PR ：任务的优先级，内核使用它来确定哪个进程首先获取CPU
NI ：任务的良好级别，由系统管理员设置，以影响哪个进程首先获取CPU
VIRT ：进程的虚拟映像的大小，它是RAM使用的空间（驻留大小）和交换中的数据大小（交换大小）的总和
RES ：进程的常驻大小，即您的进程使用的实际RAM量
SHR ：应用程序共享的内存量，例如SysV共享内存或动态库（* .so）
S ：状态，例如睡眠，跑步或僵尸
％CPU ：上次测量间隔中使用的CPU百分比
％MEM ：上次测量时使用的RAM（不包括交换）的百分比
TIME + ：进程使用的时间，以分钟：秒：百分之一为单位
COMMAND ：正在运行的命令的名称

top提供了一种快速的方法来查看哪些进程使用了最多的CPU，还为您提供了有关系统CPU和内存的良好仪表板视图。通过在top显示中键入M ，可以按内存使用量进行top排序。

自由

看完top ，您应该立即了解free 。清单8显示了free的输出，使用-m选项报告所有值（以兆字节为单位）。

清单8.使用`free`命令

# free -m total used free shared buffers cached Mem: 939 904 34 0 107 310 -/+ buffers/cache: 486 452 Swap: 2847 0 2847

free在几个方向上列出了内存使用情况。第一行显示与top相同的信息。第二行表示已使用和可用内存，未考虑缓冲区和高速缓存。在清单8中，452M内存可供应用程序使用；该内存将来自空闲内存（34M），缓冲区（107M）或缓存（310M）。

最后一行显示的交换统计信息与top相同。

显示网络统计信息

获取网络统计信息的直接性不如CPU，内存和磁盘。主要方法是从/ proc / net / dev读取计数器，该计数器以包和字节为单位报告每个接口的传输。如果要获得每秒的值，则必须自己将两个连续测量值之间的差除以时间间隔来自己计算。另外，您可以使用bwm类的工具来自动收集和显示总带宽。图2显示了bwm中的bwm 。

图2.使用`bwm`命令

bwm以多种方式显示界面使用情况。图2显示了每半秒的瞬时速率，尽管可以使用30秒的平均值，最大带宽和计数字节。从图2中可以看到eth0正在接收大约20K / sec的流量，这似乎来自vif0.0。如果不是您想要的每秒字节数，则可以使用u键在位，数据包和错误之间循环。

要获取有关哪些主机负责流量的更多详细信息，您需要iftop ，它为网络流量提供了一个类似top的界面。内核不会直接提供此信息，因此iftop使用pcap库检查网络上的数据包，这需要root特权。图3显示了连接到eth2设备时的iftop动作。

图3. `iftop -i eth2`的输出

iftop显示您网络上的主要对话者。缺省情况下，每个会话占用两行：一条用于发送一半，另一条用于接收一半。查看从mybox到pub1.kernel.org的第一个对话，第一行显示从mybox发送的流量，第二行显示mybox接收的流量。右侧的数字分别表示最近2秒，10秒和40秒的平均流量。您还可以在主机名上看到一个黑条，这是10秒平均值的可视指示（标度显示在屏幕顶部）。

仔细观察图3，第一次传输可能是下载，因为与少量的上载流量相比，接收到的大量流量（在过去10秒钟内平均每秒约半兆比特）。第二个发话人的流量相等，一直保持在75-78k /秒左右。这是通过les.net（我的VoIP服务提供商）进行的G.711语音呼叫。第三次传输显示了128K下载和少量上传：这是Internet广播流。

您要附加的接口的选择很重要。图3使用防火墙上的外部接口，该接口在所有数据包经过IP伪装后可以看到它们。这将导致内部地址丢失。使用其他接口（例如内部接口）将保留此信息。

萨尔

sar是整篇文章的主题（请参阅“ 相关主题”部分）。 sar每10分钟测量几十个关键指标，并提供一种检索这些指标的方法。您可以使用前面的工具来确定“现在正在做什么？”； sar回答：“这周发生了什么？” 请注意， sar清除其数据以仅保留最后7天的数据。

您必须通过在root的crontab中添加两行来配置数据收集。清单9显示了sar的典型crontab。

清单9.用于`sar`数据收集的root的crontab

# Collect measurements at 10-minute intervals 0,10,20,30,40,50 * * * * /usr/lib/sa/sa1 -d 1 1 # Create daily reports and purge old files 0 0 * * * /usr/lib/sa/sa2 -A

第一行执行sa1命令，每10分钟收集一次数据；此命令运行sadc进行实际收集。这项工作是独立的：它知道要写入哪个文件，不需要其他配置。第二行在午夜调用sa2以清除较旧的数据文件，并将当天的数据收集到可读的文本文件中。

在依赖数据之前，有必要检查一下系统如何运行sar 。某些系统禁用磁盘统计信息的收集。要解决此问题，必须在对sa1的调用中添加-d （清单9已添加了此内容）。

收集了一些数据后，您现在可以运行sar而不用任何选择来查看当天的CPU使用率。清单10显示了部分输出。

清单10. `sar`示例输出

[root@bob cron.d]# sar | head Linux 2.6.20-1.3002.fc6xen (bob.ertw.com) 02/11/2008 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:10:01 AM all 0.18 0.00 0.18 3.67 0.01 95.97 12:20:01 AM all 0.08 0.00 0.04 0.02 0.01 99.85 12:30:01 AM all 0.11 0.00 0.03 0.02 0.01 99.82 12:40:01 AM all 0.12 0.00 0.02 0.02 0.01 99.83 12:50:01 AM all 0.11 0.00 0.03 0.05 0.01 99.81 01:00:01 AM all 0.12 0.00 0.02 0.02 0.01 99.83 01:10:01 AM all 0.11 0.00 0.02 0.03 0.01 99.83

The numbers shown in Listing 10 should be familiar by now: they're the various CPU counters shown by top , vmstat , and mpstat . You can view much more information by using one or more of the command-line parameters shown in Table 3.

Table 3. A synopsis of `sar` options

选项	例	描述
`-A`	`sar -A`	Displays everything . Unless you're dumping this result to a text file, you probably don't need it. If you do need it, this process is run nightly as part of `sa2` anyway.
`-b`	`sar -b`	Shows transactions and blocks sent to, and read from, block devices, much like `iostat` .
`-B`	`sar -B`	Shows paging (swap) statistics such as those reported by `vmstat` .
`-d`	`sar -d`	Shows disk activity much like `iostat -x` , which includes wait and service times, and queue length.
`-n`	`sar -n DEV or sar -n NFS`	Shows interface activity (like `bwm` ) when using `DEV` , or NFS client statistics when using the `NFS` keyword (use the `NFSD` keyword for the NFS server daemon stats). The `EDEV` keyword shows error information from the network cards.
`-q`	`sar -q`	Shows information about the run queue and total process list sizes, and load averages, such as those reported by `vmstat` and `uptime` .
`-r`	`sar -r`	Shows information about memory, swap, cache, and buffer usage (like `free` ).
`-f`	`sar -f /var/log/sa/sa11`	Reads information from a different file. Files are named after the day of the month.
`-s`	`sar -s 08:59:00`	Starts displaying information at the first measurement after the given time. If you specify 09:00:00, the first measurement will be at 09:10, so subtract a minute from the time you want.
`-e`	`sar -e 10:01:00`	Specifies the cutoff for displaying measurements. You should add a minute to the time you want, to make sure you get it.

You may also combine several parameters to get more than one report, or a different file with a starting and an ending time.

df

Your hard disks are a finite resource. If a partition runs out of space, be prepared for problems. The df command shows the disk-space situation. Listing 11 shows the output of df -h , which forces output to be in a more friendly format.

Listing 11. Checking disk space usage with `df -h`

$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 225G 169G 44G 80% / /dev/hda1 99M 30M 64M 32% /boot tmpfs 474M 0 474M 0% /dev/shm

Listing 11 shows a single file system on the root that is 225G in size, with 44G free. The /boot partition is 99M with 64M free. tmpfs is a special file system and doesn't refer to any particular device. If a partition is full, you see no available space and 100% usage.

Troubleshoot resource problems

This section covers material for topic 306.2 for the Senior Level Linux Professional (LPIC-3) exam 301. This topic has a weight of 4.

在本节中，学习如何：

Match/correlate system symptoms with likely problems
Identify bottlenecks in a system

The previous section showed how to display different performance counters within the Linux system. It's now time to apply these commands to solving resource-related problems in your systems.

Troubleshooting methodology

Lay out your strategy for problem solving before getting into the details of resource constraints and your system. The many strategies for problem solving boil down to four steps:

Identify the symptoms.
Determine the root cause.
Implement a fix.
Evaluate the results.

Identify the symptoms

The first step to solving a problem is to identify the symptoms. An example of a symptom is "e-mail is slow" or "I'm out of disk space." The symptoms are causing your users to complain, and those users won't be happy until the symptoms go away. Don't confuse the symptoms with the problem, though: most often, the problem is different than what your users are reporting, although the problem is causing the symptoms.

Once you've collected the symptoms, try to quantify the complaint and clarify the conditions under which it occurs. Rather than the e-mail system being slow, you may learn that e-mails used to be received within seconds of being sent but now take hours. And a user who is out of disk space has to be doing something, such as saving a file or processing a batch job.

This final step of quantifying the complaint has two purposes. The first is to let you reproduce the problem, which will help later in determining when the problem has been solved. The second purpose is to get more details from the user that will help you determine the root cause. After learning that a job that once took 5 minutes to execute now takes an hour, you might inquire about the nature of the job. This might lead you to learn that it pulls information from a database on another server, which you must include in the scope of your search for the root cause.

Determine the root cause

Determining the root cause involves using the commands learned in the Measure resource usage section to find the cause of the problem. To do so, you must investigate the resources, such as CPU, memory, disk, and network. Ideally, you'll be able to collect data while the problem is occurring with real-time tools like vmstat , iostat , and top . If not, something that produces historical information such as sar may have to do.

If the problem is resource related, then one of two results will appear: one (or more) of your resources will be at 100% utilization, and the cause should be obvious; or nothing is obviously being overused.

In the second case, you should refer to a baseline . The baseline is a set of reference data you can use to compare what's "normal" to what you're seeing. Your baseline might be a series of graphs, or archived sar reports showing normal activity. The baseline will also be helpful later when you learn about predicting growth.

As you use the administration tools, you'll start to develop a picture of the problem's root cause. It may be that your mail server is stuck on a message, causing it to stop processing other messages. Or perhaps a batch job is consuming all the CPU on the system.

At this point, you must be careful that you have properly identified the root cause of the problem. An application that generates a huge logfile may have caused you to run out of space. If you identify the logfile as the root cause and decided to delete it, the application may still fill up the disk at some point in the future.

Implement a fix

You'll often have several ways to fix a problem. Take, for instance, a batch job that is consuming all CPU resources. If you kill the job, then the user running the job will probably lose his or her work, although the other users will have their server back. You may decide to renice the process to give other processes more time on the CPU. This is usually a judgment call, depending on the needs of the business and the urgency of the situation.

Evaluate the results

Once you've implemented your solution, you must go back and check to see if the problem was solved. Are e-mails delivered near-instantly? Are users able to log in? If not, then you must step back and look at the root cause again, which will lead to another fix that must be evaluated. If your fix failed, you must also check to see that you didn't make things worse!

After the problem is fixed, determine if you need to take any longer-term actions. Do you have to consider a bigger disk, or moving a user's batch job to another host? Unexplained processes on a machine mat prompt you to do a more in-depth security check of the server to make sure it hasn't been exploited.

Compound problems

Some performance problems are obvious. A user complains that something is running slowly: you log in and fire up top . You see a process unnecessarily hogging the CPU: you kill it, and the system returns to normal. After showering you with praise, your boss gives you a raise and the rest of the day off. (OK, maybe the last part is made up.)

What happens when your problem isn't obvious? Sometimes problems are caused by more than one thing, or a symptom is caused by something that may seem unrelated at first.

The swap spiral

Memory is fast, and you probably have lots of it in your system. But sometimes an application needs more memory than the system has, or a handful of processes combined end up using more memory than the system has. In this case, virtual memory is used. The kernel allocates a spot on disk and swaps the resident memory pages to disk so that the active application can use it. When the memory on disk is needed, it's brought back to RAM, optionally swapping out some other memory to disk to make room.

The problem with that process is that disk is slow. If you briefly dip into swap, you may not notice. But when the system starts aggressively swapping memory to disk in order to satisfy a growing demand for memory, you've got problems. You'll find your disk I/O skyrocketing, and it will seem that the system isn't responding. In fact, the system probably isn't responding, because your applications are waiting for their memory to be transferred from disk to RAM.

UNIX admins call this the swap spiral (or sometimes, more grimly, the swap death spiral ). Eventually, the system grinds to a halt as the disks are running at capacity trying to swap memory in and out. If your swap device is on the same physical disk as data, things get even worse. Once your application makes it onto the CPU and issues an I/O request, it has to wait longer while the swap activity is serviced.

The obvious symptom of the swap spiral is absurdly long waits to do anything, even getting the uptime. You also see a high load average, because many processes are in the run queue due to the backed-up system. To differentiate the problem from a high-CPU problem, you can check top to see if processes are using the CPU heavily, or you can check vmstat to see if there is a lot of swap activity. The solution is usually to start killing off processes until the system returns to order, although depending on the nature of the problem, you may be able to wait it out.

Out of disk space

Applications aren't required to check for errors. Many applications go through life assuming that every disk access executes perfectly and quickly. A disk volume that fills up often causes applications to behave in weird ways. For example, an application may consume all available CPU as it tries to do an operation over and over without realizing it's not going to work. You can use the strace command to see what the application is doing (if it's using system calls).

Other times, applications simply stop working. A Web application may return blank pages if it can't access its database.

Logging in and checking your available disk (with du ) is the quickest way to see if disk space is the culprit.

Blocked on I/O

When a process requests some form of I/O, the kernel puts the process to sleep until the I/O request returns. If something happens with the disk (sometimes as part of a swap spiral, disk failure, or network failure on networked file systems), many applications are put to sleep at the same time.

A process that's put to sleep can be put into an interruptible sleep or an uninterpretable sleep. The former can be killed by a signal, but the second can't. Running ps aux shows the state. Listing 12 shows one process in uninterpretable sleep and another in interruptible sleep.

Listing 12. Two processes in a sleep state

apache 26575 0.2 19.6 132572 50104 ? S Feb13 3:43 /usr/sbin/httpd root 8381 57.8 0.2 3844 532 pts/1 D 20:46 0:37 dd

The first process in Listing 12, httpd, is in an interruptible sleep state, indicated by the letter S just after the question mark. The second process, dd, is in an uninterpretable sleep state. Uninterpretable sleeps are most often associated with hard disk accesses, whereas interruptible sleeps are for operations that take comparably longer to execute, such as NFS and socket operations.

If you find a high load average (meaning a lot of processes in the run queue) and a lot of processes in an uninterpretable sleep state, then you may have a problem with hard drive I/O, either because the device is failing or because you're trying to get too many reads/writes out of the drive at a time.

Analyze demand

This section covers material for topic 306.3 for the Senior Level Linux Professional (LPIC-3) exam 301. This topic has a weight of 2.

在本节中，学习如何：

Identify capacity demands
Detail capacity needs of programs
Determine CPU/memory needs of programs
Assemble program needs into a complete analysis

Fixing immediate problems is a key task for the system admin. Another task involves analyzing how systems are currently performing in the hope that you can foresee resource constraints and address them before they become a problem. This section looks at analyzing the current demand, and the next section builds on that to predict future usage.

You can use two approaches to analyze current demand: measure the current demand over a period of time (like a baseline), or model the system and come up with a set of parameters that makes the model reflect current behavior. The first approach is easier and reasonably good. The second is more accurate but requires a lot of work. The real benefit of modeling comes when you need to predict future behavior. When you have a model of your system, you can change certain parameters to match growth projections and see how performance will change.

In practice, both of these approaches are used together. In some cases, it's too difficult to model a particular system, so measurements are the only basis on which to base demand and growth projections. Measurements are still required to generate models.

Model system behavior

The activity in a computer can be modeled as a series of queues . A queue is a construct where requests come in one end and are held until a resource is available. Once the resource is available, the task is executed and exits the queue.

Multiple queues can be attached together to form a bigger system. A disk can be modeled as a queue where requests come in to a buffer. When the request is ready to be serviced, it's passed to the disk. This request generally comes from the CPU, which is a single resource with multiple tasks contending for the use of the CPU. The study of queues and their applications is called queuing theory .

The book Analyzing Computer System Performance with Perl::PDQ (see Related topics for a link) introduces queuing theory and shows how to model a computer system as a series of queues. It further describes a C library called PDQ and an associated Perl interface that lets you define and solve the queues to give performance estimates. You can then estimate the result of changes to the system by changing parameters.

Introducing queues

Figure 4 shows a single queue. A request comes in from the left and enters the queue. As requests are processed by the circle, they leave the queue. The blocks to the left of the circle represent the queued objects.

Figure 4. A simple queue

The queue's behavior is measured in terms of times, rates, and sizes. The arrival rate is denoted as Lambda (Λ) and is usually expressed in terms of items per second. You can determine Λ by observing your system over a reasonable period of time and counting the arrivals. A reasonable amount of time is defined to be at least 100 times the service time , which is the length of time that the request is processed. The residence time is the total time a request spends in the queue, including the time it takes to be processed.

The arrival rate describes the rate at which items enter the queue, and the throughput defines the rate at which the items leave. In a more complex system, the nodal throughput defines the throughput of a single queuing node, and the system throughput refers to the system as a whole.

The size of the buffer doesn't matter in most cases because it will have a finite and predictable size as long as the following conditions hold true:

The buffer is big enough to handle the queued objects.
The queue doesn't grow unbounded.

The second constraint is the most important. If a queue can dispatch requests at the rate of one request per second, but requests come in more often than one per second, then the queue will grow unbounded. In reality, the arrival rate will fluctuate, but performance analysis is concerned with the steady state , so averages are used. Perhaps at one point, 10 requests per second come in, and at other times no requests come in. As long as the average is less than one per second, then the queue will have a finite length. If the average arrival rate exceeds the rate at which requests are dispatched, then the queue length will continue to grow and never reach a steady state.

The queue in Figure 4 is called an open queue because an unlimited population of requests is arriving, and they don't necessarily come back after they're processed. A closed queue feeds back to the input; there is a finite number of requests in the system. Once the requests have been processed, they go back to the arrival queue.

The classic example of a queue is a grocery store. The number of people entering the line divided by the measurement period is the arrival rate. The number of people leaving the line divided by the measurement period is the throughput. The average time it takes a cashier to process a customer is the service time. The average time a customer waits in line, plus the service time, is the residence time.

To move into the PDQ realm, consider the following scenario. A Web service sees 30,000 requests over the course of 1 hour. Through some tracing of an unloaded system, the service time is found to be 0.08 seconds. Figure 5 shows this drawn as a queue.

Figure 5. The Web service modeled as a queue

What information can PDQ provide? Listing 13 shows the required PDQ program and its output.

Listing 13. A PDQ program and its output

#!/usr/bin/perl use strict; use pdq; # Observations my $arrivals = 30000; # requests my $period = 3600; # seconds my $serviceTime = 0.08; # seconds # Derived my $arrivalRate = $arrivals / $period; my $throughput = 1 / $serviceTime; # Sanity check -- make sure arrival rate &lt; throughput if ($arrivalRate &gt; $throughput) { die "Arrival rate $arrivalRate &gt; throughput $throughput"; } # Create the PDQ model and define some units pdq::Init("Web Service"); pdq::SetWUnit("Requests"); pdq::SetTUnit("Seconds"); # The queuing node pdq::CreateNode("webservice", $pdq::CEN, $pdq::FCFS); # The circuit pdq::CreateOpen("system", $arrivalRate); # Set the service demand pdq::SetDemand("webservice", "system", $serviceTime); # Run the report pdq::Solve($pdq::CANON); pdq::Report(); ..... output .. *************************************** ****** Pretty Damn Quick REPORT ******* *************************************** *** of : Sat Feb 16 11:24:54 2008 *** *** for: Web Service *** *** Ver: PDQ Analyzer v4.2 20070228*** *************************************** *************************************** *************************************** ****** PDQ Model INPUTS ******* *************************************** Node Sched Resource Workload Class Demand ---- ----- -------- -------- ----- ------ CEN FCFS webservice system TRANS 0.0800 Queueing Circuit Totals: Streams: 1 Nodes: 1 WORKLOAD Parameters Source per Sec Demand ------ ------- ------ system 8.3333 0.0800 *************************************** ****** PDQ Model OUTPUTS ******* *************************************** Solution Method: CANON ****** SYSTEM Performance ******* Metric Value Unit ------ ----- ---- Workload: "system" Mean Throughput 8.3333 Requests/Seconds Response Time 0.2400 Seconds Bounds Analysis: Max Demand 12.5000 Requests/Seconds Max Throughput 12.5000 Requests/Seconds ****** RESOURCE Performance ******* Metric Resource Work Value Unit ------ -------- ---- ----- ---- Throughput webservice system 8.3333 Requests/Seconds Utilization webservice system 66.6667 Percent Queue Length webservice system 2.0000 Requests Residence Time webservice system 0.2400 Seconds

Listing 13 begins with the UNIX "shebang" line that defines the interpreter for the rest of the program. The first two lines of Perl code call for the use of the PDQ module and the strict module. PDQ offers the PDQ functions, whereas strict is a module that enforces good Perl programming behavior.

The next section of Listing 13 defines variables associated with the observations of the system. Given this information, the section that follows the observations calculates the arrival rate and the throughput. The latter is the inverse of the service time—if you can serve one request in N seconds, then you can serve 1/ N requests per second.

Installing PDQ

You can download the PDQ tarball from the author's Web site (see the Related topics ). Unpack it in a temporary directory with tar -xzf pdq.tar.gz , and change into the newly created directory with cd pdq42 . Then, run make to compile the C code and the Perl module. Finally, cd perl5 and then run ./setup.sh to finish building the Perl module and install it in your system directory.

The sanity test checks to make sure the queue length is bounded. Most of the PDQ functions flag an error anyway, but the author of the module recommends an explicit check. If more requests per second come in than are leaving, then the program dies with an error.

The rest of the program calls the PDQ functions directly. First, the module is initialized with the title of the model. Then, the time unit and work units are set so the reports show information the way you expect.

Each queue is created with the CreateNode function. In Listing 13, a queue called webservice is created (the names are tags to help you understand the final report) that is of type CEN (a queuing center, as opposed to a delay node that doesn't do any work). The queue is a standard first in, first out (FIFO) queue that PDQ calls a first-come first-served queue .

Next, CreateOpen is called to define the circuit (a collection of queues). The arrival rate to the circuit has already been calculated. Finally, the demand for the queue is set with SetDemand . SetDemand defines the time taken to complete a particular workload (a queue within a circuit).

The circuit is finally solved with the Solve function and reported with the Report function. Note that PDQ takes your model, turns it into a series of equations, and then solves them. PDQ doesn't simulate the model in any way.

Interpreting the output is straightforward. The report first starts with a header and a summary of the model. The WORKLOAD Parameters section provides more interesting information. The circuit's service time is 0.08 seconds, which was defined. The per second rate is the input rate.

The SYSTEM performance section calculates performance of the system as a whole. The circuit managed to keep up with the input rate of 8.3333 requests per second. The response time, which includes the 0.08 seconds of service time, and the time spent in queue was 0.24 seconds (more on this later). The maximum performance of the circuit was deemed to be 12.5 requests per second.

Looking closely at the queue, you can see that it's 66.6667% used. The average queue length is two requests. This means that a request coming in can expect to have two requests queued ahead of it, plus the request being executed. At 0.08 seconds per request, the average wait is then the 0.24 seconds reported earlier.

This model could be extended to show the components of the Web service. Rather than a single queue representing the Web service, you might have a queue to process the request, a queue to access a database, and a queue to package the response. The system performance should stay the same if your model is valid, but then you'll have insight into the inner workings of the Web service. From there, you can play "what if" and model a faster database or more Web servers to see what sort of improvement in response you get. The individual resource numbers will tell you if a particular queue is the bottleneck, and how much room you have to grow.

Listing 13 is a basic example of using the PDQ libraries. Read Analyzing Computer System Performance with Perl::PDQ (see Related topics for a link) to learn how to build more complex models.

Predict future resource needs

This section covers material for topic 306.4 for the Senior Level Linux Professional (LPIC-3) exam 301. This topic has a weight of 1.

在本节中，学习如何：

Predict capacity break point of a configuration
Observe growth rate of capacity usage
Graph the trend of capacity usage

The previous section introduced the PDQ library and a sample report. The report showed the calculated values for the utilization and maximum load of a queue and the system as a whole. You can use the same method to predict the break point of a configuration. You can also use graphs to show the growth of a system over time and predict when it will reach capacity.

More on PDQ

Listing 14 shows the same Web service as Listing 13 , but it's broken into two queues: one representing the CPU time on the Web server for processing the request and response, and one showing the time waiting for the database request to return.

Listing 14. A new PDQ program for the sample Web service

#!/usr/bin/perl use strict; use pdq; # Observations my $arrivals = 30000; # requests my $period = 3600; # seconds # Derived my $arrivalRate = $arrivals / $period; # Create the PDQ model and define some units pdq::Init("Web Service"); pdq::SetWUnit("Requests"); pdq::SetTUnit("Seconds"); # The queuing nodes pdq::CreateNode("dblookup", $pdq::CEN, $pdq::FCFS); pdq::CreateNode("process", $pdq::CEN, $pdq::FCFS); # The circuit pdq::CreateOpen("system", $arrivalRate); # Set the service demand pdq::SetDemand("dblookup", "system", 0.05); pdq::SetDemand("process", "system", 0.03); # Solve pdq::Solve($pdq::CANON); pdq::Report();

The code in Listing 14 adds another queue to the system. The total service time is still 0.08 seconds, comprising 0.05 seconds for a database lookup and 0.03 seconds for CPU processing. Listing 15 shows the generated report.

Listing 15. The PDQ report from Listing 14

*************************************** ****** Pretty Damn Quick REPORT ******* *************************************** *** of : Sun Feb 17 11:35:35 2008 *** *** for: Web Service *** *** Ver: PDQ Analyzer v4.2 20070228*** *************************************** *************************************** *************************************** ****** PDQ Model INPUTS ******* *************************************** Node Sched Resource Workload Class Demand ---- ----- -------- -------- ----- ------ CEN FCFS dblookup system TRANS 0.0500 CEN FCFS process system TRANS 0.0300 Queueing Circuit Totals: Streams: 1 Nodes: 2 WORKLOAD Parameters Source per Sec Demand ------ ------- ------ system 8.3333 0.0800 *************************************** ****** PDQ Model OUTPUTS ******* *************************************** Solution Method: CANON ****** SYSTEM Performance ******* Metric Value Unit ------ ----- ---- Workload: "system" Mean Throughput 8.3333 Requests/Seconds Response Time 0.1257 Seconds Bounds Analysis: Max Demand 20.0000 Requests/Seconds Max Throughput 20.0000 Requests/Seconds ****** RESOURCE Performance ******* Metric Resource Work Value Unit ------ -------- ---- ----- ---- Throughput dblookup system 8.3333 Requests/Seconds Utilization dblookup system 41.6667 Percent Queue Length dblookup system 0.7143 Requests Residence Time dblookup system 0.0857 Seconds Throughput process system 8.3333 Requests/Seconds Utilization process system 25.0000 Percent Queue Length process system 0.3333 Requests Residence Time process system 0.0400 Seconds

Look at the output side of the report, and note that the average response time has decreased from Listing 13, and the maximum requests per second has gone from 12.5 to 20. This is because the new model allows for pipelining . While a request is being dispatched to the database, another request can be processed by the CPU. In the older model, this wasn't possible to calculate because only one queue was used.

More important, you can see that the database is 42% utilized and the CPU is only 25% utilized. Thus the database will be the first to hit capacity as the system falls under higher load.

Change the arrivals to be 60,000 over the course of an hour, and you'll find that the average response time increases to 0.36 seconds, with the database hitting 83% utilization. You'll also see that out of the 0.36 second, .30 is spent waiting on the database. Thus, your time would be better spent speeding up database access.

You may define maximum capacity in different ways. At 20 requests per second (from the top of the report), the system is at 100% capacity. You may also choose to define your capacity in terms of average response time. At roughly 15 requests per second, the response time will exceed a quarter of a second. If your goal is to keep response to under 0.25 second, your system will hit capacity at that point even though you still have room to grow on the hardware.

Use graphs for analysis

Graphs are an excellent way to show historical information. You can look at the graph over a long period of time, such as 6 months to a year, and get an idea of your growth rate. Figure 6 represents the CPU usage of an application server over the course of a year. The average daily usage measurements were brought into a spreadsheet and graphed. A trendline was also added to show the growth.

Figure 6. Graphing the CPU usage of a server

From this graph, you can project the future usage (assuming growth stays constant). Growth on the server in Figure 6 is approximately 10% every 3 months. The effects of queuing are more pronounced at higher utilization, so you may find you need to upgrade prior to reaching 100% usage.

How to graph

The spreadsheet method doesn't scale well to many servers with many different measurements. One method takes the output from sar and passes it through a graphing tool like GNUplot. You may also look at the graphing tools available, many of which are open source. The open source category includes a series of tools based on the RRDTool package.

RRDTool is a series of programs and libraries that put data into a round-robin database (RRD). An RRD continually archives data as it comes in, so you can have hourly data for the past year and 5-minute averages for the week. This gives you a database that never grows and that constantly prunes old data. RRDTool also comes with tools to make graphs.

See the Related topics section for several good graphing tools.

What to graph

You should graph any information that is important to your service, and anything you could potentially use to make decisions. Graphs also play a secondary role by helping you find out what happened in the past, so you may end up graphing items like fan speeds. Normally, though, you'll focus your attention on graphing CPU, memory, disk, and network stats. If possible, graph response times from services. Not only will this help you make better decisions based on what your users will expect, but the information will also help you if you develop any models of your system.

摘要

In this tutorial, you learned about measuring and analyzing performance. You also learned to use your measurements to troubleshoot problems.

Linux provides a wealth of information regarding the health of the system. Tools like vmstat , iostat , and ps provide real-time information. Tools like sar provide longer-term information. Remember that when you're using vmstat and iostat , the first value reported isn't real-time!

When troubleshooting a system, you should first try to identify the symptoms of the problem, both to help you understand the problem and to know when it has been solved. Then, measure system resources while the problem is ongoing (if possible) to determine the source of the problem. Once you've identified a fix, implement it and then evaluate the results.

The PDQ Perl module allows you to solve queuing problems. After replacing your system with a series of queues, you can write a Perl script using the PDQ functions. You can then use this model to calculate demand based on current usage and on future predicted usage.

Both models and graphs can be used to predict growth. Ideally, you should use both methods and compare the results.

This concludes the series on preparing for the LPIC 3 exam. If you'll be writing the exam, I wish you success and hope this series is helpful to you.

翻译自: https://www.ibm.com/developerworks/linux/tutorials/l-lpic3306/index.html

是否优化更新主题浏览量: