前一篇BLOG讲了一下smartmontools 工具的使用.
http://blog.163.com/digoal@126/blog/static/16387704020121028103934749/
smartmontools的数据库在不断的更新, 如果你的硬盘还没有加入到smartmontools 的 drivedb.h数据库中, 那就再等等吧.
已经加入到这里的, 也不要欣喜若狂, 读取到的SMART Attributes你都读懂了么?
SMART Attributes的定义是硬盘厂商提供的, 没有一个统一的标准, 所以比较痛苦.
下面以OCZ PCI-E RevoDrive3为例, 来看看SMART Attribute的意思 :

[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/1134847535 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+08m+29.325s12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113484753
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113484753
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113484753
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    44814
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning
解释一下每个属性的意思 : 
请参看smartctl的帮助文件 -A 部分 : 
ID# :  属性ID, 从1到255.
ATTRIBUTE_NAME : 属性名.
FLAG : 表示这个属性携带的标记. 使用-f brief可以打印.

                            ||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning
VALUE : Normalized value, 取值范围1到254. 越低表示越差. 越高表示越好. (with 1 representing the worst case and 254 representing the best)
注意wiki上说的是1到253. 这个值是硬盘厂商根据RAW_VALUE转换来的, smartmontools工具不负责转换工作.

Each  Attribute  has  a  "Raw"  value,  printed  under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE".
[Note: smartctl prints these values in  base-10.]   In  the  example just  given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly  one  year.
Each vendor  uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254.
Please keep in mind that smartctl only reports  the  different  Attribute  types,  values,  andthresholds as read from the device.
It does not carry out the conversion between "Raw" and "Normalized values: this is done by the disk's firmware.
The conversion from Raw value to a quantity with physical units is not specified by the SMART  standard.
In  most cases, the values printed by smartctl are sensible.
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius.  However in some cases vendors use unusual conventions.
For  example  the  Hitachi  disk  on my laptop reports its power-on hours in minutes, not hours.
Some IBM disks track three temperatures rather than one, in their raw values.  And so on.

WORST : 表示SMART开启以来的, 所有Normalized values的最低值. (which represents the lowest recorded normalized value.)

                Each Attribute also has a "Worst" value shown under the heading "WORST".  This is the smallest  (closest to  failure)  value  that  the disk has recorded at any time during its lifetime when SMART was enabled.[Note however that some vendors firmware may actually increase the "Worst" value  for  some  "rate-type" Attributes.]

THRESH : 阈值, 当Normalized value小于等于THRESH值时, 表示这项指标已经failed了.

                 Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under  the  heading "THRESH".   If  the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed.  If the Attribute is a pre-failure Attribute, then disk failure is imminent.
注意这里提到, 如果这个属性是pre-failure的, 那么这项如果出现Normalized value<=THRESH, 那么磁盘将马上failed掉.
TYPE : 这里存在两种TYPE类型, Pre-failed和Old_age. 
Pre-failed 类型的Normalized value可以用来预先知道磁盘是否要坏了. 例如Normalized value接近THRESH时, 就赶紧换硬盘吧.
Old_age 类型的Normalized value是指正常的使用损耗值, 当Normalized value 接近THRESH时, 也需要注意, 但是比Pre-failed要好一点.

            The  Attribute  table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age.  Pre-failure Attributes are ones which, if Normalized value less  than  or equal  to their threshold values, indicate pending disk failure.Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute's               Normalized value is less than or equal to the threshold.Please note: the fact that an Attribute is of type ’Pre-fail’ does not mean that your disk is about to fail!  It only has this meaning if the Attribute′s  current  Normalized value is less than or equal to the threshold value.

UPDATED : 这个字段表示这个属性的值在什么情况下会被更新. 一种是通常的操作和离线测试都更新(Always), 另一种是只在离线测试的情况下更新(Offline).

                 The  table  column  labeled "UPDATED" shows if the SMART Attribute values are updated during both normal operation and off-line testing, or only during offline testing.  The former are labeled "Always" and the latter are labeled "Offline".

WHEN_FAILED : 这个字段表示当前这个属性的状态 : failing_now(normalized_value <= THRESH), 或者in_the_past(WORST <= THRESH), 或者 - , 正常(normalized_value以及wrost >= THRESH).

            If  the  Attribute's  current  Normalized  value  is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is  less  than  or equal  to the threshold value, then this column will display "In_the_past".  If the "WHEN_FAILED" column has no entry (indicated by a dash: '-') then this Attribute is OK now (not failing) and has  also  never failed in the past.
RAW_VALUE : 表示这个属性的未转换前的RAW值, 可能是计数, 也可能是温度, 也可能是其他的.
注意RAW_VALUE转换成Normalized value是由厂商的firmware提供的, smartmontools不提供转换.

So  to  summarize: the Raw Attribute values are the ones that might have a real physical interpretation,
such as "Temperature Celsius", "Hours", or "Start-Stop Cycles".
Each manufacturer converts these, using their  detailed  knowledge of the disk's operations and failure modes, to Normalized Attribute values in the range 1-254.
The current and worst (lowest measured)  of  these  Normalized  Attribute  values  are stored on the disk, along with a Threshold value that the manufacturer has determined will indicate that the disk is going to fail, or that it has exceeded its design age or aging  limit. smartctl  does  not calculate  any of the Attribute values, thresholds, or types, it merely reports them from the SMART data on the device.

其他注意事项 : 
1. 当SSD磁盘不在smartmontools 数据库中时, 显示的attribute name可能不准. 因为硬盘厂商可能更改了attribute id的定义.

Solid-state drives use different meanings for some of the attributes.  In this case the  attribute  name printed by smartctl is incorrect unless the drive is already in the smartmontools drive database.
2. 注意有个FLAG是KEEP, 如果不带这个FLAG的属性, 值将不会KEEP在磁盘中, 可能出现WORST值被刷新的情况, 例如这里的ID=1的值, 已经89了, 重新执行又变成91了, 但是WORST的值并不是历史以来的最低89.
遇到这种情况的解决办法是找个地方存储这些值的历史值.

[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   091   091   050    -    0/1137468655 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+43m+38.270s12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113746865
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113746865
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113746865
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    45175
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

因此监控磁盘的重点在哪里呢?
严重情况从上到下 : 
1. 最严重的情况WHEN_FAILED = FAILING_NOW 并且 TYPE=Pre-failed, 表示现在这个属性已经出问题了. 并且硬盘也已经failed了.
2. 次严重的情况WHEN_FAILED = in_the_past 并且 TYPE=Pre-failed, 表示这个属性曾经出问题了. 但是现在是正常的.
3. WHEN_FAILED = FAILING_NOW 并且 TYPE=Old_age, 表示现在这个属性已经出问题了. 但是硬盘可能还没有failed.
4. WHEN_FAILED = in_the_past 并且 TYPE=Old_age, 表示现在这个属性曾经出问题了. 但是现在是正常的.
为了避免这4种情况的发生.
1. 对于UPDATE=Offline的属性, 应该让smartd定期进行测试(smartd还可以发邮件). 或者crontab进行测试. 
2. 应该时刻关注磁盘的Normalized value以及WORST的值是否接近THRESH的值了. 当有值要接近THRESH了, 提前更换硬盘.
3. 温度, 有些磁盘对温度比较敏感, 例如PCI-E SSD硬盘. 如果温度过高可能就挂了. 这里读取RAW_VALUE就比较可靠了.
监控手段, 
1. 可以写nagios插件, 通过nagios来监控.
2. smartd, 它可以发邮件告警.
3. 定期收集smart数据, 分析每个属性的Normalized value的变化趋势, 更加精确的推算出磁盘的剩余使用寿命.
另一块SSD硬盘的输出 :

[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     -O-R-K   001   001   000    -    169369325 Reallocated_Sector_Ct   PO--CK   100   100   001    -    09 Power_On_Hours          -O--CK   094   094   000    -    242312 Power_Cycle_Count       -O--CK   099   099   000    -    613 Read_Soft_Error_Rate    -O-R-K   100   100   000    -    0
175 Program_Fail_Count_Chip -O--CK   100   100   000    -    0
176 Erase_Fail_Count_Chip   -O--CK   100   100   000    -    0
177 Wear_Leveling_Count     -O--CK   099   099   000    -    3
178 Used_Rsvd_Blk_Cnt_Chip  PO--CK   074   074   001    -    43
179 Used_Rsvd_Blk_Cnt_Tot   PO--CK   091   091   001    -    895
180 Unused_Rsvd_Blk_Cnt_Tot -O--CK   091   091   000    -    9734
181 Program_Fail_Cnt_Total  -O--CK   100   100   000    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   000    -    0
183 Runtime_Bad_Block       PO--CK   100   100   001    -    0
194 Temperature_Celsius     -O---K   076   071   000    -    24 (Min/Max 22/28)
195 Hardware_ECC_Recovered  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Unknown_SSD_Attribute   PO--CK   100   100   090    -    0
232 Available_Reservd_Space -O--CK   074   074   000    -    123
233 Media_Wearout_Indicator -O--CK   054   054   000    -    1953821162
240 Unknown_SSD_Attribute   -O--CK   100   100   000    -    0||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning
简单的分析一下上面这个输出,
1. Raw_Read_Error_Rate 这个值已经非常接近THRESH了, 但是它只是一个正常损耗值, 不一定会带来硬盘的failed. 
2. 需要注意ID=202的这条, THRESH=90, 当前的Normalized value=100, 也有点接近了, 并且这个属性是Pre-failed的, 所以更加要注意它的变化.

【参考】

1. http://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis%2C_and_Reporting_Technology
2. http://www.ocztechnologyforum.com/forum/showthread.php?75786-SMART-Attributes-for-Sandforce-SSD-s-%28Agility2-Vertex2-VertexLE%29
3. http://blog.163.com/digoal@126/blog/static/16387704020121028103934749/

【附】

1. (SMART Attributes define : OCZ Vertex2, Agility2, etc) :

补充 :

245TB 写入后
SSD_Life_Left 的VALUE 降低到了96 .

[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/1502820485 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    516h+42m+05.505s12 Power_Cycle_Count       -O--CK   100   100   000    -    10
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    9
177 Wear_Range_Delta        ------   000   000   000    -    9
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/150282048
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/150282048
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/150282048
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   096   096   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    245127
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    6946||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

explain SMART Attributes相关推荐

  1. Linux中硬盘smart故障,硬盘驱动器 – 此SMART自检是否表示驱动器出现故障?

    我想知道这个SMART自检的结果是否表明驱动器出现故障,这是唯一一个在结果中出现"完成:读取失败"的驱动器. # smartctl -l selftest /dev/sde sma ...

  2. linux硬盘温度,linux查看硬盘温度跟使用情况

    硬盘用在服务器上好几年了,加上用的时候还是一个用了好几年的旧硬盘,担心它会不会突然挂掉. 用百度搜索linux下查看硬盘温度和使用时间的,发现如下工具: 2007.2008年我配过两台台式机,都用的三 ...

  3. 硬盘运行微型linux,linux用smartctl看硬盘运行了多少小时

    linux用smartctl看硬盘运行了多少小时 2013-02-03 1.什么是S.M.A.R.T. SMART是一种磁盘自我分析检测技术,早在90年代末就基本得到了普及每一块硬盘(包括IDE.SC ...

  4. 怎么查硬盘序列号_担心硬盘体质?不妨先给硬盘做一次体检

    这个,技术这东西真不好说,毕竟技术无对错,任何技术的确都有风险和需要我们付出代价,与其瞎担心,不妨花点时间给硬盘做一次体检. 01 争议中的SMR 对于争议比较大的技术话题,小狮子一般不太愿意参与其中 ...

  5. es-04-mapping和setting的建立

    mapping和setting, 使用java客户端比较难组装, 可以使用python或者scala 这儿直接在kibana中进行DSL创建 1, mapping 创建索引的时候, 可以事先对数据进行 ...

  6. linux 看硬盘运行时间长,Ubuntu 14.04查看硬盘使用时间

    1.需要安装smartmontools Ubuntu14.04下安装方法: sudo apt-get install smartmontools 2.使用方法: 在终端输入:sudo smartctl ...

  7. 硬盘检测工具Smartmontools安装、部署、使用

    在服务器管理的实际环境中,硬盘是最容易出现问题及发生故障的硬件,而且硬盘中存储着大量重要的数据,万一出现故障所造成的损失也是无法估计的,轻则需要化费大量的时间与精力去做数据恢复,重则硬盘报废,里面重要 ...

  8. 机械硬盘 运行 linux 很慢,如果读写硬盘操作有问题,假死机、很慢等,就检查一下硬盘坏道...

    本帖最后由 tonyliu2ca 于 16-3-28 11:12 编辑 这个问题要是写出来有时一个大块文章,咱们这里简单说说. 现代的硬磁盘技术已经相当成熟和智能,成熟并不是说就不是好就是坏,磁盘的状 ...

  9. 检测磁盘是否有问题的方法

    在windows系统下检测磁盘是否有问题的方法有 可以安装一些检测的工具来测试硬盘是否是坏道  a)HD Tune 软件可以检测硬盘是否有坏道 使用很简单的,网上下载好之后直接安装在系统上之后,打开安 ...

最新文章

  1. 不想CRUD干到老,就来看看这篇OOM排查的实战案例!
  2. ICCV 2021 | R-MSFM: 用于单目深度估计的循环多尺度特征调制
  3. c++ requests网络请求库
  4. 8086中断系统——《x86汇编语言:从实模式到保护模式》读书笔记04
  5. linux pcie驱动框架_Linux设备驱动框架设计
  6. java环境变量立即生效_win7批处理环境变量立即生效
  7. 将经过身份验证的用户注入Spring MVC @Controllers
  8. 感量越大抑制频率约低_脉冲信号是什么?它与频率,占空比,正、负逻辑间是什么关系?...
  9. 双剑合璧————Spring Boot + Mybatis Plus
  10. 【排序算法】插入、选择、堆排、快排、归并、计数
  11. bzoj 1609: [Usaco2008 Feb]Eating Together麻烦的聚餐(DP)
  12. 【比赛】新冠肺炎疫情趋势预测大赛~推荐大家参加下
  13. 项目IDEA启动配置
  14. 为企业选择最合适的SSL证书
  15. react-countup 电子表数字样式 定时滚动 大屏需要 炫酷
  16. PostSQL编写经验(利用坐标值,创建空间要素字段)
  17. github电脑壁纸_程序员必用的电脑桌面!
  18. linux加载显卡驱动模块,linux怎样加载显卡驱动,急,在线等
  19. 201671030113 词频统计软件项目报告
  20. 设计一个程序,其中有三个类CBank,BBank,GBank,分别为中国银行类…………

热门文章

  1. 黑客特种兵能套路一个骗子
  2. nc65用友uap开发-人力资源模块合同审批流开发
  3. HTML学生个人网站作业设计——中华美食(HTML+CSS) 美食静态网页制作 WEB前端美食网站设计与实现
  4. ab753变频器参数怎么拷贝到面板_AB753变频器参数设置
  5. Android Project : FXiami 虾米音乐播放器
  6. epic怎么添加本地游戏_Epic游戏商城本周免费游戏:无主之地 帅杰克合集
  7. 有限元网格生成程序及软件
  8. 与汉语拼音不同的中国部分城市英文名
  9. 在 Java 9 的JShell中 跟Kotlin 的REPL中尽情体验函数式编程乐趣吧
  10. 统计学专业综合【个人笔记】