explain SMART Attributes

前一篇BLOG讲了一下smartmontools 工具的使用.

http://blog.163.com/digoal@126/blog/static/16387704020121028103934749/

smartmontools的数据库在不断的更新, 如果你的硬盘还没有加入到smartmontools 的 drivedb.h数据库中, 那就再等等吧.

已经加入到这里的, 也不要欣喜若狂, 读取到的SMART Attributes你都读懂了么?

SMART Attributes的定义是硬盘厂商提供的, 没有一个统一的标准, 所以比较痛苦.

下面以OCZ PCI-E RevoDrive3为例, 来看看SMART Attribute的意思 :

[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/1134847535 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+08m+29.325s12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113484753
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113484753
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113484753
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    44814
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

解释一下每个属性的意思 :

请参看smartctl的帮助文件 -A 部分 :

ID# : 属性ID, 从1到255.

ATTRIBUTE_NAME : 属性名.

FLAG : 表示这个属性携带的标记. 使用-f brief可以打印.

                            ||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

VALUE : Normalized value, 取值范围1到254. 越低表示越差. 越高表示越好. (with 1 representing the worst case and 254 representing the best)

注意wiki上说的是1到253. 这个值是硬盘厂商根据RAW_VALUE转换来的, smartmontools工具不负责转换工作.

Each  Attribute  has  a  "Raw"  value,  printed  under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE".
[Note: smartctl prints these values in  base-10.]   In  the  example just  given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly  one  year.
Each vendor  uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254.
Please keep in mind that smartctl only reports  the  different  Attribute  types,  values,  andthresholds as read from the device.
It does not carry out the conversion between "Raw" and "Normalized values: this is done by the disk's firmware.
The conversion from Raw value to a quantity with physical units is not specified by the SMART  standard.
In  most cases, the values printed by smartctl are sensible.
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius.  However in some cases vendors use unusual conventions.
For  example  the  Hitachi  disk  on my laptop reports its power-on hours in minutes, not hours.
Some IBM disks track three temperatures rather than one, in their raw values.  And so on.

WORST : 表示SMART开启以来的, 所有Normalized values的最低值. (which represents the lowest recorded normalized value.)

                Each Attribute also has a "Worst" value shown under the heading "WORST".  This is the smallest  (closest to  failure)  value  that  the disk has recorded at any time during its lifetime when SMART was enabled.[Note however that some vendors firmware may actually increase the "Worst" value  for  some  "rate-type" Attributes.]

THRESH : 阈值, 当Normalized value小于等于THRESH值时, 表示这项指标已经failed了.

                 Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under  the  heading "THRESH".   If  the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed.  If the Attribute is a pre-failure Attribute, then disk failure is imminent.

注意这里提到, 如果这个属性是pre-failure的, 那么这项如果出现Normalized value<=THRESH, 那么磁盘将马上failed掉.

TYPE : 这里存在两种TYPE类型, Pre-failed和Old_age.

Pre-failed 类型的Normalized value可以用来预先知道磁盘是否要坏了. 例如Normalized value接近THRESH时, 就赶紧换硬盘吧.

Old_age 类型的Normalized value是指正常的使用损耗值, 当Normalized value 接近THRESH时, 也需要注意, 但是比Pre-failed要好一点.

            The  Attribute  table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age.  Pre-failure Attributes are ones which, if Normalized value less  than  or equal  to their threshold values, indicate pending disk failure.Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute's               Normalized value is less than or equal to the threshold.Please note: the fact that an Attribute is of type ’Pre-fail’ does not mean that your disk is about to fail!  It only has this meaning if the Attribute′s  current  Normalized value is less than or equal to the threshold value.

UPDATED : 这个字段表示这个属性的值在什么情况下会被更新. 一种是通常的操作和离线测试都更新(Always), 另一种是只在离线测试的情况下更新(Offline).

                 The  table  column  labeled "UPDATED" shows if the SMART Attribute values are updated during both normal operation and off-line testing, or only during offline testing.  The former are labeled "Always" and the latter are labeled "Offline".

WHEN_FAILED : 这个字段表示当前这个属性的状态 : failing_now(normalized_value <= THRESH), 或者in_the_past(WORST <= THRESH), 或者 - , 正常(normalized_value以及wrost >= THRESH).

            If  the  Attribute's  current  Normalized  value  is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is  less  than  or equal  to the threshold value, then this column will display "In_the_past".  If the "WHEN_FAILED" column has no entry (indicated by a dash: '-') then this Attribute is OK now (not failing) and has  also  never failed in the past.

RAW_VALUE : 表示这个属性的未转换前的RAW值, 可能是计数, 也可能是温度, 也可能是其他的.

注意RAW_VALUE转换成Normalized value是由厂商的firmware提供的, smartmontools不提供转换.

So  to  summarize: the Raw Attribute values are the ones that might have a real physical interpretation,
such as "Temperature Celsius", "Hours", or "Start-Stop Cycles".
Each manufacturer converts these, using their  detailed  knowledge of the disk's operations and failure modes, to Normalized Attribute values in the range 1-254.
The current and worst (lowest measured)  of  these  Normalized  Attribute  values  are stored on the disk, along with a Threshold value that the manufacturer has determined will indicate that the disk is going to fail, or that it has exceeded its design age or aging  limit. smartctl  does  not calculate  any of the Attribute values, thresholds, or types, it merely reports them from the SMART data on the device.

其他注意事项 :

1. 当SSD磁盘不在smartmontools 数据库中时, 显示的attribute name可能不准. 因为硬盘厂商可能更改了attribute id的定义.

Solid-state drives use different meanings for some of the attributes.  In this case the  attribute  name printed by smartctl is incorrect unless the drive is already in the smartmontools drive database.

2. 注意有个FLAG是KEEP, 如果不带这个FLAG的属性, 值将不会KEEP在磁盘中, 可能出现WORST值被刷新的情况, 例如这里的ID=1的值, 已经89了, 重新执行又变成91了, 但是WORST的值并不是历史以来的最低89.

遇到这种情况的解决办法是找个地方存储这些值的历史值.

[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   091   091   050    -    0/1137468655 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    170h+43m+38.270s12 Power_Cycle_Count       -O--CK   100   100   000    -    8
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    7
177 Wear_Range_Delta        ------   000   000   000    -    2
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/113746865
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/113746865
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/113746865
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    45175
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    363||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

因此监控磁盘的重点在哪里呢?

严重情况从上到下 :

1. 最严重的情况WHEN_FAILED = FAILING_NOW 并且 TYPE=Pre-failed, 表示现在这个属性已经出问题了. 并且硬盘也已经failed了.

2. 次严重的情况WHEN_FAILED = in_the_past 并且 TYPE=Pre-failed, 表示这个属性曾经出问题了. 但是现在是正常的.

3. WHEN_FAILED = FAILING_NOW 并且 TYPE=Old_age, 表示现在这个属性已经出问题了. 但是硬盘可能还没有failed.

4. WHEN_FAILED = in_the_past 并且 TYPE=Old_age, 表示现在这个属性曾经出问题了. 但是现在是正常的.

为了避免这4种情况的发生.

1. 对于UPDATE=Offline的属性, 应该让smartd定期进行测试(smartd还可以发邮件). 或者crontab进行测试.

2. 应该时刻关注磁盘的Normalized value以及WORST的值是否接近THRESH的值了. 当有值要接近THRESH了, 提前更换硬盘.

3. 温度, 有些磁盘对温度比较敏感, 例如PCI-E SSD硬盘. 如果温度过高可能就挂了. 这里读取RAW_VALUE就比较可靠了.

监控手段,

1. 可以写nagios插件, 通过nagios来监控.

2. smartd, 它可以发邮件告警.

3. 定期收集smart数据, 分析每个属性的Normalized value的变化趋势, 更加精确的推算出磁盘的剩余使用寿命.

另一块SSD硬盘的输出 :

[root@db-172-16-3-150 postgresql-9.2.1]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief -d sat+megaraid,2 /dev/sdb
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     -O-R-K   001   001   000    -    169369325 Reallocated_Sector_Ct   PO--CK   100   100   001    -    09 Power_On_Hours          -O--CK   094   094   000    -    242312 Power_Cycle_Count       -O--CK   099   099   000    -    613 Read_Soft_Error_Rate    -O-R-K   100   100   000    -    0
175 Program_Fail_Count_Chip -O--CK   100   100   000    -    0
176 Erase_Fail_Count_Chip   -O--CK   100   100   000    -    0
177 Wear_Leveling_Count     -O--CK   099   099   000    -    3
178 Used_Rsvd_Blk_Cnt_Chip  PO--CK   074   074   001    -    43
179 Used_Rsvd_Blk_Cnt_Tot   PO--CK   091   091   001    -    895
180 Unused_Rsvd_Blk_Cnt_Tot -O--CK   091   091   000    -    9734
181 Program_Fail_Cnt_Total  -O--CK   100   100   000    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   000    -    0
183 Runtime_Bad_Block       PO--CK   100   100   001    -    0
194 Temperature_Celsius     -O---K   076   071   000    -    24 (Min/Max 22/28)
195 Hardware_ECC_Recovered  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Unknown_SSD_Attribute   PO--CK   100   100   090    -    0
232 Available_Reservd_Space -O--CK   074   074   000    -    123
233 Media_Wearout_Indicator -O--CK   054   054   000    -    1953821162
240 Unknown_SSD_Attribute   -O--CK   100   100   000    -    0||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

简单的分析一下上面这个输出,

1. Raw_Read_Error_Rate 这个值已经非常接近THRESH了, 但是它只是一个正常损耗值, 不一定会带来硬盘的failed.

2. 需要注意ID=202的这条, THRESH=90, 当前的Normalized value=100, 也有点接近了, 并且这个属性是Pre-failed的, 所以更加要注意它的变化.

【参考】

1. http://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis%2C_and_Reporting_Technology

2. http://www.ocztechnologyforum.com/forum/showthread.php?75786-SMART-Attributes-for-Sandforce-SSD-s-%28Agility2-Vertex2-VertexLE%29

3. http://blog.163.com/digoal@126/blog/static/16387704020121028103934749/

【附】

1. (SMART Attributes define : OCZ Vertex2, Agility2, etc) :

补充 :

245TB 写入后

SSD_Life_Left 的VALUE 降低到了96 .

[root@db-172-16-3-150 ~]# /opt/smartmontools-6.0/sbin/smartctl -A -f brief /dev/sdd
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-2.6.18-274.el5] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate     POSR--   089   089   050    -    0/1502820485 Retired_Block_Count     PO--CK   100   100   003    -    09 Power_On_Hours_and_Msec -O--CK   100   100   000    -    516h+42m+05.505s12 Power_Cycle_Count       -O--CK   100   100   000    -    10
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    9
177 Wear_Range_Delta        ------   000   000   000    -    9
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   030   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/150282048
196 Reallocated_Event_Count PO--CK   100   100   003    -    0
201 Unc_Soft_Read_Err_Rate  --SRC-   120   120   000    -    0/150282048
204 Soft_ECC_Correct_Rate   --SRC-   120   120   000    -    0/150282048
230 Life_Curve_Status       PO--C-   100   100   000    -    100
231 SSD_Life_Left           PO--C-   096   096   010    -    0
233 SandForce_Internal      ------   000   000   000    -    0
234 SandForce_Internal      ------   000   000   000    -    0
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    245127
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    6946||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning

explain SMART Attributes相关推荐

Linux中硬盘smart故障,硬盘驱动器 – 此SMART自检是否表示驱动器出现故障？
我想知道这个SMART自检的结果是否表明驱动器出现故障,这是唯一一个在结果中出现"完成:读取失败"的驱动器. # smartctl -l selftest /dev/sde sma ...
linux硬盘温度,linux查看硬盘温度跟使用情况
硬盘用在服务器上好几年了,加上用的时候还是一个用了好几年的旧硬盘,担心它会不会突然挂掉. 用百度搜索linux下查看硬盘温度和使用时间的,发现如下工具: 2007.2008年我配过两台台式机,都用的三 ...
硬盘运行微型linux,linux用smartctl看硬盘运行了多少小时
linux用smartctl看硬盘运行了多少小时 2013-02-03 1.什么是S.M.A.R.T. SMART是一种磁盘自我分析检测技术,早在90年代末就基本得到了普及每一块硬盘(包括IDE.SC ...
怎么查硬盘序列号_担心硬盘体质？不妨先给硬盘做一次体检
这个,技术这东西真不好说,毕竟技术无对错,任何技术的确都有风险和需要我们付出代价,与其瞎担心,不妨花点时间给硬盘做一次体检. 01 争议中的SMR 对于争议比较大的技术话题,小狮子一般不太愿意参与其中 ...
es-04-mapping和setting的建立
mapping和setting, 使用java客户端比较难组装, 可以使用python或者scala 这儿直接在kibana中进行DSL创建 1, mapping 创建索引的时候, 可以事先对数据进行 ...
linux 看硬盘运行时间长,Ubuntu 14.04查看硬盘使用时间
1.需要安装smartmontools Ubuntu14.04下安装方法: sudo apt-get install smartmontools 2.使用方法: 在终端输入:sudo smartctl ...
硬盘检测工具Smartmontools安装、部署、使用
在服务器管理的实际环境中,硬盘是最容易出现问题及发生故障的硬件,而且硬盘中存储着大量重要的数据,万一出现故障所造成的损失也是无法估计的,轻则需要化费大量的时间与精力去做数据恢复,重则硬盘报废,里面重要 ...
机械硬盘运行 linux 很慢,如果读写硬盘操作有问题，假死机、很慢等，就检查一下硬盘坏道...
本帖最后由 tonyliu2ca 于 16-3-28 11:12 编辑这个问题要是写出来有时一个大块文章,咱们这里简单说说. 现代的硬磁盘技术已经相当成熟和智能,成熟并不是说就不是好就是坏,磁盘的状 ...
检测磁盘是否有问题的方法
在windows系统下检测磁盘是否有问题的方法有可以安装一些检测的工具来测试硬盘是否是坏道 a)HD Tune 软件可以检测硬盘是否有坏道使用很简单的,网上下载好之后直接安装在系统上之后,打开安 ...

explain SMART Attributes

explain SMART Attributes相关推荐

最新文章

热门文章