GoogleがHDD障害の分析レポートを出しました: Failure Trends in a Large Disk Drive Population

More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB.



First, only very young and very old age groups appear to show the expected behavior. After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high utilization ones.



The figure shows that failures do not increase when the average temperature increases.


三つ目は、SMARTの情報と障害率の関係の分析です。いくつか、Criticalなparameterを分析しています: scan error, reallocation, offline reallocation, probational counts。これらに関しては、countが一つでも上がると、障害率がかなり上昇しています。

We find that the group of drives with scan errors are ten times more likely to fail than the group with no errors.

the critical threshold for scan errors is one. After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan error.

After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical threshold for this parameter also one.

After the first offline reallocation, drives have over 21 times higher chances of failure within 60 days than
drives without offline reallocations; an effect that is again more drastic than total reallocations.

The critical threshold for probational counts is also one: after the first event, drives are 16 times more
likely to fail within 60 days than drives with zero probational counts.


Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely
scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only
on those signals can never predict more than half of the failed drives.


それ以外にもいくつか面白いデータがでています。HDDの故障率(AFR: Annual Failure Rate)は、一年目までは大体2%前後。二年目以降は6〜9%ぐらいになっている。論文によると、Vendorの公表値は大体2%だとのことなので、それよりは随分高い感じがします。ただ、これは、userが実際は壊れていないHDDを壊れたとしているためだという意見があるようです。

While drive manufacturers often quote yearly failure rates below 2% [2], user studies have seen rates as high as 6% [9]. Elerath and Shah [7] report between 15-60% of drives considered to have failed at the user site are found to have no defect by the manufacturers upon returning the unit. Hughes et al. [11] observe between 20-30% “no problem found” cases after analyzing failed drives from their study of 3477 disks.