GoogleがHDD障害を分析

GoogleがHDD障害の分析レポートを出しました: Failure Trends in a Large Disk Drive Population http://216.239.37.132/papers/disk_failures.pdf
ATAを対象として、10万個(!)以上のHDDを使った結果とのことです。

More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB.

いくつか面白い結果が出ているので、順に見てみます。

まず一つ目は、HDDの利用率(utilization)が高くても、必ずしも故障率は上がらない、という結果です。

First, only very young and very old age groups appear to show the expected behavior. After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high utilization ones.

結果を見ると、一年目ぐらいまでは高利用率のHDDの障害率が高いですが、二年目からは確かにあまり変わらなくなってきています。ただ、面白いのは五年目になると再び高利用率のHDDの障害率がグッと上がっているところです。使い始めて一年経つまで、また使ってから五年経ったものは、高使用率にする場合は障害率が上がるので注意、といったところでしょうか。

二つ目は、温度の高さと障害率の高さは必ずしも相関関係に無い、という結果です。

The figure shows that failures do not increase when the average temperature increases.

グラフを見ると、確かに温度が著しく低いもしくは高い場合には障害率が上がっていますが、摂氏30〜45度で使う分には、あまり障害率の変化は見られないようです。

三つ目は、SMARTの情報と障害率の関係の分析です。いくつか、Criticalなparameterを分析しています: scan error, reallocation, offline reallocation, probational counts。これらに関しては、countが一つでも上がると、障害率がかなり上昇しています。

We find that the group of drives with scan errors are ten times more likely to fail than the group with no errors.

the critical threshold for scan errors is one. After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan error.

After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical threshold for this parameter also one.

After the first offline reallocation, drives have over 21 times higher chances of failure within 60 days than
drives without offline reallocations; an effect that is again more drastic than total reallocations.

The critical threshold for probational counts is also one: after the first event, drives are 16 times more
likely to fail within 60 days than drives with zero probational counts.

ただ、最も面白いのは、壊れたHDDの56%は、SMARTに何のメッセージも出ずに壊れているということです。

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely
scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only
on those signals can never predict more than half of the failed drives.

この結果から、SMARTの有効性に対して疑念を投げかけています。

それ以外にもいくつか面白いデータがでています。HDDの故障率(AFR: Annual Failure Rate)は、一年目までは大体2%前後。二年目以降は6〜9%ぐらいになっている。論文によると、Vendorの公表値は大体2%だとのことなので、それよりは随分高い感じがします。ただ、これは、userが実際は壊れていないHDDを壊れたとしているためだという意見があるようです。

While drive manufacturers often quote yearly failure rates below 2% [2], user studies have seen rates as high as 6% [9]. Elerath and Shah [7] report between 15-60% of drives considered to have failed at the user site are found to have no defect by the manufacturers upon returning the unit. Hughes et al. [11] observe between 20-30% “no problem found” cases after analyzing failed drives from their study of 3477 disks.