Hard disks and S.M.A.R.T.

Old hard disks exposed a lot of their internals to the operating system: in order to request a data block from the drive, the system had to specify the exact cylinder, head and sector (CHS) where it was located (as happens with floppy disks). This structure became unsustainable as drives got larger (due to some limits in the BIOS calls) and more intelligent. Current hard disks are little (and complex) specific-purpose machines that work in LBA mode (not CHS). Oversimplifying, when presented a sector number and an operation, they read or write the corresponding block wherever it physically is — i.e. the operating system needn't care any more about the physical location of that sector in the disk. (They do provide CHS values to the BIOS, but they are fake and do not cover the whole disk size.) This is very interesting because the drive can automatically remap a failing sector to a different position if needed, thus correcting some serious errors in a transparent fashion (more on this below). Furthermore, "new" disks also have a very interesting diagnostic feature known as S.M.A.R.T. This interface keeps track of internal disk status information, which can be queried by the user, and also provides a way to ask the drive to run some self-tests. If you are wondering how I discovered this, it is because I recently had two hard disks fail (one in my desktop PC and the one in the iBook) reporting physical read errors. I thought I had to replace them but using smartmontools and dd(1) I was able to resolve the problems. Just try a smartctl -a /dev/disk0 on your system and be impressed by the amount of detailed information it prints! (This should be harmless but I take no responsibility if it fails for you in some way.) First of all I started by running an exhaustive surface test on the drive by using the smartctl -t long /dev/disk0. It is interesting to note that the test is performed by the drive itself, without interaction with the operating system; if you try it you will see that not even the hard disk led blinks, which means that the test does not "emit" any data through the ATA bus. Anyway. The test ended prematurely due to the read errors and reported the first failing sector; this can be seen by using smartctl -l selftest /dev/disk0. With the failing sector at hand (which was also reported in dmesg when it was first encountered by the operating system), I wrote some data over it with dd(1) hoping that the drive could remap it to a new place. This should have worked according to the instructions at smartmontools' web site, but it didn't. The sector kept failing and the disk kept reporting that it still had some sectors pending to be remapped (the Reallocated_Sector_Ct attribute). (I now think this was because I didn't use a big-enough block size to do the write, so at some point dd(1) tried to read some data and failed.) After a lot of testing, I decided to wipe out the whole disk (also using dd(1)) hoping that at some point the writes could force the disk to remap a sector. And it worked! After a full pass S.M.A.R.T. reported that there were no more sectors to be remapped and that several ones were moved. Let's now hope that no more bad sectors appear... but the desktop disk has been working fine since the "fixes" for over a month and has not developed any more problems. All in all a very handy tool for testing your computer health. It is recommended that you read the full smartctl(1) manual page before trying it; it contains important information, specially if you are new to S.M.A.R.T. as I were.

December 4, 2006 · Tags: <a href="/tags/hardware">hardware</a>, <a href="/tags/smart">smart</a>
Continue reading (about 3 minutes)