Woke up one morning with an awesome message in the server logs:
smartd: Device: /dev/sdd [SAT], 1 Currently unreadable (pending) sectors
Any hard disk error is a nightmare. It’s even more interesting when you have no ideea what happened.
The drive in question (/dev/sdd) is set up with another one in RAID 1. The error appeared while mdadm was running its monthly consistency check. Basically mdadm was checking if both drives have the same data by reading and comparing bytes, and at one point it got an error from /dev/sdd but it continued along instead of kicking the drive out of the raid array.
Then smartd kept on screaming the line above in the logs for every 30 minutes.
What does it mean?
Well… the hard disk drive now has 1 unreadable (pending) sector. This means there’s probably a bad block on the hdd, but it hasn’t been yet remapped. The bad block was detected by the hdd when mdadm tried to read it, but didn’t take any further action. A bad block is remapped only when a write happens.
Is the hdd failing soon?
Maybe. Let’s just say the chances are a little higher.
You should make sure you do have a backup at this point, and to be prepared to replace the hdd at any time.
On the other hand, the hdd may function very well another couple of years. But you decide how precious is your data.
Can I get rid of the error?
Yes, by forcing the hdd to write on the bad block, it will re-assign it to a spare block, if there are any left.
There are probably a few ways to achieve this:
1) Format the whole hdd, but make you use the option to overwrite or zero the data – this means no quick format.
2) Use the tool badbocks, run it with “-n” parameter (non-destructive read-write mode), any filesystem needs to be unmounted beforehand. badblocks will read and the write the same data on the hdd, block by block.
3) If you want to do this with no downtime and the hdd is part of a raid array that can sustain one failed drive there’s another way:
– run badblocks on the hdd (read only mode) with -v (verbose, will display percent of progress) and watch the logs to see if you catch at what percent do you get any media error from the hdd
– stop badblocks once you get the error
– optional – mdadm: you can enable internal bitmaps, it will make re-adding the drive later very fast compared to complete syncing (at least on RAID 1), check the man page
– mdadm: fail the drive, then remove it from the array
– run badblocks, using non-destructive read-write mode, you can specify which block portion to go trough – this is where you can calculate an approximate position from the percent obtained above
– badblocks should finish faster this way, then running it through the whole drive
– re-add the drive to the raid array
– mdadm: you can disable internal bitmaps if you don’t want it anymore
What to check for?
Run “smartctl -a” on the drive. Check the SMART attribute table, look for the row “Current_Pending_Sector”, and the corresponding value under “RAW_VALUE”.
That’s the number of the current pending sectors, it is 1 when you first get the error. And then it goes back to 0 if you manage to remap the bad block.
There are no explicit commands given here, you should be reading the man pages of all the tols involved and make sure you understand what are you doing and why.