Linux software RAID is managed with mdadm tool,  I ave another page shows mdadm commands examples to manage the software array on linux. In this article, I'll particularly show how to replace a failed disk drive in mdadm array(RAID1), which is the most popular use case.

1. Receive mdadm monitor e-mail

If you have e-mail monitoring configured, you will receive a e-mail similar to the one titled with 'Fail event on /dev/md1:<hostname>'

The content

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md3 : active raid1 sdi4[1] sda4[0](F)
      971064320 blocks super 1.1 [2/1] [_U]
      bitmap: 1/8 pages [4KB], 65536KB chunk

md0 : active raid1 sdi1[1] sda1[0]
      195264 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sda2[0](F) sdi2[1]
      1463296 blocks super 1.1 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sda3[0](F) sdi3[1]
      3904512 blocks super 1.1 [2/1] [_U]
      
unused devices: <none>

From the e-mail, you can tell it's the disk sda failed(see the F flag in red )

2. Exame the software array

you can either use mdadm -E, cat /proc/mdstat, or mdadm --detail to double check  the array status

# mdadm --detail /dev/md1
/dev/md1:
        Version : 1.1
  Creation Time : Fri Mar 27 04:56:03 2015
     Raid Level : raid1
     Array Size : 1463296 (1429.24 MiB 1498.42 MB)
  Used Dev Size : 1463296 (1429.24 MiB 1498.42 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Aug 12 13:08:41 2015
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : pool05.fibrevillage.com:1 
           UUID : 8d7c3f3d:fc385020:cc773033:33df860d
         Events : 301

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8      130        1      active sync   /dev/sdi2

       0       8        2        -      faulty   /dev/sda2

3. Mark and remove the failed disk from the array

You don't have to this step is the faulty disk completely dead, the software array will kick it out if so. Sometimes you will just see one partition of the disk shows failed, in this case, mark and remove all disk partitions of the faulty disk, in this case, it's sda1,sda2,sda3,sda4

Mark the disk failed

#mdadm --manage /dev/md1 --fail /dev/sda2

Remove the failed disk partition

#mdadm --manage /dev/md1 --remove /dev/sda2

4. replace the failed disk

Shutdown the server and replace the failed disk with new one, or you can do the disk replacement on the fly if the server supports hot pluggable(most of servers support this way).

5. replicate the disk partition to new disk

Once you have new disk online and appears to OS, then replicate the disk partition from the old/good disk.

sfdisk -d /dev/sdi | sfdisk /dev/sda

If the disk was partitioned by GPT, then you probably will see the following error below,

# sfdisk -d /dev/sdi | sfdisk /dev/sda
Checking that no-one is using this disk right now ...

WARNING: GPT (GUID Partition Table) detected on '/dev/sdi'! The util sfdisk doesn't support GPT. Use GNU Parted.

OK

Disk /dev/sda: 121601 cylinders, 255 heads, 63 sectors/track
 /dev/sda: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1             1 1953525167 1953525167  ee  GPT
/dev/sda2             0         -          0   0  Empty
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty
Warning: partition 1 does not end at a cylinder boundary

sfdisk: I don't like these partitions - nothing changed.
(If you really want this, use the --force option.)

Try sgdisk then, ( install gdisk package).

sgdisk -R /dev/sda /dev/sdi  # NOTE: copy sdi partition table to sda
sgdisk -G /dev/sda           # Randomizes the GUID on the disk and all the partitions

6. add the new partition back to the software array

Check the array status

# cat /proc/mdstat 
Personalities : [raid1]
md3 : active raid1 sdi4[1]
      971064320 blocks super 1.1 [2/1] [_U]
      bitmap: 1/8 pages [4KB], 65536KB chunk

md0 : active raid1 sdi1[1]
      195264 blocks super 1.0 [2/1] [_U]
      
md1 : active raid1 sdi2[1]
      1463296 blocks super 1.1 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sdi3[1]
      3904512 blocks super 1.1 [2/1] [_U]
      
unused devices: <none>

Add 4 disk partitions back to their arraies.

# mdadm /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1
# mdadm /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
# mdadm /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3
# mdadm /dev/md3 --add /dev/sda4

Check the array status until all array are in good shape

# cat /proc/mdstat 
Personalities : [raid1]
md3 : active raid1 sda4[2] sdi4[1]
      971064320 blocks super 1.1 [2/1] [_U]
      [>....................]  recovery =  0.1% (1629120/971064320) finish=277.6min speed=58182K/sec
      bitmap: 1/8 pages [4KB], 65536KB chunk

md0 : active raid1 sda1[2] sdi1[1]
      195264 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sda2[2] sdi2[1]
      1463296 blocks super 1.1 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md2 : active raid1 sda3[2] sdi3[1]
      3904512 blocks super 1.1 [2/2] [UU]

That's it!