replace failed disk in ZFS on Sun fire x4540

The procedure was written by me 4 years ago, but still worth to share particularly for people quite often ignore the post actions.

How to know and identify the failed disk 

        a. Alert from ILOM SNMP

        b. Our own daily raid/ZFS monitoring script

        c. you can also identify the failed disk by examining the /var/adm/messages file and the metastat command output

        d. or fmadm faulty

If it's a disk in ZFS, then run command

/usr/sbin/zpool status

 
 

...

 

raidz2 DEGRADED 0 0 0

 

c4t0d0 FAULTED 0 0 0 too many errors

 

c0t4 shows you the disk logical location is on the controller 4 target 0. The disk status could be 'FAULTED','REMOVED','FAILED' etc..

What to do if zfs status shows disk good, but you get asserted disk problem from ILOM, also from fmadm

Particularly, in fmadm faulty output, there is impact says


Impact      : It is likely that the continued operation of
              this disk will result in data loss.

What you need to do is still replace this disk, you can wait until it failed by zfs, but for your data safety, you'd better to replace it as soon as possible.

For replacement, you can do it without uncinfugure the disk first to me, it's bit too wild. Here is proper way.

zpool offline spool01_1 c4t0d0

cfgadm -f  -c unconfigure c4::dsk/c4t0d0

After disk replacement(follow the next step to identify disk physical location), do

cfgadm -c configure c4::dsk/c4t0d0

Then follow Notifice ZFS session.

Locate the failed disk, replace it with new one

Find the disk map of that node. That map shows you the physical location of the failed disk like the following map.

Sun Fire X4540

X4540 Rear

3: c0t3

7: c0t7

11: c1t3

15: c1t7

19: c2t3

23: c2t7

27: c3t3

31: c3t7

35: c4t3

39: c4t7

43: c5t3

47: c5t7

2: c0t2

6: c0t6

10: c1t2

14: c1t6

18: c2t2

22: c2t6

26: c3t2

30: c3t6

34: c4t2

38: c4t6

42: c5t2

46: c5t6

1: c0t1

5: c0t5

9: c1t1

13: c1t5

17: c2t1

21: c2t5

25: c3t1

29: c3t5

33: c4t1

37: c4t5

41: c5t1

45: c5t5

0: c0t0

4: c0t4

8: c1t0

12: c1t4

16: c2t0

20: c2t4

24: c3t0

28: c3t4

32: c4t0

36: c4t4

40: c5t0

44: c5t4

*

*

X4540

*

Front

*

*

 

 

Use the map, replace the failed disk with new one.

Notice ZFS after disk replacement.

If your ZFS pool is configured auto replacement, it will do resilient itself, otherwise, you have to let ZFS pool know you've replaced a physical disk for it, run command

 

#zpool status

 

You will see information like following:

 

raidz2 DEGRADED 0 0 0
c4t0d0 UNAVAIL 0 0 0 cannot open

 

Run following command, then you are done.

 

#zpool replace <poolname> cxtyd0(the disk logical location)

 

You will see the following output if you run zpool status

 

replacing DEGRADED 0 0 0
c4t0d0s0/o FAULTED 0 0 0 corrupted data
c4t0d0 ONLINE 0 0 0

 

Shortly after, you will see output like this:

 

raidz2 ONLINE 0 0 0

 

c4t0d0 ONLINE 0 0 0

 

That means ZFS finished disk replacement, now it is doing resilver.

Note: prompt messages may vary on different situation and ZFS version, but steps are similar.

Review the FMA/zfs diagnostic counters from the previous fault.

# fmstat
  • Reset the ZFS diagnostic counters and determine whether any new fault activity is occurring.

    # fmadm reset zfs-diagnosis # fmadm reset zfs-retire # fmstat
  • Determine the FMA fault event from the failed device.

    # fmadm faulty -a --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jul 17 11:03:56 378924d1-840b-c4dd-c8e2-a5491d4047ff ZFS-8000-D3 Major ... Fault class : fault.fs.zfs.device Affects  : zfs://pool=rzpool/vdev=70f7855d9f673fcc faulted but still in service Problem in  : zfs://pool=rzpool/vdev=70f7855d9f673fcc faulted but still in service ...
  • Let FMA know that the ZFS fault event is cleared.

    # fmadm repair zfs://pool=rzpool/vdev=70f7855d9f673fcc
    fmadm: recorded repair to zfs://pool=rzpool/vdev=70f7855d9f673fcc
  • Confirm that no new faults have occurred.

    # fmadm faulty