The procedure was written by me 4 years ago, but still worth to share particularly for people quite often ignore the post actions.
How to know and identify the failed disk
a. Alert from ILOM SNMP
b. Our own daily raid/ZFS monitoring script
c. you can also identify the failed disk by examining the /var/adm/messages file and the metastat command output
d. or fmadm faulty
If it's a disk in ZFS, then run command
/usr/sbin/zpool status
...
raidz2 DEGRADED 0 0 0
c4t0d0 FAULTED 0 0 0 too many errors
…
c0t4 shows you the disk logical location is on the controller 4 target 0. The disk status could be 'FAULTED','REMOVED','FAILED' etc..
What to do if zfs status shows disk good, but you get asserted disk problem from ILOM, also from fmadm
Particularly, in fmadm faulty output, there is impact says
Impact : It is likely that the continued operation of
this disk will result in data loss.
What you need to do is still replace this disk, you can wait until it failed by zfs, but for your data safety, you'd better to replace it as soon as possible.
For replacement, you can do it without uncinfugure the disk first, to me, it's bit too wild. Here is proper way.
zpool offline spool01_1 c4t0d0
cfgadm -f -c unconfigure c4::dsk/c4t0d0
After disk replacement(follow the next step to identify disk physical location), do
cfgadm -c configure c4::dsk/c4t0d0
Then follow Notifice ZFS session.
Locate the failed disk, replace it with new one
Find the disk map of that node. That map shows you the physical location of the failed disk like the following map.
Sun Fire X4540
X4540 Rear
|
3: c0t3
|
7: c0t7
|
11: c1t3
|
15: c1t7
|
19: c2t3
|
23: c2t7
|
27: c3t3
|
31: c3t7
|
35: c4t3
|
39: c4t7
|
43: c5t3
|
47: c5t7
|
2: c0t2
|
6: c0t6
|
10: c1t2
|
14: c1t6
|
18: c2t2
|
22: c2t6
|
26: c3t2
|
30: c3t6
|
34: c4t2
|
38: c4t6
|
42: c5t2
|
46: c5t6
|
1: c0t1
|
5: c0t5
|
9: c1t1
|
13: c1t5
|
17: c2t1
|
21: c2t5
|
25: c3t1
|
29: c3t5
|
33: c4t1
|
37: c4t5
|
41: c5t1
|
45: c5t5
|
0: c0t0
|
4: c0t4
|
8: c1t0
|
12: c1t4
|
16: c2t0
|
20: c2t4
|
24: c3t0
|
28: c3t4
|
32: c4t0
|
36: c4t4
|
40: c5t0
|
44: c5t4
|
*
|
*
|
X4540
|
*
|
Front
|
*
|
*
|
Use the map, replace the failed disk with new one.
Notice ZFS after disk replacement.
If your ZFS pool is configured auto replacement, it will do resilient itself, otherwise, you have to let ZFS pool know you've replaced a physical disk for it, run command
#zpool status
You will see information like following:
raidz2 DEGRADED 0 0 0
c4t0d0 UNAVAIL 0 0 0 cannot open
Run following command, then you are done.
#zpool replace <poolname> cxtyd0(the disk logical location)
You will see the following output if you run zpool status
replacing DEGRADED 0 0 0
c4t0d0s0/o FAULTED 0 0 0 corrupted data
c4t0d0 ONLINE 0 0 0
Shortly after, you will see output like this:
raidz2 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
That means ZFS finished disk replacement, now it is doing resilver.
Note: prompt messages may vary on different situation and ZFS version, but steps are similar.
Review the FMA/zfs diagnostic counters from the previous fault.
# fmstat
-
Reset the ZFS diagnostic counters and determine whether any new fault activity is occurring.
# fmadm reset zfs-diagnosis # fmadm reset zfs-retire # fmstat
-
Determine the FMA fault event from the failed device.
# fmadm faulty -a --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jul 17 11:03:56 378924d1-840b-c4dd-c8e2-a5491d4047ff ZFS-8000-D3 Major ... Fault class : fault.fs.zfs.device Affects : zfs://pool=rzpool/vdev=70f7855d9f673fcc faulted but still in service Problem in : zfs://pool=rzpool/vdev=70f7855d9f673fcc faulted but still in service ...
-
Let FMA know that the ZFS fault event is cleared.
# fmadm repair zfs://pool=rzpool/vdev=70f7855d9f673fcc
fmadm: recorded repair to zfs://pool=rzpool/vdev=70f7855d9f673fcc
-
Confirm that no new faults have occurred.
# fmadm faulty
Comments powered by CComment