If you have IBM DS/DCS storage systems, it's not unusual to see a message like below when you have SNMP trap defined.

This trap indicates an event where user-interaction is required immediately. ipv4 dcs3700 1010 Feb 15, 2016 3:45:55 AM Impending drive failure detected by drive Drive Enclosure 1, Drawer 2, Slot 6

You will see error message in event log, or guri. It is detected by IBM DS/DCS storage system background checking/monitoring the status of the disk drives. A Predictive Failure Analysis error is logged against the drive whenever a sufficient level of errors are detected and regarded as a concern yet the drive remains usable. Hense, the drive is likely to fail in the near future.

 

This type of disk failure is called Disk Impending failure. Most of cases, you just need to fail the drive(see the step below), in some cases, you may have to be careful.

Here is the detail.

There are 3 types of PFA (Impending failure):

Low risk PFA
Medium risk PFA
High risk PFA

Deal with Low/Medium risk PFA(impending failure)

When Impending failure happens, for low and Medium risk PFA, set the disk to failed, then replace it as soon as possible.

Set a impending failure disk to failed(GUI)

Start IBM Storage Manager, --> select the impending disk --> Hardware --> Drive --> Advanced --> fail

Replace the failed disk, and the logical drive will start rebuild righ away. If you have hot spare disk configured, the hot disk will take the failed disk role and start array rebuilding.

Set a impending failure disk to failed(SMcli)

SMcli <Ctrl_a ip> <Ctrl_b ip>  -c 'use password "<password>"; set drive[1,4,10] operationalState=failed';

Replace the failed disk, and the logical drive will start rebuild righ away. If you have hot spare disk configured, the hot disk will take the failed disk role and start array rebuilding.

Deal with High risk PFA(Impending failure)

The recovery options for high risk PFAs are different. It is a good idea to back up all data on
the affected logical drives.

There are normally two type of steps you can follow up:

PFA warning on a disk in a RAID 0 array

An array is configured without redundancy (RAID 0) with the understanding that a single disk failure results in data loss. Only temporary or non-critical data should be stored on the associated logical drives. Therefore, the main PFA recovery action for RAID 0 arrays is a disruptive procedure with all associated LUNs being inaccessible while the affected drive is replaced and data restored.

Perform these steps:
1. Stop all I/O to the affected logical drives.
2. Volume Copy can be used as an alternative to tape backup and restore. This function is
only available with the optional premium feature. If any of the affected logical drives are
also source or target logical drives in a Volume Copy operation that is either Pending or In
Progress, you must stop the copy operation before continuing. Go to the Copy Manager by
selecting Logical Drive  VolumeCopy  Copy Manager, highlight each copy pair that
contains an affected logical drive, and select Copy  Stop.
3. If you have FlashCopy logical drives associated with the affected logical drives, these
FlashCopy logical drives will no longer be valid. Perform any necessary operations (such
as backup) on the FlashCopy logical drives and then delete them.
4. Highlight the affected drive in the Physical View of the Subsystem Management window
and select Hardware --> Drive --> Advanced --> Fail Drive. The amber fault LED illuminates on the
affected disk. The affected logical drives become Failed.
Or use SMcli command line, see example command in previous section
5. Replace the failed drive.
6. Highlight the array associated with the replaced drive in the Logical View of the Subsystem
Management window and select Advanced  Recovery  Initialize  Array. The
logical drives in the array are initialized, one at a time.
To monitor initialization progress for a logical drive, highlight the logical drive in the Logical
View of the Subsystem Management window and select Logical Drive  Properties.
Note that after the operation in progress has completed, the progress bar is no longer
displayed in the Properties dialog.
When initialization is completed, all logical drives in the array have the Optimal status.
7. Use operating system tools to re-discover the initialized LUNs.
8. Restore data from backup or recreate any Volume Copy relationships by highlighting the
copy pairs in the Copy Manager (select Logical Drive  VolumeCopy  Copy
Manager) and selecting Copy  Re-Copy.
It might also be possible to add redundancy by changing the RAID level if sufficient spare
drives are available. If successful, this alters the PFA risk level from high to medium, allowing
the disk to be replaced without disruption. However, there will be data loss if the affected disk
fails during this operation.

PFA warning on a disk in a degraded array

For the array to be in a degraded state, there must already be a failed disk when the PFA is detected on another disk in the same array. Reconstruction to a hotspare might already be in progress. Two replacement disks will be required. In this scenario, it is important to replace the failed disk as soon as possible by

performing the following steps:
1. Although not required, I/O to the affected logical drives should be stopped to reduce the
possibility of inducing a failure on the PFA disk before the failed disk is replaced.
2. If a standby hotspare drive is not available, replace the failed disk.
3. Monitor the progress of reconstruction on the affected logical drives or change the
reconstruction rate by highlighting the logical drive in the Logical View of the Subsystem
Management window and then selecting Logical Drive  Properties. Note that after the
operation in progress has completed, the progress bar is no longer displayed in the
Properties dialog.
4. When all affected logical drives have returned to the Optimal status, the PFA risk level
reduces from high to medium. At this point, it safe to resume I/O to the affected logical
drives.
5. Highlight the PFA flagged drive in the Physical View of the Subsystem Management
window and select Hardware --> Drive --> Advanced --> Fail Drive. The amber fault LED for the
affected disk illuminates. The affected logical drives become degraded until reconstruction
is complete.
Or use SMcli command line, see example in previous section
6. Replace the failed drive(s).