raid6 data loss probability calculation

This is a quite old topic, but it is not a easy topic. Manufacture MTTF could be a big misleading. You can find a lot of article on the web, but most of them only focus on MTTDL(mean time to data loss) RAID equation that was formulated by Gibson and Patterson in 1993.  More realistic model should includs not just disk failure, read error also should be considered. Before I started to make one by myself, I found NetApp has already developed a web based easy to use equtation.

http://raideqn.netapp.com/#

As it says in on its page:

Our goal is to provide an easy-to-use tool that can help system designers, who might not be reliability analysis experts, explore the design space of their existing next-generation RAID solutions. Our intent is for this tool to help designers understand the performance and reliability trade-offs and choose the right design with more or less aggressive proactive mechanisms, such as media and RAID scrubbing, and understand the inherent reliability of their devices, including HDDs and solid-state drives.

Here is the example I tried for my storage experiment.

First, I preset disk model to 'Unreliable SATA', which probably is for the worse case.

System Parameters

Data disks in a RAID-6 group: 13

Mission time: 5 years

No. of RAID-6 groups: 132

HDD Failure Characteristics

I did not change any value, this field could be varies from vendor to vendor, as well manufacture batches. For general purpose, I use the number from NetApp. You can also use the number of your own, or have a reference from a paper Statistical Modeling and Analysis of Hard Disk Drives (HDDs) Failure by Zhimin He, Hao Yang, Min Xie.

Time To Repair a Failed HDD

In the section below, I adjusted a bit for MTTR, to meet MTTR similar to our system. Which is around 31 hours.
I put 35 for Characteristic life, still use default 1.65 for weilbull distribution model.

HDD Latent sector defect

I doubled workload according to my system, 2MB/sec.  For my case, More than 10 times disk capacity data access a year, plus scrub once a month.
For error bit per bit, you can also use data from disk spec.  Usually most of SATA disk has the number of 10E-15. In this equation, it uses 2.8172e-15

Media Scrubbing

I set Characteristic life to 30 days, which is 720 hours

So, I got the expected number of data loss event about 0.15 at the year of 5.

Note: The equation also accounts for restorative features of modern RAID implementations, including NetApp Data ONTAP® systems such as media and RAID parity scrubbing, which proactively repair transient errors.
So, if you are using the production not from NetApp, do a bit more adjustification.

As for general MTTDL calculation, I found this one is quite good to use.

http://www.servethehome.com/raid-calculator/raid-reliability-calculator-simple-mttdl-model/