Objectives

Upon completion of this lesson, you will be able to:

  • estimate availability
  • estimate reliability
  • connect reliability to availability
  • use MTBF, MTTR, MTTF

Overview

This lesson provides an overview of mathematical and statistical techniques for analyzing the reliability and availability of databases, servers, networks, storage systems, and information systems.

When designing systems an effort must be made to assess the system’s reliability. When assessing reliability, the estimated reliability of each system component must be determined. Related to reliability is the concept of availability. Users care about both and this lessons explains the difference between the two and how to assess each.

Reliability vs Availability

Both are measures of the degree to which a system is usable by its users and can perform its expected tasks. The two concepts are closely related but are different measures. Both capture a non-functional solution requirement: quality of service.

Availability is a measure of when a system is available for use to its users. Generally expressed as a percentage during some time period, such as 99.35% availability during business hours. The percentage can be interpreted as a probability that a system is available at some specific point in time.

Reliability measures how long a system is operating as expected between failures. It measures how well, in terms of “uptime”, a system functions.

Simply put, availability is a measure of the percentage of time a system is in an operable state while reliability is a measure of how long the system performs its intended function.

Availability matters when service or system “uptime” is important. Availability captures in its measure both “Mean Time to Failure” (MTTF) and “Mean Time To Restore” (MTTR). Alternatively, we can measure “Mean Time Between Failure” (MTBF) rather than MTTF.

The times during which a system is expected to be available to its users (subject to the system’s reliability) is generally expressed in time intervals. For example:

  • Available Monday through Sunday from 5:00am to 11:59pm and Sunday from 7am to 11:59pm. System may be down for maintenance during the other times

Estimating Availability

An empirical estimate of the availability of a system component can be calculated with the formula:

\[A = 1 - \frac{\sum\limits^{n}_{i=1}{f(t)_i}}{T}\]

where \(T\) is the total time period over which availability is measured (expressed in the same time units as the measurements for the failures) and \(f(t)_i\) is the amount of time that the system was unavailable due to the \(i^{th}\) failure.

Example: Availability Estimate

Based on observation and records, a system component \(C\) that is part of a database application experienced the following outages during the month of September:

9/3 10 min 9/3 1 min
9/7 11 min 9/8 28 min
9/18 92 min 9/10 4 min
9/21 4 min 9/23 2 min
9/30 6 min 9/30 1 min

What is the estimated availability expressed as a probability?

The total time of all 10 failures combined is 159 min. The total number of minutes in the month of September is 43,200. So, \(1 - ({159}/{43200}) = 0.99632\) or an estimated availability of 99.63%.

If there is no better estimate, then we would use that as as estimate for future months until we have collected more data on failures. Naturally, meticulous records of failures are essential for such empirical estimates.

System Availability

A system is a collection of interconnected components that must all be available for the system to be available. System availability is the joint probability that all \(n\) components are available at the same time:

\[A_{System} = \prod\limits^{n}_{i=1}{A_i}\]

Example: Assume that a system \(S\) consists of four components having annual availability estimates of 99.9%, 99.999%, 94.3%, and 99%. What is the overall estimated annual availability of the system assuming a uniform distribution of failures? Further assume that for an online shopping cart system each minute of downtime costs $147 in lost sales due to abandoned carts. What is the annual loss assuming given the aggregate availability?

Answer: Multiply all four availability estimates to get an aggregate availability estimate of 0.933 or 93.3%. Next, find the number of minutes that the system is not available during the year: \((365 \times 24 \times 60) \times (1 - 0.933) \times 147 = 5176634\) or about $5.2 million.

Improving Availability

Naturally, to improve the availability, a systems engineer would first address the availability of the weakest system component.

Systems can be made more available through:

  • Duplication and use of redundancy, e.g., multiple CPUs, power supplies, storage devices (RAID)
  • Faster recoverability to reduce MTTR
  • Data backup to avoid catastrophic loss of information
  • Hot swap capability to repair components to reduce MTTR and MTBF
  • Continuous power through uninterruptible power supplies (UPS) and backup power sources

Estimating Reliability

Reliability is a measure of how long a system functions correctly without breaking. It is generally captured by two measures:

  • “Mean Time to Failure (MTTF)”, and
  • “Mean Time Between Failures (MTBF)”

For example, the MTBF for the Seagate Barracuda 7200 Serial ATA 1TB hard drive is 0.75 million hours with an annualized failure rate of 0.32%. This measure can be gleaned from specifications published by the manufacturer who, in turn, derived it from lab experiments. An extract of the relevant page from the manual is shown as an example.

Note that the reliability estimates are based on specific environmental and usage constraints and if a component is deployed outside those constraints then the estimates may not be reliable.

MTBF

Mean Time Between Failures (MTBF) is a measure of the average time that elapses between two consecutive system failures during which the system becomes unavailable.

\[MTBF = \frac{T_{total} - T_{down}}{n}\]

where \(T_{total}\) is the total time of observation and \(T_{down}\) is the total time during which the system was not available, i.e., the sum of time of the \(n\) failures.

MTTR

Mean Time to Repair/Restore (MTTR) is a metric that represents the time duration to repair or restore a failed system component until the overall system is available. This time is often codified in a service level agreement (SLA) between a system provider and the client.

Availability and MTTR/MTBF

Availability can be defined in terms of reliability when a reliability estimate exists but downtime measure have not been recorded. This is often the case for new systems that are not yet deployed long enough to be able to derive an empirical estimate of availability.

\[A = \frac{MTBF}{MTBF + MTTR}\]

As a corollary we can estimate MTBF in terms of availability A and MTTR, which might sometimes be easier to measure.

\[MTBF = A \times \frac{MTTR}{1-A}\]

Reliability as a Probability

Rather than expressing reliability in terms MTBF and MTTR, it can also be, like availability, expressed as the likelihood that a system is functioning properly during some time period \(t\), given some measure of MTBF – and, as a corollary, in terms of availability and MTTR.

From this perspective, an estimate of the reliability can be calculated as follows:

\[R = e^{-t/MTBF}\]

Service Level Agreements

A service level agreement (SLA) is an agreement between a system provider and a system purchaser containing service level objectives (SLO). An SLA codifies the quality of service requirements and avoids disputes between provider and user.

An SLA specifies maintenance intervals, provider responsibilities, costs, and a set of availability and reliability measures to which the system must adhere.

For example, let’s say that the SLA for a cloud data service specifies a 90% availability. How many hours can the system be “down” and still meet its SLO? The answer is 875 hours \((365 * 24 * (1 - 0.9))\). To get less than 48 hours of downtime, the availability must be at least 0.99452 or 99.452%.

Practice Questions

  1. Assuming continual operation and an availability of 99.8%, how many minutes of downtime can be expected in a year?1
  1. What is the MTBF in hours in the above scenario assuming it takes on average 15 minutes to repair the system after a failure?2
  1. If a system has a MTBF of 750,000 hours, what is its expected annual reliability expressed as a probability?3
  1. What is the probability that the above system is not available (down) at some point during the year?4
  1. Razor LLC is planning to provide a web hosting service to its customers and wants to guarantee 99.98% availability in its Service Level Agreements (SLAs). The QA team has estimated a MTBF of 3 years and a MTTR of 4 hours. Are you confident you can deliver on the guarantee?5
  1. A key customer requests reliability of 99.9% during any given 30-day period for the above system. Can you deliver?6

Tutorial

The video tutorial below is a narration of this lesson supported by the linked slide deck.

Slide Deck: Reliability & Availability Modeling


Summary

To summarize, availability is defined as the percentage of time a system remains operational under acceptable parameters of operations and serves its intended purpose, while reliability is the probability that a system functions correctly for some period of time.


Files & Resources

All Files for Lesson 92.505

References

None.

Errata

None collected yet. Let us know.


  1. Answer: 1,051↩︎

  2. Answer: MTBF = 124.75 hours↩︎

  3. Answer: \(e^{-(365*24/750000} = 0.9884\)↩︎

  4. Answer: 1 - R↩︎

  5. Answer: Actual availability = 99.98% which equals desired availability (but barely and agreeing to this would be very risky)↩︎

  6. Answer: Reliability = 97.3% in 30 days↩︎