Upon completion of this lesson, you will be able to:
This lesson provides an overview of mathematical and statistical techniques for analyzing the reliability and availability of databases, servers, networks, storage systems, and information systems.
When designing systems an effort must be made to assess the system’s reliability. When assessing reliability, the estimated reliability of each system component must be determined. Related to reliability is the concept of availability. Users care about both and this lessons explains the difference between the two and how to assess each.
Both are measures of the degree to which a system is usable by its users and can perform its expected tasks. The two concepts are closely related but are different measures. Both capture a non-functional solution requirement: quality of service.
Availability is a measure of when a system is available for use to its users. Generally expressed as a percentage during some time period, such as 99.35% availability during business hours. The percentage can be interpreted as a probability that a system is available at some specific point in time.
Reliability measures how long a system is operating as expected between failures. It measures how well, in terms of “uptime”, a system functions.
Simply put, availability is a measure of the percentage of time a system is in an operable state while reliability is a measure of how long the system performs its intended function.
Availability matters when service or system “uptime” is important. Availability captures in its measure both “Mean Time to Failure” (MTTF) and “Mean Time To Restore” (MTTR). Alternatively, we can measure “Mean Time Between Failure” (MTBF) rather than MTTF.
The times during which a system is expected to be available to its users (subject to the system’s reliability) is generally expressed in time intervals. For example:
An empirical estimate of the availability of a system component can be calculated with the formula:
\[A = 1 - \frac{\sum\limits^{n}_{i=1}{f(t)_i}}{T}\]
where \(T\) is the total time period over which availability is measured (expressed in the same time units as the measurements for the failures) and \(f(t)_i\) is the amount of time that the system was unavailable due to the \(i^{th}\) failure.
Based on observation and records, a system component \(C\) that is part of a database application experienced the following outages during the month of September:
9/3 | 10 min | 9/3 | 1 min |
9/7 | 11 min | 9/8 | 28 min |
9/18 | 92 min | 9/10 | 4 min |
9/21 | 4 min | 9/23 | 2 min |
9/30 | 6 min | 9/30 | 1 min |
What is the estimated availability expressed as a probability?
The total time of all 10 failures combined is 159 min. The total number of minutes in the month of September is 43,200. So, \(1 - ({159}/{43200}) = 0.99632\) or an estimated availability of 99.63%.
If there is no better estimate, then we would use that as as estimate for future months until we have collected more data on failures. Naturally, meticulous records of failures are essential for such empirical estimates.
A system is a collection of interconnected components that must all be available for the system to be available. System availability is the joint probability that all \(n\) components are available at the same time:
\[A_{System} = \prod\limits^{n}_{i=1}{A_i}\]
Example: Assume that a system \(S\) consists of four components having annual availability estimates of 99.9%, 99.999%, 94.3%, and 99%. What is the overall estimated annual availability of the system assuming a uniform distribution of failures? Further assume that for an online shopping cart system each minute of downtime costs $147 in lost sales due to abandoned carts. What is the annual loss assuming given the aggregate availability?
Answer: Multiply all four availability estimates to get an aggregate availability estimate of 0.933 or 93.3%. Next, find the number of minutes that the system is not available during the year: \((365 \times 24 \times 60) \times (1 - 0.933) \times 147 = 5176634\) or about $5.2 million.
Naturally, to improve the availability, a systems engineer would first address the availability of the weakest system component.
Systems can be made more available through:
Reliability is a measure of how long a system functions correctly without breaking. It is generally captured by two measures:
For example, the MTBF for the Seagate Barracuda 7200 Serial ATA 1TB hard drive is 0.75 million hours with an annualized failure rate of 0.32%. This measure can be gleaned from specifications published by the manufacturer who, in turn, derived it from lab experiments. An extract of the relevant page from the manual is shown as an example.
Note that the reliability estimates are based on specific environmental and usage constraints and if a component is deployed outside those constraints then the estimates may not be reliable.
Mean Time Between Failures (MTBF) is a measure of the average time that elapses between two consecutive system failures during which the system becomes unavailable.
\[MTBF = \frac{T_{total} - T_{down}}{n}\]
where \(T_{total}\) is the total time of observation and \(T_{down}\) is the total time during which the system was not available, i.e., the sum of time of the \(n\) failures.
Mean Time to Repair/Restore (MTTR) is a metric that represents the time duration to repair or restore a failed system component until the overall system is available. This time is often codified in a service level agreement (SLA) between a system provider and the client.
Availability can be defined in terms of reliability when a reliability estimate exists but downtime measure have not been recorded. This is often the case for new systems that are not yet deployed long enough to be able to derive an empirical estimate of availability.
\[A = \frac{MTBF}{MTBF + MTTR}\]
As a corollary we can estimate MTBF in terms of availability A and MTTR, which might sometimes be easier to measure.
\[MTBF = A \times \frac{MTTR}{1-A}\]
Rather than expressing reliability in terms MTBF and MTTR, it can also be, like availability, expressed as the likelihood that a system is functioning properly during some time period \(t\), given some measure of MTBF – and, as a corollary, in terms of availability and MTTR.
From this perspective, an estimate of the reliability can be calculated as follows:
\[R = e^{-t/MTBF}\]
A service level agreement (SLA) is an agreement between a system provider and a system purchaser containing service level objectives (SLO). An SLA codifies the quality of service requirements and avoids disputes between provider and user.
An SLA specifies maintenance intervals, provider responsibilities, costs, and a set of availability and reliability measures to which the system must adhere.
For example, let’s say that the SLA for a cloud data service specifies a 90% availability. How many hours can the system be “down” and still meet its SLO? The answer is 875 hours \((365 * 24 * (1 - 0.9))\). To get less than 48 hours of downtime, the availability must be at least 0.99452 or 99.452%.
The video tutorial below is a narration of this lesson supported by the linked slide deck.
Slide Deck: Reliability & Availability Modeling
To summarize, availability is defined as the percentage of time a system remains operational under acceptable parameters of operations and serves its intended purpose, while reliability is the probability that a system functions correctly for some period of time.
None.
None collected yet. Let us know.