Upon completion of this lesson, you will be able to:
Data, such as images, media, files, or database records, must be stored on some physical device whose memory is non-volatile, i.e., memory that does not require a constant supply of power. This lesson explains common storage methods for data in databases and file systems.
This lesson is a derivative work of parts of “How Operating Systems Work” by Peter Desnoyer of Northeastern University which is permitted under the CC4 License. This license gives important freedoms, including the right to copy and distribute the chapter non-commercially without permission or fee. Specifically, it allows one to copy and redistribute the material in any medium or format and to remix, transform, and build upon the material which is what this lesson does.
The memory inside a computer (random access memory, video memory, and cache memory) is volatile and if the power supply is interrupted then the contents is irretrievably lost. Non-volatile storage devices generally include mechanical hard disk drives, solid-state devices (flash drives), CD-ROM, DVD, and tape. The most commonly used devices for databases and file systems are mechanical and solid-state devices.
At this point in time, mechanical disk drives are still less expensive per GB (Giga-Byte) than solid-state drives but the gap is shrinking. Most personal computers and smart devices (such as mobile phones and tablets) use solid-state drives as they require far less power and are not subject to destruction when dropped or jarred.
The most widely used storage technology today is the hard disk drive, which records data magnetically on one or more spinning platters, in concentric circular tracks. The performance of a hard drive is primarily determined by physical factors: the size and geometry of its components and the speeds at which they move.
The principles components of a mechanical hard disk drive (or, simply harddrive) are:
Common rotational speeds of platters are summarized in the table below:
Speed | Rotations/Second | ms/rotation |
---|---|---|
5400 RPM | 90 r/sec | 11 ms/r |
7200 RPM | 120 r/sec | 8.3 ms/r |
10,000 RPM | 167 r/sec | 6 ms/r |
15,000 RPM | 250 r/sec | 4 ms/r |
Head and actuator arm: This takes between 1 and 10 ms to move from one track to another on consumer disk drives, depending on the distance between tracks, and between 1 and 4 ms on high-end enterprise drives (at the cost of higher power consumption and noise).
Bits and tracks: On modern drives each track is about 3 micro-inches (75nm) wide, and bits are about 1 micro-inch (25nm) long.
Electronics and interface: The drive electronics are responsible for controlling the actuator and transferring data to and from the host. On a consumer drive this occurs over a SATA interface, which has a top speed of 150, 300, or 600MB/s for SATA 1, 2, or 3.
In this fascinating demonstration, an engineer from Seagate Technologies looks inside a mechanical hard disk drive and the process a drive goes through to access data on disk.
In 2022, the largest solid-state drives are between 64 and 100 TB, while the largest hard-disk drives are smaller although much less expensive per GB. Therefore, most cloud storage providers are still provisioned using hard disks, as high-capacity drives (in 2022 “high capacity” = 15-20TB) remain far cheaper than SSDs.
Unlike memory, data on a disk drive is read and written in fixed-sized units, or sectors, of either 512 or 4096 bytes. Thus small changes (e.g., a single byte) require what is known as a read/modify/write operation – a full sector is read from disk into memory, modified, and then written back to disk.
These sectors are arranged in concentric tracks on each platter surface; a sector may thus be identified by its geometric coordinates:
Cylinder: This is the track number; for historical reasons the group formed by the same track on all disk platters has been called a cylinder, as shown in the figure. Early disk drives could switch rapidly between accesses to tracks in the same cylinder; however this is no longer the case with modern drives.
Head: This identifies one of the platter surfaces, as there is a separate head per surface and the drive electronics switches electrically between them in order to access the different surfaces.
Sector: The sector within the track.
Platters are stacked within the hard disk drive, often only nano-meters apart.
Years ago operating systems had to use cylinder/head/sector addresses to talk to a hard drive, which required detailed information about each drive being used. Modern drive interfaces (SATA, SAS/SCSI, etc.) use Logical Block Addressing, where the controller inside the drive is responsible for translating between an integer sector number or block address and a physical location on one of the platters.
Data on a drive can be identified by the platter surface it is on, the track on that surface, and finally the position on that track. Reading data from a disk (or writing to it) requires the following steps:
Switching the electronics to communicate with the appropriate head (we’ll ignore this, as it’s fast).
Moving the head assembly until the head is positioned over the target track known as seek time.
Waiting for the platter to rotate until the first bit of the target data is passing under the head known as rotational latency.
Reading or writing until the last bit has passed under the head known as transfer time.
The overhead of seek and rotational delay has a major effect on disk performance. To give an example, consider randomly accessing a 4KB data block on a 7200 RPM disk with the following parameters:
So, to read some random 4KB block requires 8ms + 4ms + (4KB / 200KB/ms) ≈ 12ms for an average throughput of ~334 KB/s.
A 5MB block requires 8ms + 4ms + (5MB / 200KB) ≈ 37ms for an average throughput of ~135MB/s.
Take a moment to convince yourself that these are correctly calculated; and be midnful of the units.
Although disks are random-access devices, random access is expensive. To achieve anywhere near full bandwidth on a modern disk drive it is necessary to read or write data in large contiguous blocks; in our random access example, for instance, a 2MB transfer would require 22ms, or less than twice as long as the smallest transfer.
A number of strategies are used to avoid the full penalties of seek and rotational delay in disks. One of these strategies is that of optimizing the order in which requests are performed — for instance, reading sectors 10 and 11 on a track would require a seek, a rotational delay, and then two sectors of transfer time, while reading 11 first would require the same seek and initial rotational delay (plus one sector), followed by a full rotation to get back to sector 10.
This is known as disk scheduling, and relies on the fact that multitasking operating systems frequently generate multiple disk requests in parallel, which do not have to be completed in strict order. Although a single process may wait for a read or write to complete before continuing, when multiple processes are running they can each issue requests and go to sleep, and then be woken in the order that requests complete.
The primary algorithms used for disk scheduling are:
First-come first-served (FCFS): In other words no scheduling, with requests handled in the order that they are received.
Shortest seek time first (SSTF): This is the optimal strategy; however it is prone to starvation as a stream of requests to nearby sections of the disk can prevent another request from being serviced for a long time.
SCAN: This (and variants) are what is termed the elevator algorithm — pending requests are served from the inside to the outside of the disk, then from the outside back in, etc., much like an elevator goes from the first floor to the highest requested one before going back down again. It is nearly as efficient as SSTF, while avoiding starvation.
Disk scheduling can be implemented in two ways – in the operating system, or in the device itself. OS-level scheduling is performed by keeping a queue of requests which can be re-ordered before they are sent to the disk. On-disk scheduling requires the ability to send multiple commands to the disk before the first one completes, so that the disk is given a choice of which to complete first. This is supported as Command Queuing in SCSI, and in SATA as Native Command Queuing (NCQ).
In addition to scheduling, the other strategy used to improve disk performance is caching. Disk drives typically have a small amount of RAM (8-16MB on many modern drives) used for caching. Although this is very small in comparison the the amount of RAM typically dedicated to caching on the host, if used properly it can make a significant difference in performance. This takes one of two forms:
Read caching (a.k.a, track buffering): At read time, after seeking to a track it is common practice to store the entire track in the on-disk cache, in case the host requests this data in the near future. Consider, for example, the case when the host requests sector 10 on a track, then almost (but not quite) immediately requests sector 11. Without the track buffer it would have missed the chance to read 11, and would have to wait an entire revolution for it to come back around; with the track buffer, small sequential requests such as this can be handled efficiently.
Write buffering: Write buffering is a different matter entirely, and refers to a feature where a disk drive may acknowledge a write request while the data is still in RAM, before it has been written to disk. This can risk loss of data, as there is a period of time during which the application thinks that data has been safely written, while it would in fact be lost if power failed.
Although in theory most or all of the performance benefit of write buffering could be achieved in a safer fashion via proper use of command queuing, this feature was not available (or poorly implemented) in consumer drives until recently. As a result write buffering is typically enabled by default; however it can be enabled or disabled on a per-drive basis.
Solid-state drives (SSD) are more expensive (per GB) and come in smaller capacities than mechanical disks, but are far more reliable and extremely fast for access and writing. However, solid-state drives have a limited number of “write cycles” and therefore are not good for high volume transactional databases.
The video below explains how solid-state drives work and how they differ from mechanical disk drives.
Storage is a key component of most systems, and is critical to systems with databases. The availability and reliability of storage systems must be virtually perfect as otherwise a system would not be functioning. The storage subsystem must therefore be designed with fault tolerance in mind. Imagine if all data for an organization is stored on a disk and that disk fails? Even with a backup, it would take significant time to restore and during that time the system would not be available to its users, potentially costing the organization lost revenue, client dissatisfaction, among other negative consequences. Therefore, a systems engineer must devise a storage system that continues to function even when a disk or data on a disk is lost.
One of the most commonly used methods to prevent loss of data is _RAID (Redundant Array of Independent Devices)1. A RAID configuration requires at least two storage devices, and data is copied, mirrored, other striped across the devices in the array. Each “RAID Level” defines a different way in which the data is stored across the devices in the array.
There are several different RAID configurations distinguished by how they write the data across the drive array. Of these, RAID 0 (striping), 1 (mirroring), 5 (distributed parity), and 10 (RAID 1 + 0) are most common in practice. RAID levels are standardized by the Storage Networking Industry Association (SNIA) as part of the Common RAID Disk Drive Format (DDF) standard. The numbers are mere identifiers and do not imply any performance or reliability benefits.
While most RAID levels provide protection against and recovery from hardware failures or defective drive hardware, they do not provide any protection against data loss due to environmental failures such as fire or flooding. Data Centers must still guard against such events through distributed storage across multiple, geographically dispersed locations. Nor does RAID protect from software induced errors such as data entry errors, user errors, software malfunction, bugs, or viruses, ransomware or malware. In a data storage architecture, RAID is simply a part of an overall data loss prevention and recovery policy; and, of course, it cannot replace a backup plan with offsite storage of backups.
In a RAID 0 configuration the storage devices the data is striped evenly across the devices without any parity (error correcting) information. Therefore, RAID 0 improves (marginally, according to benchmark tests2) read and write performance but offers no fault tolerance. If a device fails, the entire array is compromised. RAID 0 is normally used in practice to improve performance or to build a single large storage device from multiple devices. For example, four 100TB disks can be combined into a 400TB drive array. With a hardware controller, the entire array can be connected to the computer as a single drive – the computer, operating system, and any applications will perceive the array as a single large disk.
While a RAID 0 array can be made from devices with differing capacities, the capacity of the array is limited to the smallest device capacity times the number of devices. For instance, if one 100TB hard disk and a 50TB hard disk are combined into a RAID 0 array, then the capacity is 2x50TB = 100TB. Some device controllers may allow the remaining 50TB capacity from the larger drive to be used for non-RAID storage.
A RAID 0 array of n storage devices provides data read and write transfer rates up to n times as high as the individual device access rates, although without any data redundancy and no guard against device failure. Therefore, the use case for RAID 0 is generally in environments that require high performance yet are able to tolerate lower reliability, such as scientific computing. It is not an appropriate choice for databases, file systems, or object stores unless the application using storage array has other means for fault tolerance such as distributing data across systems.
A RAID 1 device configuration consists of an exact copy (i.e., mirror) of the data on two or more storage devices. This RAID level requires a minimum of two storage devices; more are possible as long as it is an even number of devices. The devices are split into two equal size groups. If the devices are of unequal storage capacity, all devices are limited to the storage capacity of the smallest device in the array.
Fault tolerance is achieved by having an exact copy of the data on a mirrored device, thus RAID 1 can sustain multiple device failures as long as the original and the mirror are not both damaged. There is no parity and no striping.
Reads can be done in parallel, thus improving retrieval performance over a single large device. However, write performance is degraded as the data must be written twice – if the write operations can be done in parallel, it is no worse than the write performance of a single device.
This configuration is useful when read performance is more important than write performance.
The cost per GB is double than that of a RAID 0 configuration as the storage capacity is essentially cut in half. A RAID 0 configuration of four 100TB devices is 400TB, while a RAID 1 configuration of the same devices is 200TB. While it increased capacity above what might be possible with a single device, a systems engineer who wishes to have a storage capacity of 100TB might consider building a RAID 1 from two 100TB or four 50TB devices rather than a single 100TB drive as the RAID 1 configuration would improve fault tolerance and read performance.
A RAID 5 configuration stripes the data across the storage devices in blocks and includes parity (error correcting) information across the devices. It can sustain at most one device failure. In the case of a failure, the information on the failed device can be reconstructed through a mathematical operation that uses the data from the remaining devices plus the parity data. As this reconstruction is computationally intensive and requires times, “reading” from the damaged device is slow thus impacting performance. Because of the needed parity information, RAID 5 requires at least three disks.
Reading subsequent blocks can be done in parallel, thus RAID 5 offers some read performance improvement depending on the data that is read. Write performance is degraded compared to a single device or a RAID 0 or RAID 1 configuration as the parity information must be calculated before writing the parity sector.
RAID 6 extends RAID 5 by adding another parity block which allows two device failures. Like RAID 5, it uses block-level striping but now with two parity blocks distributed across all devices in the array. It has slower write performance than RAID 5 but is more fault tolerant. It has the same read performance as RAID 5.
The overall capacity of RAID 5 and RAID 6 is less than RAID 0 as some of each device’s storage capacity is used for parity information. It requires a minimum of four devices.
The combination of striping across mirrored sets of an even number of (at least four) devices is referred to as RAID 10 (or sometimes as RAID 1+0). There is also a RAID01 where striped disks are mirrored. The image below illustrates the difference between the two.
The choice of a RAID configuration depends on the use cases of the system and the desired or needed degree of fault tolerance balanced with read and write performance as well as cost. There is no “best RAID configuration” – choosing any one of them is a trade-off.
If cost is not a factor, then RAID 1 is generally preferable when fault tolerance is needed. If read performance is paramount, fault tolerance is not a factor, and cost is an issue, then RAID 0 is a good option. To get the best of both worlds, at a higher cost, RAID 10 is the correct choice.
RAID 5 and 6 strike a balance between cost, fault tolerance, and increased read performance, but at the cost of write performance. For data warehouses, RAID 5 and 6 are good choices as they are not updated in real-time. RAID 10 is the “best” option for transactional system that must be highly available and are frequently read and written and cost is not a factor.
All RAID systems need to be controlled by either software or hardware. After all, somehow data has to be mirrored, blocks have to be distributed, and parity has to be calculated. This can either be done within the operating system (a software controller) or in a dedicated hardware-based controller. Of course, in most cases, the hardware controller runs an embedded operating system that implements the RAID functionality in software rather than actual embedded firmware. Most often, a dedicated hardware-based controller works significantly faster and more reliable than a software-only solution, particularly one embedded in the operating system.
There are two common storage device technologies in use for file and database systems: mechanical hard disks and solid state drives. Disk devices are often combined to form larger “disk arrays” and through that increase capacity and reliability. RAID provides a convenient way to combine disks into larger disk arrays for improved access performance and fault tolerance.
None collected yet. Let us know.
Images licensed used in this lesson are licensed under CC4 or are links.
Desnoyer, Peter (2020). How Operating Systems Work. Content from this manuscript are provided under a CC4 license.
This term evolved over time; initially is was Redundant Array of Inexpensive Disks.↩︎
[“Does RAID0 Really Increase Disk Performance?”](http://www.hardwaresecrets.com/does-raid0-really-increase-disk-performance/. HardwareSecrets.com. November 1, 2006.↩︎