Upon completion of this lesson, you will be able to:
Block storage devices, such as disks, provide a read/write interface for fixed-sized blocks of data, identified by a numeric address. A file system is a way to organize and manage data on block storage devices such as hard drives, solid-state drives (SSD), or USB drives. It provides a structure for storing and accessing files, and manages the allocation and retrieval of storage space.
There are many different types and implementations of file systems, each with its own specific characteristics, benefits, and drawbacks, but two of the most commonly used file systems are FAT32 and ext2. A third file system worth looking at is ISO9660.
FAT32 (File Allocation Table 32) is a file system developed by Microsoft for use in its Windows operating systems. It is a simple and widely supported file system that can be used on a wide variety of devices, including USB drives, memory cards, and external hard drives.
FAT32 organizes files into clusters, which are groups of contiguous sectors on the disk. Each cluster is allocated to a file, and the file’s data is stored in that cluster. The FAT32 file system uses a file allocation table (FAT) to keep track of which clusters are allocated to which files.
One limitation of FAT32 is its maximum file size, which is limited to 4GB. This makes it unsuitable for storing large files such as video files or disk images. However, FAT32 remains a popular file system for smaller storage devices due to its simplicity and widespread compatibility.
ext2 (Second Extended File System) is a file system developed for use in Linux operating systems. It was designed to improve upon the original ext file system by adding support for larger file sizes and improving file system performance.
ext2 organizes files into blocks, which are groups of contiguous sectors on the disk. Each block is allocated to a file, and the file’s data is stored in that block. The ext2 file system uses an inode structure to keep track of file metadata, such as permissions, ownership, and timestamps.
One of the advantages of ext2 is its support for larger file sizes, up to 2TB. It also provides better performance than some other file systems, particularly for large files. However, ext2 does not support journaling, which means that file system errors can result in data loss or corruption.
In summary, file systems are an essential component of data storage and retrieval, and each file system has its own specific characteristics and advantages. FAT32 is a simple and widely supported file system that is popular for smaller storage devices, while ext2 is a more advanced file system that is commonly used in Linux environments.
A third type of file system that is extremely simple, is the ISO9660 file system. It is the standard file system used in CD-ROMs. IS9660 was developed in 1986 and is widely used for distributing software, music, and other digital content on CDs. ISO9660 organizes files into a hierarchical directory structure and limits file names to 8 characters with a 3 character extension.
This lesson explains the inner working of these three file systems and uses these exemplars to illustrate common file system mechanisms and design approaches.
General-purpose operating systems typically provide access to block storage via a file system, which provides a more application- and user-friendly interface to storage. From the point of view of the user, a file system contains objects such as the files themselves as well as directories (also often referred to as folders on some operating systems) and other supporting objects and the operations on these objects. These are organized inside of a name space, the set of names identifying objects. Each file system object also has a set of properties.
File systems have traditionally used a tree-structured name space, as illustrated below:
This tree is constructed via the use of directories, or objects in the name space which map strings (i.e., individual names) to further file system objects, as shown below:
A full file name thus specifies a path from the root, through the tree, to the object (a file or directory) itself such as the one shown for the file /home/pjd/.profile. (This is why the term “path” is used for “file name” in Unix documentation: it’s literally the path through the tree to find a file.) Because there are no duplicate object names allowed at any level of the tree, a path guarantees a unique identifier for every object in the file system.
Incidentally, a URL (and, likewise, a URI) also use a hierarchical naming strategy as objects on the web were traditionally files in the file system of a web server. A URL also includes the retrieval protocol, e.g., http, https, or S3. An example of a URL “path” is:
http://artificium.us/lessons/92.systems/l-92-505-reliability/l-92-505.html
In most modern operating systems the contents of a file is a sequence of some number of 8-bit bytes, and nothing else. Any structure to the file – such as a JPEG image, an executable program, or a database – is the responsibility of applications which read and write the file. The file format is commonly indicated by a file extension like .jpg or .xml, but this is just a convention followed by applications and users. After all, you can rename file.pdf to file.jpg, which will confuse some applications and users, but have no effect on the file contents. Operating system often use the extension as a way to know which application to automatically launch to open a file.
Data in a byte-sequence file is identified by its offset (in bytes) in the file; it can be created by a write which appends more data to the end of a shorter file, and modified by over-writing in the middle of a file. However it can’t be “moved” from one offset to another. For example, if you use a text editor to add or delete text in the middle of a file, the editor must re-write the entire file (or at least from the modified part to the end).
Files didn’t use to be so simple – early operating systems supported a multitude of file types, offering record structures, indexes and sorting, and other functionality. For example, ISAM (Indexed Sequential Access Method) is a file structure used in the IBM z/OS operating system for mainframe computers. It is a type of file format and structure that provides indexed sequential access to records stored in a file.
In an ISAM file structure, records are stored in fixed-length blocks, and each block contains one or more records. A record can be accessed sequentially or directly using its index key value. The index is maintained separately from the data and contains the index key values and the block addresses where the records are located.
ISAM provides efficient access to records using its index, and it can also support parallel access to records from multiple users. However, ISAM does not provide full transaction support, meaning that changes to the data may not be atomic and may require manual intervention in case of system failures.
While ISAM is much more complex that files in Unix, ISAM is a quite powerful and efficient file structure because it makes managing large volumes of data efficient.
In Unix (and most other operating systems) each process has an associated current directory (often referred to as the “present working directory” and obtainable via the pwd command), which may be changed via the chdir system call (or cd command). Filenames beginning with “/” are termed absolute names and are interpreted relative to the root of the naming tree, while relative names are interpreted beginning at the current directory.
In addition each directory contains two special names1:
“..” refers to the parent of the directory it is in. Thus, /home/pjd/.. refers to the same directory as /home.
“.” refers to the directory itself. /home/pjd/. and /home/pjd/././. both refer to /home/pjd. (Note that “.” is also a legal part of a file name: only the names “.” and “.. by themselves are special and must be specified by themselves.)
An alternative to files and directories to allow multiple names for a file is what is known as a symbolic link, which is a third file system object which holds a text string, and is interpreted as a “pointer” to another location in the file system. When the kernel encounters a symbolic link while searching for a file, it substitutes this text into the current portion of the path, and continues the translation process. Thus, if we have:
Directory: /usr/program-1.0.1
File: /usr/program-1.0.1/file.txt
Symbolic link: /usr/program-current -> "program-1.0.1"
Using this information, if the OS is looking up the file /usr/program-current/file.txt, it will:
Note that a symbolic link may be broken if the file it refers to doesn’t exist. This can happen if the link was created in error, or the file or directory it points to is later deleted. In that case, path translation will fail with an error:
A typical system may provide access to several file systems at once, like a local disk and an external USB drive or network volume. There are two common operating system approaches when naming files in different file systems:
Windows identifies the file system explicitly using a “drive letter”, e.g., “Z:” or “C:”. Note that Windows 10 and later actually uses a mount table-like approach, with a unified name space at the lowest levels; device:\path naming used in the Win32 API is then translated into this name space.
On a related note, Windows uses the backslash character as a directory separator in paths, although the forward slash is also understood when calling file system operations from within a program, i.e., c:\users\mjjghs\downloads\bar.txt is the same as c:/users/mjjghs/downloads/bar.txt. Absolute path names must start with the drive letter.
Unix attaches file systems to each other via a process termed mounting, as is done in Unix-like operating systems. In this case the operating system “attaches” a file system to a “parent” directory on another file system, so that the it shows up as a tree under that parent. This is implemented via a mount table that tracks mount points to child file systems; when translating a name, this table is consulted before passing the request to the underlying file system.
The contents of the Unix (MacOS) mount table shown here (as output from the mount command) shows the root file system (ext3), two disk images, and one MS-DOS formatted USB drive.
There are several common types of file operations operations supported by Linux (and with slight differences, Windows). They can be classified into four main categories:
For more information on the operations listed in the following subsections, consult your Unix system’s man pages.
In order to access a file in Unix (or most operating systems) you first need to open the file, passing the file name and other parameters and receiving a handle (called a file descriptor in Unix) which may be used for further operations. This allows the operating system to (among other things) look up the location of the file, and store that and other information in a location identified by the handle, so that further reads or writes can skip that lookup process.
The open/close operations for C are:
int desc = open(name, O_READ)
Verify that the file name exists and you are allowed to read it, then return a descriptor which may be used for reading (but not writing) that file.
int desc = open(name, O_WRITE | flags, mode)
Verify permissions and open name for writing, creating it (or erasing existing contents) if necessary as specified in flags. Returns a descriptor which may be used for writing to that file.
close(desc)
Destroy the indicated desc if possible and free any resources allocated to it.
Each file descriptor has a current position, which is the offset in the file at which the next read or write operation will take place. This is initialized to 0 (i.e. the beginning of the file) by the open system call. (Unless you use the O_APPEND flag, in which case the position is set to the end of the file.)
The read/write operations are:
n = read(desc, buffer, max)
Read max bytes (or fewer if end-of-file) into buffer, starting at the current position, and returning the actual number of bytes (n) read. The current position is incremented by n.
n = write(desc, buffer, len)
Write len bytes from buffer into the file, starting at the current position, and incrementing the current position by len.
lseek(desc, offset, flag)
Set desc’s current position to that specified by offset and flag, which specifies whether offset is relative to the beginning, end, or current position).
Most programs either read or write a file from the beginning to the end, in which case the only operations needed are read or write. These functions don’t just work on real files — they work with anything that accepts or generates a stream of data, like another program, a terminal, or a network connection. Programs that must write to arbitrary locations in a file (e.g., databases like SQLite or MySQL) do this by using lseek
to move the current position before reading or writing.
A key feature of Unix is that most programs use only read and write, and can have their input and output redirected in ways that the original author may not have dreamed of. One reason for this consistency is probably because out-of-order access is just complicated enough that developers only use it when they actually need to.
The Windows equivalents to these functions are found in the Win32 API, e.g.,
OpenFile
,ReadFile
, andSetFilePointer
instead ofopen
,read
, andwrite
, as described in the quite detailed Microsoft documentation. For decades Win32 has been a library on top of the actual (and undocumented) Windows system calls; for quite a while now the Win32 interface has been deprecated in favor of higher-level mechanisms which actually skip Win32 and are more “Unix-like”.
In languages that compile to virtual machine code, such as Java, the operating system details are hidden and access to files is universal, but using different methods than described above.
In Unix there is a difference between a name (a directory entry) and the object (file or directory) that the name points to.
The naming and directories operations are:
rename(path1, path2)
Rename an object (i.e., a file or directory) by either changing the name in its directory entry (if the destination is in the same directory) or creating a new entry and deleting the old one (if it’s being moved into a new directory).
link(path1, path2)
Add a hard link to a file. A hard link is an additional directory entry pointing to the same i-node, giving the file two (or more) names. Hard links are peculiar to Unix, and in modern systems have mostly been replaced with symbolic links; however Apple’s Time Machine makes very good use of them. (Each backup is a “shadow” of the entire file system, with hard links for files which are unchanged.)
unlink(path)
delete a file (sort of). If there are multiple hard links to a file, then this just removes one of them; the file isn’t deleted until the last link is removed. Even then it might not be removed yet – on Unix, if you delete an open file it won’t actually be removed until all open file handles are closed. In general, deleting open files is a problem: while Unix solves the problem by deferring the actual delete, Windows solves it by protecting open files so that they cannot be deleted.
desc = opendir(path)
, readdir(desc, dirent*)
: This interface allows a program to enumerate names in a directory, and determine their type (i.e., file, directory, symbolic link, or special-purpose file).
mkdir(path)
, rmdir(path)
: Directory operations to create a new, empty directory, or delete an empty directory.
stat(file, statbuf)
, fstat(desc, statbuf)
Returns file attributes, like size, owner, permissions, modification time, etc. in statbuf. Remember that in Unix these are attributes of the file itself and can’t be found in the directory entry.
To store a file system on a physical disk, the high-level objects (directories, files, symbolic links) must be translated into fixed-sized blocks identified by logical block addresses.
Note that instead of 512-byte sectors, file systems traditionally use disk blocks, which are some small power-of-two multiple of the sector size, typically 1KB, 2KB, or 4KB. Reading and writing is performed in units of complete blocks, and addresses are stored as disk block numbers rather than LBAs (and multiplied by the appropriate value before being passed to the disk).
Designing on-disk data structures is complicated by the fact that for various reasons (virtual memory, disk controller restrictions, etc.) the data in a file needs to be stored in full-disk blocks – e.g. bytes 0 through 4095 of a file should be stored in a single 4096-byte block.
In general, file systems can be categorized by the different solutions their designers have found to three primary problems:
Identification: how to find objects (files, directories)
Organization: how to find the data within a file
Allocation: how to allocate free space for creating new files
In the figure, you can see an example of a very simple file system, similar to early versions of the ISO-9660 CD-ROM file system.
Objects on disk are either files or directories, each composed of one or more 2048-byte blocks; all addresses are in terms of block numbers, with blocks numbered from block 0 at the beginning of the disk. There are no hard links - each object has exactly one name – and the type of an object is indicated in its directory entry. (The only exception to the “one name” rule is the root directory, which has no name; however it is always found at the beginning of the disk.)
Why the 2048 byte block size? Because back in 1988 the people who wrote the CD-ROM standard decided it would be 2048 bytes, even though pretty much every disk at the time had 512-byte sectors. (or 256 bytes, for floppy disks) The CD-ROM format includes large strong error-correcting codes, to reduce data loss due to scratches, and these codes work much better on larger blocks, so the smaller sizes weren’t practical (and 2KB probably seemed huge already).
Directory entries include the file name, its starting block number, and length. Finally, all objects are contiguous, allowing them to be identified by start and length. Solutions to the three problems are:
Identification: Files are identified by their starting block number.
Organization: Blocks in a file are contiguous, so an offset in the file can be found by adding to the starting block number.
Allocation: Free space is not tracked. (in fact in a typical first-generation CD-ROM there is no free space - once all the directories and files are written, the data just stops and the rest of the disk is blank)
This organization works for a read-only file system, where all files (and their sizes) are known when the file system is created. It would be inefficient as a read-write file system, as (a) it would quickly fragment, making it impossible to create large files, and (b) locating free space for new files would be very inefficient.
The MS-DOS (or FAT, File Allocation Table) file system is a common file system for USB Drives despite being developed in the 1980s. It which organizes blocks within a file in a linked list:
In order to allow disk access to be performed in multiples of power-of-two disk blocks, pointers are kept externally to the blocks:
This table of pointers is called the File Allocation Table, or FAT. Entries in the FAT can indicate (a) the number of the next block in a file or directory, (b) that the block is the last one in a file or directory; or (c) the block is free.
Directories are similar to the CD-ROM file system where each entry has a name, the object type (file or directory), its length, and the starting address of the file contents. Note that although the last block of a file can be identified by a flag in the FAT, the length is still needed to know how much of the last block is valid (e.g. a 5-byte file will use one block, but only 5 bytes in that block). Each directory entry is a fixed size and has a field indicating whether it is valid; to delete a file this field is set to invalid and the blocks are marked as free in the file allocation table.
Like most file systems, linear search is used to locate a file in a directory. This is usually reasonably efficient, but works poorly for very large directories. (That’s why your browser cache has filenames that look like ab/xy/abxy123x.dat, instead of putting all its files in the same directory.)
The FAT table file system layout solves the three problems by:
Identification: Files/directories are identified by starting block number.
Organization: Blocks within a file are linked by pointers in the FAT.
Allocation: Free blocks are marked in the FAT, and linear search is used to find free space.
Most Unix file systems (including Linux ext2 and ext3) use a separate structure called an inode (short for indirect node, also sometimes abbreviated i-node) with a tree-like structure holding pointers to the blocks in the file. As shown here, a series of larger trees are used, allowing small files to avoid the space and time overhead of using multiple levels of indirection.
The inode contains a small number of direct block pointers (12 in ext2), so that files of 12 blocks or less need no indirect blocks. It then contains pointers to a series of trees of height 1, 2, and 3 (not shown), used for progressively larger files. In ext2, which uses 4-byte block numbers, the 3-level tree imposes a maximum file size of 4 TB if 4 KB blocks are used.
This organization allows random access within a file with overhead O(log N) where N is the file size; in contrast, accessing the middle of a large file in the MS-DOS file system requires traversing the FAT linked list.
Using an inode also allows the name (which could be the directory entry) of a file to be separate from the file itself (the inode and content blocks), allowing multiple hard links to the same file, with a single owner and permissions stored in the inode.
The ext2 file system has a more complicated layout and allocation mechanism in order to increase performance. At the basic level allocation is done via an allocation bitmap: a table with an entry (typically a bit) for each block indicating whether the block is free or in use, which can be scanned to find one or more free blocks. Another bitmap is used for inodes, and works the same way.
The disk is partitioned into block groups, each of which looks like a miniature file system with its own bitmaps, inodes, and data blocks. Since each block group is a small fraction of the entire disk, seeking within a block group is much faster (on a hard drive) than seeking from one end of the disk to the other.
To open and read a file it is usually necessary to read the contents of the directory containing the file (like the data blocks of that directory), and then the inode of the file, and finally the data blocks of the file itself. ext2 tries to keep each of these in the same block group; to distribute usage evenly across the disk, new directories are randomly placed in a different block group.
Solutions to the three problems are (for ext2 and other file systems derived from the Berkeley Fast File System):
Identification: Files/directories are identified by inode number, which translated directly to a block number.
Organization: Blocks within a file are located via pointers from the inode.
Allocation: Free blocks are tracked in a free-space bitmap, and block groups are used to keep blocks from the same file near to each other, their inode, and their directory.
When a new file system is created (often called “formatting the disk”) it is often possible to specify a number of parameters, such as:
Block size: Most file systems can be formatted with different block sizes, and the OS needs to know this size before it can interpret any pointers given in terms of disk blocks. Historically larger blocks were used for performance and to allow larger file systems, and smaller blocks for space efficiency. In recent years disk drives have transitioned to using an internal block size of 4KB, while keeping the traditional 512-byte sector addressing, so any file system should use a block size of at least 4KB.
Version: Including a version number allows backwards compatibility as a file system evolves. That way you can upgrade your OS, for instance, without reformatting your disk. (it makes the file system code uglier, but that’s life…)
Other parameters: In the MS-DOS file system the OS needs to know how large the FAT table is, so that it doesn’t accidentally go off the end and start looking at the first data block. In ext2 you need to know the sizes of the block groups, as well as the bitmap sizes, how many inodes are in each group, etc.
Dirty flag: When a file system is mounted, this flag is set; as part of a clean shutdown, the flag is cleared again. This allows for additional error checks to be performed before mounting in the event of a system crash or unclean termination of part or all of a file system.
These parameters are typically placed in a block (called the superblock) at the beginning of the disk, so that the OS can find it and get the information it needs to interpret the rest of the file system.
The ext2 and MS-DOS file systems contains pointers to every data block in a file, located in inodes and indirect blocks in the case of ext2, and in the file allocation table in MS-DOS. Yet implementations of both file systems go to great lengths to ensure that blocks are allocated in sequence, to allow reading and writing to be done at high speed without interruption by disk seeks.
There’s a way around this, though. Instead, the same sequence of data blocks can be represented by a start address and a count (like a file in the CD-ROM file system, except that in this case it’s only an arbitrary chunk of a file). That’s exactly how files are represented in extent-based file systems: as lists of extents. Each extent specifies the start and length of a file. If that list gets too long, you make (you guessed it) a list of extents holding more lists of extents.
An example of an extent-based file system is NTFS. In NTFS, the superblock contains a pointer to the start of a Master File Table (MTF) which resembles the inode table in ext2 – each file or directory has an entry in the table which holds thinks like permissions, timestamps, and block information. The first entry in the MFT is a description of itself so that it can grow as needed. Additional entries are structured as a set of attributes, with a $Data attribute holding a list of extents describing the file (or for very small files, holding the file data itself). As the file grows, eventually the list of extents making up the $Data attribute grows too large to fit into the file entry (the equivalent of an inode) and an $ATTRIBUTE_LIST field is created to hold a list of extents describing the blocks holding the list of extents describing the file. This can continue for one more level, which is enough to support files up to 16TB.
Free space is handled similarly, as a list of extents sorted by starting block number; allowing the free space list to be easily compacted with storage is freed. The advantage of this organization is that it is easy to minimize file fragmentation, reducing the number of disk seeks required to read a file or directory. The disadvantage is that random file access is somewhat more complex and appears to require reading the entire extent list to find which extend an offset may be found at.
Solutions to the three problems?
Identification: Files/directories are identified using the master file table entries.
Organization: Blocks within a file are located by searching the list of extents.
Allocation: Free blocks are tracked in another list of extents.
In the CD-ROM, MS-DOS, and ext2 file systems, a directory is just an array of directory entries, in unsorted order. To find a file, you search through the directory linearly; to delete a file, you mark its entry as unused; finally, to create a new entry, you find any entry that’s free. (In practice, it’s a bit more complicated for ext2 because of variable-length filenames, but the concept remains the same.)
From your data structures class you should realize that linear search isn’t an optimal data structure for searching, but it’s simple, robust, and fast enough for small directories, where the primary cost is retrieving a block of data from the disk. As an example, one of Peter Desnoyer’s Linux machines has 94944 directories that use a single 4KB block, another 957 that use 2 to 5 blocks, and only 125 larger than 5 blocks. In other words, 99% of the directories are small enough that a more complex data structure and search algorithm (a) won’t result in reading less data from disk, and (b) probably won’t run any faster than linear search.
However the largest directories are actually quite big: the largest on this machine, for example, has 13,748 entries. Linear search on a data structure of this size is expensive. As a result, modern file systems tend to use more advanced data structures for their directories. NTFS (and Linux Btrfs) use B-trees, a form of a balanced tree.
B-trees are commonly used in on-disk structures. B-trees can use large blocks containing many entries; if the block size is m entries – referred to as a B-tree of order m, then each block in the B-tree holds between m/2 and m entries. In the following video, m = 2 which is demonstrated using a visualizer from USFCA. In a real system each node would be an entire disk block.
The short narration below by Dr. Maria Jump (formerly Northeastern University, Boston) explains the use of B-Trees for disk structures and superior access performance.
Other file systems, like Sun ZFS, use hash tables for their directories, while ext4 uses a hybrid hash/tree structure.
This lesson explained how file systems organize and name objects such as files and directories using a hierarchical naming strategy. It also discussed file types, including the traditional sequence of 8-bit bytes, and the history of more complex file structures like ISAM. Additionally, the lesson covered file names, including absolute and relative names, and special names like “..” and “.”. Finally, it defined symbolic links and how they work as pointers to other locations in the file system.
File systems are used to store high-level objects like directories, files, and symbolic links as fixed-sized blocks on a physical disk. There are different types of file systems categorized by the solutions used to identify objects, organize data within files, and allocate free space. Examples of different file systems include the read-only ISO-9660 CD-ROM file system, the MS-DOS (FAT) file system that uses linked lists, and the Unix file system that uses a separate structure called an inode with a tree-like structure holding pointers to the blocks in the file.
Desnoyers, Peter (Fall 2020). How Operating Systems Work, Unpublished Manuscript. Fall 2020 Edition, Northeastern University.
This lesson is based on work by Desnoyers, Peter (Fall 2020). How Operating Systems Work, Fall 2020 Edition, Northeastern University, that is provided under Creative Commons License 4.0.
Inspiration for some sections was drawn through interactions with ChatGPT-3.5.
None collected yet. Let us know.
Note that in these explanations, double quotes enclose the single and double dots used to denote file system paths; that is simply for the purpose of clarity in these materials and is not a convention used in file names.↩︎