Upon completion of this lesson, you will be able to:
In this tutorial, you will learn how to use indexes in SQLite to speed up data retrievals (SELECT), improve sorting (ORDER BY), and enforce uniqueness constraints (UNIQUE). An index is an additional data structure stored in the databases that is associated with a single table with the goal to improve the performance of a query. A table may have multiple indexes. Most databases automatically create an index for the primary key, although SQLite does not, while MySQL does.
There are several data structures that may be used for indexing: hashmaps and trees. SQLite uses self-balancing B-trees for the structure of its indexes. A B-tree is a self-balancing tree data structure that maintains its data nodes in sorted order and allows searches in \(O(log_n)\) (logarithmic) time. The B-tree has generally more than 2 children per node. B-trees are well suited for storage systems that read and write relatively large blocks of data and is therefore commonly used in databases and file systems. B-trees were invented in 1972 by Rudolf Bayer and Edward M. McCreight (Bayer & McCreight, 1972). During access operations, the entire index, in the form of a B-tree, must be kept in memory, although the data remains on disk. Thus, access via a B-tree, reduces how much of the actual data must be loaded into memory, which results in less memory being used and far fewer disk accesses (which are comparatively slow, although that distinction is becoming less pronounced as traditional mechanical disks are being replaced with solid-state storage devices
For demonstration, let’s open an existing SQLite database.
## [1] 0
Prior to using an index, the index must be explicitly created. We will create an index on the column LastName in the table customers. The full definition of the table is shown below (and is the result of a .schema customer SQLite operation):
CREATE TABLE IF NOT EXISTS "customers" ( [CustomerId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, [FirstName] NVARCHAR(40) NOT NULL, [LastName] NVARCHAR(20) NOT NULL, [Company] NVARCHAR(80), [Address] NVARCHAR(70), [City] NVARCHAR(40), [State] NVARCHAR(40), [Country] NVARCHAR(40), [PostalCode] NVARCHAR(10), [Phone] NVARCHAR(24), [Fax] NVARCHAR(24), [Email] NVARCHAR(60) NOT NULL, [SupportRepId] INTEGER, FOREIGN KEY ([SupportRepId]) REFERENCES "employees" ([EmployeeId]) ON DELETE NO ACTION ON UPDATE NO ACTION );
An index is created with the CREATE INDEX
operation.
The general form for CREATE INDEX
is shown below:
CREATE [UNIQUE] INDEX index_name [ASC|DESC] ON table_name(column_list);
Indexes are critical for non-prime columns when those columns are frequently in joins or searches. A B-Tree based index makes searching on equality (=) as well as ranges (< and >) significantly quicker. If a search is based on some specific single column, then an index on that column is warranted. However, if a search involves a compound key, i.e., several columns, then the index should be created on these columns – one index for all columns and not separate indexes on each column.
While indexes speed up searching dramatically, they are also useful in detecting duplicate values. The CREATE UNIQUE INDEX
statement is an index for that purpose. It generally is implemented with a hashmap or a hashtable rather than a B-Tree.
Indexes are not useful when pattern matching searches are performed with LIKE.
An index is often used when an ORDER BY clause is present in a query as the in-order traversal of a B-Tree results in an ordered set of leaf nodes. When creating an index, each column name or expression can be followed by one of the “ASC” or “DESC” keywords to indicate sort order.
Indexes are only useful when a direct match search is done on an indexed column in the WHERE clause. SQLite, like all other relational databases, automatically uses an index when available and when the use of an index improves performance of a query.
LastName | FirstName |
---|---|
Johansson | Joakim |
The EXPLAIN QUERY PLAN
operation can make the use of indexes explicit. The query plan below shows that the column is searched rather than scanned linearly, and that the index is used.
id | parent | notused | detail |
---|---|---|---|
3 | 0 | 0 | SEARCH customers USING INDEX custLN (LastName=?) |
If an approximate search using LIKE is performed, then an index is not useful and the table is scanned row by row. The query plan below shows that the table is scanned and the index is ignored.
id | parent | notused | detail |
---|---|---|---|
2 | 0 | 0 | SCAN customers |
Note that SQLite, unlike many other databases, does not automatically create an index for the primary key, but it does build a hashmap on the primary to quickly determine uniqueness.
id | parent | notused | detail |
---|---|---|---|
2 | 0 | 0 | SEARCH customers USING INTEGER PRIMARY KEY (rowid=?) |
To delete an index, use the DROP INDEX
statement.
The general form for CREATE INDEX
is shown below:
DROP INDEX [IN EXISTS] INDEX index_name;
Dropping an index is often advantageous when inserting a large number of rows as might be the case during bulk loading of new data or re-building an OLAP database. It is often faster to drop the index, perform the insertions, and then create the index anew.
To list all of the created and available indexes for some specific, use one the commands below:
seq | name | unique | origin | partial |
---|---|---|---|---|
0 | custLN | 0 | c | 0 |
1 | IFK_CustomerSupportRepId | 0 | c | 0 |
Another way to get all indexes from a database is to query from the sqlite_master table:
type | name | tbl_name | sql |
---|---|---|---|
index | sqlite_autoindex_playlist_track_1 | playlist_track | NA |
index | IFK_AlbumArtistId | albums | CREATE INDEX [IFK_AlbumArtistId] ON “albums” ([ArtistId]) |
index | IFK_CustomerSupportRepId | customers | CREATE INDEX [IFK_CustomerSupportRepId] ON “customers” ([SupportRepId]) |
index | IFK_EmployeeReportsTo | employees | CREATE INDEX [IFK_EmployeeReportsTo] ON “employees” ([ReportsTo]) |
index | IFK_InvoiceCustomerId | invoices | CREATE INDEX [IFK_InvoiceCustomerId] ON “invoices” ([CustomerId]) |
index | IFK_InvoiceLineInvoiceId | invoice_items | CREATE INDEX [IFK_InvoiceLineInvoiceId] ON “invoice_items” ([InvoiceId]) |
index | IFK_InvoiceLineTrackId | invoice_items | CREATE INDEX [IFK_InvoiceLineTrackId] ON “invoice_items” ([TrackId]) |
index | IFK_PlaylistTrackTrackId | playlist_track | CREATE INDEX [IFK_PlaylistTrackTrackId] ON “playlist_track” ([TrackId]) |
index | IFK_TrackAlbumId | tracks | CREATE INDEX [IFK_TrackAlbumId] ON “tracks” ([AlbumId]) |
index | IFK_TrackGenreId | tracks | CREATE INDEX [IFK_TrackGenreId] ON “tracks” ([GenreId]) |
Searching using substring matching or the keyword LIKE cannot take advantage of available indexes and should be avoided as they will involve linear table scans.
When new data is inserted or data is updated or deleted, indexes on the table where the modification occurred must be re-arranged which can be time consuming. The worse case run-time behavior of updating a B-Tree is \(O(n^2)\) which can lead to significant performance reduction.
Consequently, indexes should be dropped prior to loading data in bulk during ETL as they significantly reduce insertion performance. After ETL, indexes must be created again.
Searching on expressions does not use indexes unless a specific index on the expression is created
This experiment looks at the difference in retrieval performance with and without an index on a large table.
This experiment looks at the performance degradation when inserting new data into a table with an index versus no index.
Indexes also improve sorting, so this experiment will look at that benefit of indexes.
Indexing is an important performance improvement option for large tables where searches are performed frequently on non-prime attribute and a reduction in write performance is acceptable. SQLite offers two options for indexes: one for efficient searching and another for duplicate detection.
Bayer, R. and McCreight, E. (1972). Organization and maintenance of large ordered indexes. Acta Informatica, 1(1), 73-189.