Star Schema Design
A star schema is a widely used data warehouse design approach that organizes data into a central fact table and surrounding dimension tables. An OLAP database may contain multiple star schemas. Each fact table is for a particular “subject” i.e., it is “subject-oriented”. For example, one fact table for a retail data warehouse might be organized for sales facts (total sales per quarter, sales per sales person, sales per region, sales per product), while another fact table might contain facts on sales agents (total sales per agent, average sales per agent per region, number of units sold by each sales agent). Different fact tables can have different dimensions; the “by” or “per” clause is a dimension. For example, the fact “average sales per sales person” means that “sales person” is a dimension.
The fact table contains the quantitative, measurable data, while the dimension tables contain the descriptive attributes that provide context for the facts. The star schema gets its name from the visual representation, where the fact table is at the center, and the dimension tables radiate outward like the points of a star, when drawn as an Entity-Relationship Diagram (ERD). The example ERD below illustrates this.
Link to Diagram Source
The star schema is optimized for querying and data retrieval, as it minimizes the number of joins and groupings required to access the data. This is achieved by denormalizing the dimension tables, which means that redundant data is stored in the dimension tables to avoid the need for additional joins.
Dimension tables surround the fact table in a star schema and provide the descriptive attributes that give context to the facts. Each dimension table contains a primary key that is referenced by the foreign keys in the fact table. This relationship allows users to analyze the facts based on different dimensions, such as product categories, store locations, or time periods.
Dimension tables are typically smaller than fact tables and are denormalized to reduce the number of joins required during query execution. They contain hierarchical and descriptive attributes that enable users to drill down or roll up the data at different levels of detail. The diagram below shows the linking of a fact table with dimension tables.
By joining the fact table with the dimension tables, users can answer questions like:
- What were the total sales for “Corticosteroids” in “USA” last month?
- How does the sales performance compare across different store locations for a given time period?
Some designers forgo the use of dimension tables and merge the dimension information into the fact table. This makes the design less flexible and complicates adding new dimensions, but avoids additional joins during queries.
The choice of which facts to include in a table are based on working with subject matter experts and users. They can change over time and it is likely that new facts must be added over time (requiring a schema change during the next loading of the data warehouse).
It is common to index all columns in all fact and dimension tables to make joins more efficient. In addition, some designers choose to add “roll-up” rows in some dimension tables, most often those pertaining to dates.
The star schema design, with fact tables at the center and dimension tables surrounding them, provides a foundation for efficient querying and analysis of large-scale data in data warehousing and OLAP systems.
- Granularity of Fact Tables
Granularity refers to the level of aggregation or summarization at which data is stored in a fact table. It determines the lowest level of detail available for analysis and affects the size of the fact table and the types of queries that can be answered. Choosing the appropriate granularity is crucial for ensuring that the data warehouse meets the business requirements and performs efficiently. For example, in the above fact table we stored dates at the level of “days” but also provided roll-ups for quarter and year, although that is strictly not necessary and the roll-up can be made at run-time. However, doing so can speed up analytical queries but requires more storage.
To determine the appropriate level of granularity, consider the following factors:
- Business requirements: Identify the most granular level of data needed to answer the business questions and support the desired analytics.
- Query performance: Consider the impact of granularity on query performance. More granular data results in larger fact tables, which can affect query response times. However, this can be mitigated through the use of roll-up dimensions.
- Storage costs: Assess the storage requirements for different levels of granularity. More granular data consumes more storage space.
- Data availability and quality: Evaluate the availability and quality of data at different levels of detail. Granular data may not always be available or reliable.
Example: Daily sales vs. monthly sales
Consider a retail sales data warehouse. The business may require analysis at different levels of granularity:
- Daily Sales: Storing sales data at the daily level allows for detailed analysis of day-to-day performance, seasonality, and trends. However, this level of granularity results in a larger fact table.
- Monthly Sales: Storing sales data at the monthly level provides a higher-level overview of performance and trends. The fact table will be smaller, but the ability to drill down to daily details will not be possible.
Additivity of Facts
Facts (“measures”) in a fact table can be classified based on their additivity. Additive measures can be summed up along any dimension without losing meaning. Examples include sales amount and quantity sold. Semi-additive measures can be summed up along some dimensions but not others. For example, account balances can be summed up across accounts but not over time. Non-additive measures cannot be summed up along any dimension. Examples include ratios, percentages, and averages.
Non-additive measures should be handled differently in fact tables. Store non-additive measures in separate fact tables or as attributes in dimension tables. Calculate non-additive measures on-the-fly during query execution using SQL expressions or business intelligence tools or a programming language (such as R or Python when using Markdown for report generation). Use appropriate aggregation functions, such as AVG for averages or LAST for ending balances.
Sparsity
Data sparsity refers to the presence of a high proportion of null or zero values in a fact table. This occurs when there are many possible combinations of dimension values, but not all combinations have associated facts. Sparse fact tables can impact query performance and storage efficiency. While they may avoid some joins for roll-up, they do require special query consideration, e.g., handling of NULL in SQL requires using IS NULL or IS NOT NULL or may require some outer join.
It is recommended to store only the non-null fact records and their corresponding dimension keys. This reduces the size of the fact table but may require more complex queries and aggregation using SQL or programming constructs. Alternatively, one can use a dense fact table with null placeholders where the fact tables stores all possible combinations of dimension keys, using null or zero values for missing facts. This simplifies queries but increases the fact table size. Of course, a hybrid approach is also possible where we combine sparse and dense fact table designs based on the sparsity levels of different dimensions.
For example, consider a retail data warehouse where not all products are sold in all stores every day. The sales fact table may have a high proportion of null values for product-store-date combinations with no sales. A sparse fact table design would store only the non-null sales records, while a dense fact table would include null placeholders for all possible combinations.
Partitioning
Partitioning involves dividing a large fact table into smaller, more manageable fact tables based on a specific criterion, such as time or geography. Benefits of partitioning include:
- Improved query performance: Queries can target specific partitions, reducing the amount of data scanned.
- Easier data management: Partitions can be loaded, refreshed, or archived independently.
- Increased availability: Maintenance activities can be performed on individual partitions without affecting the entire fact table.
Common partitioning strategies include:
- Time-based partitioning: Partition the fact table based on a time dimension, such as year, quarter, or month. This is useful for historical analysis and data archiving.
- Geography-based partitioning: Partition the fact table based on a geographic dimension, such as country or region. This can improve query performance for location-specific analysis.
- Hybrid partitioning: Combine multiple partitioning strategies, such as time and geography, for more granular partitioning.
For example, in a retail data warehouse, the sales fact table can be partitioned by year and quarter:
- Partition 1: Sales data for 2022 Q1
- Partition 2: Sales data for 2022 Q2
- Partition 3: Sales data for 2022 Q3
- Partition 4: Sales data for 2022 Q4
This partitioning scheme allows queries to target specific time periods efficiently and simplifies data management tasks, such as archiving old data or loading new data incrementally. Because the tables are smaller, joins and groupings are faster and roll-up queries need to work on smaller sets of data. However, aggregating acros partitions can then be more difficult unless the combined large fact table is kept in addition to the partitions. Of course, that would substantially increase storage but provide faster query response time.
By considering granularity, additivity, sparsity, and partitioning when designing fact tables, it is easier to create a data warehouse that supports efficient querying, that meets business analytics requirements, and that scales effectively as data volume grows.