60.801 Key Concepts of Data Warehousing and OLAP

Objectives

Upon completion of this lesson, you will be able to:

explain the need for data warehouses
distinguish between data warehouses, data marts, and data lakes
appreciate the role of relational databases for data warehousing
identify key mechanisms of data analytics and data mining

Overview

In this video tutorial, Khoury Boston’s Prof. Schedlbauer provides an overview of the role of data warehouses and their designs as the basis for data analytics and data mining efforts in organizations.

Download Slide Deck: Data Warehousing & OLAP

Key Concepts

Organizational Data Growth

Organizations have accumulated large databases by collecting data about day-to-day activities. Such databases range from a few gigabytes, to hundreds of gigabytes, or even terabytes. Information systems that record and update data in such databases are known as online transaction processing (OLTP) applications, often supported by relational, hierarchical, and network database management systems.

In contrast to transactional (or operational) databases, data warehouses are databases that consolidate data from multiple sources, augmented with summary information, and historical data over a long time period. Applications that provide access to data for analytics and forecasting are often referred to as online analytical processing (OLAP) applications and they include dashboards, data mining, interactive exploration, scenario analysis, forecasting, and reporting.

OLAP refers to the analysis of complex data from the data warehouse, while data mining refers to the automated discovery of new information such as patterns, clusters, or association rules.

Data Warehouses

A key concept of data warehouses is their integration of data from several sources which facilitates analytics.

Enterprise Data Warehouses Integrates Data

The data loaded into the data warehouse includes not only data from operational databases (often relational), but also data from files (CSV and XML), and non-relational databases (MongoDB, DynamoDB), and applications (often via an XML or JSON bridge). The key is that the consolidated and integrated data helps understand data differences and allows for data mining. The data warehouse is often very large (measured in giga and tera bytes and beyond), includes historical data (and not just the latest value), as well as pre-computed and summary data. It is the basis for statistical analysis, data mining, and machine learning, as well as reporting and exploratory data visualization.

To summarize, a data warehouse: - is an integrated and consolidated collection of data from multiple sources, that - is primarily queried and is non-volatile, - contains data records for a long time and retains historical data, and - provides users with a single consolidated interface to data, enabling decision-support systems.

Because data warehouses are very large and often contain data that is not relevant for every purpose, organizations extract subsets of the data warehouse for specific purposes and around specific subject areas. Such extracts are called data marts. Data marts are subject-specific subsets of a data in the data warehouse geared towards the needs of a particular user.

The process of loading of data from different sources are is often called ETL (Extraction, Transformation, and Loading).

Systems that extract high-level information stored in databases and use that information in supporting a variety of decision making processes are called decision-support systems (DSS). Some of the tools and techniques used to build DSS include data warehousing, online analytical processing (OLAP), data mining, plus predictive and prescriptive modeling with machine learning.

OLAP vs OLTP

Unlike OLTP databases that hold operational data, data warehouses are not updated (changed) in real-time – only periodically. Redundancy in data warehouses is often deliberately planned to improve access:

summarized data is stored
data is deliberately copied to several tables
relational schemas are de-normalized
joins, aggregations, and grouping are avoided during retrieval through pre-joining and pre-computation

Characteristics of a Data Warehouse

Bill Inmon, considered to be one of the “fathers” of the modern data warehouse, defines a data warehouse as a subject-oriented, integrated, time- variant, non-volatile collection of data in support of management’s decisions.

Subject Orientation

A data warehouse is organized around a major subject (domain concept) such as accounts, products, and orders. Unlike an operational database which supports an application, in a data warehouse data is organized according to a subject. For example, an insurance company using a data warehouse might organize its data by customer, premium and claim instead of by different policies (auto sweep policy, joint life policy, etc.).

Non-Volatility

A data warehouse is always a physically separated store of data. Due to this separation, data warehouses do not support transaction processing, recovery, or concurrency control. The data is not overwritten or deleted once it enters the data warehouse, but is only loaded, refreshed, and accessed for queries. The data in the data warehouse is retained for future reporting and trend analytics.

Time Variance

Data is stored in a data warehouse to provide historical perspective. Consequently, the data in the data warehouse is time- variant or historical in nature. It is not unusual for to hold several years or even decades worth of data for, in order to facilitate comparative analytics, trend analysis, and forecasting. All values are stored so that reports and visualizations can be produced which show changes over time.

Integration

A data warehouse is almost always constructed by integrating multiple, heterogeneous sources of data such as relational databases, XML files, and other flat files. The database contains data from most, or all, of an organization’s operational applications. The data is integrated and harmonized to ensure a consistent view of the data and reduce data quality issue.

Integration does not come without its challenges. Among the most problematic integration issues are: - category encoding and formatting - field/attribute naming - primary key selection - duplicate data - default values for missing data - imputation of missing data - inconsistent data with differing values - filtering and aggregation

A simple example of a data integration issue: data about people is imported from various data sources. In one data source gender is encoded as “m” and “f” while in another it is “M”, “F”, “B”, “H”, “U”, “T”, and in yet another it is 1/0. Often, the integration requires rules to be formulated and applied.

The harmonization of data from the different sources is done programmatically at the time of ETL. During integration of data from multiple sources, data quality issues are generally resolved, and the updated data is then often copied back to the operational data sources. This process is generally referred to as backflushing.

Multi-Dimensional Modeling

Normalized relational schemas are two dimensional: all tables are equal, but not all tables contain the same amount of data.

The data in a data warehouse that is built on a relational database is usually multidimensional data having measure attributes and dimension attributes. Measure attributes “quantify” some value and can be aggregated. These are often called “facts”. On the other hand, the attributes that define the dimensions on which the measure attributes and their summaries are viewed are called dimension attributes. The facts are added to special tables called fact tables while the dimension attributes are stored in dimension tables.

It is important to note that not all data warehouses are built on relational databases and thus do not need fact and dimension tables if the database has inherent support for aggregation. Examples of such databases types include cube databases.

Multidimensional databases: - contain summarized data - use facts, dimensions, sparsity: values, counts, weights, occurrences - view data as a “slice of time” not a transaction - concentrate on modeling “facts” rather than entities

For example, a “fact” for a data warehouse that aggregrates sales data is the sale for some product in a particular month. Facts are numbers, e.g., $226,998. A fact tables contains lots of facts: sales for each product aggregated by month, by quarter, and by year as well as by store and by region, perhaps even by department or by sales person.

Fact Tables

The star schema is the simplest data warehouse schema for relational data warehouses, which consists of a fact table with a single table for each dimension. The dimension tables are also referred to as lookup tables. A data warehouse may contain numerous star schemas. Fact tables can be very large and the facts are pre-computed during ETL. That makes the lookup of facts (e.g., “What are the sales for each region fo reach quarter last year”) very quick as it involves the retrieval of a single column for a few rows – doing this in an operational database would require a SQL statement with GROUP BY and aggregation functions such as COUNT or SUM – those SQL statements can be very slow if there is a lot of data. Pre-computation eliminates that and thus makes browsing and interactive exploration of the data “facts” through dashboards and analytics applications fast. Of course, the facts have to bve periodically re-computed whenever the data warehouse is refreshed with new operational data.

Fact tables are tables that are: - heavily populated - are at the center of the star schema - contain pre-joined data from multiple tables - have aggregated data

The surrounding tables are called dimension tables which have fewer rows.

Star Schemas

Star schemas offer several benefits over normalized relational schemas: streamlined access for OLAP queries selective redundancy to simplify and speed up queries for decision support and data mining pre-computation of derived data Fact tables are not appropriate for OLTP databases that experience frequent, real-time updates as that would make real-time updates very slow.

Knowledge Discovery

Data Mining is a form of knowledge discovery that requires a data warehouse where data is integrated and cleaned plus is easily and quickly accessible, particularly for interactive browsing.

Mining is used to find patterns in the data and to build models for forecasting, prediction, and classification. Data Mining is now often called unsupervised machine learning. In the past, the term knowledge discovery in databases was used instead of data mining.

Data mining generally results in one of three types of models:

predictive: makes predictions based on data or trends, such as:
- “Which applicant will be a poor risk for a mortgage?”
- “For which candidate is a voter going to vote?”
- “What are the different segments of customers?”
- “What is the likely selling price of this property?”
- “Which songs would this customer like given her preferences and the songs others listen to?”
descriptive: finds relationships between data, such as:
“How much more likely are customers who buy pasta to buy cheese?”
prescriptive: defines automated behavior

Data Mining Algorithms

A supervised machine learning algorithm for automated data mining and prediction uses a set of labelled “training data” to identify patterns and build a predictive model. For example, in the illustration below, we might have collected data about voters along with who they voted for. We can then construct a model that, given some data about voters, can predict who they are likely going to vote for based on historical voting patterns by others. The prediction of a target variable is based on known features. This type of data mining is often referred to as supervised machine learning and requires pre-labeled training data – without that data it is impossible to construct a predictive model. Of course, the prediction is only accurate to a degree and evaluation the models accuracy and applicability is another aspect of data mining.

Labeled Training Data for Data Mining

Data Mining Process

There are six steps to data mining and model construction:

Define the problem (from an organizational or decision-making perspective)
Collect, clean, and prepare the data
Mine the data and developing predictive models
Evaluate the accuracy, validity, and usefulness of the models
Deploy the models
Monitor the models and revise or re-train periodically and as needed

In practice, many organizations rely on the CRISP-DM framework for data project management.

Types of Data Mining Tasks

Most predictive models fit into one of these five types of models – and each type of model is constructed with a different set of algorithms.

Model Type	Description	Example
Binary Classification	Make a decision among two choices	Which applicant will be a poor risk for a mortgage?
General Classification	Determine to which class some thing belongs	For which candidate is a voter going to vote?
Clustering	Can the data be grouped?	What are the different segments of customers?
Regression	Predict a numeric value	What is the likely selling price of this property?
Association Rules	What else might be something be related to	Which songs would this customer like given her preferences and the songs others listen to?

A common confusion is the difference between machine learing, data mining, and knowledge discovery. Data Mining is a common umbrella term for extracting knowledge from large databases. Machine Learning often refers to the process of building computational predictive models. Data Mining is an older term for unsupervised machine learning where no pre-labeled data sets are required and are commonly used to discover structure and patterns in data. Clustering algorithms are an example of an unsupervised machine learning or data mining algorithm. In contrast, supervised machine learning requires labeled data sets and the goal of a supervised ML algorithm is to discover a function that maps features to target variables with a certain degree of accuracy.

Naturally, data mining and machine learning rely on very large amounts of data, which means that the problems of “Big Data” rear their ugly head.

6V’s of Big Data

V	Characteristics
Volume	Terabyte Records & Archived Data Historical Values Tables & Files
Velocity	Batch Real/Near-Real Processes Streams/IoT
Value	Statistical Analytics Correlations Predictions
Variability	Changing Data Changing Model Linkage
Veracity	Trustworthiness Authenticity Provenance Availability
Variety	Structured Unstructured Multi-Factor Probabilistic Linked

Knowledge Discovery in Databases

In this short talk, Khoury Boston’s Prof. Schedlbauer provides an overview of the use of data mining to automate knowledge discovery in database to enable decision support and predictive modeling.

Summary

Data warehouses are key to analytical queries and to enable data mining through machine learning.

Files & Resources

All Files for Lesson 60.801

References

None.

Errata

None collected yet. Let us know.

60.801
Key Concepts of Data Warehousing and OLAP

Martin Schedlbauer, PhD

2024-02-14

Objectives

Overview

Key Concepts

Organizational Data Growth

Data Warehouses

OLAP vs OLTP

Characteristics of a Data Warehouse

Subject Orientation

Non-Volatility

Time Variance

Integration

Multi-Dimensional Modeling

Fact Tables

Star Schemas

Knowledge Discovery

Data Mining Algorithms

Data Mining Process

Types of Data Mining Tasks

6V’s of Big Data

Knowledge Discovery in Databases

Summary

Files & Resources

References

Errata

60.801Key Concepts of Data Warehousing and OLAP

Martin Schedlbauer, PhD

2024-02-14

Objectives

Overview

Key Concepts

Organizational Data Growth

Data Warehouses

OLAP vs OLTP

Characteristics of a Data Warehouse

Subject Orientation

Non-Volatility

Time Variance

Integration

Multi-Dimensional Modeling

Fact Tables

Star Schemas

Knowledge Discovery

Data Mining Algorithms

Data Mining Process

Types of Data Mining Tasks

6V’s of Big Data

Knowledge Discovery in Databases

Summary

Files & Resources

References

Errata

60.801
Key Concepts of Data Warehousing and OLAP