Upon completion of this lesson, you will be able to:
In this video tutorial, Khoury Boston’s Prof. Schedlbauer provides an overview of the role of data warehouses and their designs as the basis for data analytics and data mining efforts in organizations.
Download Slide Deck: Data Warehousing & OLAP
Organizations have accumulated large databases by collecting data about day-to-day activities. Such databases range from a few gigabytes, to hundreds of gigabytes, or even terabytes. Information systems that record and update data in such databases are known as online transaction processing (OLTP) applications, often supported by relational, hierarchical, and network database management systems.
In contrast to transactional (or operational) databases, data warehouses are databases that consolidate data from multiple sources, augmented with summary information, and historical data over a long time period. Applications that provide access to data for analytics and forecasting are often referred to as online analytical processing (OLAP) applications and they include dashboards, data mining, interactive exploration, scenario analysis, forecasting, and reporting.
OLAP refers to the analysis of complex data from the data warehouse, while data mining refers to the automated discovery of new information such as patterns, clusters, or association rules.
A key concept of data warehouses is their integration of data from several sources which facilitates analytics.
The data loaded into the data warehouse includes not only data from operational databases (often relational), but also data from files (CSV and XML), and non-relational databases (MongoDB, DynamoDB), and applications (often via an XML or JSON bridge). The key is that the consolidated and integrated data helps understand data differences and allows for data mining. The data warehouse is often very large (measured in giga and tera bytes and beyond), includes historical data (and not just the latest value), as well as pre-computed and summary data. It is the basis for statistical analysis, data mining, and machine learning, as well as reporting and exploratory data visualization.
To summarize, a data warehouse: - is an integrated and consolidated collection of data from multiple sources, that - is primarily queried and is non-volatile, - contains data records for a long time and retains historical data, and - provides users with a single consolidated interface to data, enabling decision-support systems.
Because data warehouses are very large and often contain data that is not relevant for every purpose, organizations extract subsets of the data warehouse for specific purposes and around specific subject areas. Such extracts are called data marts. Data marts are subject-specific subsets of a data in the data warehouse geared towards the needs of a particular user.
The process of loading of data from different sources are is often called ETL (Extraction, Transformation, and Loading).
Systems that extract high-level information stored in databases and use that information in supporting a variety of decision making processes are called decision-support systems (DSS). Some of the tools and techniques used to build DSS include data warehousing, online analytical processing (OLAP), data mining, plus predictive and prescriptive modeling with machine learning.
Unlike OLTP databases that hold operational data, data warehouses are not updated (changed) in real-time – only periodically. Redundancy in data warehouses is often deliberately planned to improve access:
Bill Inmon, considered to be one of the “fathers” of the modern data warehouse, defines a data warehouse as a subject-oriented, integrated, time- variant, non-volatile collection of data in support of management’s decisions.
A data warehouse is organized around a major subject (domain concept) such as accounts, products, and orders. Unlike an operational database which supports an application, in a data warehouse data is organized according to a subject. For example, an insurance company using a data warehouse might organize its data by customer, premium and claim instead of by different policies (auto sweep policy, joint life policy, etc.).
A data warehouse is always a physically separated store of data. Due to this separation, data warehouses do not support transaction processing, recovery, or concurrency control. The data is not overwritten or deleted once it enters the data warehouse, but is only loaded, refreshed, and accessed for queries. The data in the data warehouse is retained for future reporting and trend analytics.
Data is stored in a data warehouse to provide historical perspective. Consequently, the data in the data warehouse is time- variant or historical in nature. It is not unusual for to hold several years or even decades worth of data for, in order to facilitate comparative analytics, trend analysis, and forecasting. All values are stored so that reports and visualizations can be produced which show changes over time.
A data warehouse is almost always constructed by integrating multiple, heterogeneous sources of data such as relational databases, XML files, and other flat files. The database contains data from most, or all, of an organization’s operational applications. The data is integrated and harmonized to ensure a consistent view of the data and reduce data quality issue.
Integration does not come without its challenges. Among the most problematic integration issues are: - category encoding and formatting - field/attribute naming - primary key selection - duplicate data - default values for missing data - imputation of missing data - inconsistent data with differing values - filtering and aggregation
A simple example of a data integration issue: data about people is imported from various data sources. In one data source gender is encoded as “m” and “f” while in another it is “M”, “F”, “B”, “H”, “U”, “T”, and in yet another it is 1/0. Often, the integration requires rules to be formulated and applied.
The harmonization of data from the different sources is done programmatically at the time of ETL. During integration of data from multiple sources, data quality issues are generally resolved, and the updated data is then often copied back to the operational data sources. This process is generally referred to as backflushing.
Normalized relational schemas are two dimensional: all tables are equal, but not all tables contain the same amount of data.
The data in a data warehouse that is built on a relational database is usually multidimensional data having measure attributes and dimension attributes. Measure attributes “quantify” some value and can be aggregated. These are often called “facts”. On the other hand, the attributes that define the dimensions on which the measure attributes and their summaries are viewed are called dimension attributes. The facts are added to special tables called fact tables while the dimension attributes are stored in dimension tables.
It is important to note that not all data warehouses are built on relational databases and thus do not need fact and dimension tables if the database has inherent support for aggregation. Examples of such databases types include cube databases.
Multidimensional databases: - contain summarized data - use facts, dimensions, sparsity: values, counts, weights, occurrences - view data as a “slice of time” not a transaction - concentrate on modeling “facts” rather than entities
For example, a “fact” for a data warehouse that aggregrates sales data is the sale for some product in a particular month. Facts are numbers, e.g., $226,998. A fact tables contains lots of facts: sales for each product aggregated by month, by quarter, and by year as well as by store and by region, perhaps even by department or by sales person.
The star schema is the simplest data warehouse schema for relational data warehouses, which consists of a fact table with a single table for each dimension. The dimension tables are also referred to as lookup tables. A data warehouse may contain numerous star schemas. Fact tables can be very large and the facts are pre-computed during ETL. That makes the lookup of facts (e.g., “What are the sales for each region fo reach quarter last year”) very quick as it involves the retrieval of a single column for a few rows – doing this in an operational database would require a SQL statement with GROUP BY and aggregation functions such as COUNT or SUM – those SQL statements can be very slow if there is a lot of data. Pre-computation eliminates that and thus makes browsing and interactive exploration of the data “facts” through dashboards and analytics applications fast. Of course, the facts have to bve periodically re-computed whenever the data warehouse is refreshed with new operational data.
Fact tables are tables that are: - heavily populated - are at the center of the star schema - contain pre-joined data from multiple tables - have aggregated data
The surrounding tables are called dimension tables which have fewer rows.
Star schemas offer several benefits over normalized relational schemas: streamlined access for OLAP queries selective redundancy to simplify and speed up queries for decision support and data mining pre-computation of derived data Fact tables are not appropriate for OLTP databases that experience frequent, real-time updates as that would make real-time updates very slow.
Data Mining is a form of knowledge discovery that requires a data warehouse where data is integrated and cleaned plus is easily and quickly accessible, particularly for interactive browsing.
Mining is used to find patterns in the data and to build models for forecasting, prediction, and classification. Data Mining is now often called unsupervised machine learning. In the past, the term knowledge discovery in databases was used instead of data mining.
Data mining generally results in one of three types of models:
A supervised machine learning algorithm for automated data mining and prediction uses a set of labelled “training data” to identify patterns and build a predictive model. For example, in the illustration below, we might have collected data about voters along with who they voted for. We can then construct a model that, given some data about voters, can predict who they are likely going to vote for based on historical voting patterns by others. The prediction of a target variable is based on known features. This type of data mining is often referred to as supervised machine learning and requires pre-labeled training data – without that data it is impossible to construct a predictive model. Of course, the prediction is only accurate to a degree and evaluation the models accuracy and applicability is another aspect of data mining.
There are six steps to data mining and model construction:
In practice, many organizations rely on the CRISP-DM framework for data project management.
Most predictive models fit into one of these five types of models – and each type of model is constructed with a different set of algorithms.
Model Type | Description | Example |
---|---|---|
Binary Classification | Make a decision among two choices | Which applicant will be a poor risk for a mortgage? |
General Classification | Determine to which class some thing belongs | For which candidate is a voter going to vote? |
Clustering | Can the data be grouped? | What are the different segments of customers? |
Regression | Predict a numeric value | What is the likely selling price of this property? |
Association Rules | What else might be something be related to | Which songs would this customer like given her preferences and the songs others listen to? |
A common confusion is the difference between machine learing, data mining, and knowledge discovery. Data Mining is a common umbrella term for extracting knowledge from large databases. Machine Learning often refers to the process of building computational predictive models. Data Mining is an older term for unsupervised machine learning where no pre-labeled data sets are required and are commonly used to discover structure and patterns in data. Clustering algorithms are an example of an unsupervised machine learning or data mining algorithm. In contrast, supervised machine learning requires labeled data sets and the goal of a supervised ML algorithm is to discover a function that maps features to target variables with a certain degree of accuracy.
Naturally, data mining and machine learning rely on very large amounts of data, which means that the problems of “Big Data” rear their ugly head.
V | Characteristics |
---|---|
Volume | Terabyte Records & Archived Data Historical Values Tables & Files |
Velocity | Batch Real/Near-Real Processes Streams/IoT |
Value | Statistical Analytics Correlations Predictions |
Variability | Changing Data Changing Model Linkage |
Veracity | Trustworthiness Authenticity Provenance Availability |
Variety | Structured Unstructured Multi-Factor Probabilistic Linked |
In this short talk, Khoury Boston’s Prof. Schedlbauer provides an overview of the use of data mining to automate knowledge discovery in database to enable decision support and predictive modeling.
Data warehouses are key to analytical queries and to enable data mining through machine learning.
None.
None collected yet. Let us know.