5.104 CRISP-DM Process for Data Analytics and Data Mining
Martin Schedlbauer
2024-02-14
Objectives
Upon completion of this lesson, you will be able to:
list the phases of CRISP-DM
understand the work products and activities of each phase
Introduction
CRISP-DM stands for Cross-Industry Standard Process for Data Mining and data mining in this context encompasses not only data mining but also data analytics and machine learning. It is a widely-used and well-established methodology for data mining that provides a structured approach to planning and executing data mining (and machine learning and data analytics) projects.
The CRISP-DM framework consists of six major phases:
Business Understanding: In this phase, the project objectives and requirements are defined, and the data mining problem is formulated.
Data Understanding: This phase involves collecting and analyzing the data that will be used for the project. This includes identifying the data sources, evaluating data quality, and exploring the data to gain insights.
Data Preparation: In this phase, the data is cleaned, transformed, and formatted in a way that makes it suitable for data mining. This can involve selecting relevant variables, dealing with missing values, and encoding categorical variables.
Modeling: In this phase, the data is analyzed using various modeling techniques, such as classification, clustering, or regression. The aim is to develop a model that can accurately predict or explain the target variable.
Evaluation: The model is evaluated using various performance metrics to determine its effectiveness and suitability for the intended purpose.
Deployment: In this final phase, the model is deployed in a real-world setting, and the results are monitored to ensure that the model is performing as expected.
The graphic below shows the iterative nature of the CRISP-DM framework1:
The short tutorial below by Dr. Soofastaei of Pearson Education summarizes the process. You might want to watch before reading more about CRISP-DM.
Why CRISP-DM?
The CRISP-DM framework provides a structured and iterative approach to data mining, allowing for continuous improvement and refinement throughout the project. It provides a standard blueprint to guide data projects. By using a standard framework, the data mining process becomes reliable and repeatable, particularly by analysts with few data mining skills. Furthermore, it reduces reliance on “heroes” and provides comfort for those new to data science and data mining.
In most data mining endeavors 60-80% of the effort is in data extraction, transformation, and loading, including – data understanding and data preparation. These phases are “standard” and should be done in a repeatable manner. CRISP-DM provides a framework to do so.
CRISP-DM Phases
This section takes a more detailed look at the various stages of the pipeline. The video below provides a summary, should you find a narrated tutorial helpful.
Stage 1: Business Understanding
At the initial stage of the CRISP-DM process, it is essential to define the business objectives of the data project and to understand any constraints. The objective of this phase is to reveal key factors that might impact the project’s outcome. Failing to undertake this step may lead to expending significant effort on generating correct answers to irrelevant queries.
Define Project Outcomes
Set Objectives: This involves defining your primary business objectives, which could be keeping current customers by predicting when they may switch to a competitor. You may also have other related questions to address, such as “Does the number of times a customer’s visits our website impact their decision to abandon their shopping cart?” or “Will reducing shipping fees significantly increase the number of customers we retain?”
Produce Project Plan: In this phase, the outcome is a detailed plan for attaining the data mining and business objectives. The plan should outline the steps to be executed throughout the remainder of the project, including the initial selection of tools and techniques. This is comparable to a project plan is software project management and commonly includes a timeline, task list, and perhaps critical path analysis, task assignments, and cost and time estimates.
Define Success Criteria: In the “Business success criteria” stage, we define the criteria that will be applied to evaluate the project’s success from a business perspective. Ideally, these should be precise and quantifiable, such as a specific decrease in customer churn by some point in time. Whenever possible, the success criteria must be SMART, i.e., specific, measurable, attainable, relevant, and time-based. However, in some cases, subjective criteria like “provide valuable insights into the relationships” may be necessary. If so, it is crucial to specify who will make the subjective assessment and when.
Assess Current Situation
This phase necessitates in-depth information gathering on all resources, restrictions/constraints, assumptions, and other variables that must be taken into account while establishing your data analysis objective and project plan.
Inventory of Resources: Identify the resources available to the project including:
Personnel (e.g., subject matter experts, data scientists, data engineers, technical support)
Data (e.g., transactional and analytical databases, data files, third-party data)
Requirements, Assumptions and Constraints: Identify all current requirements including the schedule, use cases, quality of results, data security concerns, as well as any potential legal ramifications. Verify ownership of or license to use the data, along with any usage constraints. List any assumptions made by the project sponsors and stakeholders. These assumptions may pertain to the data, which can be confirmed during data mining, or they may relate to the business and cannot be verified. If the latter affects the results’ validity, it is crucial to enumerate them. Enumerate the project limitations, which could be restrictions on resource accessibility or technological constraints, such as the practicality of using a particular dataset size for modeling purposes.
Risks and Mitigation: List any risks or adverse events that might delay the project or cause it to fail, or impact the deliverable or schedule. For each identified risks, define its likelihood of occurrence and its impact on the project. Create risks mitigation strategies for all high-impact or likely risks. Define what actions would need to be taken to reduce a risk or to deal with it should it occur.
Terminology: Compile a glossary of terms relevant to the project and its stakeholders. The glossary should encompass a definition of all relevant business terms, which forms part of the business understanding available to the project. The glossary is generally a deliverable of the requirements gathering and elicitation efforts. Additionally, the project team must write a glossary of data mining terms to ensure that everyone has the same understanding of data mining techniques employed on the project.
Costs and benefits: Create a cost-benefit analysis that contrasts the project expenses with the potential business benefits if the project is successful. This comparison is part of the business case and should be as precise as possible, utilizing financial measures.
Determine Data Mining Objectives
A business goal states objectives using measures that provide business value. The data mining objectives expresses the project’s technical aims. For instance, the business objective could be “Enhance e-mail marketing sales to current customers,” whereas the data mining objective might be “Forecast the order total for customers based on their previous three-year purchase history, demographic data (e.g., age, salary, zip code), and item pricing.” The data mining objectives are used to realize the business objectives. Data mining success criteria might include certain levels of precision, recall, accuracy, F1-score, among other technical data mining metrics.
Project Plan
The project plan defines how the project will be carried out, including which team members will do which activities and tasks, which tools are to be used, and when deliverables are due.
For the project tasks, define their duration, required resources and tools, inputs, outputs, and task dependencies. Where possible, try and make explicit the large-scale iterations in the data mining process, for example, repetitions of the modelling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document.
At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan. Initial assessment of tools and techniques – At the end of the first phase you should undertake an initial assessment of tools and techniques. Here, for example, you select a data mining tool that supports various methods for different stages of the process. It is important to assess tools and techniques early in the process since the selection of tools and techniques may influence the entire project.
Stage 2: Data Understanding
The second phase of the CRISP-DM process entails obtaining the data indicated in the project resources. This initial collection may involve data loading, especially if it aids in data comprehension. For instance, if a particular tool is utilized for data understanding, it is logical to upload the data into that tool. If there are multiple data sources, it is crucial to contemplate how and when they will be integrated.
A key deliverable is the data collection report which identifies the sources of data, their locations, and the methods necessary to retrieve the data, in addition to any data collection risks.
Describe Data
Among the first steps in understanding the data is to examine the data’s properties, i.e., how many instances, how many variables, what kinds of variables (categorical, numeric, textual), and the format of the data.
A key insight from this step is to know whether the right data is available to address the business objectives and whether there is sufficient data to build predictive models.
Examine the “gross” or “surface” properties of the acquired data and report on the results.
Explore Data
In this stage the data analysis team addresses initial, high-level data mining questions using data queries, descriptive statistics, and exploratory data visualizations. Exploratory analysis focuses chiefly on the distribution of target and predictor variables and correlations. Often, simple aggregations such as sums and means are explored, along with variance and co-variance.
A key project deliverable is the Data Exploration Report which contains the results of the exploratory data analysis, including key findings and initial hypothesis and whether those might impact the remainder of the project; or even if the project should be terminated at this point. To some extent, this step is a “gate keeping” act with a “go/no-go” decision.
It is common to include initial exploratory analysis results in the report, including exploratory visualizations, graphs and plots to show data characteristics, and possible future areas of more detailed examination of relevant and interesting subsets of the data.
Evaluate Data Quality
It is essential to conduct an evaluation of data quality, while considering pertinent issues such as:
Does the data encompass all the necessary cases, or is it incomplete?
Is the data accurate, or does it contain mistakes? If so, what is the extent of these errors?
Are there any omissions in the data? If present, how are they indicated, where do they arise, and how frequent are they?
Analysts generally enumerate the outcomes of the data quality validation in a report. In cases where quality issues are apparent, it is helpful to recommend applicable and appropriate remedies and imputation strategies. Addressing data quality problems ordinarily relies on a combination of expertise in data analytics and the business domain.
Stage 3: Data Preparation
Select Data
The data selection phase of a project involves making informed choices about the dataset to be analyzed, based on various factors including the pertinence of the data to the data mining objectives, the data quality, and technical constraints such as data size and format. It is important to note that data selection encompasses the choice of both attributes (columns) and records (rows) within a table.
To justify the inclusion or exclusion of data, it is necessary to provide a rationale for these decisions. This involves documenting the list of selected and omitted data, along with the reasons that informed these decisions.
Clean Data
The next task focuses on enhancing the quality of the data to meet the requirements of the selected analysis techniques and data mining algorithms. This may entail selecting clean subsets of the data, introducing appropriate default values, identifying corrupted data, or implementing more complex procedures such as modelling to impute missing data. The process must be reproducible.
To document the steps taken to address data quality issues, it is necessary to create a report outlining the decisions and actions taken in terms of data cleaning. This report should detail any transformations applied to the data for the purposes of cleaning and their potential impact on the results of the analysis.
Transform and Derive Data
Once the data has been clean, additional data preparation operations can be performed, including the creation of derived attributes, additional records, or transformed values for existing attributes.
Derived attributes refer to new variables constructed from one or more pre-existing attributes within the same record using some formula or rule. For instance, the variables of length and width might be used to derive a new attribute of area.
Generated records describe the process of creating entirely new records. For instance, it may be necessary to create records for customers who did not make any purchases in the past year, even though such records were not present in the raw data. Representing the fact that certain customers made zero purchases may be critical for modelling purposes and to ensure that data mining algorithms produce reasonable values.
Integrate Data
Data integration involves combining information from multiple data sources, files, tables, or external databases to create new values or records.
Merged data refers to the process of joining two or more sources that contain different information about the same objects. For instance, a retail chain may have separate databases containing information about each store’s characteristics, summarized sales data, and demographics of the surrounding area. These databases can be merged into a new database with one record for each store, combining relevant fields from the source databases.
Aggregations involve computing new values by summarizing information from multiple records and/or tables. For instance, a table of customer purchases with one record for each purchase can be transformed into a new table with one record for each customer, containing fields such as number of purchases, average purchase amount, percentage of orders charged to credit card, and percentage of items purchased during a promotion.
Stage 4: Model Selection
The initial stage of modeling involves selecting the specific technique that will be used for analysis, even if a tool was already selected in the Business Understanding phase. As an example, the analysis team may need to choose a proper classification technique and need to determine whether to use a decision-tree constructed with C5.0 or a classification model built from neural network generation with back propagation. It is important to perform this task separately for each modelling technique that will be applied.
The selected modelling techniques should be documented in detail and their training must be reproducible.
It is crucial to consider the assumptions made by the chosen modelling technique regarding the data. Many modelling techniques make specific assumptions about the data, such as requiring all attributes to have normal distributions, disallowing missing values, and requiring the class attribute to be categorical. All modelling assumptions made must be recorded.
Design Model Testing and Validation
Before constructing a model, it is necessary to develop a mechanism or procedure to evaluate its quality and validity. In supervised data mining tasks, such as classification, error rates are commonly used as quality measures for data mining models. As such, it is customary to split the dataset into training and testing sets, constructing the model on the training set, and assessing its quality on the separate testing set.
To train, test, and evaluate the models effectively, a comprehensive plan must be established. A key component of this plan is determining how to partition the available dataset into training, testing, and validation datasets. The plan should outline the specific methods and metrics that will be employed to assess the model’s accuracy and performance. It should also address any potential biases or limitations in the data and ensure that the models are evaluated in a fair and unbiased manner.
Construct Model
Running the modelling tool on the prepared dataset involves creating one or more models for analysis.
Parameter settings for the modelling tool should be documented, including the values chosen for each parameter and the reasoning behind their selection. The parameter settings will vary depending on the specific modelling tool and technique used.
The resulting models created by the modelling tool should be documented, including all relevant details about the models.
Model descriptions should be provided, including any interpretation of the models, difficulties encountered during their analysis, and potential limitations or biases. A comprehensive report should be created, outlining the findings and insights gained from the models, as well as any recommendations for future analysis.
Evaluate Model
Interpreting the models according to domain knowledge, data mining success criteria, and desired test design is a critical step in evaluating the success of modelling and discovery techniques. It is essential to assess the technical aspects of the models and rank them accordingly, while also considering the business objectives and success criteria.
The models should be assessed in terms of their qualities, such as accuracy, and ranked in relation to one another. The results of this task should be summarised, highlighting the strengths and limitations of the generated models.
Based on the model assessment, parameter settings may need to be revised and tuned for the next modelling run. This iterative process of model building and assessment should continue until the best possible model(s) are identified. All revisions and assessments should be thoroughly documented to ensure transparency and reproducibility.
Stage 5: Evaluation
Evaluate Results and Objectives
During this step, the evaluation phase assesses the extent to which the generated model meets the business objectives. This may involve testing the model(s) on real-world applications or identifying any deficiencies in the model that may impact its practical usefulness. The evaluation phase also includes an assessment of all other data mining results generated during the project, which may reveal additional challenges or insights for future directions.
The assessment of data mining results should be summarised in terms of business success criteria, including a final statement on whether the project has successfully achieved its initial business objectives.
Based on the assessment of models with respect to the business success criteria, the generated models that meet the selected criteria become the approved models.
Conduct Process Retrospective
After determining that the resulting models meet the business needs, it is necessary to conduct a more comprehensive review of the data mining engagement to ensure that no important factor or task has been overlooked. This review should also include quality assurance issues, such as whether the model was correctly built and whether only permissible attributes were used for analysis.
The process review should be summarised, highlighting any activities that were missed or overlooked, as well as any tasks that should be repeated to ensure that the data mining engagement was executed accurately and effectively. This review may also identify potential areas for improvement or refinement in the data mining process that could be implemented in future projects. Overall, the review should provide a comprehensive evaluation of the data mining engagement and ensure that all necessary steps were taken to achieve the project’s goals.
Assess Next Steps
After completing the assessment and process review, it is necessary to decide how to proceed with the project. This may involve finishing the project and moving on to deployment, initiating further iterations to improve the models or the process, or setting up new data mining projects. The remaining resources and budget should be taken into consideration when making this decision.
A list of potential further actions should be created, along with the reasons for and against each option. For example, continuing with the project may be the best option if the models meet the business needs and there are no additional insights to be gained from further iterations. On the other hand, initiating further iterations may be necessary if the models need to be refined or if there are still unanswered questions that need to be addressed.
The decision on how to proceed should be described in detail, along with the rationale for the chosen course of action. This decision should be based on a thorough evaluation of the data mining engagement, taking into account the results of the assessment, the process review, and the available resources and budget.
Stage 6: Deployment
Plan Deployment
The deployment stage involves taking the evaluation results and determining a strategy for deploying the models. If a general procedure has been identified for creating the relevant models, it should be documented for later deployment. It is important to consider the ways and means of deployment during the business understanding phase as well, as this is crucial to the success of the project and the operational side of the business.
The deployment plan should summarize the deployment strategy, including the necessary steps and how to perform them. This may involve integrating the models into existing systems or developing new applications to make use of the models. The plan should also identify any potential challenges or limitations in the deployment process and outline strategies for addressing them.
The deployment plan should be comprehensive and include all necessary details, such as timelines, resources, and key stakeholders involved in the deployment process. It should also be reviewed and updated regularly to ensure that it remains relevant and effective. Overall, a well-developed deployment plan is essential for maximizing the value of the data mining project and achieving the desired business outcomes.
Monitor Models
Monitoring and maintenance are crucial issues when data mining results become part of the day-to-day business environment. A well-prepared maintenance strategy is essential to ensure the continued accuracy and effectiveness of the data mining results and to avoid any unnecessary periods of incorrect usage.
To monitor the deployment of the data mining result(s), a detailed monitoring process plan should be developed, taking into account the specific type of deployment. This plan should summarise the monitoring and maintenance strategy, including the necessary steps and how to perform them.
The monitoring and maintenance plan should outline key performance indicators (KPIs) that will be used to evaluate the effectiveness of the data mining results and identify any potential issues or areas for improvement. It should also include a schedule for regular review and update of the models, as well as procedures for resolving any issues that may arise.
The plan should be reviewed and updated regularly to ensure that it remains relevant and effective. This will help to ensure the continued success of the data mining project and enable the business to maximise the value of the data mining results.
Write Final Report
At the end of the data mining project, it is important to prepare a final report that summarises and organises all the results and deliverables. Depending on the deployment plan, the final report may be a comprehensive presentation of the data mining results or a summary of the project experiences (if they have not already been documented as an ongoing activity).
The final report should include all the previous deliverables, such as the business understanding, data preparation, modelling, evaluation, and deployment plans. It should also provide an overview of the data mining process, including any challenges and limitations encountered, and highlight the key findings and insights gained.
In addition to the final report, there will often be a meeting at the conclusion of the project where the results are presented to the customer. This final presentation should provide a clear and concise overview of the data mining results, highlighting the most important findings and their potential implications for the business.
Overall, the final report and presentation are critical components of the data mining project and help to ensure that the business can fully leverage the insights gained from the project to achieve its goals and objectives.
Conduct Retrospective
Assessing the data mining project at its conclusion is an important step for identifying what went well and what could be improved. Experience documentation can help to capture important insights and lessons learned from the project that can be used to inform future projects and improve overall performance.
Experience documentation should summarize the important experiences gained during the project, including any pitfalls encountered, misleading approaches, or hints for selecting the best suited data mining techniques in similar situations. This documentation can also include any reports that were written by individual project members during previous phases of the project, as these may contain valuable insights and lessons learned.
It is important to identify both the successes and failures of the project in order to improve future performance. This may involve reflecting on the effectiveness of the data preparation and modelling techniques used, the quality of the data, the appropriateness of the evaluation criteria, and the success of the deployment strategy. By understanding what went well and what needs to be improved, future data mining projects can be more effective and efficient, leading to better business outcomes.
Summary
The CRISP-DM (Cross-Industry Standard Process for Data Mining) framework is a widely-used methodology for data mining projects. It consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Each phase involves a set of tasks and objectives that help guide the project through its life cycle, from identifying the business problem to deploying the solution. The framework is iterative, allowing for adjustments and improvements at each stage, and emphasizes the importance of communication and collaboration between team members.