Objectives

Upon completion of this lesson, you will be able to:

  • list the different methods of feature engineering
  • know when to apply different methods
  • consider what happens when data is not properly shaped

Introduction

Feature engineering is a key step in creating machine learning models. It is concerned with preparing the input dataset to ensure that it meets the requirements of the machine learning algorithm, improving the performance of machine learning models, and creating features from raw data that better represent the underlying problem to the predictive models.

In simpler terms, feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

Features

Features in your data are important characteristics or attributes that, when processed and interpreted by a machine learning model, can help the model make accurate predictions. They can be anything from individual data points (like a customer’s age or income in a financial model) to calculated values based on raw data (like the length of a text string or the sum total of certain transactions).

In the context of machine learning, a feature is an individual, measurable property or characteristic of a phenomenon being observed. It’s essentially an input variable — the independent variable in the model. In simpler terms, features are the data points or attributes that we use to predict our target variable.

Alternative terms are: column, dimension, variable, attribute, property

For example, if you’re trying to predict the price of a house (the target variable), features could include the number of bedrooms, the size of the house in square feet, the age of the house, the location, and so on. Each of these features provides information that the model can use to learn patterns and, ultimately, predict house prices.

In the same vein, if you’re building a spam detection model, features could include the frequency of certain words, the email’s length, the time it was sent, whether it contains attachments, and more.

Feature Engineering

Feature engineering involves creating new features from existing data that can better represent the underlying problem to the predictive models, ultimately improving their performance. This might involve transforming existing features, creating new features through combinations of existing ones, or even using domain knowledge to create entirely new features.

Feature engineering requires domain knowledge of the specific field the model is being built for, as well as a solid understanding of the data. This is because creating effective new features involves an understanding of the relationship between existing features and the target variable we want to predict.

The process can involve a variety of tasks like:

  1. Variable Transformation: This includes tasks such as standardization (scaling your variable to have zero mean and unit variance), or normalization (scaling variables between a minimum and maximum value), and, when needed, transforming features to a normal (Gaussian) distribution.

  2. Variable / Feature Creation: This involves creating new, more complex features out of existing ones. This could be as simple as creating a “length of text” feature in a sentiment analysis model, or as complex as creating a “customer lifetime value” feature in a sales prediction model.

  3. Handling Missing Values: Not all features will have values for all data points. Deciding on how to handle missing data is a crucial step in feature engineering.

  4. Handling Categorical Data: Many machine learning models require inputs to be numerical. Converting categorical data into a suitable numerical form is another common feature engineering task.

  5. Dealing with Outliers: Outliers, or values that are significantly different from most of the other values, can distort the performance of a machine learning model. Feature engineering often involves identifying and dealing with outliers, such as removing them or treating them as missing values and imputing a new value.

  6. Feature Grooming: Some features may not be relevant or may not contribute to the training of a model, so they are removed. It may also be the case, that there are too many features and the machine learning algorithm cannot handle the quantity of features, so some must be removed or combined into new features.

By leveraging domain knowledge, the feature engineering process can create more meaningful features and significantly boost the predictive power of machine learning models.

Consequences of Poor Engineering

If feature engineering is omitted from the process of developing a machine learning model, it could lead to a number of negative consequences:

  1. Poor Model Performance: Without feature engineering, the model may not perform as well as it could. This is because the raw data may not adequately represent the underlying structures and patterns that the model needs to learn. By creating new features or transforming existing ones, we can expose these structures and patterns more clearly, making it easier for the model to learn from them.

  2. Overfitting or Underfitting: Overfitting occurs when a model learns the training data too well and performs poorly on unseen data, while underfitting happens when the model doesn’t learn enough from the training data and performs poorly even on it. Feature engineering can help in dealing with both of these issues. For example, creating more meaningful features can reduce the dimensionality of the data, which can help in avoiding overfitting.

  3. Inefficiency: Without feature engineering, we might need to use more complex models to achieve the same performance. For example, with the right features, a linear model might be sufficient to get good results. Without feature engineering, we might need to use a much more computationally intensive model, like a neural network, to get the same performance.

  4. Lack of Interpretability: Feature engineering often results in models that are more interpretable. For example, if we use domain knowledge to create features that have a clear meaning, it can be easier to understand why the model is making the predictions it is.

  5. Ignoring Domain Knowledge: One of the biggest strengths of feature engineering is the ability to incorporate domain knowledge into the model. If we omit feature engineering, we miss out on this opportunity.

Overall, feature engineering is a crucial part of the machine learning process, and omitting it can lead to worse performance, inefficiency, overfitting, underfitting, and models that are hard to interpret.

Feature Engineering and CRISP-DM

The CRISP-DM (Cross-Industry Standard Process for Data Mining) framework consists of six phases:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Feature engineering primarily occurs during the Data Preparation phase of the CRISP-DM framework. This is after you’ve developed an understanding of the business problem and the data that you have to work with, but before you begin modeling.

During the Data Preparation phase, you’ll perform tasks such as data cleaning, dealing with missing values, transforming variables, and creating new features. These activities are crucial for ensuring that your data is in the right format and condition for modeling.

However, it’s important to note that feature engineering is an iterative process. While the majority of feature engineering work often happens during Data Preparation, you might go back and perform additional feature engineering after doing some modeling, especially if the model performance is not satisfactory or if you get new insights during the modeling or evaluation phase.

For instance, you might find that a certain feature isn’t providing as much predictive power as you thought it would, or that a different transformation of a feature might improve the model’s performance. In these cases, you would go back to the Data Preparation phase, make the necessary adjustments, and then proceed with the modeling process again.

For more information on CRISP-DM, consult Lesson 5.104 / CRISP-DM Process for Data Analytics and Data Mining.

Feature Engineering Tasks

Variable Transformation

This includes tasks such as standardization (scaling your variable to have zero mean and unit variance), or normalization (scaling variables between a minimum and maximum value), and, when needed, transforming features to a normal (Gaussian) distribution.

For more information consult Lesson TBD.

Variable / Feature Creation

This involves creating new, more complex features out of existing ones. This could be as simple as creating a “length of text” feature in a sentiment analysis model, or as complex as creating a “customer lifetime value” feature in a sales prediction model.

For more information consult Lesson 3.206 / Normalizing Numeric Features for Machine Learning Algorithms.

Handling Missing Values

Not all features will have values for all data points. Deciding on how to handle missing data is a crucial step in feature engineering.

For more information consult Lesson 3.204 / Managing Missing Values in Data.

Handling Categorical Data

Many machine learning models require inputs to be numerical. Converting categorical data into a suitable numerical form is another common feature engineering task.

For more information consult Lesson 3.207 / Encoding Categorical Features.

Dealing with Outliers

Outliers, or values that are significantly different from most of the other values, can distort the performance of a machine learning model. Feature engineering often involves identifying and dealing with outliers, such as removing them or treating them as missing values and imputing a new value.

For more information consult Lesson 3.203 / Detecting and Managing Outliers.

Feature Grooming

Some features may not be relevant or may not contribute to the training of a model, so they are removed. It may also be the case, that there are too many features and the machine learning algorithm cannot handle the quantity of features, so some must be removed or combined into new features.

Tutorials

Summary

Feature engineering is a critical step in the machine learning pipeline where raw data is transformed into a suitable format to improve the performance of machine learning models. It involves creating new features from existing ones, or transforming features, based on domain knowledge and insights gained from the data.

A “feature” in the context of machine learning refers to an individual measurable property or characteristic of the phenomenon being observed. Essentially, it’s an input variable or an independent variable in the model that we use to predict our target variable.

Neglecting feature engineering can lead to various negative consequences such as poor model performance, overfitting or underfitting of the model, inefficiency in model training, and difficulty in model interpretability. It may also prevent the model from fully leveraging the domain knowledge encapsulated in the engineered features.

In the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework, feature engineering primarily happens during the Data Preparation phase. This phase is after the Business Understanding and Data Understanding stages but before the Modeling phase. However, feature engineering is an iterative process and adjustments may be made after initial modeling and evaluation based on further insights and the performance of the model.

In summary, feature engineering is a critical, iterative process in machine learning model development that involves transforming and creating features to better represent the underlying problem to the model, improving model performance and interpretability.


Files & Resources

All Files for Lesson 3.202

Errata

None collected yet. Let us know.

---
title: "Overview of Feature Engineering"
params:
  type: lesson
  category: 3
  stacks: 0
  number: 202
  time: 30
  level: beginner
  tags: missing data,machine learning,data cleaning,imputation,feature engineering,outliers
  description: "Provides an overview of the methods and purpose of feature
                engineering required to shape data for the training of 
                machine learning models."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

------------------------------------------------------------------------

## Objectives

Upon completion of this lesson, you will be able to:

-   list the different methods of feature engineering
-   know when to apply different methods
-   consider what happens when data is not properly shaped

------------------------------------------------------------------------

## Introduction

Feature engineering is a key step in creating machine learning models. It is concerned with preparing the input dataset to ensure that it meets the requirements of the machine learning algorithm, improving the performance of machine learning models, and creating features from raw data that better represent the underlying problem to the predictive models.

In simpler terms, feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

## Features

Features in your data are important characteristics or attributes that, when processed and interpreted by a machine learning model, can help the model make accurate predictions. They can be anything from individual data points (like a customer's age or income in a financial model) to calculated values based on raw data (like the length of a text string or the sum total of certain transactions).

In the context of machine learning, a feature is an individual, measurable property or characteristic of a phenomenon being observed. It's essentially an input variable --- the independent variable in the model. In simpler terms, features are the data points or attributes that we use to predict our target variable.

Alternative terms are: column, dimension, variable, attribute, property

For example, if you're trying to predict the price of a house (the target variable), features could include the number of bedrooms, the size of the house in square feet, the age of the house, the location, and so on. Each of these features provides information that the model can use to learn patterns and, ultimately, predict house prices.

In the same vein, if you're building a spam detection model, features could include the frequency of certain words, the email's length, the time it was sent, whether it contains attachments, and more.

## Feature Engineering

Feature engineering involves creating new features from existing data that can better represent the underlying problem to the predictive models, ultimately improving their performance. This might involve transforming existing features, creating new features through combinations of existing ones, or even using domain knowledge to create entirely new features.

Feature engineering requires domain knowledge of the specific field the model is being built for, as well as a solid understanding of the data. This is because creating effective new features involves an understanding of the relationship between existing features and the target variable we want to predict.

The process can involve a variety of tasks like:

1.  **Variable Transformation:** This includes tasks such as standardization (scaling your variable to have zero mean and unit variance), or normalization (scaling variables between a minimum and maximum value), and, when needed, transforming features to a normal (Gaussian) distribution.

2.  **Variable / Feature Creation:** This involves creating new, more complex features out of existing ones. This could be as simple as creating a "length of text" feature in a sentiment analysis model, or as complex as creating a "customer lifetime value" feature in a sales prediction model.

3.  **Handling Missing Values:** Not all features will have values for all data points. Deciding on how to handle missing data is a crucial step in feature engineering.

4.  **Handling Categorical Data:** Many machine learning models require inputs to be numerical. Converting categorical data into a suitable numerical form is another common feature engineering task.

5.  **Dealing with Outliers:** Outliers, or values that are significantly different from most of the other values, can distort the performance of a machine learning model. Feature engineering often involves identifying and dealing with outliers, such as removing them or treating them as missing values and imputing a new value.

6.  **Feature Grooming**: Some features may not be relevant or may not contribute to the training of a model, so they are removed. It may also be the case, that there are too many features and the machine learning algorithm cannot handle the quantity of features, so some must be removed or combined into new features.

By leveraging domain knowledge, the feature engineering process can create more meaningful features and significantly boost the predictive power of machine learning models.

## Consequences of Poor Engineering

If feature engineering is omitted from the process of developing a machine learning model, it could lead to a number of negative consequences:

1.  **Poor Model Performance:** Without feature engineering, the model may not perform as well as it could. This is because the raw data may not adequately represent the underlying structures and patterns that the model needs to learn. By creating new features or transforming existing ones, we can expose these structures and patterns more clearly, making it easier for the model to learn from them.

2.  **Overfitting or Underfitting:** Overfitting occurs when a model learns the training data too well and performs poorly on unseen data, while underfitting happens when the model doesn't learn enough from the training data and performs poorly even on it. Feature engineering can help in dealing with both of these issues. For example, creating more meaningful features can reduce the dimensionality of the data, which can help in avoiding overfitting.

3.  **Inefficiency:** Without feature engineering, we might need to use more complex models to achieve the same performance. For example, with the right features, a linear model might be sufficient to get good results. Without feature engineering, we might need to use a much more computationally intensive model, like a neural network, to get the same performance.

4.  **Lack of Interpretability:** Feature engineering often results in models that are more interpretable. For example, if we use domain knowledge to create features that have a clear meaning, it can be easier to understand why the model is making the predictions it is.

5.  **Ignoring Domain Knowledge:** One of the biggest strengths of feature engineering is the ability to incorporate domain knowledge into the model. If we omit feature engineering, we miss out on this opportunity.

Overall, feature engineering is a crucial part of the machine learning process, and omitting it can lead to worse performance, inefficiency, overfitting, underfitting, and models that are hard to interpret.

## Feature Engineering and CRISP-DM

The **CRISP-DM** (Cross-Industry Standard Process for Data Mining) framework consists of six phases:

1.  Business Understanding
2.  Data Understanding
3.  Data Preparation
4.  Modeling
5.  Evaluation
6.  Deployment

Feature engineering primarily occurs during the **Data Preparation** phase of the CRISP-DM framework. This is after you've developed an understanding of the business problem and the data that you have to work with, but before you begin modeling.

During the Data Preparation phase, you'll perform tasks such as data cleaning, dealing with missing values, transforming variables, and creating new features. These activities are crucial for ensuring that your data is in the right format and condition for modeling.

However, it's important to note that feature engineering is an iterative process. While the majority of feature engineering work often happens during Data Preparation, you might go back and perform additional feature engineering after doing some modeling, especially if the model performance is not satisfactory or if you get new insights during the modeling or evaluation phase.

For instance, you might find that a certain feature isn't providing as much predictive power as you thought it would, or that a different transformation of a feature might improve the model's performance. In these cases, you would go back to the Data Preparation phase, make the necessary adjustments, and then proceed with the modeling process again.

For more information on CRISP-DM, consult [Lesson 5.104 / CRISP-DM Process for Data Analytics and Data Mining](http://artificium.us/lessons/05.dm/l-5-104-crisp-dm/l-5-104.html).

## Feature Engineering Tasks

### Variable Transformation

This includes tasks such as standardization (scaling your variable to have zero mean and unit variance), or normalization (scaling variables between a minimum and maximum value), and, when needed, transforming features to a normal (Gaussian) distribution.

For more information consult [Lesson TBD]().

### Variable / Feature Creation

This involves creating new, more complex features out of existing ones. This could be as simple as creating a "length of text" feature in a sentiment analysis model, or as complex as creating a "customer lifetime value" feature in a sales prediction model.

For more information consult [Lesson 3.206 / Normalizing Numeric Features for Machine Learning Algorithms](http://artificium.us/lessons/03.ml/l-3-206-feature-normalization/l-3-206.html).

### Handling Missing Values

Not all features will have values for all data points. Deciding on how to handle missing data is a crucial step in feature engineering.

For more information consult [Lesson 3.204 / Managing Missing Values in Data](http://artificium.us/lessons/03.ml/l-3-204-missing-values/l-3-204.html).

### Handling Categorical Data

Many machine learning models require inputs to be numerical. Converting categorical data into a suitable numerical form is another common feature engineering task.

For more information consult [Lesson 3.207 / Encoding Categorical Features](http://artificium.us/lessons/03.ml/l-3-207-categorical-encoding/l-3-207.html).

### Dealing with Outliers

Outliers, or values that are significantly different from most of the other values, can distort the performance of a machine learning model. Feature engineering often involves identifying and dealing with outliers, such as removing them or treating them as missing values and imputing a new value.

For more information consult [Lesson 3.203 / Detecting and Managing Outliers](http://artificium.us/lessons/03.ml/l-3-203-outliers/l-3-203.html).

## Feature Grooming

Some features may not be relevant or may not contribute to the training of a model, so they are removed. It may also be the case, that there are too many features and the machine learning algorithm cannot handle the quantity of features, so some must be removed or combined into new features.

## Tutorials

<iframe src="https://player.vimeo.com/video/833650060?h=06a83c4670" width="640" height="564" allowfullscreen="allowfullscreen" allow="autoplay; fullscreen" data-external="1">

</iframe>

<iframe src="https://player.vimeo.com/video/833650171?h=462f71c095" width="640" height="564" allowfullscreen="allowfullscreen" allow="autoplay; fullscreen" data-external="1">

</iframe>

<iframe src="https://player.vimeo.com/video/833650221?h=58de102c33" width="640" height="564" allowfullscreen="allowfullscreen" allow="autoplay; fullscreen" data-external="1">

</iframe>

<iframe src="https://player.vimeo.com/video/833650266?h=fcc435d593" width="640" height="564" allowfullscreen="allowfullscreen" allow="autoplay; fullscreen" data-external="1">

</iframe>

<iframe src="https://player.vimeo.com/video/833650300?h=5d85cfa61c" width="640" height="564" allowfullscreen="allowfullscreen" allow="autoplay; fullscreen" data-external="1">

</iframe>

## Summary

Feature engineering is a critical step in the machine learning pipeline where raw data is transformed into a suitable format to improve the performance of machine learning models. It involves creating new features from existing ones, or transforming features, based on domain knowledge and insights gained from the data.

A "feature" in the context of machine learning refers to an individual measurable property or characteristic of the phenomenon being observed. Essentially, it's an input variable or an independent variable in the model that we use to predict our target variable.

Neglecting feature engineering can lead to various negative consequences such as poor model performance, overfitting or underfitting of the model, inefficiency in model training, and difficulty in model interpretability. It may also prevent the model from fully leveraging the domain knowledge encapsulated in the engineered features.

In the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework, feature engineering primarily happens during the Data Preparation phase. This phase is after the Business Understanding and Data Understanding stages but before the Modeling phase. However, feature engineering is an iterative process and adjustments may be made after initial modeling and evaluation based on further insights and the performance of the model.

In summary, feature engineering is a critical, iterative process in machine learning model development that involves transforming and creating features to better represent the underlying problem to the model, improving model performance and interpretability.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## Learn More

-   [Lesson 3.206 / Normalizing Numeric Features for Machine Learning Algorithms](http://artificium.us/lessons/03.ml/l-3-206-feature-normalization/l-3-206.html)

-   [Lesson 3.204 / Managing Missing Values in Data](http://artificium.us/lessons/03.ml/l-3-204-missing-values/l-3-204.html)

-   [Lesson 3.207 / Encoding Categorical Features](http://artificium.us/lessons/03.ml/l-3-207-categorical-encoding/l-3-207.html)

-   [Lesson 3.203 / Detecting and Managing Outliers](http://artificium.us/lessons/03.ml/l-3-203-outliers/l-3-203.html)

-   [Lesson 55.202 / Data Quality, Outliers, and Missing Data](http://artificium.us/lessons/55.data-analytics/l-55-202-missing-data-outliers/l-55-202.html)

## Errata

None collected yet. [Let us know](https://form.jotform.com/212187072784157){target="_blank"}.

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
