Objectives

Upon completion of this lesson, you will be able to:

  • define the kNN algorithm
  • know when to use kNN
  • tune the value of the hyper-parameter k
  • appreciate the need for feature normalization
  • recognize why outliers can be problematic for kNN

Introduction

k-Nearest Neighbors (k-NN or kNN) is a type of instance-based learning algorithm used in both classification and regression tasks. It is called ‘lazy learning’ as it doesn’t build a model but instead memorizes the training dataset. The prediction is done based on the “neighborhood” of data points in the dataset.

The basic steps involved in the k-NN algorithm are:

  1. Choose the number of k and a distance metric: The number of neighbors (k) and the type of distance are chosen as per the problem context. For example, you could choose k=3 and use Euclidean distance as the metric.

  2. Find the k nearest neighbors of the sample that you want to classify: For each data point that you want to make a prediction for, you identify the k points in the training dataset that are “closest” according to the chosen distance metric.

  3. Assign the label: For a classification problem, the new data point is assigned the class which has the majority among its k nearest neighbors. For a regression problem, the new data point could be assigned the mean value of its k nearest neighbors.

kNN is a simple algorithm that is easy to understand and implement. It works well with smaller datasets that have fewer dimensions (features), and where the data points are clearly separable. However, it may not be suitable for large, high-dimensional datasets, as the computation of distances can become very expensive. It is also sensitive to irrelevant or correlated features, which can impact the calculation of “closest” neighbors.

An important aspect of kNN is the choice of k. A small value of k can make the algorithm sensitive to noise, while a large value of k makes the boundaries between classes less distinct. Typically, the optimal value of k is selected using techniques such as cross-validation.

Examples of applications of kNN include recommendation systems, image recognition, and text categorization, among others.

Required Data Shaping

Prior to applying kNN, the data must be shaped:

  1. remove outliers
  2. impute or remove missing feature values
  3. normalize numeric features
  4. encode categorical features

Tutorials

In the narrated presentation below, Khoury Boston’s Dr. Schedlbauer provides an introduction to the kNN machine learning algorithm for predicting categorical target variables.

In the presentation below, Dr. Schedlbauer provides a more detailed look at kNN, its implementation, and the data preparation steps required for kNN.


Files & Resources

All Files for Lesson 3.410

References

No references.

Errata

None collected yet. Let us know.

