Objectives

Upon completion of this lesson, you will be able to:

  • implement kNN in R for classification and regression

Motivation

This is a simple and illustrative implementation of the k-Nearest Neighbors algorithm for classification and regression. It is neither efficient not scalable, but meant to be illustrative – it should not be used in production.

Definitions of Functions

kNNMODE - identifies the most frequently occurring value in a vector of nominal values

kNNMODE <- function(x) 
{
  ux <- unique(x)
  return (ux[which.max(tabulate(match(x, ux)))])
}

kNNAVG - calculates the average value in a vector of nominal values

kNNAVG <- function(x) 
{
  return (mean(x))
}

dist - calculates the Euclidean distance between two vectors of equal size containing numeric elements.

kNNDIST <- function(p, q)
{
  d <- 0
  for (i in 1:length(p)) {
    d <- d + (p[i] - q[i])^2
  }
  
  return(sqrt(d))
}

neighbors - returns a vector of distances between an object u and a data frame of features; all features must be numeric

kNNNeighbors <- function (train, u)
{
   m <- nrow(train)
   ds <- numeric(m)
   for (i in 1:m) {
     p <- train[i,]
     ds[i] <- unlist(kNNDIST(p,u))
   }
   
   return(ds)
}

k.closest - finds the smallest k values in a vector of values

k.closest <- function(neighbors,k)
{
  # uses the order function from R to sort the vector 
  # of neighbors by distance
  ordered.neighbors <- order(neighbors)
  
  # extracts only the top k neighbors
  # returns the indexes of those closest neighbors
  k.closest <- ordered.neighbors[1:k]
}

KNN.CLASSIFICATION - finds the most likely class that an unknown object u belongs to based on a training data frame of features, a corresponding vector of labels, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

KNN.CLASSIFICATION <- function (train, labels, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNMODE(labels[f])
}

KNN.REGRESSION - finds the most likely target value that an unknown object u based on a training data frame of features, a corresponding vector of target values, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

KNN.REGRESSION <- function (train, target, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNAVG(target[f])
}

Use of Algorithm: Classification

Let’s apply the algorithm to classify food items based on three features: sweetness, crunchiness, and saltiness.

foods.training <- read.csv("foods.csv")

head(foods.training)
##   ingredient sweetness crunchiness saltiness      type cost
## 1      apple        10           9         0     fruit  2.3
## 2      bacon         1           4         8   protein  1.8
## 3     banana        10           1         0     fruit  2.8
## 4     carrot         7          10         0 vegetable  2.1
## 5     celery         3          10         0 vegetable  1.2
## 6     cheese         1           1         5   protein  3.2
# unknown case 
# (new food items with a measured sweetness, crunchiness, and saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the label and features
labels <- foods.training$type

# only consider numeric features (or encoded categorical features)
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
nn <- KNN.CLASSIFICATION(training.features, labels, u, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'protein'"
nn <- KNN.CLASSIFICATION(training.features, labels, w, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'fruit'"

Use of Algorithm: Regression

Let’s apply the algorithm to calculate the cost of a food item based on three numeric and one categorical feature: sweetness, crunchiness, and saltiness.

# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"

Let’s also consider the categorical feature type. Of course, we will first have to convert the feature to a numeric one using an encoding scheme. We will use frequency encoding.

# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2, 0)
w <- c(10, 8, 2, 0)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]
training.features$type <- 0

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"

Make sure you standardize any new data values the same way as you standardized the training data or distance calculations will not be meaningful.


Files & Resources

All Files for Lesson 3.411

References

No references.

Errata

None collected yet. Let us know.

