Objectives
Upon completion of this lesson, you will be able to:
- implement kNN in R for classification and regression
Motivation
This is a simple and illustrative implementation of the k-Nearest Neighbors algorithm for classification and regression. It is neither efficient not scalable, but meant to be illustrative – it should not be used in production.
Definitions of Functions
kNNMODE - identifies the most frequently occurring value in a vector of nominal values
kNNMODE <- function(x)
{
ux <- unique(x)
return (ux[which.max(tabulate(match(x, ux)))])
}
kNNAVG - calculates the average value in a vector of nominal values
kNNAVG <- function(x)
{
return (mean(x))
}
dist - calculates the Euclidean distance between two vectors of equal size containing numeric elements.
kNNDIST <- function(p, q)
{
d <- 0
for (i in 1:length(p)) {
d <- d + (p[i] - q[i])^2
}
return(sqrt(d))
}
neighbors - returns a vector of distances between an object u and a data frame of features; all features must be numeric
kNNNeighbors <- function (train, u)
{
m <- nrow(train)
ds <- numeric(m)
for (i in 1:m) {
p <- train[i,]
ds[i] <- unlist(kNNDIST(p,u))
}
return(ds)
}
k.closest - finds the smallest k values in a vector of values
k.closest <- function(neighbors,k)
{
# uses the order function from R to sort the vector
# of neighbors by distance
ordered.neighbors <- order(neighbors)
# extracts only the top k neighbors
# returns the indexes of those closest neighbors
k.closest <- ordered.neighbors[1:k]
}
KNN.CLASSIFICATION - finds the most likely class that an unknown object u belongs to based on a training data frame of features, a corresponding vector of labels, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.
KNN.CLASSIFICATION <- function (train, labels, u, k)
{
nb <- kNNNeighbors(train,u)
f <- k.closest(nb,k)
KNN <- kNNMODE(labels[f])
}
KNN.REGRESSION - finds the most likely target value that an unknown object u based on a training data frame of features, a corresponding vector of target values, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.
KNN.REGRESSION <- function (train, target, u, k)
{
nb <- kNNNeighbors(train,u)
f <- k.closest(nb,k)
KNN <- kNNAVG(target[f])
}
Use of Algorithm: Classification
Let’s apply the algorithm to classify food items based on three features: sweetness, crunchiness, and saltiness.
foods.training <- read.csv("foods.csv")
head(foods.training)
## ingredient sweetness crunchiness saltiness type cost
## 1 apple 10 9 0 fruit 2.3
## 2 bacon 1 4 8 protein 1.8
## 3 banana 10 1 0 fruit 2.8
## 4 carrot 7 10 0 vegetable 2.1
## 5 celery 3 10 0 vegetable 1.2
## 6 cheese 1 1 5 protein 3.2
# unknown case
# (new food items with a measured sweetness, crunchiness, and saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)
# separate the label and features
labels <- foods.training$type
# only consider numeric features (or encoded categorical features)
training.features <- foods.training[,2:4]
# classify the new item using our new knn algorithm
nn <- KNN.CLASSIFICATION(training.features, labels, u, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'protein'"
nn <- KNN.CLASSIFICATION(training.features, labels, w, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'fruit'"
Use of Algorithm: Regression
Let’s apply the algorithm to calculate the cost of a food item based on three numeric and one categorical feature: sweetness, crunchiness, and saltiness.
# unknown case
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)
# separate the target feature and the predictive features
target <- foods.training$cost
# only consider numeric features
training.features <- foods.training[,2:4]
# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)
# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"
Let’s also consider the categorical feature type. Of course, we will first have to convert the feature to a numeric one using an encoding scheme. We will use frequency encoding.
# unknown case
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2, 0)
w <- c(10, 8, 2, 0)
# separate the target feature and the predictive features
target <- foods.training$cost
# only consider numeric features
training.features <- foods.training[,2:4]
training.features$type <- 0
# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)
# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"
Make sure you standardize any new data values the same way as you standardized the training data or distance calculations will not be meaningful.
References
No references.
Errata
None collected yet. Let us know.
