Objectives

Upon completion of this lesson, you will be able to:

  • implement kNN in R for classification and regression

Motivation

This is a simple and illustrative implementation of the k-Nearest Neighbors algorithm for classification and regression. It is neither efficient not scalable, but meant to be illustrative – it should not be used in production.

Definitions of Functions

kNNMODE - identifies the most frequently occurring value in a vector of nominal values

kNNMODE <- function(x) 
{
  ux <- unique(x)
  return (ux[which.max(tabulate(match(x, ux)))])
}

kNNAVG - calculates the average value in a vector of nominal values

kNNAVG <- function(x) 
{
  return (mean(x))
}

dist - calculates the Euclidean distance between two vectors of equal size containing numeric elements.

kNNDIST <- function(p, q)
{
  d <- 0
  for (i in 1:length(p)) {
    d <- d + (p[i] - q[i])^2
  }
  
  return(sqrt(d))
}

neighbors - returns a vector of distances between an object u and a data frame of features; all features must be numeric

kNNNeighbors <- function (train, u)
{
   m <- nrow(train)
   ds <- numeric(m)
   for (i in 1:m) {
     p <- train[i,]
     ds[i] <- unlist(kNNDIST(p,u))
   }
   
   return(ds)
}

k.closest - finds the smallest k values in a vector of values

k.closest <- function(neighbors,k)
{
  # uses the order function from R to sort the vector 
  # of neighbors by distance
  ordered.neighbors <- order(neighbors)
  
  # extracts only the top k neighbors
  # returns the indexes of those closest neighbors
  k.closest <- ordered.neighbors[1:k]
}

KNN.CLASSIFICATION - finds the most likely class that an unknown object u belongs to based on a training data frame of features, a corresponding vector of labels, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

KNN.CLASSIFICATION <- function (train, labels, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNMODE(labels[f])
}

KNN.REGRESSION - finds the most likely target value that an unknown object u based on a training data frame of features, a corresponding vector of target values, and a provided k. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

KNN.REGRESSION <- function (train, target, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNAVG(target[f])
}

Use of Algorithm: Classification

Let’s apply the algorithm to classify food items based on three features: sweetness, crunchiness, and saltiness.

foods.training <- read.csv("foods.csv")

head(foods.training)
##   ingredient sweetness crunchiness saltiness      type cost
## 1      apple        10           9         0     fruit  2.3
## 2      bacon         1           4         8   protein  1.8
## 3     banana        10           1         0     fruit  2.8
## 4     carrot         7          10         0 vegetable  2.1
## 5     celery         3          10         0 vegetable  1.2
## 6     cheese         1           1         5   protein  3.2
# unknown case 
# (new food items with a measured sweetness, crunchiness, and saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the label and features
labels <- foods.training$type

# only consider numeric features (or encoded categorical features)
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
nn <- KNN.CLASSIFICATION(training.features, labels, u, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'protein'"
nn <- KNN.CLASSIFICATION(training.features, labels, w, k = 4)
print(paste0("food type is '",nn,"'"))
## [1] "food type is 'fruit'"

Use of Algorithm: Regression

Let’s apply the algorithm to calculate the cost of a food item based on three numeric and one categorical feature: sweetness, crunchiness, and saltiness.

# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"

Let’s also consider the categorical feature type. Of course, we will first have to convert the feature to a numeric one using an encoding scheme. We will use frequency encoding.

# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2, 0)
w <- c(10, 8, 2, 0)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]
training.features$type <- 0

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
## [1] "food price is '4.95'"

Make sure you standardize any new data values the same way as you standardized the training data or distance calculations will not be meaningful.


Files & Resources

All Files for Lesson 3.411

References

No references.

Errata

None collected yet. Let us know.

---
title: "Simple Implementation of kNN in R"
params:
  category: 3
  stacks: 0
  number: 411
  time: 30
  level: beginner
  tags: knn,machine learning,classification
  description: "Presents a simple implementation of kNN for classification
                and another implementation for regression, both in R."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

------------------------------------------------------------------------

## Objectives

Upon completion of this lesson, you will be able to:

-   implement *kNN* in R for classification and regression

------------------------------------------------------------------------

## Motivation

This is a simple and illustrative implementation of the *k-Nearest Neighbors* algorithm for classification and regression. It is neither efficient not scalable, but meant to be illustrative -- it should not be used in production.

## Definitions of Functions

*kNNMODE* - identifies the most frequently occurring value in a vector of nominal values

```{r}
kNNMODE <- function(x) 
{
  ux <- unique(x)
  return (ux[which.max(tabulate(match(x, ux)))])
}
```

*kNNAVG* - calculates the average value in a vector of nominal values

```{r}
kNNAVG <- function(x) 
{
  return (mean(x))
}
```

*dist* - calculates the Euclidean distance between two vectors of equal size containing numeric elements.

```{r}
kNNDIST <- function(p, q)
{
  d <- 0
  for (i in 1:length(p)) {
    d <- d + (p[i] - q[i])^2
  }
  
  return(sqrt(d))
}
```

*neighbors* - returns a vector of distances between an object *u* and a data frame of features; all features must be numeric

```{r}
kNNNeighbors <- function (train, u)
{
   m <- nrow(train)
   ds <- numeric(m)
   for (i in 1:m) {
     p <- train[i,]
     ds[i] <- unlist(kNNDIST(p,u))
   }
   
   return(ds)
}
```

*k.closest* - finds the smallest *k* values in a vector of values

```{r}
k.closest <- function(neighbors,k)
{
  # uses the order function from R to sort the vector 
  # of neighbors by distance
  ordered.neighbors <- order(neighbors)
  
  # extracts only the top k neighbors
  # returns the indexes of those closest neighbors
  k.closest <- ordered.neighbors[1:k]
}
```

*KNN.CLASSIFICATION* - finds the most likely class that an unknown object *u* belongs to based on a training data frame of features, a corresponding vector of labels, and a provided *k*. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

```{r}
KNN.CLASSIFICATION <- function (train, labels, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNMODE(labels[f])
}
```

*KNN.REGRESSION* - finds the most likely target value that an unknown object *u* based on a training data frame of features, a corresponding vector of target values, and a provided *k*. The name is purposely capitalized to avoid conflicts with other implementations of kNN from packages.

```{r}
KNN.REGRESSION <- function (train, target, u, k)
{
  nb <- kNNNeighbors(train,u)
  f <- k.closest(nb,k)
  KNN <- kNNAVG(target[f])
}
```

## Use of Algorithm: Classification

Let's apply the algorithm to classify food items based on three features: *sweetness*, *crunchiness*, and *saltiness*.

```{r}
foods.training <- read.csv("foods.csv")

head(foods.training)
```

```{r}
# unknown case 
# (new food items with a measured sweetness, crunchiness, and saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the label and features
labels <- foods.training$type

# only consider numeric features (or encoded categorical features)
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
nn <- KNN.CLASSIFICATION(training.features, labels, u, k = 4)
print(paste0("food type is '",nn,"'"))

nn <- KNN.CLASSIFICATION(training.features, labels, w, k = 4)
print(paste0("food type is '",nn,"'"))
```

## Use of Algorithm: Regression

Let's apply the algorithm to calculate the cost of a food item based on three numeric and one categorical feature: *sweetness*, *crunchiness*, and *saltiness*.

```{r}
# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2)
w <- c(10, 8, 2)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
```

Let's also consider the categorical feature *type*. Of course, we will first have to convert the feature to a numeric one using an encoding scheme. We will use frequency encoding.

```{r}
# unknown case 
# (new food items with a measured sweetness, crunchiness, saltiness)
u <- c(3, 1, 2, 0)
w <- c(10, 8, 2, 0)

# separate the target feature and the predictive features
target <- foods.training$cost

# only consider numeric features
training.features <- foods.training[,2:4]
training.features$type <- 0

# classify the new item using our new knn algorithm
pr <- KNN.REGRESSION(training.features, target, u, k = 4)

# predicted food price based on features
print(paste0("food price is '",pr,"'"))
```

Make sure you standardize any new data values the same way as you standardized the training data or distance calculations will not be meaningful.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
