Read all instructions before starting.
In this assignment you will have an opportunity to practice programming in R and to implement a version of the kNN (k Nearest neighbor) machine learning algorithm to predict a continuous numeric target variable. In addition, you will have the chance to learn more about training versus validation data and the evaluation of classification algorithms.
Completing this assignment, will provide you with practice opportunities to
Working is individually is recommended, but working in pairs may be helpful.
Prior to working on this assignment, it is suggest that you review these lessons and refer to them during the assignment:
Create a new project in R Studio and then, within that project, create a new R Notebook. Set the title parameter of the notebook to “Practice / Implement kNN”; set the author parameter of the notebook to your name; set the date parameter to today’s date.
Follow the instructions below and build an R code chunk for each of the questions below. If you don’t know how to proceed or understand the instructions, then be sure to follow the prerequisite tutorials.
There will be packages required to be installed and loaded. The instructions will have details. Be sure to install the packages prior to loading. On occasion, installing new packages may require additional packages or updates to already installed packages.
Use a level 3 header (using ###) for each part of the exercise, e.g., ### Load Data. Label your code chunks.
Download the data set and save the file in your project folder or load from the URL.
Load the data into R (the choice of variable and structure in yours). Inspect the data as you see fit.
Are there missing data values? How would you handle them?
Normalize the numeric features using z-score standardization and create a new dataframe with the normalized features.
Split the data into two random samples: 80% for training and 20% for validation. Save the subsets in different dataframe (perhaps calling them df.train
and df.val
).
Implement your own version of kNN to predict the categorical variable “diagnosis_result”. Use Euclidean distance for the distance measure. Name the function xkNN()
and have it take four arguments: a vector for the target variables in the training data, a dataframe for the features of the training data, a value for k, and a vector of variables for which to make a prediction. The function should return the value of the categorical target variable.
Use your function to calculate the accuracy (percentage of correct classifications) by predicting the validation data target using your version of kNN for different values of k (from 3 to \(\sqrt(d)\)) where d are the number of dimensions in your data, i.e., the number of predictor features). What is the “optimal” k?
Modify kNN so that it uses a different distance formula: Manhattan distance.
Recalculate the accuracy for different values of k as you’ve done previously. Comment on the differences. Is the value of k the same? Did you get better or worse accuracy?
Is kNN sensitive to outliers? Should you have removed outliers before normalizing?
None yet.