Read all instructions before starting.
In this assignment you will have an opportunity to practice reproducible data analytics programming in an R Notebook.
Completing this assignment, will give you an opportunity to
Working is individually is recommended, but working in pairs may be helpful.
Prior to working on this assignment, it is suggest that you review these lessons and refer to them during the assignment:
Create a new project in R Studio and then, within that project, create a new R Notebook. Set the title parameter of the notebook to “Practice / First Steps in Data Analytics”; set the author parameter of the notebook to your name; set the date parameter to today’s date.
Follow the instructions below and build an R code chunk for each of the questions below. If you don’t know how to proceed or understand the instructions, then be sure to follow the prerequisite tutorials.
You should not use any additional packages (such as psych); you should learn to do the tasks using only ‘Base R’.
Use a level 2 header (using ##) for each new question and use the question number as your title, e.g., ## Question 3.
Label each code chunk with the question number and the objective, e.g.,
```{r Q1_LoadCSV} ... your code goes here ... ```
An organization has collected data on customer visits, transactions, operating system, and gender and desires to build a model to predict revenue. For the moment, the goal is to prepare the data for modeling. Analyze the data set in the following manner and write a report. The result should be a printable report with embedded code and not a program.
Load the data set from the link above into a dataframe. Inspect the data set using the str()
function. Each row represents a customer’s interactions with the organization’s web store. The first column is the number of visits of a customer, the second the number of transactions of that customer, the third column is the customer’s operating system, and the fourth column is the customer’s reported gender, while the last column is revenue, i.e., the total amount spent by that customer.
In an R code chunk that is not displayed (i.e., set echo = F) Calculate the following summative statistics: total transaction amount (revenue), mean number of visits, median revenue, standard deviation of revenue, most common gender. Exclude any cases where there is a missing value.
Add a header to the R Notebook (## Data Analysis) and use the data calculated above to produce the narrative below, inserting the computed values in embedded R code chunks – do not hard code the value as they might change if the data changes.
There were 376 customers and the mean number of visits per customer was 3.4. The median revenue was US$ 35.8 (σ = 11.3). Most of the visitors were male.
The numbers in the above text are not necessarily correct – they are added for illustration only. Also, make sure that the gender for the most visits is calculated and not “hard coded”.
Create a bar (aka column) chart of gender (x-axis) versus revenue (y-axis). Omit missing values, i.e., where gender is NA or missing. Use the plot()
function rather than functions from ggplot2. Show only the chart and not the code that generated it. Add markdown text to comment on what the chart means.
What is the Pearson Moment of Correlation between number of visits and revenue? Comment on the correlation in your markdown, i.e., explain what it means and on the overall strength and its relevance.
Which columns have missing data? How did you recognize them? How would you impute missing values? In your markdown, add comments on missing data and imputation strategies.
Impute missing transaction and gender values. Use the mean for transaction (rounded to the nearest whole number) and the mode for gender. Recalculate the descriptive statistics of (2) and repeat the markdown of (3) with the new (computed) values. Comment in the markdown on the difference.
Split the data set into two equally sized data sets where one can be used for training a model and the other for validation. Take every odd numbered case and add them to the training data set and every even numbered case and add them to the validation data set, i.e., rows 1, 3, 5, 7, etc. are training data while rows 2, 4, 6, etc. are validation data. Put this work into a new section with a new header and comments on the need for this.
Calculate the mean revenue for the training and the validation data sets and compare them. Comment on the difference.
For many data mining and machine learning tasks, there are packages in R. Use the sample()
function to split the data set, so that 60% is used for training and 20% is used for testing, and another 20% is used for validation. To ensure that your code is reproducible and that everyone gets the same result, use the number 77654 as your seed for the random number generator.
Use the code fragment below for reference (it does not match the requirements above; it is for illustration only):
# Set the seed so that same sample can be reproduced in the future
set.seed(101)
# Draw 75% of the data from total 'n' rows of the data
<- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
sample <- data[sample, ]
train <- data[-sample, ] test
Be sure that this is a “report”. Knit the file to HTML and also to PDF. For PDF knitting you may need to install additional packages (particularly for TeX). Alternatively, you could upload your R Notebook to posit.cloud and knit to PDF there and then export the knitted PDF.
None yet.
The solution can be accessed from the link below. As it contains embedded code, it should be knitted.
Solution A-55-103: Rmd Notebook | Knitted PDF
A narrated code walk of the solution is below: