Section 1: The Process of Programming
A program doesn’t magically appear and run. It follows a systematic life cycle that begins with an idea and ends with execution. Let’s break this down step by step.
Programming Lifecycle
Think of a program as a recipe you want to cook. Just like you start by deciding what to cook, gathering ingredients, and then following the steps to prepare a dish, a program follows a similar life cycle:
Problem Definition: This is where you define the task you want the program to accomplish. For example, you might decide, “I want to calculate the average score of students in my class.”
Design: Here, you plan the structure of the program. This involves deciding the steps required to solve the problem. For our example, the steps might be:
- Collect the scores.
- Add all the scores together.
- Divide the total by the number of scores to find the average.
Programming: You write the program using a programming language like R. This is akin to writing down the recipe. Programming is also often referred to as “development” or as “coding”.
Execution: After coding the solution, you run the program. This is like cooking the recipe and seeing if the dish turns out as expected. Of course, execution is often very fast.
Testing and Debugging: Sometimes, the program doesn’t work perfectly the first time. You check for errors (bugs) and fix them. Imagine you forgot to add salt to your dish; you would taste it and adjust. You then return back to steps 2 and 3 and make adjustment and try again. Programming involves lots of experimentation and “trial and error”. It is an iterative process and not linear.
Maintenance: After the program works, you might need to update it or add features over time, just like modifying a recipe to improve it or adapt it to new ingredients.
Documentation: You need to document your program and how it works so it can be modified by others (which includes your future you in six months, a year, or in several years).
Programming Tools
Programming tools help you move through the life cycle efficiently. Let’s explore the main tools and their roles, using everyday analogies and visualizations.
Text Editor
A text editor is where you write your code. It’s like a notebook where you draft your ideas. While you could use any basic editor like Notepad on Windows or TextEdit on MacOS, specialized editors for programming provide helpful features, such as syntax highlighting (color-coding of your code) and error detection. Some of the common editors include Sublime, Notepad++, Vim, Visual Studio Code, and even fully terminal-based editors like emacs and vi.
In R, the Script Editor in RStudio is an example of a specialized text editor. When you write code in the Script Editor, you can save it, run it, and reuse it later. This editor is part of an integrated development environment (IDE) that includes tools for managing the files that are part of a project, tracking code changes, assisting with debugging, among many other useful features. Most modern programming is done within an IDE rather than just a text editor.
Execution
Execution is the process of running your code to see the results. When you write instructions in the Script Editor in RStudio, you need to have your computer execute them. There are two different ways this is done:
- compilation
- interpretation
In a “compiled” programming language like Java or C++, the code that is written in the editor (the “source code”) must first be compiled. Compilation is done by a special program called a compiler. Its purpose is to translate the source code into binary machine language instructions suitable for execution by the computer’s CPU. The machine language instructions are then linked together with libraries of operating system functions into an “executable” which can then be run (“executed”). The executable is specific for a particular operating system and CPU. For example, if you write a program in C++ using Notepad++, then compile it with the gcc compiler on your Mac having an M2 Apple CPU, the executable can only be run on that operating system with that type of CPU; you could not run that executable on a Windows computer with an Intel i7 CPU. Of course, you could take the C++ source code and compile it with a C++ compiler on Windows and get a Windows/Intel executable. The benefit of compiled programs and compiled language is that they run much faster. Most production data analytics and machine learning code is written in C++ or Java and compiled for Linux or specialized vector CPU’s such as those made Nvidia.
An alternative to compiled languages are interpreted languages. In an interpreted language, the code you write is “interpreted” and run line by line within an “interpreter”. Every language has its own interpreter. For R, you can download the interpreter from <r-project.org>. The benefit of an interpreted language is that is doesn’t require the compilation step and that you can experiment with the code. Development is often much faster, but the code executes much slower. Often, data analytics, experimenting, and algorithm development are done in R or Python and production code is then written in Java or C++.
For example, when you write the R code:
x <- 10
y <- 5
result <- x + y
print(result)
and click “Run,” each line is executed one by one. In fact, within the RStudio IDE you can explicitly run each line one by one. This very helpful for program development as you can see the intermediary results and inspect the values of variables. This makes programming much faster.
Tutorial / Installing R and Running R Scripts
The tutorial below shows how to install R (the programming language) and edit and execute simple R programs (aka scripts). In a later section, we will explain how to use an integrated development environment to make writing R programs simpler.
The R Programming Language
Let’s take a brief detour and talk a bit more about R – of course, lots more later, but it’s worthwhile to put R in perspective so the tutorials below will make more sense.
R is a programming language and software environment specifically designed for statistical computing, data analysis, and graphical representation of data. If you are venturing into the world of data analytics, R is one of the most powerful and versatile tools available, alongside Python.
R is both a language and a programming environment. As a language, it provides syntax and functions to perform complex statistical operations, create visualizations, and manipulate data. As a programming environment, it includes tools and resources, such as data handling capabilities and visualization libraries, that make data analysis more intuitive and efficient.
R was developed in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland. It was inspired by the S language, which was developed at Bell Laboratories. Unlike S, which was proprietary, R is open-source, meaning it’s free to use, and anyone can contribute to its development. This collaborative nature has helped R evolve rapidly, keeping it at the forefront of data analytics.
Think of R as a specialized toolkit for working with data. Just as a carpenter would use a hammer, saw, and chisel for woodwork, data analysts use R to process and interpret data. It’s designed with data in mind, so it excels at tasks like:
- Performing statistical calculations.
- Creating high-quality data visualizations.
- Handling large datasets with ease.
- Generating reproducible reports.
- Interacting with databases.
Unlike general-purpose programming languages like Python or Java, R is specifically built for statistical analysis, data analytics, data management, data visualization, machine learning, and data mining. This focus makes it particularly appealing to statisticians, data scientists, and researchers.
For example, imagine you have data on the sales performance of several products over a year. Using R, you can:
- Summarize the data to find the total sales for each product.
- Create a line chart to visualize sales trends over time.
- Perform a statistical test to determine whether sales increased significantly in the second half of the year.
All of this can be done efficiently using R’s built-in functions and packages.
Key Features of R
Statistical Power: R includes built-in functions for statistical operations like regression analysis, hypothesis testing, and clustering. For example, if you want to calculate the mean of a dataset, you can simply use:
data <- c(5, 10, 15, 20)
mean(data) # Output: 12.5
Data Visualization: R provides tools to create a wide variety of plots, from simple bar charts to complex multi-dimensional visualizations. Using the ggplot2
package, you can create elegant and informative graphs with minimal effort.
Extensive Package Ecosystem: R has thousands of user-contributed packages that extend its capabilities. For instance, the dplyr
package simplifies data manipulation, while shiny
allows you to build interactive web applications.
Reproducibility: With R, you can write scripts to automate data analysis tasks, ensuring consistency and reproducibility. This is particularly useful in research, where transparency and repeatability are essential.
Installing R and RStudio
To start using R, you need two main tools:
- R: The language itself. You can download it for free from CRAN, the Comprehensive R Archive Network. This is the “interpreter” that lets you run R code. It does include a simple user interface, but it’s really just the language run-time environment. To do actual programming, you need RStudio, but you must install R before installing the RStudio IDE.
- RStudio: A user-friendly interface (Integrated Development Environment or IDE) that makes working with R much easier. You can download RStudio Desktop for free from its [official website](https://posit.co/download/rstudio-desktop/. It is now called Posit because it supports more than just R; it also supports programming in Python and a number of other programming languages.
While R is the engine, RStudio is the dashboard that lets you interact with R more conveniently. Think of RStudio as a smartphone that makes the raw computing power of R accessible through a user-friendly interface.
The tutorial below guides you through the process of installing R and RStudio. As an alternative to installing the R programming language and the RStudio IDE, you can also create an account on the cloud service for RStudio: posit.cloud. Posit is run by RStudio and there are free plans available as well as inexpensive monthly educational subscriptions.
The RStudio Interface
When you open RStudio, you’ll notice several key panels:
- Console: This is where R executes commands. If you type
2 + 2
in the Console and press Enter, R will immediately output 4
.
- Script Editor: This is where you write and save longer pieces of code. For instance, if you want to analyze a dataset repeatedly, you can write a script in this panel and save it for future use.
- Environment: This panel shows all the variables, data, and objects currently in use. For example, if you create a dataset called
sales_data
, it will appear here.
- Plots Panel: When you create visualizations, they will appear in this panel. For example, if you plot a line graph of sales over time, you can view and export the graph here.
The tutorial below provides an overview of RStudio and its capabilities.
The real strength of R lies in its community. Thousands of developers and analysts contribute to its ecosystem, creating packages and resources that make it easier to tackle any data analysis challenge. Whether you’re performing simple descriptive statistics or building a predictive model, R has the tools to help you succeed.
Learning R may seem challenging at first, but it’s a skill that pays off immensely in the world of data analytics. With practice, you’ll find that R is not just a programming language — it’s a partner that empowers you to turn raw data into actionable insights. You will find R is easier to learn than most other programming languages and becoming proficient at R takes comparatively little time.
Tutorial I: Programming Lifecycle
The tutorial below by Khoury Boston’s Prof. Schedlbauer, demonstrates the program development lifecycle using a text editor and executing code written in R.
Integrated Development Environment (IDE)
An Integrated Development Environment (IDE) combines several tools into one interface to make programming easier. RStudio is an IDE specifically for R. It integrates a text editor, execution environment, debugging tools, and visualization panels in a single workspace.
Illustration: An IDE is like a fully equipped kitchen. Instead of having the oven, fridge, and sink scattered across different rooms, an IDE brings everything together in one space for convenience. RStudio’s Console, Script Editor, Environment panel, and Plots panel are like your oven, chopping board, spice rack, and serving area—everything you need to “cook” a program.
To summarize, programming follows a structured lifecycle: problem definition, design, coding, execution, testing, maintenance, and documentation. The lifecycle is not linear but rather iterative. It involves frequent trial-and-error. A text editor is where you write your code. Execution happens when a compiled program is executed or an interpreter carries out each instruction. An IDE like RStudio combines all tools (editor, execution, debugging, and program output) into one convenient workspace.
“Code is like humor. When you have to explain it, it’s bad.” – Cory House
Section 2: Introduction to Programming
Programming is the art and science of instructing a computer to perform specific tasks. At its core, it involves writing a set of instructions in a language the computer can understand. These instructions, collectively called a program, are executed by the computer to solve a problem or perform an action. Let’s explore this idea in detail, step by step. This form of programming is generally called procedural programming; it is one of many programming styles. Other programming styles and ways to organize program code are object-oriented programming, functional programming, and logic programming. Machine learning is considered a newer form of programming where an algorithm derives a pattern from data to make a prediction and produce a result rather than a programmer writing the instructions to arrive at the result.
Imagine you are teaching a robot to make a cup of tea. You can’t simply say, “Make tea” and expect the robot to understand. Instead, you would need to break down the task into a series of small, precise steps, perhaps such as these:
- Pick up the kettle.
- Fill it with water.
- Place the kettle on the stove.
- Turn on the stove.
- Wait for the water to boil.
Similarly, procedural programming is about breaking down a problem into logical, clear steps and expressing these steps in a language that the computer understands. Computers, unlike humans, cannot infer what you mean—they only do exactly what you tell them to do.
In this way, programming is like writing a recipe or an instruction manual: you are describing how to achieve a specific outcome using a sequence of precisely defined and ordered actions.
Programming is essential because computers are incredibly fast, accurate, and consistent, but only when given precise instructions. For example, imagine you have a dataset containing the sales figures for thousands of stores over a year, and you want to calculate the total revenue. Doing this manually would take days or weeks. With a program, this calculation can be completed in seconds.
For data analytics, programming is invaluable because it allows us to:
- Automate repetitive tasks, like cleaning data or generating reports.
- Analyze large datasets quickly and accurately.
- Create reproducible workflows, ensuring that results are consistent and can be verified.
Programming doesn’t just save time — it enables us to do things that would otherwise be impossible.
“I don’t need to program; I could just do it manually. But where’s the fun in not spending three hours writing a program to save ten minutes?” – Every Programmer
The R Programming Language
R is a programming language specifically designed for statistical computing and data visualization. It is widely used in data analytics because it simplifies many tasks that would be complex in other languages. For instance, R has built-in functions for calculating statistics, creating charts, and manipulating data, which makes it a perfect tool for beginners in data analytics.
Think of R as a specialized toolkit. Just as a carpenter has tools designed for woodwork, R provides tools tailored for working with data. These tools include functions for calculating averages, plotting trends, and summarizing datasets—all tasks that are central to the work of a data analyst.
Many data analytics professionals use R alongside Python as well as numerous specialized programming tools.
R originated as an open-source implementation of the S language, which was developed in the 1970s at Bell Laboratories (also known as AT&T Labs) by John Chambers and his team. The S language was designed for statistical computing and data analysis, but it was proprietary, limiting its accessibility.
In the early 1990s, Ross Ihaka and Robert Gentleman, two statisticians from the University of Auckland in New Zealand, began developing R as a free alternative to S. Their goal was to create a programming language that retained the flexibility and statistical power of S while being open-source and freely available to the academic community.
The first version of R was released to the public in 1995, and its popularity grew quickly among statisticians and data scientists. In 1997, the R Development Core Team was established to oversee its continued development, ensuring that R remained a collaborative, community-driven project.
Today, R is a global standard for statistical computing and data visualization, widely used in academia, industry, and government. Its open-source nature and extensive package ecosystem make it an indispensable tool for data analysis, research, experimentation, and discovery.
How Does Programming Work?
The process of programming involves writing, testing, and running code. Let’s explore this process through an analogy: Imagine you are designing a set of instructions for a delivery robot. Your goal is to program the robot to deliver a package to a specific address. The steps might look something like this:
- Write the instructions: “Leave the warehouse, walk straight for 500 meters, turn left, and stop at house number 25.”
- Test the instructions: Follow them yourself to ensure they are correct.
- Run the program: Give the instructions to the robot and observe its behavior.
If the robot delivers the package successfully, your program works. If it ends up at the wrong house, you would need to review and correct the instructions.
This same process applies to programming in R. You write code (the instructions), test it by running small parts of it, and then execute the full program to see the results. If it doesn’t deliver the correct result, you find where the error is (well, more likely errors), fix them (hopefully), and run the program again. Observe again, a repeat the process until the program provides the desired result and meets the needs (requirements) of its intended users. To be fair, no program is without any defects: every program contains bugs; the question really is how many defects are we willing to live with and tolerate.
Breaking Down a Simple Example
Let’s look at a simple programming example in R:
# Define numbers
x <- 10
y <- 5
# Calculate the sum
result <- x + y
# Print the result
print(result)
Here’s what happens step by step:
- Define numbers: The variables
x
and y
are like labeled containers that store the numbers 10 and 5, respectively.
- Calculate the sum: The program tells R to add the values in
x
and y
and store the result in a new variable called result
.
- Print the result: The program outputs the value stored in
result
, which is 15
.
Note that in R we can simply use a variable and that defines it and we got a container for values. Whatever value you assign to it determines what the “type” of the variable is, i.e., what kinds of things it can hold: text, numbers, dates, etc. Other programming languages have different rules for variables and how to define them; these are the rules for R.
Programming is Problem Solving
At its heart, programming is about solving problems. The computer acts as a tool to execute your solution. A simple problem might be, “How many hours are there in a week?” You could solve this manually, but in R, you would write:
# Calculate hours in a week
hours_per_day <- 24
days_per_week <- 7
hours_per_week <- hours_per_day * days_per_week
print(hours_per_week)
Here, the program calculates the answer (168
) for you. This process demonstrates how programming allows you to focus on defining the solution while the computer handles the drudgery of the calculations. Of course, this is a super trivial example, but imagine if the calculation involved many more numbers and was much more complicated – then you’d really appreciate having a computer do the drudgery of the calculations for you. Of course, until recently, “computers” were people (applied mathematicians, to be more precise) who were really good at numerical methods for doing complex calculations.
Why Do Errors Happen?
Errors, or “defects” or simply “bugs”, are a common part of programming: frustrating, but common They occur when the instructions are incomplete, ambiguous, or incorrect. For example, if you accidentally wrote:
hours_per_day <- 24
days_per_week <- 7
hours_per_week <- hours_per_day + days_per_week
print(hours_per_week)
The program would calculate 31
instead of 168
. This mistake highlights the importance of precision in programming. Debugging, or finding and fixing errors, is an essential skill that you will develop as you practice. It is being a sleuth: some programmers love debugging; they love the hunt and the dopamine shot one gets from finding an error – others abhor the process of debugging and come to hate the hours of wasted time finding some “stupid” typo or fault in logic.
It’s a Journey
In fact, programming is not just about writing code — it’s about thinking logically and solving problems systematically and methodically. As you begin your journey with programming and with R, remember that every program you write is an opportunity to learn. Mistakes are part of the process, and with practice, programming will become a powerful tool in your data analytics toolkit. Embrace the challenge, and enjoy the satisfaction that comes from making the computer do exactly what you want. You’ll also come to appreciate the beauty of well-written code and you will marvel at code written by “masters” of their craft. It is also neat to see code in different programming languages: find some and just try to decipher it.
“Learning to program is like learning to play an instrument: at first, it sounds terrible, but with practice, you’ll create something amazing.”
