Upon completion of this lesson, you will be able to
describe the difference between R Scripts and R Notebooks
choose the correct R code mechanism
list the benefits of packages
write R Scripts and execute them
create R Notebooks and knit them to documents
list the most common packages
The Universe of R
R is a language and environment designed specifically for statistical computing, data analysis, data-oriented computing, machine learning, data mining, and visualization. Developed in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland as an opens-source alternative to the S statistical programming language, R has grown to become one of the most widely used languages for data analysis1. It is open-source and available under the GNU General Public License, which means it is freely available for anyone to use and modify.
R is both a programming language and a software environment. This dual nature means that while you can write scripts and programs in R, you can also interact with it directly through a command-line interface to perform calculations and visualize data on the fly. In fact, there are four distinct ways to leverage R:
write R scripts (programs) that can run by themselves
use R interactively as a “calculator”
create dynamic documents that intersperse markdown with R code chunks
build web apps that include interactive elements and visualizations
The R programming language is similar to Python. It is interpreted (and not compiled like Java or C++), weakly typed (i.e., variables do not have to be declared first nor do you have to define the variable’s data type), and very flexible. It is ideal for exploratory data work and experimentation, along with report and script writing. R, like Python and other interpreted languages, can be slow for certain types of computational tasks, so when code must run fast, it is often re-written in C++ and then called from R.
R vs Python
R and Python are both powerful languages used extensively in data analysis, data-oriented computing, statistical analysis, data mining, and machine learning. While they share some similarities, they also have distinct differences that can influence the choice between them based on specific needs and preferences.
Python generally has a broader user base and is considered more versatile as it is a more recently developed language. According to various industry surveys and studies, Python tends to be more commonly used than R in areas like web development and software engineering, while R is more commonly used in statistical analysis, data analytics reporting, and data mining. Both languages have an extensive set of add-on packages that extend the core functionality of the language, including working with databases.
There are some key differences between these two popular languages:
Language Design and Syntax: Specifically designed for statistical computing and data analysis, R’s syntax can be more intuitive for statisticians and data analysts. It offers a vast number of packages tailored for statistical methods. A general-purpose programming language with a simple and readable syntax, Python is easy to learn and versatile, making it suitable for a wide range of applications beyond data science.
Libraries and Packages: R is known for its comprehensive statistical and data analysis libraries, such as ggplot2 for visualization, dplyr for data manipulation, and caret for machine learning. CRAN hosts over 18,000 packages creating a vast ecosystem. Likewise, Python boasts comprehensive libraries for data science work, including pandas for data manipulation, matplotlib and seaborn for visualization, and scikit-learn for machine learning. Like R, it also integrates well with deep learning frameworks like TensorFlow and offers its own deep learning support with PyTorch.
Data Visualization: R is renowned for its advanced data visualization capabilities, particularly with ggplot2, which allows for creating intricate and customizable plots that are reproducible and can be made interactive through shiny web apps. Similarly, Python offers robust visualization libraries such as matplotlib, seaborn, and plotly, which are highly capable and provide a wide range of plotting options.
Community and Support: R has a strong community focused on statistics and data analysis. CRAN and Bioconductor provide extensive resources and packages. Python also benefits from a larger and more diverse community. Resources are abundant for both languages, and support extends beyond data science into software development, automation, and more.
Integration and Deployment: R is primarily used for data analysis, reporting, prototyping, exploration, and research. Integration with production environments can be more challenging compared to Python. Python is excellent for integrating data science workflows into production systems. Its versatility makes it suitable for developing web applications, APIs, and automation scripts.
Python’s popularity has been rising steadily and it is currently more popular than R in some areas, as evidenced by surveys from organizations like Stack Overflow and the TIOBE Index. However, in practice both languages are essential for data work and a practicing data or computer scientist should know both languages, in addition to other essential languages such as Java, C++, JavaScript, and perhaps Swift, Rust, Go, Rackett, among others. In fact, one can never know enough languages and knowing a variety of languages supporting differing programming paradigms makes learning emerging languages easier. Naturally, the set of language popular in any given time change. In the early 1990’s, C++ was the most popular language, then it was followed by Visual Basic, and then later Java and eventually JavaScript in the early 2000’s.
Downloading R
The base language environment of R can be freely obtained for Windows, MacOS, and various version of Linux from r-project.org.
The most common programming environment for R is Posit (formerly RStudio) which can be obtained for free from posit.co. R must be installed prior to installing Posit.
As an alternative to installing R and Posit locally, a cloud hosted version is provided on posit.cloud. A free plan is available, along with various subscriptions plans including one for education. posit.cloud supports both Posit as well as Jupyter Notebook for integrated R and Python.
Tutorial: Installing R
The tutorial below by Prof. Schedlbauer of Khoury Online demonstrates how to install R (the base language), use the R Console for interactive R, write scripts, and execute scripts from the command line.
Tutorial: Installing Posit
The tutorial below shows the process for installing and navigating Posit (formerly RStudio), the most commonly used IDE for R. Note that Posit also supports Python.
Tutorial: posit.cloud
Packages: R’s Secret Sauce
In R, a package is a collection of R functions, data, and compiled code bundled together for easy sharing and reuse. Packages extend the functionality of base R by providing additional tools, methods, and datasets that users can leverage to perform specific tasks more efficiently. Each package serves a particular purpose, such as data manipulation, statistical analysis, machine learning, visualization, or even integrating with databases and web services. Other languages might call packages libraries, but the key is that they provide additional functionality. There are literally thousands of packages freely available for R. They are the “secret sauce” that makes R such a vibrant environment for data work. As an aside, many of the packages are written in C++ for performance reasons and have also been ported to Python.
Packages are so important for R due to these reasons:
Extended Functionality: Packages provide specialized functions that are not available in base R. For example, the ggplot2 package offers advanced data visualization capabilities, while dplyr simplifies data manipulation tasks.
Reusability: By using packages, you can avoid reinventing the wheel. Instead of writing your own functions from scratch, you can utilize well-tested and optimized functions provided by packages.
Community Contribution: The R community is highly active in developing and maintaining packages. This collaborative effort ensures a continuous influx of new tools and improvements to existing packages.
Efficiency: Many packages are optimized for performance, allowing you to perform complex tasks more quickly and efficiently. For example, the data.table package is known for its speed in handling large datasets.
Consistency and Best Practices: Using packages developed by experienced programmers ensures that your code follows best practices and is consistent with standard methods used in the industry.
Availability of Packages
R has a vast ecosystem of packages available through repositories (kind of like “app stores”) such as CRAN (Comprehensive R Archive Network), Bioconductor, and GitHub. CRAN is the primary repository and hosts thousands of packages, each thoroughly tested and documented.
As of 2024, CRAN hosts over 18,000 packages, catering to a wide range of applications. Bioconductor, another significant repository, focuses on bioinformatics and computational biology, offering over 2,000 packages. Additionally, many developers share their packages on GitHub, providing access to cutting-edge tools and the latest advancements in R programming.
The bottom line is that packages are an integral part of the R ecosystem, significantly enhancing its capabilities and making it a powerful tool for data analysis, statistical computing, and beyond. With thousands of packages available across various repositories, R users have access to an extensive library of functions and tools that can address virtually any computational need.
Interpreted vs Compiled Languages
Understanding the difference between interpreted and compiled languages is essential in the context of programming languages like R, Python, Java, and C++. These differences impact how the languages are executed, their performance, and their typical use cases.
Interpreted Languages
R is an interpreted language, meaning its code is executed line-by-line by an interpreter at runtime. This approach offers several advantages and disadvantages. Code can be written and executed interactively, making these languages ideal for rapid prototyping, exploratory data analysis, and scripting. Errors can be detected and fixed on the fly, which is useful for debugging.
Other examples of interpreted languages include JavaScript and PHP.
Interpreted languages are generally more portable across different platforms since the interpreter handles the machine-specific details, but they are typically slower than compiled languages because the interpreter must translate the high-level instructions into machine code every time the code runs. Performance can be mitigated through optimization techniques and the use of libraries written in compiled languages such as C++ (a method that R takes advantage of).
Compiled Languages
C and C++ are examples of compiled languages, meaning their code is translated into machine code by a compiler before execution. This process creates an executable file that can be run independently of the original source code but is specific to a particular CPU and operating systems (e.g., code compiled for an Apple M2 cannot run on a Linux computer with an Intel i7 CPU).
Compiled code generally runs much faster than interpreted code because it is translated directly into machine language, which the processor can execute directly. Optimization performed by the compiler can lead to significant performance improvements. In addition, some errors can be caught at compile-time, which can result in more robust and reliable code. Once compiled, the executable runs independently of the source code, which can be advantageous for distributing software as the source code isn’t revealed.
Hybrid Languages
Python and Java are considered a hybrid between interpreted and compiled languages. Their source code is compiled into bytecode by a compiler. This bytecode is then interpreted or just-in-time (JIT) compiled by a Virtual Machine (.e.g, the Java Virtual machine: JVM) at runtime. This approach combines some of the advantages of both interpreted and compiled languages as the compiled code is portable across operating systems and CPU designs.
The video below illustrates the common differences between these three types of execution paradigms for programming languages:
Common Use Cases for R
R is a powerful tool widely used across various fields due to its extensive capabilities in statistical computing, data analysis, and graphical representation. Below are some common use cases where R excels:
Data Analysis and Statistical Computing
R is specifically designed for data analysis and statistical computing, making it a preferred choice for statisticians and data scientists. It offers a vast array of built-in functions and packages for conducting statistical tests, building predictive models, and performing advanced data analysis. Common statistical tasks include hypothesis testing, regression analysis, ANOVA, time series analysis, and clustering.
Database Access with R
R provides robust tools for accessing and manipulating databases, which is essential for handling large datasets and integrating data from various sources. The DBI package, along with database-specific packages like RMySQL, RPostgres, and RODBC, allows users to connect to relational databases such as MySQL, PostgreSQL, and SQL Server. For example, using the DBI and RSQLite packages, you can connect to an SQLite database, execute SQL queries, and retrieve results directly into R for analysis and visualization:
Data Visualization
One of R’s strengths is its ability to create high-quality graphics and visualizations. Packages like ggplot2 allow users to produce complex and aesthetically pleasing plots with ease. Whether it’s basic charts like bar plots and histograms or advanced visualizations like heatmaps and interactive graphs, R provides powerful tools to visualize data effectively.
Machine Learning and Predictive Modeling
R is widely used in the field of machine learning for building and evaluating predictive models. The caret package, among others, provides a unified interface for training and tuning a variety of machine learning algorithms. R supports both supervised and unsupervised learning techniques, including decision trees, random forests, support vector machines, and neural networks.
Data Mining with R
R is extensively used for data mining, which involves extracting useful information and patterns from large datasets. The rattle package provides a graphical user interface for data mining, making it accessible for users with varying levels of programming expertise. Additionally, packages like arules and arulesViz support association rule mining and visualization, while tm and text2vec facilitate text mining and natural language processing. R’s comprehensive suite of tools for classification, clustering, and regression, combined with its powerful visualization capabilities, make it ideal for uncovering insights and knowledge from complex data.
Bioinformatics and Genomics
R is a popular choice in the bioinformatics community for analyzing and visualizing biological data. Packages like Bioconductor offer tools for the analysis of genomic data, including DNA sequencing, gene expression, and SNP analysis. R’s ability to handle large datasets and perform complex statistical analysis makes it suitable for biological research.
Financial Analysis
R is used extensively in the finance industry for tasks such as risk analysis, portfolio management, and time series forecasting. Packages like quantmod and TTR provide tools for financial modeling and technical analysis. R’s robust statistical capabilities are essential for developing and testing financial models.
Social Science Research
Researchers in social sciences use R for data analysis, survey analysis, and statistical modeling. The ability to manage and analyze large datasets, combined with R’s extensive statistical functions, makes it a valuable tool for conducting research in sociology, psychology, economics, and political science.
Reporting and Reproducible Research
R facilitates the creation of reproducible reports and dynamic documents through packages like knitr and rmarkdown. These tools allow users to integrate R code with Markdown or LaTeX, producing documents that include both analysis and narrative text. This is particularly useful in academic research and industry reporting, ensuring that analyses are transparent and reproducible.
In summary, R’s versatility and powerful statistical capabilities make it an invaluable tool across various domains. From data analysis and visualization to machine learning and bioinformatics, R provides the tools necessary to perform sophisticated analyses and produce high-quality results. Understanding these common use cases will help you leverage R effectively in your specific field of study or work.
Integrated Programming Environments for R
While you really need nothing more than a simple text editor, such as Sublime, although even Notepad or TextEdit would be fine, to write R scripts (aka programs), most developers leverage an integrated development environment (IDE) that offers syntax-aware editors, code completion, AI assistants, github integration, source code control, package control, and source file management with projects.
The base language environment of R can be freely obtained for Windows, MacOS, and various version of Linux from r-project.org.
For R, there are several popular IDEs and tools that enhance the programming experience by offering features like syntax highlighting, code completion, debugging, and direct execution of code. Here are some of the most commonly used development environments for R:
Posit (RStudio)
Posit (formerly RStudio), is arguably the most popular and widely used IDE for R. It is available in both desktop and server versions and offers a user-friendly interface with many features designed to improve productivity and ease of use.
Key Features:
Syntax highlighting and code completion.
Integrated R help and documentation.
Interactive graphics with support for multiple plots.
Debugging and profiling tools.
Version control integration (Git and SVN).
Project management capabilities.
Support for R Markdown and dynamic report generation.
Usage: Posit is suitable for all levels of R users, from beginners to advanced programmers. It is extensively used in academic settings, data analysis, research environments, and for production. It supports the development of R scripts, markdown documents, presentations, and R Notebooks. Posit supports not just R, but also Python, SQL, D3, as well as Java and C++ allowing multi-language programming.
Posit (formerly RStudio) can be obtained for free from posit.co. R must be installed prior to installing Posit.
As an alternative to installing R and Posit locally, a cloud hosted version is provided on posit.cloud. A free plan is available, along with various subscriptions plans including one for education. posit.cloud supports both Posit as well as Jupyter Notebook for integrated R and Python.
Jupyter Notebooks
Jupyter Notebooks are an increasingly popular choice for data scientists and researchers due to their interactive and literate programming approach. While Jupyter is language-agnostic, it supports R through the IRKernel along with Python – Jupyter Notebooks are equivalent to R Notebooks and the new Quarto Documents.
Key Features:
Interactive, web-based notebook environment.
Support for rich media output (plots, images, videos).
In-line code execution with immediate feedback.
Integration with various data science libraries.
Shareable notebooks with narrative text, code, and output.
Usage: Jupyter Notebooks are ideal for exploratory data analysis, teaching, and creating reproducible research documents.
Visual Studio Code (VS Code)
Visual Studio Code is a lightweight, powerful code editor developed by Microsoft. With the appropriate extensions, VS Code can be transformed into a robust IDE for R.
Key Features:
Syntax highlighting and code completion.
Integrated terminal and debugger.
Extensive extensions marketplace (e.g., R Extension for Visual Studio Code).
Git integration for version control.
Support for multiple programming languages.
Usage: VS Code is suitable for developers who prefer a versatile code editor that can be customized for various programming needs, including R development.
Emacs with ESS (Emacs Speaks Statistics)
Emacs, combined with the ESS (Emacs Speaks Statistics) package, provides a powerful environment for R programming. ESS is an add-on package that enhances Emacs to support interactive statistical programming and data analysis.
Key Features:
Advanced text editing capabilities.
Syntax highlighting and code completion.
Direct interaction with the R process.
Debugging and profiling tools.
Integration with other statistical packages (e.g., SAS, Stata).
Usage: Emacs with ESS is favored by users who are comfortable with Emacs and seek a highly customizable and extensible programming environment.
RGui (R Graphical User Interface)
RGui is the standard graphical user interface that comes with the base R installation. It provides a basic environment for writing and executing R code.
Key Features:
Simple and lightweight.
Integrated R console.
Basic script editor.
Usage: RGui is suitable for quick, simple tasks and is often used by beginners who are just starting with R.
R as a Programming Language
R is a high-level language that primarily follows the paradigm of functional programming but also supports procedural and object-oriented programming styles. Its flexibility in accommodating different programming paradigms makes it a versatile language for a wide range of computational tasks. However, the programming mechanisms for large-scale software engineering are not as robust as those in Java (or even Python) and therefore it is not recommended to build large applications in R. However, R is an ideal language to experimentation, exploration, proof-of-concept development, research, and prototype development. It is also ideal for quick programming tasks.
Functional Programming in R
Functional programming is at the core of R’s design. In this paradigm, computation is treated as the evaluation of mathematical functions and avoids changing-state and mutable data. R’s functions are first-class objects, meaning they can be passed as arguments to other functions, returned as values from other functions, and assigned to variables.
For example, consider the use of the lapply function, which applies a function to each element of a list and returns a list of results:
# Define a simple function to square a numbersquare <-function(x) {return(x * x)}# Create a list of numbersnumbers <-list(1, 2, 3, 4, 5)# Apply the square function to each element of the listsquared_numbers <-lapply(numbers, square)print(squared_numbers)
Procedural Programming in R
R also supports procedural programming, where you write sequences of instructions to perform tasks. This approach is useful for tasks that involve looping, conditionals, and step-by-step computations. Here’s an example using a for loop to achieve the same result as the previous functional example:
# Create a list of numbersnumbers <-list(1, 2, 3, 4, 5)# Initialize an empty list to store squared numberssquared_numbers <-list()# Use a for loop to square each numberfor (i in1:length(numbers)) { squared_numbers[[i]] <- numbers[[i]] * numbers[[i]]}print(squared_numbers)
Object-Oriented Programming in R
R also supports object-oriented programming (OOP) through two systems: S3 and S4. The S3 system is more informal and relies on generic functions and method dispatch based on object class. The S4 system is more formal, with explicit class and method definitions.
Here’s a simple example using the S3 system:
# Define a simple S3 class for a pointpoint <-function(x, y) {structure(list(x = x, y = y), class ="point")}# Define a method to print a point objectprint.point <-function(p) {cat("Point(", p$x, ", ", p$y, ")\n", sep ="")}# Create a point objectp <-point(2, 3)# Print the point objectprint(p)
Literate Programming
Literate programming is a programming paradigm introduced by Donald Knuth in the early 1980s. The core idea behind literate programming is to write programs that are understandable by humans first and computers second. This is achieved by interspersing natural language explanations with source code, creating a document that can be read and understood like a book or an article. Literate programming emphasizes the importance of clear communication of ideas and logic, making the code more maintainable and easier to understand.
In literate programming, the emphasis is on explaining the logic and rationale behind the code in a way that is accessible to humans. The goal is to produce a document that is as much about conveying ideas as it is about writing executable code.
The document alternates between sections of explanatory text and sections of code. The text explains what the code does, why certain decisions were made, and how different parts of the code interact. This approach helps readers follow the thought process of the programmer, making it easier to understand complex algorithms and systems.
The structure of a literate program follows a logical narrative flow rather than the strict syntactical requirements of a programming language. This means that the order in which the code is presented in the document can differ from the order in which it is executed. Tools used in literate programming will later rearrange the code into a form that can be compiled or interpreted by the computer.
Literate programming tools allow the document to be compiled or interpreted, executing the code segments embedded within the text. This ensures that the documentation is always synchronized with the code and can be run to verify its correctness.
RMarkdown and R Notebooks are the tools in the R ecosystem that support literate programming for data analysis and reporting. Literate programming is also supported in Python with Jupyter Notebooks. Note that R Notebooks and Jupyter Notebooks can actually contain a mix of code blocks in R, Python, SQL, Java, C++, D3, and many others allowing one to use the programming language most suitable for a particular task. There is also support in Haskell and Rackett for literate programming, so it is not restricted to R and Python.
Extending R with Packages
One of the greatest strengths of R is its extensive ecosystem of packages. Packages in R are collections of R functions, data, and compiled code that extend the capabilities of the base R system. The Comprehensive R Archive Network (CRAN) is the primary repository for R packages, hosting thousands of packages that cover a wide array of functionalities. The key packages include:
tidyverse: A collection of packages designed for data science. This includes ggplot2 for data visualization, dplyr for data manipulation, tidyr for data tidying, and others.
data.table: An extension of data.frame that provides an enhanced version with faster data manipulation capabilities.
shiny: For building interactive web applications directly from R.
caret: For machine learning and predictive modeling.
knitr: For dynamic report generation, integrating R code with LaTeX, HTML, and Markdown.
Rcpp: For seamless integration of R with C++ to enhance performance.
RSQLite: For access and integration with SQLite.
ggplot2: For creating reproducible visualizations.
kable: For construction of visually appealing tables.
Scripts vs Notebooks
R code can be written in two general ways: (1) as Literate Programs using R Markdown Notebooks and (2) as Scripts. Scripts are programs that can run within an IDE such as R Studio or be directly executed from the command line through R. Scripts have the benefit that they can be debugged using a debugger, while R Notebook code chunks are more difficult to debug. Another benefit of scripts is that they can be run from the command line and as part of cron jobs, i.e., they can be scheduled to run automatically at a point in time. Finally, R scripts can be included in shell scripts (on Unix and MacOS) and .bat batch programs (on Windows).
R Notebooks are markdown documents that contain embedded code in R and other languages. They are most useful when producing reports, memos, and analytics journals where we need to intersperse narratives and text with code. An R Notebook produces a document (HTML or PDF, most commonly) when it is “run” (aka “knitted”). An R Script, on the other hand, is a program that runs like in any other language.
R Scripts are preferable when you need stand-alone R programs that can be run directly from the command line or within a shell script. One main benefit of using R Scripts is that we can use the debugger within R Studio.
See Also
For more information on R Scripts (aka R Programs):
In this lesson, we explored the concepts of literate programming and its implementation using R Notebooks. Literate programming, introduced by Donald Knuth, integrates natural language explanations with source code, making programs more understandable and maintainable by humans. R Notebooks exemplify this paradigm by combining narrative text, code, and output in a single document, enhancing readability and reproducibility. We contrasted R Notebooks for dynamic document generation with R Scripts that are stand-along programs that can be run from the command line or embedded within shell scripts.
We also compared R and Python, highlighting that while both are powerful tools for data analysis and machine learning, they differ in syntax, libraries, and use cases. R is favored for statistical analysis, data analytics reporting and visualization, whereas Python is more versatile, suitable for general-purpose programming and integration into production environments. Python’s broader user base and extensive libraries make it more popular overall while R has a strong following in research. An educated computer and data scientist must know both languages.
Additionally, we discussed common development environments for R, including RStudio/Posit.
As “R” was a close approximation at the time to the popular “S” programming language, the name chosen was the letter prior to “S” in the alphabet. Today, “R” is much broader in scope that “S” and far more popular.↩︎
Social Science Research
Researchers in social sciences use R for data analysis, survey analysis, and statistical modeling. The ability to manage and analyze large datasets, combined with R’s extensive statistical functions, makes it a valuable tool for conducting research in sociology, psychology, economics, and political science.