72.611 Working with MongoDB in R

Objectives

Upon completion of this lesson, you will be able to:

connect to a MongoDB document database from R
retrieve and process data

Introduction

In contemporary software systems, especially those dealing with large-scale, flexible, and heterogeneous data like healthcare, patient, or employment records, a non-relational approach to database design is often more suitable than traditional relational databases. MongoDB is one of the most widely used NoSQL databases, and it is particularly well-suited for applications that require rapid development, horizontal scalability, and the ability to handle diverse data structures.

In this tutorial, we will explore how to design a MongoDB database to store health care data, retrieve this data for use in an application, and use the R programming language to connect to and manipulate the database. We will begin with an overview of MongoDB, describe how it differs from relational databases, provide installation instructions for both Windows and MacOS, and finally demonstrate how to connect to and work with MongoDB using R.

MongoDB

MongoDB is a document-oriented NoSQL database. Instead of storing data in tables with rows and columns like in relational databases, MongoDB stores data in flexible, JSON-like documents called BSON (Binary JSON). These documents can contain nested structures, arrays, and varying sets of fields, making MongoDB particularly useful when dealing with unstructured or semi-structured data, such as patient health records, diagnostic histories, and lab results.

One of the key differences between MongoDB and relational databases lies in the schema. A relational database enforces a strict schema — every row in a table must adhere to the same structure. In contrast, MongoDB collections (the equivalent of tables) allow documents (rows) with different structures, which supports schema-less development and evolving data requirements.

MongoDB, like other NoSQL databases, more directly supports horizontal scalability. By sharding data across multiple servers, MongoDB enables distributed computing and high availability. This is especially important in domains such as healthcare and bioinformatics, where data volume and reliability are mission-critical.

Document-Oriented Databases

In a document-oriented database, data is stored as documents, typically in a format similar to JSON (JavaScript Object Notation). MongoDB uses a binary version of JSON called BSON (Binary JSON), which allows it to store complex, nested data structures in a compact and efficient format.

This kind of NoSQL database differs from key-value stores, wide-column stores, and graph databases in the following ways:

In MongoDB, each document is a self-contained record that can include fields of various data types, including strings, numbers, arrays, and even nested documents. These documents are grouped into collections, which are analogous to tables in relational databases, but without a rigid schema. That is, different documents in the same collection can have different fields and structures.

Because of this document-based structure, MongoDB is ideal for use cases where data structures are complex, hierarchical, or subject to change over time — such as user profiles, content management systems, e-commerce catalogs, or, as we discussed earlier, electronic health records. The flexibility, scalability, and high read/write throughput make MongoDB particularly attractive for modern, cloud-native applications that require rapid iteration and scaling.

Other NoSQL (non-relational) data models include:

key-value stores (like Redis) which store data as simple key-value pairs, similar to a hash map.
wide-column stores (like Cassandra) which organize data into tables with rows and dynamic columns, optimized for fast writes and analytics.
graph databases (like Neo4j) which store entities and their relationships as nodes and edges, ideal for traversing complex networks.

Case Study: MongoDB in for Healthcare

Let us imagine a scenario where we need to store the medical records of patients, their visit history, diagnoses, and prescribed medications. In a MongoDB database, each patient can be represented as a single document containing all of this information.

Here is a conceptual schema using JSON-like syntax to describe a single document in a patients collection:

{
  "patient_id": "P123456",
  "name": {
    "first": "John",
    "last": "Doe"
  },
  "dob": "1985-07-23",
  "gender": "Male",
  "visits": [
    {
      "visit_date": "2024-10-01",
      "doctor": "Dr. Smith",
      "diagnoses": ["Hypertension"],
      "medications": [
        {
          "name": "Lisinopril",
          "dosage": "10mg",
          "frequency": "Once daily"
        }
      ]
    },
    {
      "visit_date": "2025-01-12",
      "doctor": "Dr. Lee",
      "diagnoses": ["Hyperlipidemia"],
      "medications": []
    }
  ]
}

This embedded design allows us to store all relevant data about a patient in a single document, which optimizes read performance for common queries such as retrieving a full patient history. If you are used to relational databases, think of a document as the result of a join.

Installing MongoDB

Storing data in a MongoDB database requires installing the database server. In the following sections, we will provide general instructions on how to install MongoDB locally for both Windows and MacOS. The specifics can change over time, so be sure to always consult the latest installation instructions published by MongoDB.

In addition, we will show how to run MongoDB as a cloud service rather than installing the database server locally.

MongoDB on Windows

To install MongoDB on Windows, visit the MongoDB Community Edition download page at and select the Windows platform. Use the .msi installer for ease of use. Once downloaded, run the installer and follow the setup wizard, ensuring you select the option to install MongoDB as a service.

After installation, the MongoDB server can be started from the command prompt by executing:

net start MongoDB

To test the installation, open a new terminal window and enter:

mongo

This will connect you to the MongoDB shell interface. Be sure to always start the MongoDB server before connecting. If your connection fails, verify that MongoDB is running. The command interface is useful to administration and ad hoc queries, however, most of the commands to store and retrieve data come from within a program or an app.

MongoDB on MacOS

On MacOS, the simplest method is to use Homebrew. Open Terminal and run the following commands:

brew tap mongodb/brew
brew install mongodb-community@7.0

To start the MongoDB service, use:

brew services start mongodb-community@7.0

Once started, you can verify the service is running by typing:

mongosh

This opens the MongoDB shell for administration and to check the health of the service.

Using a Cloud Service: MongoDB Atlas

For a hosted MongoDB solution, MongoDB Atlas offers a free-tier cloud service. Visit https://www.mongodb.com/cloud/atlas and sign up for an account. Create a new cluster and configure its settings such as cloud provider, region, and instance size.

Once your cluster is deployed, create a database user and whitelist your IP address. Then obtain the connection URI string, which will look something like:

mongodb+srv://your_user:your_password@cluster0.mongodb.net/healthcare

This URI will be used to connect from R or any other client.

Accessing MongoDB from R

In R, the mongolite package provides a straightforward interface to MongoDB. Begin by installing the package.

Connecting to MongoDB

To connect to a local MongoDB instance and work with the patients collection in a database named healthcare, you can use the following code:

library(mongolite)

# Connect to local MongoDB
patients <- mongo(collection = "patients", db = "healthcare", url = "mongodb://localhost")

If you are using MongoDB Atlas, simply replace the url argument with your cluster URI:

atlas_url <- "mongodb+srv://your_user:your_password@cluster0.mongodb.net/healthcare"
patients <- mongo(collection = "patients", db = "healthcare", url = atlas_url)

Insert Data

# Insert a sample patient record
patients$insert('{
  "patient_id": "P123456",
  "name": { "first": "John", "last": "Doe" },
  "dob": "1985-07-23",
  "gender": "Male",
  "visits": [
    {
      "visit_date": "2024-10-01",
      "doctor": "Dr. Smith",
      "diagnoses": ["Hypertension"],
      "medications": [
        {
          "name": "Lisinopril",
          "dosage": "10mg",
          "frequency": "Once daily"
        }
      ]
    }
  ]
}')

In the above example, localhost refers to your own computer. It is a standard domain name and recognized universally.

Retrieve Data

To query this data, for example to retrieve the patient record by patient ID, use:

patient_data <- patients$find('{"patient_id": "P123456"}')
print(patient_data)

Alternatives to MongoDB

Several other document-oriented databases exist in addition to MongoDB, each offering its own features, data models, and use cases. These databases also store data as documents (usually in XML, JSON or a JSON-like format), and are designed to be schema-flexible and capable of handling semi-structured or nested data.

Here are some notable examples of document-oriented databases:

CouchDB

CouchDB is an open-source NoSQL document database developed by the Apache Software Foundation. It stores documents in JSON format and uses HTTP for communication. Each document has a unique _id, and revisions are tracked through a multi-version concurrency control (MVCC) model. One of CouchDB’s key features is its support for synchronization across distributed databases, making it ideal for offline-first mobile or desktop applications.

Couchbase

Originally based on CouchDB, Couchbase evolved into a more enterprise-grade NoSQL solution that combines the features of a document store and a distributed cache. It uses a memory-first architecture and supports complex queries through its N1QL query language, which is SQL-like and designed for querying JSON documents.

RavenDB

RavenDB is a .NET-based document database that supports ACID transactions and provides a powerful query language called RQL. It is known for being developer-friendly and integrates well with the .NET ecosystem, making it popular in Microsoft-heavy environments.

Firebase and Firestore

Both Firebase Realtime Database and Firestore are cloud-hosted NoSQL databases used primarily in mobile and web development. They are document-based and integrate tightly with the Firebase ecosystem. Firestore, in particular, organizes documents into collections and supports querying, offline access, and real-time synchronization across clients.

Amazon DocumentDB

Amazon DocumentDB is a fully managed document database service that is compatible with MongoDB APIs. It allows developers to use MongoDB drivers and tools while leveraging the scalability, security, and maintenance features provided by AWS.

ArangoDB

While ArangoDB is a multi-model database that supports document, key-value, and graph data models, it offers full document store functionality. It uses AQL (ArangoDB Query Language) to query collections of documents.

OrientDB

Another multi-model database, OrientDB supports both document and graph data models. It allows a document to participate in graph relationships, which can be beneficial in applications where both kinds of data access are required.

General Critiques of MongoDB

One frequent critique is that MongoDB trades data integrity for flexibility. Because MongoDB is schema-less, developers can insert documents with arbitrary structures into the same collection. While this flexibility is useful in agile development and rapidly evolving applications, it can also lead to inconsistencies if no validation rules are enforced. Although MongoDB later introduced JSON schema validation and strong typing options, these are not enforced by default.

Another common issue is write safety and consistency. In its early versions, MongoDB did not provide strong guarantees around write durability, especially with its default write concerns. Although later versions addressed this by offering configurable write and read concerns, older perceptions about MongoDB being “eventually consistent” or “unsafe for mission-critical data” still persist in some circles.

Scalability is often advertised as a strength, but managing sharding manually can be non-trivial. MongoDB supports automatic sharding, but it still requires careful planning in terms of shard key selection and balancing workloads, which can complicate operational concerns for large-scale deployments.

Another technical critique concerns memory usage and performance trade-offs. MongoDB keeps frequently accessed documents in memory, but because it does not use traditional relational indexes (beyond B-trees), complex document lookups and aggregations can sometimes require scanning large datasets unless well-indexed, which can lead to performance degradation.

Perhaps the most specific critiques of MongoDB arise from its querying model. MongoDB does not use SQL; instead, it relies on a JavaScript-like query language built around JSON. This brings several challenges with it. For one, MongoDB’s query language lacks composability and standardization. Unlike SQL, which has a formal grammar and standardized clauses (think of SELECT, FROM, WHERE, JOIN, etc.), MongoDB’s query language is more ad hoc and not universally standardized. Query objects must be constructed using a nested JSON-like syntax, which can be verbose and less intuitive—especially for complex queries involving nested conditions or subqueries.

Consider this simple MongoDB query to find patients over 60 years old:

db.patients.find({ age: { $gt: 60 } })

While readable, more complex queries with multiple conditions, projections, and lookups can quickly become difficult to manage, especially compared to equivalent SQL statements.

Another critique involves joins and relationships. MongoDB does not support joins in the way relational databases do. Instead, it relies on data embedding or the use of $lookup in aggregation pipelines for manual joins between collections. The $lookup stage is relatively recent in MongoDB’s evolution and can be less performant and more cumbersome than native joins in SQL, especially for large datasets or multi-stage operations.

Here’s an example of a MongoDB aggregation to simulate a join between patients and visits collections:

db.patients.aggregate([
  {
    $lookup: {
      from: "visits",
      localField: "patient_id",
      foreignField: "patient_id",
      as: "visit_history"
    }
  }
])

This works, but it can become verbose, hard to debug, and not as expressive as a simple SQL JOIN.

MongoDB’s query system also has limited support for ad hoc analytics compared to SQL-based systems. While the aggregation framework is powerful, it lacks the immediacy and readability of SQL for analysts who are used to expressing calculations, grouping, and ordering in compact statements. As a result, some data analysts or data scientists find MongoDB less approachable for exploratory work.

Additionally, until recent versions, MongoDB had no native support for ACID transactions across multiple documents, making it difficult to enforce transactional integrity in multi-document updates. This was a key limitation for querying workflows that involve reading multiple documents, modifying them, and writing them back in a consistent state. Although multi-document transactions are now supported, they add complexity and performance costs.

MongoDB vs SQL

To illustrate the differences between the ad hoc query language offered by MongoDB and standard SQL, let’s walk through a side-by-side example comparing the two in the context health care data. Presume that we are working with patient visit records and we want to retrieve specific information based on some practical conditions.

Query Objective:

Retrieve all patients over the age of 65 who have had at least one visit in the last year and were diagnosed with “Hypertension”. Also, return the patient’s full name, date of birth, and details about the matching visits (i.e., visit date, doctor, and diagnoses).

Relational Database Model (SQL)

Let’s assume we have two normalized tables in a relational schema as follows:

patients(patient_id, first_name, last_name, dob, gender)
visits(visit_id, patient_id, visit_date, doctor, diagnoses)

Here, diagnoses is a text field (e.g., comma-separated), or, more likely, it could be normalized in a separate visit_diagnoses table, but for simplicity, we’ll assume it’s a text field

SQL Query

SELECT 
    p.first_name,
    p.last_name,
    p.dob,
    v.visit_date,
    v.doctor,
    v.diagnoses
FROM 
    patients p
JOIN 
    visits v ON p.patient_id = v.patient_id
WHERE 
    DATEDIFF(year, p.dob, CURRENT_DATE) > 65
    AND v.visit_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
    AND v.diagnoses LIKE '%Hypertension%'
ORDER BY 
    p.last_name, v.visit_date DESC;

This query joins the patients and visits tables, filters patients older than 65, limits visits to those within the past year, and checks whether the diagnosis includes the term “Hypertension”. It then returns a list of patient details and visit information.

Document Model (MongoDB)

Now assume we have a single patients collection in MongoDB. Each document may look like this:

{
  "patient_id": "P123456",
  "first_name": "John",
  "last_name": "Doe",
  "dob": "1955-07-23",
  "gender": "Male",
  "visits": [
    {
      "visit_date": "2024-10-01",
      "doctor": "Dr. Smith",
      "diagnoses": ["Hypertension", "Diabetes"]
    },
    {
      "visit_date": "2023-01-12",
      "doctor": "Dr. Lee",
      "diagnoses": ["Hyperlipidemia"]
    }
  ]
}

MongoDB Query Using Aggregation Pipeline

db.patients.aggregate([
  {
    $addFields: {
      age: {
        $dateDiff: {
          startDate: "$dob",
          endDate: "$$NOW",
          unit: "year"
        }
      }
    }
  },
  {
    $match: {
      age: { $gt: 65 },
      "visits": {
        $elemMatch: {
          visit_date: { $gte: ISODate("2024-03-26") }, // One year ago
          diagnoses: "Hypertension"
        }
      }
    }
  },
  {
    $project: {
      first_name: 1,
      last_name: 1,
      dob: 1,
      visits: {
        $filter: {
          input: "$visits",
          as: "visit",
          cond: {
            $and: [
              { $gte: ["$$visit.visit_date", ISODate("2024-03-26")] },
              { $in: ["Hypertension", "$$visit.diagnoses"] }
            ]
          }
        }
      }
    }
  },
  { $sort: { last_name: 1, "visits.visit_date": -1 } }
])

This aggregation performs multiple steps:

Calculates the age using $dateDiff.
Filters patients over 65 who have at least one matching visit in the past year.
Projects only visits that meet the same condition (to return only relevant visits).
Sorts by last name and visit date.

Comparison

In SQL, queries are declarative and easy to express in a few lines using joins and filter conditions. The JOIN clause and relational integrity make it simple to operate across multiple tables.

In MongoDB, because the data is nested (visits embedded inside the patient document), queries must navigate the structure explicitly. The use of $elemMatch and $filter in aggregation is powerful, but the query is more verbose and imperative in style.

However, MongoDB’s embedded model offers faster retrieval of patient records with visits since the data is co-located in one document—no joins are needed. This is beneficial for read-heavy applications where entire patient histories are frequently fetched.

Conclusion

MongoDB offers a flexible and efficient data model ideal for complex, nested, and heterogeneous data structures often found in health care applications. Its document-based design supports the aggregation of patient-related information in a single document, reducing the need for joins and simplifying query logic. With the mongolite package, R developers can seamlessly integrate MongoDB into analytical and reporting workflows, whether running on a local server or hosted in the cloud via MongoDB Atlas. This allows for rapid development of scalable health care data systems and real-time access to patient information for predictive analytics, dashboards, or clinical decision support.

MongoDB is often praised for its flexibility, ease of use, and scalability, but it has been critiqued for weak schema enforcement, complex sharding mechanics, and verbose, non-standard query syntax. Querying in particular is seen as less intuitive and powerful than SQL, especially for relational or analytical workloads.

Despite these limitations, MongoDB continues to improve with each version, adding features like multi-document transactions, improved aggregation performance, schema validation, and better tooling. For many real-world applications—especially those involving semi-structured data or rapidly evolving schemas—MongoDB remains a compelling option. However, developers should weigh its strengths against the trade-offs when deciding whether it fits their use case.

72.611
Working with MongoDB in R

Martin Schedlbauer, PhD

2025-03-26

Objectives

Introduction

MongoDB

Document-Oriented Databases

Case Study: MongoDB in for Healthcare

Installing MongoDB

MongoDB on Windows

MongoDB on MacOS

Using a Cloud Service: MongoDB Atlas

Accessing MongoDB from R

Connecting to MongoDB

Insert Data

Retrieve Data

Alternatives to MongoDB

General Critiques of MongoDB

MongoDB vs SQL

Relational Database Model (SQL)

SQL Query

Document Model (MongoDB)

MongoDB Query Using Aggregation Pipeline

Comparison

Conclusion

Files & Downloads

References

Errata

72.611Working with MongoDB in R

Martin Schedlbauer, PhD

2025-03-26

Objectives

Introduction

MongoDB

Document-Oriented Databases

Case Study: MongoDB in for Healthcare

Installing MongoDB

MongoDB on Windows

MongoDB on MacOS

Using a Cloud Service: MongoDB Atlas

Accessing MongoDB from R

Connecting to MongoDB

Insert Data

Retrieve Data

Alternatives to MongoDB

General Critiques of MongoDB

MongoDB vs SQL

Relational Database Model (SQL)

SQL Query

Document Model (MongoDB)

MongoDB Query Using Aggregation Pipeline

Comparison

Conclusion

Files & Downloads

References

Errata

72.611
Working with MongoDB in R