Objectives

Upon completion of this lesson, you will be able to:

  • connect to a MongoDB document database from R
  • retrieve and process data

Introduction

MongoDB is a non-relational – or NoSQL – database. It belongs to the family of “document databases”. There are several other NoSQL database architectures besides document databases, including key-value, graph, and columnar databases.

A document database is a type of NoSQL database that stores data as self-describing, semi-structured documents rather than as rows in fixed-schema tables. Each document is a collection of key-value pairs, typically serialized in JSON (JavaScript Object Notation) or a binary derivative such as BSON (Binary JSON, used by MongoDB). A document is roughly analogous to a row in a relational table, but with two critical differences: documents within the same collection are not required to share an identical structure, and a document can embed complex nested data — arrays, subdocuments, and arrays of subdocuments — directly within itself rather than in separate linked tables. This is the schema-less nature of document databases and is one of the main drivers of its flexibility.

MongoDB Documents

A single MongoDB document representing a student record might look like this:

{
  "_id": "CS001",
  "name": "Alice Müller",
  "department": "Computer Science",
  "gpa": 3.9,
  "courses": [
    { "code": "CS501", "title": "Algorithms", "grade": "A" },
    { "code": "CS512", "title": "Machine Learning", "grade": "A-" }
  ],
  "advisor": {
    "name": "Prof. Chen",
    "email": "chen@university.edu"
  }
}

In a relational database, this structure would require at minimum three tables — students, courses, and advisors — joined at query time. In a document database, all of this information lives in one place and is retrieved in a single read. Note the nested structure of a document – it is similar to XML documents.

Key Terminology

MongoDB Term Relational Equivalent Notes
Database Database Direct equivalent
Collection Table No enforced schema by default
Document Row JSON/BSON; flexible structure
Field Column Can be nested or an array
_id Primary Key Auto-generated if not supplied
Index Index Supports single field, compound, text, geospatial
Aggregation Pipeline GROUP BY + JOIN + subqueries Staged transformation of documents

Learn more about the different NoSQL database architecture from Lesson 72.102 – Common NoSQL Database Paradigms.

About MongoDB

MongoDB, released in 2009 by 10gen (now MongoDB, Inc.), is the most widely deployed document database. It stores documents in BSON, a binary superset of JSON that adds additional data types (e.g., Date, ObjectId, Decimal128, BinData). Its distinguishing architectural features include:

Schema flexibility. Collections do not enforce a schema by default. Different documents in the same collection can have entirely different fields. Since version 3.6, MongoDB optionally supports JSON Schema validation at the collection level, allowing teams to enforce structure when consistency is required.

Horizontal scaling via sharding. MongoDB partitions data across multiple servers (shards) based on a shard key. This allows write and read throughput to scale horizontally by adding servers, which is architecturally difficult in traditional relational systems.

Replica sets. MongoDB replicates data across multiple nodes automatically. One node is the primary (accepts writes); others are secondaries (replicate asynchronously and can serve reads). Failover is automatic – if the primary goes down, the replica set elects a new primary without manual intervention.

Aggregation Pipeline. A powerful data transformation framework where data passes through a sequence of stages (e.g., $match, $group, $sort, $project, $lookup) to produce computed results, similar in power to complex SQL queries.

Atlas. MongoDB’s managed cloud service, available via AWS, Azure, and GCP. It handles provisioning, backups, monitoring, and scaling, and adds features such as Atlas Search (full-text search via Apache Lucene) and Atlas Vector Search (for similarity search in ML applications).

Common Use Cases of MongoDB

Content Management and Catalogs

Content, such as articles, product listings, media metadata, is inherently variable in structure. A news article has a headline, body, tags, and author. A video has a duration, codec, and subtitles. A product listing has dimensions, weight, and SKU variants. Forcing all of these into a single relational table requires either many nullable columns or a complex inheritance hierarchy. A document database accommodates this naturally: each item is stored as a document with exactly the fields it needs.

User Profiles and Personalization

User profile data tends to be sparse and highly variable across users. One user may have a shipping address, a payment method, and a list of preferences; another may have only an email address. Document databases allow each profile to carry only its own relevant fields, and nested arrays make it straightforward to embed lists of addresses, orders, or settings directly on the user document.

Real-Time Event and Log Data

Applications that ingest high volumes of time-stamped events — clickstreams, IoT sensor readings, application logs — generate data at rates that often exceed what a single relational database write path can absorb. MongoDB’s horizontal write scaling and its ability to insert arbitrary JSON payloads without a schema migration make it well suited for this workload.

Mobile and Web Application Back Ends

The data layer of most web and mobile applications is naturally document-shaped: a single API response for a user’s dashboard might aggregate data from several relational tables. MongoDB allows the developer to store data in the shape it will be consumed, reducing or eliminating the translation layer between the database and the application.

Hierarchical and Graph-Adjacent Data

Data with deep nesting or recursive structure — organizational hierarchies, comment threads, bill-of-materials trees — can be stored as embedded documents. While not a graph database, MongoDB handles tree structures well when the nesting depth is bounded and predictable.

Prototyping and Evolving Schemas

During early development, when the data model is still changing, schema flexibility allows developers to iterate rapidly without writing and running migration scripts for every structural change. This may be particularly valuable in agile software development lifecycles that does not feature significant design up front.

MongoDB vs Relational Database

The decision is architectural, not tribal, and not always clear. Relational databases are the workhorse of corporate data stores and SQL is ubiquitous with strong analytical query capabilities. However, MongoDB may be the preferred choice when several of the following conditions hold:

The data is document-shaped. If the natural unit of your application is a self-contained entity that you always read and write as a whole (e.g., an order with its line items, a blog post with its comments), embedding that data in a document avoids expensive multi-table joins on every query. This may be the case in e-commerce, for instance, a “shopping cart” is a document.

The schema is variable or unknown at design time. If different records legitimately need different fields, or if the schema is expected to evolve frequently, the cost of relational migrations is high and the benefit of a fixed schema is low.

Write throughput must scale horizontally. Relational databases scale vertically (bigger hardware) much more readily than horizontally (more servers). MongoDB’s sharding architecture is designed for horizontal write scaling across commodity hardware.

The workload is read-heavy with a known access pattern. If you almost always retrieve a complete entity by its ID or by a small set of indexed fields, the document model’s co-location of related data yields lower latency than a normalized relational model that requires joins.

You are building on top of a JSON-native API or language ecosystem. When your application already thinks in JSON – as most REST and GraphQL APIs do – MongoDB eliminates the object-relational impedance mismatch: the shape of the data in the database matches the shape of the data in memory.

Disadvantages of Document Databases

Understanding the limitations of document databases is as important as understanding their strengths. MongoDB is not a general-purpose replacement for relational databases; it involves several trade-offs and considerations.

Weak Multi-Document Transactional Support (Historically)

MongoDB did not support multi-document ACID transactions until version 4.0 (2018). Even today, multi-document transactions in MongoDB carry a meaningful performance overhead and are subject to a 60-second timeout by default. Workloads that require frequent, complex, multi-entity transactions — such as double-entry bookkeeping or inventory reservation — are still better served by relational databases, where transactions are a first-class, highly optimized primitive.

No Enforced Referential Integrity

MongoDB has no native foreign key constraints. If a document in one collection references the _id of a document in another collection, MongoDB will not prevent you from deleting the referenced document, leaving the reference dangling. Maintaining referential integrity is the responsibility of the application layer, which increases the surface area for data consistency bugs.

Data Duplication and Update Anomalies

The document model encourages denormalization – embedding the same data in multiple documents to avoid joins. While this improves read performance, it means that updating shared data (e.g., a professor’s name embedded in hundreds of course documents) requires updating every affected document, and there is no transactional guarantee that all of them will be updated atomically unless multi-document transactions are used explicitly. There are no “normal forms” for document databases which can be useful but come with likely data duplication.

Unbounded Document Growth

Embedding lists within documents is convenient, but lists that grow without bound (e.g., embedding every comment ever made on a post directly in the post document) can cause documents to exceed MongoDB’s 16 MB document size limit and degrade performance as documents grow and require relocation on disk. It also makes any indexing difficult or impossible.

Weaker Ad Hoc Query and Reporting Capability

SQL is a declarative, standardized language with decades of tooling, optimization, and practitioner knowledge behind it. MongoDB’s query language is expressive but more verbose, non-standardized, and less familiar to analysts and data scientists. Complex reporting queries, particularly those involving multiple joins across entities, are cumbersome in MongoDB and often require the aggregation pipeline, which has a steeper learning curve than equivalent SQL.

No Schema Enforcement by Default

Schema flexibility is a double-edged sword. Without validation rules, a collection can accumulate structurally inconsistent documents over time. This is sometimes called “schema drift”; it makes downstream querying, analytics, and maintenance significantly harder. Teams must enforce discipline either through application-level validation or MongoDB’s optional JSON Schema validators rather than enforcement by the database.

Tooling and Ecosystem Maturity

The relational ecosystem – Object-Relational Models (ORM), BI connectors, ETL data warehouse tools, query analyzers, and database administration – is substantially more mature than the MongoDB ecosystem. Organizations with established data warehousing, compliance, or reporting requirements will find more off-the-shelf tooling for relational databases or they can leverage existing investments and skill sets.

Connecting to MongoDB from R

mongolite is the primary R package for interacting with MongoDB. It wraps the mongo-c-driver and provides a clean interface for CRUD (Create/Read/Update/Delete) operations.

After installation of the mongolite package, load the package with:

library(mongolite)

The central object in mongolite is the mongo connection object, created by calling mongo(). Its primary arguments are:

Argument Description
collection The name of the MongoDB collection (analogous to a table in a relational database).
db The name of the database. MongoDB creates it automatically if it does not exist.
url The MongoDB connection string. The default is "mongodb://localhost" (port 27017).

Below is the full code to connect and verify that the connection is live by requesting a count of the documents in the database. In this case, given that we have an empty database, a return value of 0 with no error confirms a successful connection.

library(mongolite)

# connect to a collection named "carts" in a database named "shop"

con <- mongo(
  collection = "carts",
  db         = "shop",
  url        = "mongodb://localhost:27017"
)

# request the number of documents in the collection (0 if empty, error if unreachable)
con$count()

The call tomongo does not open a persistent socket immediately – it configures the connection. The actual connection is established on the first operation: con$count() in the code above.

If the MongoDB server is not running, you will receive an error at that point.

How to start the MongoDB server depends on your operating system: see 72.610 – Installing MongoDB on Windows and MacOS for details.

Connecting to MongoDB from Python

We can connect to MongoDB from Python rather than R. A few notes on the equivalences between the two:

Package: mongolite in R maps to pymongo in Python. Install it with pip install pymongo if you have not already.

Connection structure: mongolite collapses the client, database, and collection into a single mongo() call. In pymongo these are three distinct steps – you create a MongoClient, select a database from it by name, then select a collection from that database by name. The resulting con object in Python is directly equivalent to the con object in R.

Document count: con$count() in R maps to con.count_documents({}) in Python. The empty dict {} is the query filter, meaning “match all documents,” which is the same behavior as calling count() with no arguments in mongolite. The older con.count() method exists in pymongo but has been deprecated and should not be used.

from pymongo import MongoClient

# Connect to a collection named "students" in a database named "university"
client = MongoClient("mongodb://localhost:27017")
db     = client["university"]
con    = db["students"]

# Returns the number of documents in the collection (0 if empty, error if unreachable)
print(con.count_documents({}))

Saving Data to MongoDB

Let’s first look at how we store a document in MongoDB.

As an example, we will work with a data frame representing graduate student records. Recall that MongoDB stores documents as JSON/BSON, and mongolite automatically serializes R data frames into the appropriate format.

# Create a sample data frame
students <- data.frame(
  student_id  = c("CS001", "CS002", "CS003", "CS004", "CS005"),
  name        = c("Alice Müller", "Bob Zhang", "Carla Rossi", "David Kim", "Eva Novak"),
  department  = c("Computer Science", "Computer Science", "Electrical Engineering",
                  "Computer Science", "Data Science"),
  gpa         = c(3.9, 3.5, 3.7, 3.2, 3.8),
  enrolled    = c(TRUE, TRUE, FALSE, TRUE, TRUE),
  year        = c(2L, 3L, 1L, 4L, 2L),
  stringsAsFactors = FALSE
)

Inserting Data

Use the $insert() method to write one or more documents. mongolite accepts a data frame and inserts each row as a separate document.

# Insert all rows at once
con$insert(students)

# Confirm the count
cat("Documents inserted:", con$count(), "\n")

You should see:

Documents inserted: 5

Inserting a single document can be done by passing a one-row data frame or by using jsonlite::toJSON() directly, but passing a data frame is the clearest approach:

new.student <- data.frame(
  student_id = "CS006",
  name       = "Frank Osei",
  department = "Computer Science",
  gpa        = 3.6,
  enrolled   = TRUE,
  year       = 1L,
  stringsAsFactors = FALSE
)

con$insert(new.student)
cat("Documents after new insert:", con$count(), "\n")

Retrieving Data from MongoDB

Retrieving All Documents

The $find() method returns documents as an R data frame. With no arguments, it returns all documents.

all.students <- con$find()
print(all.students)

Filtering with a Query

MongoDB queries are expressed as JSON strings rather than SQL. The query argument of $find() accepts a JSON character string expressing query predicates on document keys.

Retrieve all enrolled students:

enrolled.students <- con$find(
  query = '{"enrolled": true}'
)
print(enrolled.students)

Retrieve students with a GPA greater than 3.6:

high.gpa <- con$find(
  query = '{"gpa": {"$gt": 3.6}}'
)
print(high.gpa)

Retrieve students from a specific department:

cs.students <- con$find(
  query = '{"department": "Computer Science"}'
)
print(cs.students)

Selecting Specific Fields (Projection)

To return only certain fields, use the fields argument. A value of 1 includes a field; 0 excludes it. MongoDB always returns the internal _id field unless you explicitly suppress it.

# Return only name and gpa, suppress the internal _id field
name.gpa <- con$find(
  query  = "{}",
  fields = '{"name": 1, "gpa": 1, "_id": 0}'
)
print(name.gpa)

Sorting Results

The sort argument controls the order of returned documents. Use 1 for ascending and -1 for descending.

# Sort by GPA descending
sorted.by.gpa <- con$find(
  query = "{}",
  sort  = '{"gpa": -1}'
)
print(sorted.by.gpa)

Limiting the Number of Results

# Return only the top 3 documents by GPA
top.three <- con$find(
  query = "{}",
  sort  = '{"gpa": -1}',
  limit = 3L
)
print(top.three)

Closing the Connection

Always close the connection when you are done to release resources:

con$disconnect()

Common Mistakes and Troubleshooting

Mistakes When Connecting

Mistake Symptom Resolution
MongoDB service is not running Error: Failed to connect to localhost:27017 Start the service (Start-Service MongoDB on Windows; brew services start mongodb-community on macOS).
Wrong port in the connection string Connection refused The default port is 27017. Verify with mongosh --port 27017. If you changed it, update url accordingly.
mongolite not installed or loaded could not find function "mongo" Run install.packages("mongolite") and library(mongolite).
Firewall or antivirus blocking port Connection hangs indefinitely Temporarily disable antivirus to test; then add an exception for port 27017.
Using an Atlas (cloud) connection string without TLS parameters SSL handshake failed Atlas requires TLS. Use the full connection string Atlas provides, which includes tls=true.
Typo in database or collection name A new empty database/collection is silently created MongoDB does not raise an error for non-existent names; it creates them. Always cat(con$count()) after connecting to check for unexpected empty collections.
Connecting inside a loop without disconnecting R session memory grows; eventual connection pool exhaustion Create the connection object once outside the loop and reuse it, or explicitly call con$disconnect() at the end.

Mistakes When Saving Data

Mistake Symptom Resolution
Passing a list instead of a data frame Documents inserted with unexpected structure mongolite’s $insert() expects a data frame or a JSON string. Convert lists with as.data.frame() first.
Factor columns in the data frame Factors are serialized as integers, not strings Use stringsAsFactors = FALSE in data.frame(), or wrap factor columns with as.character() before inserting.
Inserting duplicate _id values Write error: E11000 duplicate key error MongoDB requires unique _id values. If you define your own ID field and name it _id, ensure all values are unique. Alternatively, let MongoDB auto-generate _id.
Inserting an empty data frame No error, but nothing is written Check nrow(df) > 0 before calling $insert().
Special characters in field names (e.g., . or $) Query operators malfunction; unexpected errors MongoDB field names must not start with $ or contain .. Rename columns before inserting: names(df) <- gsub("\\.", "_", names(df)).
Not dropping the collection between test runs Cumulative duplicate documents build up Call con$drop() at the start of development scripts to ensure a clean state.

Mistakes When Retrieving Data

Mistake Symptom Resolution
Malformed JSON in the query argument Error: Invalid JSON JSON strings in R must use double quotes inside. Use single quotes to wrap the JSON string: '{"field": "value"}'.
Using R comparison operators in the query string Query returns no results or an error MongoDB uses its own operators ($gt, $lt, $gte, $lte, $ne, $in). Do not use >, <, or == inside the JSON query string.
Forgetting that $find() returns a data frame Treating the result as a list and using list indexing Access columns with standard data frame syntax: result$column_name or result[, "column_name"].
Querying a Boolean field with 0/1 instead of true/false Query returns 0 results In MongoDB JSON, Boolean values are true and false (lowercase), not 1/0 or "TRUE". Use '{"enrolled": true}'.
Projection includes a field that does not exist No error; the field is silently absent from results Check field names with names(con$find(limit = 1L)) to see what fields are actually stored.
Large result sets causing memory issues R session freezes or crashes Use limit and skip for pagination, or use $iterate() for cursor-based streaming of large collections.
Not accounting for the _id field in the returned data frame Unexpected _id column in downstream processing Suppress it in the projection: fields = '{"_id": 0}'.

MongoDB’s Query Language vs. SQL

MongoDB does not use SQL. Its query language is expressed as structured JSON objects passed to driver methods (or to the shell). The conceptual operations are equivalent, but the syntax is fundamentally different.

Selecting Documents (SELECT / WHERE)

SQL:

SELECT * FROM students WHERE department = 'Computer Science';

MongoDB:

db.students.find({ "department": "Computer Science" })

The JSON object passed to find() is the query filter. Fields in the filter are implicitly ANDed together.

Note that we can use Relational Algebra to express queries independently of the actual query language that will eventually be used.

Relational Algebra:

\(\sigma_{department = 'Computer Science'} (students)\)

6.2 Comparison Operators

SQL uses standard comparison symbols (>, <, >=, !=). MongoDB uses named operators prefixed with $ inside the filter object.

SQL Operator MongoDB Operator
= (implicit, just the value)
!= $ne
> $gt
>= $gte
< $lt
<= $lte
IN (...) $in
NOT IN (...) $nin

SQL:

SELECT * FROM students WHERE gpa >= 3.7 AND enrolled = TRUE;

MongoDB:

db.students.find({ "gpa": { "$gte": 3.7 }, "enrolled": true })

6.3 Logical Operators

SQL:

SELECT * FROM students WHERE department = 'Computer Science' OR gpa > 3.8;

MongoDB:

db.students.find({
  "$or": [
    { "department": "Computer Science" },
    { "gpa": { "$gt": 3.8 } }
  ]
})

MongoDB’s logical operators ($or, $and, $not, $nor) take an array of condition objects.

6.4 Projection (SELECT specific columns)

SQL:

SELECT name, gpa FROM students;

MongoDB:

db.students.find({}, { "name": 1, "gpa": 1, "_id": 0 })

The second argument to find() is the projection document. 1 includes a field; 0 excludes it. Mixing inclusion and exclusion is not allowed except for _id.

Relational Algebra:

\(\pi_{<name,gpa>} (students)\)

6.5 Sorting and Limiting

SQL:

SELECT * FROM students ORDER BY gpa DESC LIMIT 3;

MongoDB:

db.students.find({}).sort({ "gpa": -1 }).limit(3)

In MongoDB, 1 is ascending and -1 is descending – somewhat cryptic compared to SQL.

6.6 Inserting Data

SQL:

INSERT INTO students (student_id, name, gpa) VALUES ('CS007', 'Greta Han', 3.85);

MongoDB:

db.students.insertOne({ "student_id": "CS007", "name": "Greta Han", "gpa": 3.85 })

6.7 Updating Data

SQL:

UPDATE students SET gpa = 4.0 WHERE student_id = 'CS001';

MongoDB:

db.students.updateOne(
  { "student_id": "CS001" },
  { "$set": { "gpa": 4.0 } }
)

The first argument is the filter (which documents to update). The second argument is the update operator document. $set updates only the specified fields and leaves all other fields untouched. Without $set, MongoDB would replace the entire document with only the fields specified — a common source of data loss for newcomers.

6.8 Deleting Data

SQL:

DELETE FROM students WHERE student_id = 'CS003';

MongoDB:

db.students.deleteOne({ "student_id": "CS003" })

6.9 Aggregation (GROUP BY)

The aggregation pipeline is MongoDB’s equivalent of SQL’s GROUP BY, HAVING, JOIN, and window functions combined. Data flows through a sequence of transformation stages.

SQL:

SELECT department, COUNT(*) AS total, AVG(gpa) AS avg_gpa
FROM students
GROUP BY department
ORDER BY avg_gpa DESC;

MongoDB Aggregation Pipeline:

db.students.aggregate([
  { "$group": {
      "_id": "$department",
      "total":   { "$sum": 1 },
      "avg_gpa": { "$avg": "$gpa" }
  }},
  { "$sort": { "avg_gpa": -1 } }
])

Each element of the array is a stage. Stages are applied in order: $group collapses documents by department and computes aggregates; $sort then reorders the resulting group summaries.

6.10 Joining Collections ($lookup)

MongoDB does not have a native JOIN keyword. The aggregation pipeline stage $lookup performs a left outer join between two collections.

SQL:

SELECT s.name, s.gpa, a.advisor_name
FROM students s
JOIN advisors a ON s.advisor_id = a._id;

MongoDB:

db.students.aggregate([
  { "$lookup": {
      "from":         "advisors",
      "localField":   "advisor_id",
      "foreignField": "_id",
      "as":           "advisor_info"
  }}
])

The joined documents are embedded as an array (advisor_info) on each student document. This is more verbose than SQL joins, and chaining multiple $lookup stages quickly becomes unwieldy — one of the strongest practical arguments for choosing a relational database when your data model is inherently relational.

Summary of Conceptual Differences

Concept SQL MongoDB
Query language Declarative text (SQL) JSON objects passed to methods
Standardization ISO standard (ANSI SQL) Proprietary, driver-specific
Filter conditions WHERE clause First argument to find() / updateOne() etc.
Field selection SELECT col1, col2 Projection document (second argument to find())
Aggregation GROUP BY + HAVING $group stage in aggregation pipeline
Joining JOIN keyword $lookup aggregation stage
Schema enforcement Enforced by CREATE TABLE Optional JSON Schema validator
Null semantics NULL with IS NULL Field absence or explicit null value; $exists operator

Conclusion

MongoDB offers a flexible and schema-less data model that is commonly used for complex, nested, and heterogeneous data structures. This lesson explains how to connect to a MongoDB database, insert documents, and use a JSON-based query language to retrieve data.


Files & Downloads

All Files for Lesson 72.611

Acknowledgements

We acknowledge the use of Claude 4.6 (Sonnet) during the preparation of this lesson.

Errata

Let us know.