Introduction

Like other programming languages, R needs to read data (and other information) from various types of files, including CSV, HTML, PDF, and XML. Files are located in a file system on a local or networked drive and are identified through a file path (hierarchical list of directories) and a file name. The path plus file name are unique and form a kind of key, which means that a file system is a type of hierarchical document database.

File are located in directories (also called folders on some systems) and have a file name and an (optional) extension. The extension is separated from the file name by a dot (.), although most file systems allow dots also to be part of file name, e.g., Report.2022.pdf is a legal file name for most systems; the .pdf is the extension. The extension is often used to identify the type of information that is in the file and the file format, e.g., .pdf is for a PDF file, while .Rmd is for a file containing an R Notebook. Extensions are a convention and can vary between systems. They are generally up three letters but do not have to be, e.g., .sqlite is often used for files that contain a SQLite database.

To navigate the file system from within R, we need to learn how to:

  • access a file
  • know where we are in the file system
  • specify a path - list all files in a folder
  • move to a particular folder in the file system
  • create new folders in the file system
  • list and set permissions on files and folders
  • write contents to a file

Folders, Paths, and File Names

The file system is a tree with the “root” of the tree. On Windows, the root is the drive letter followed by a colon and a backward slash, e.g., C:\. The backward slash has a different meaning in programming, so we generally use either \\ or / instead, e.g., C:/. The MacOS file system is a Unix file system. On Unix, the root is identified by the symbolic name /

File names on Unix are case sensitive, which means that the files Report.2022.pdf and report.2022.pdf are not the same file and can exist in the same folder. On Windows, the case is part of the file name but Windows is not case sensitive, so Report.2022.pdf and report.2022.pdf are the same file and cannot both be in the same folder. It is best to assume case sensitivity.

All file access is relative to a “current working directory”, i.e., the location in which the running program is looking by default unless a full path starting with the root folder is specified. So, customers.csv refers to a file in the current working directory while /users/alfred/data/customers.csv refers to the file customers.csv in the folder -data_ which is a folder within the folder alfred which, in turn, in within users and is directly off the root folder. The former is called a relative path, while the latter is an absolute path. Any path that starts with / on Unix or C:/ on Windows is absolute.

Of course, a Windows file system can have additional drives and not only C:. For instance, a USB Drive is likely labeled with D:/. Also, note that we are using / for Windows as that is commonly used from within programs. At the command line you will need to continue using C:\.

There are two special folder names: . is the name for the current folder and .. is the name of the parent folder right above this folder. So, a path of “../../data/bars.txt” would refer to the file bars.txt is the folder data that is in the folder above the folder of the current folder. In other words, think about directions: it is telling R to look up, look up again, then down into data.

Checking File and Folder Existence

We can check for the existence of a file or folder with the functions dir.exists() and file.exists(). The functions return TRUE if the file or folder exists and FALSE otherwise.

df.folder <- "data"

if (!dir.exists(df.folder))
{
  dir.create(df.folder)
}

Create New File

Most functions that write data to a file will automatically create a new file as needed, but it is possible to create a new (empty) file directly with file.create(); it is the equivalent of the function dir.create() for creating new folders.

Copy File

To copy a file and its contents from one folder to another, use the function file.copy(). In the example below, the file l-6-402.Rmd in the current working directory is copied to the folder /users/tmp. Note the trailing /. If not present, then it is presumed to be the new name of the file.

file.copy("./l-6-402.Rmd", "/users/tmp/")

File Information

To get information about a file, such as date of creation, date of last access, file size, ownership, etc. use the function file.info().

file.info("l-6-402.Rmd")
##              size isdir mode               mtime               ctime               atime uid gid    uname grname
## l-6-402.Rmd 10239 FALSE  700 2023-01-16 07:28:37 2023-03-01 11:21:16 2023-10-16 19:30:38 501  20 mschedlb  staff

Remove a File

To remove a file use either the function unlink() or the function file.remove().

Other Useful Function

  • basename()
  • dirname()

From the package tools, we also get the functions:

  • file_ext()
  • file.choose()
  • is_file()
  • is_dir()

Example: Decompress .gz Files

The code below uses the gunzip() function to decompress all -.gz_ gzip compressed files in a folder. Note that the gunzip() function is from the R.utils package.

# required for the gunzip() function
library(R.utils)

# uncompress all files in a folder
folder <- "pubmed-xml" 

gzfiles <- list.files(path = folder, pattern = "*.gz", full.names = T)

for (i in 1:length(gzfiles)) {
  gunzip(gzfiles[i], remove=FALSE, skip=TRUE)
}

Files & Resources

All Files for Lesson 6.402

Errata

None collected yet. Let us know.

---
title: "Navigating the File System in R"
params:
  category: 6
  number: 402
  time: 45
  level: beginner
  tags: "r,files,folders"
  description: "This lesson explains how to navigate the file system
                from R."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

Like other programming languages, R needs to read data (and other information) from various types of files, including CSV, HTML, PDF, and XML. Files are located in a file system on a local or networked drive and are identified through a file path (hierarchical list of directories) and a file name. The path plus file name are unique and form a kind of key, which means that a file system is a type of hierarchical document database.

File are located in directories (also called folders on some systems) and have a file name and an (optional) extension. The extension is separated from the file name by a dot (.), although most file systems allow dots also to be part of file name, *e.g.*, *Report.2022.pdf* is a legal file name for most systems; the *.pdf* is the extension. The extension is often used to identify the type of information that is in the file and the file format, *e.g.*, *.pdf* is for a PDF file, while *.Rmd* is for a file containing an R Notebook. Extensions are a convention and can vary between systems. They are generally up three letters but do not have to be, *e.g.*, *.sqlite* is often used for files that contain a SQLite database.

To navigate the file system from within R, we need to learn how to:

-   access a file
-   know where we are in the file system
-   specify a path - list all files in a folder
-   move to a particular folder in the file system
-   create new folders in the file system
-   list and set permissions on files and folders
-   write contents to a file

## Folders, Paths, and File Names

The file system is a tree with the "root" of the tree. On Windows, the root is the drive letter followed by a colon and a backward slash, *e.g.*, *C:\\*. The backward slash has a different meaning in programming, so we generally use either \\\\ or / instead, *e.g.*, *C:/*. The MacOS file system is a Unix file system. On Unix, the root is identified by the symbolic name */* \em a forward slash.

File names on Unix are case sensitive, which means that the files *Report.2022.pdf* and *report.2022.pdf* are not the same file and can exist in the same folder. On Windows, the case is part of the file name but Windows is not case sensitive, so *Report.2022.pdf* and *report.2022.pdf* are the same file and cannot both be in the same folder. It is best to assume case sensitivity.

All file access is relative to a "current working directory", *i.e.*, the location in which the running program is looking by default unless a full path starting with the root folder is specified. So, *customers.csv* refers to a file in the current working directory while */users/alfred/data/customers.csv* refers to the file *customers.csv* in the folder -data\_ which is a folder within the folder *alfred* which, in turn, in within *users* and is directly off the root folder. The former is called a relative path, while the latter is an absolute path. Any path that starts with */* on Unix or *C:/* on Windows is absolute.

Of course, a Windows file system can have additional drives and not only *C:*. For instance, a USB Drive is likely labeled with *D:/*. Also, note that we are using */* for Windows as that is commonly used from within programs. At the command line you will need to continue using *C:\\*.

There are two special folder names: *.* is the name for the current folder and *..* is the name of the parent folder right above this folder. So, a path of *"../../data/bars.txt"* would refer to the file *bars.txt* is the folder *data* that is in the folder above the folder of the current folder. In other words, think about directions: it is telling R to look up, look up again, then down into *data*.

## Navigating the File System in R

### Getting Current Directory

The function <code>getwd()</code> returns the current working directory for R meaning the default path R looks for files that have a relative path name, *i.e.* they do not start with the root folder (/ on Unix or C:/ on Windows).

```{r}
cwd <- getwd()

print(cwd)
```

It is possible to reset the current working directory to a different folder using the function <code>setwd()</code> but this is generally discouraged as it makes programs less portable since they presume a certain folder structure.

### List All Files in a Folder

The code fragment below lists all files in the folder data that is a subfolder within the current working directory. It returns a vector a file names.

```{r}
files <- list.files(path = "data")

print(files)
```

There are a number of useful parameters to <code>list.files()</code>, including:

| Parameter    | Meaning                                            | Example             |
|:-------------|:---------------------------------------------------|:--------------------|
| pattern      | a pattern describing which files to include        | pattern = "\*.cpp"  |
| recursive    | whether to include files in subfolders             | recursive = TRUE    |
| include.dirs | whether to include folders in addition to files    | include.dirs = TRUE |
| full.names   | whether to list file name only or path + file name | full-.names = TRUE  |

```{r}
files <- list.files(path = ".", pattern = "*.Rmd",
                    include.dirs = TRUE, recursive = TRUE)

print(files)
```

### List All Subfolders in a Folder

To list all of the folders in a folder (subfolders) you need to use the function <code>list.dirs()</code> rather than <code>list.files()</code>. The same parameters as described above are available for <code>list.dirs()</code> as well. The function is recursive, by default.

```{r}
dirs <- list.dirs(path = "../../03.ml", recursive = T)

print(dirs)
```

### Interactive Folder and File Selection

Assuming you are running an R program rather than knitting an R Notebook, you can promtp the user to select a file or folder interactively using <code>choose.dir()</code> (on MacOS) and <code>choose.files()</code> (on Windows).

```{r eval=F}
data.File <- choose.files(caption = "Select Data File", multi= FALSE,
                          filters = "*.csv")
```

## Checking File and Folder Existence

We can check for the existence of a file or folder with the functions <code>dir.exists()</code> and <code>file.exists()</code>. The functions return *TRUE* if the file or folder exists and *FALSE* otherwise.

```{r}
df.folder <- "data"

if (!dir.exists(df.folder))
{
  dir.create(df.folder)
}
```

## Create New File

Most functions that write data to a file will automatically create a new file as needed, but it is possible to create a new (empty) file directly with <code>file.create()</code>; it is the equivalent of the function <code>dir.create()</code> for creating new folders.

## Copy File

To copy a file and its contents from one folder to another, use the function <code>file.copy()</code>. In the example below, the file *l-6-402.Rmd* in the current working directory is copied to the folder */users/tmp*. Note the trailing */*. If not present, then it is presumed to be the new name of the file.

```{r eval=F}
file.copy("./l-6-402.Rmd", "/users/tmp/")
```

## File Information

To get information about a file, such as date of creation, date of last access, file size, ownership, etc. use the function <code>file.info()</code>.

```{r}
file.info("l-6-402.Rmd")
```

## Remove a File

To remove a file use either the function <code>unlink()</code> or the function <code>file.remove()</code>.

## Other Useful Function

-   <code>basename()</code>
-   <code>dirname()</code>

From the package **tools**, we also get the functions:

-   <code>file_ext()</code>
-   <code>file.choose()</code>
-   <code>is_file()</code>
-   <code>is_dir()</code>

## Example: Decompress *.gz* Files

The code below uses the <code>gunzip()</code> function to decompress all -.gz\_ gzip compressed files in a folder. Note that the <code>gunzip()</code> function is from the **R.utils** package.

```{r unZipPubMedDataFiles, eval=F}

# required for the gunzip() function
library(R.utils)

# uncompress all files in a folder
folder <- "pubmed-xml" 

gzfiles <- list.files(path = folder, pattern = "*.gz", full.names = T)

for (i in 1:length(gzfiles)) {
  gunzip(gzfiles[i], remove=FALSE, skip=TRUE)
}
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

[How to use pipes to clean up your R code. Rbloggers. March 2, 2022](https://www.r-bloggers.com/2022/03/how-to-use-pipes-to-clean-up-your-r-code/)

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
