Learning Outcomes

Upon completion of this tutorial, you will be able to:

  • Build flexible models in Excel and Google Sheet

  • Load data from files and the web

  • Analyze data in tables and pivot tables

  • Shape data through text processing

  • Handle errors in formulas

  • Use lookup tables

Introduction

This lesson provides a primer on the use of spreadsheets that follow the standards set by Microsoft Excel. At its core, spreadsheets are a type of computer application that allows one to organize, analyze, and store data in tabular format. A spreadsheet is a computerized version of a paper accounting worksheet that was commonly used in accounting prior to the creation of electronic spreadsheets; the first of which was VisiCalc which was followed by Lotus 1-2-3 and then Microsoft Excel, Apple Pages, Google Sheet, and Zoho Sheet, among others.

A spreadsheet is an arrangement of data into tables consisting of columns and rows. Numeric, date, currency, and other data as well as text can be stored in each cell. Cells can also contain formulas which automatically calculate and display a value based on other cells’ contents. Any stored value can be adjusted in a spreadsheet and the effect observed on the calculated values can be observed. As a result, the spreadsheet can be utilized for “what-if” analysis as many cases can be quickly investigated without the need for manual recalculation. Spreadsheets are a type of document containing structured information data objects.

Structured information objects have properties in the form of attributes. For example, instances of a Course have attributes such as title, length, credits, among others, depending on the use cases. Such information must often be externalized and stored in some structure where they can be shared and queried. A tabular organization is a common format for the externalization of structured information objects.

Today, spreadsheets are generally referred to as worksheets or, simply, sheets. Worksheets are organized into workbooks. Multiple sheets can be interconnected to represent relationships, and data can be displayed as text, numbers, or graphics. Interconnected worksheets are a form of relational data and a workbook can be looked at as a relational data store – in fact, some spreadsheet applications, such as Excel, allow SQL queries on workbooks.

Spreadsheet programs such as Excel are a popular end-user development tool (EUD). EUD refers to actions or methods used by non-professional developers to build computational data objects and automated behavior without requiring a deep understanding of programming or programming languages. In many ways using spreadsheets to express computational structure is more convenient and quicker than creating a typical program that performs the same function, although languages such as R often allow rapid development of sophisticated data-driven code. Such as end-user developed spreadsheet if sometimes called a spreadsheet program or a spreadsheet model.

Tabular Structures

Information objects are stored in tabular form in:

  • CSV text data files
  • Excel (and other spreadsheet program) worksheets
  • Data frames in R and Python
  • Relational database tables
  • Two-Level XML documents

All tabular structures have the same format: rows with columns. Each row is an instance of an information object, while each column is an attribute value. Rows are generally numbered while columns are named.

CSV

The example below shows data in a CSV. Note the separation of columns by commas.

Product,CerealName,Manufacturer,Calories,Sodium,Fiber,Carbs,Sugars,Shelf,Year,,
1,100% Bran,Nabisco,70,130,10,5,6,3,1942,,
2,All-Bran,Kellogg,70,260,9,7,5,3,1916,,
3,All-Bran w/Extra Fiber,Kellogg,50,140,14,8,0,3,1916,,

A cell or a range of cells cannot be directly specified until the CSV is loaded into a program for processing.

Excel

The image below shows an example of a tabular structure in Excel. Columns are letters (A..Z, AA-ZZ, etc.) and rows are numbered; each intersection of a column and a row is a cell. In addition to data values, cells can also contain formulas which are algebraic expressions referencing cells or ranges of cells. A range of cells is similar to a vector, array, or a matrix in R.

Sample Tabular Structure in an Excel Worksheet
Sample Tabular Structure in an Excel Worksheet

A cell in Excel is referenced as a column/row pair, e.g., D4 references the Calories attribute of Product instance 3 and has the value 50. A range of cells can be specified by a starting and ending cell, e.g., D2:D8 is the vector of values \(\{70,70,50,110,110,110,130\}\).

Data Frames

Data frames are in-memory tabular structures common in several programming languages, including R and Python. While they can be programmatically generated, they are typically the result of loading a CSV file, Excel worksheet, or SQL query result into a data frame.

For more information on working with data frames in R, see Lesson 6.103 Working with Vectors and Data Frames in R and for more information on reading files into data frames, see Lesson 6.106 Import Data into R from CSV, TSV, and Excel Files

XML

In some situations, XML documents where elements directly underneath the root represent rows and the child elements under the rows represents columns are used as an alternative to CSV. An example is shown below where each element is a row and the columns are , , and .

<?xml version="1.0"?>

<document>
  <row>
    <Girth>8.3</Girth>
    <Height>70</Height>
    <Volume>10.3</Volume>
  </row>
  <row>
    <Girth>8.6</Girth>
    <Height>65</Height>
    <Volume>10.3</Volume>
  </row>
  ...
</document>

Most commonly, such XML files are converted to a CSV or read into a programming language as a table. For example, in R, the function xmlToDataFrame() is often used to read such a specially-formatted XML into a data frame.

For more information on using xmlToDataFrame(), see Lesson 6.323 Load Simple XML into Dataframe in R using xmlToDataFrame.

Data Analytics Pipeline

Data analysis follows a process and must be methodical. For that purpose, industry standard data project management methodologies, such as CRISP-DM, have been developed.

Data Analytics Pipeline
Data Analytics Pipeline

Essentials of Excel and Sheet

Objectives

  • Distinguish between values, types, and format of a cell
  • Apply conditional formatting

Tutorial

Watch the tutorial below to get an overview of Excel and Google Sheet.

Slide Deck: Information Processing with Excel and Google Sheet: A Primer

References, Ranges, and Simple Formulas

Objectives

  • Create references
  • Build simple formulas
  • Use anchoring of cell references
  • Define named ranges

Tutorial

Watch the tutorial below for details, examples, and demonstrations on the concepts of this section.

Slide Deck: Information Processing with Excel and Google Sheet: A Primer

Text Processing

Some attributes of data objects are textual and thos attributes do not often come in a ready to process form. Text processing helps shape text attributes into analyzable formats.

Furthermore, text data is not the same as numeric data, and text must be often be converted to numeric format before calculations can be performed on those values. We will see a number of different text processing and text parsing function, plus the IF function for cases where text patterns have exceptions.

Tutorial

Watch the tutorial below for details, examples, and demonstrations on the concepts of this section.

Slide Deck: Information Processing with Excel and Google Sheet: A Primer

Worked Example

Load the workbook textprocessing.xslx into Excel or Google Sheet and then answer the following questions:

  1. In the worksheet “Exercise” are three rows of text data. Extract the each member’s name and the fee and place them into the two marked columns.
  2. Note that it is possible that some text data fields do not have a member name. How would you deal with that scenario?
  3. Convert the fees to numbers and then calculate the sum of the fees.
  4. Format the fees column in Accounting Format.

The tutorial below go through a solution approach. Of course, only watch it after you have diligently and honestly tried to work out the solution for yourself.

Table Lookup

Watch the tutorial below for details, examples, and demonstrations on the concepts of this section.

Slide Deck: Information Processing with Excel and Google Sheet: A Primer

Data Validation

Watch the tutorial below for details, examples, and demonstrations on the concepts of this section.

Slide Deck: Information Processing with Excel and Google Sheet: A Primer

Excel vs R

Both Excel and R are excellent tools for analyzing tabular data, but R has the distinct benefit, especially with the use of R Notebooks, if making the analysis reproducible. Reproducability is critical for analytics, data mining, and predictive modeling as it allows the process to be inspected and audited – one knows precisely how the result was produced. This is not easy to do in Excel.

In fact, there have been numerous cases where a bad formula or an incorrect cell reference led to incorrect conclusions. One such instance occurred in 2013 when UMass-Amherst graduate student Thomas Herndon uncovered major errors in the spreadsheet used by the economists Reinhart and Rogoff in their influential 2010 paper Growth in a Time of Debt. Their work was used to drive major economic policy which led to the implementation of the 2010-2013 European austerity programs. Unfortunately, the policy decision were based on incorrect conclusions from a flawed Excel model – an outcome that likely could have been avoided if reproducible analysis within an R Notebook (or a similar language) had been used.

Excel is not very good at summarizing data, grouping data, or finding multiple matching values. For that, R and SQL are much better, although some of those types of “queries” can be done in Excel using Pivot Tables or by writing VBA macros.

Nevertheless, Excel has one significant benefit: what-if analysis. Cells values can be changed interactively to explore outcomes of models interactively. This can only be done if we build R Dashboards using packages such as shiny. So, there are still good use cases for spreadsheet programs, but the arsenal of the data analyst must also include R (and similar languages).

Summary

This lesson provided a primer on working with tabular information objects in Excel and Google Sheet.


Files & Resources

All Files for Lesson 56.800

References

None.

Errata

Let us know.

---
title: "Primer on Excel and Google Sheet"
params:
  category: 56
  number: 800
  time: "120 min"
  level: beginner
  tags: excel,google sheet,spreadsheet,tabular data,tables,charts
  description: "Spreadsheets are a common tabular exchange format for structured 
                information objects. This lesson provides a primer on using
                Microsoft Excel and Google Sheet to manipulate information
                objects represented in tabular spreadsheets."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
editor_options: 
  markdown: 
    wrap: 72
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Learning Outcomes

Upon completion of this tutorial, you will be able to:

-   Build flexible models in Excel and Google Sheet

-   Load data from files and the web

-   Analyze data in tables and pivot tables

-   Shape data through text processing

-   Handle errors in formulas

-   Use lookup tables

## Introduction

This lesson provides a primer on the use of spreadsheets that follow the
standards set by Microsoft Excel. At its core, spreadsheets are a type
of computer application that allows one to organize, analyze, and store
data in tabular format. A spreadsheet is a computerized version of a
paper accounting worksheet that was commonly used in accounting prior to
the creation of electronic spreadsheets; the first of which was VisiCalc
which was followed by Lotus 1-2-3 and then Microsoft Excel, Apple Pages,
Google Sheet, and Zoho Sheet, among others.

A spreadsheet is an arrangement of data into tables consisting of
columns and rows. Numeric, date, currency, and other data as well as
text can be stored in each cell. Cells can also contain formulas which
automatically calculate and display a value based on other cells'
contents. Any stored value can be adjusted in a spreadsheet and the
effect observed on the calculated values can be observed. As a result,
the spreadsheet can be utilized for "what-if" analysis as many cases can
be quickly investigated without the need for manual recalculation.
Spreadsheets are a type of document containing structured information
data objects.

Structured information objects have properties in the form of
attributes. For example, instances of a *Course* have attributes such as
*title*, *length*, *credits*, among others, depending on the use cases.
Such information must often be externalized and stored in some structure
where they can be shared and queried. A tabular organization is a common
format for the externalization of structured information objects.

Today, spreadsheets are generally referred to as *worksheets* or,
simply, *sheets*. Worksheets are organized into *workbooks*. Multiple
sheets can be interconnected to represent relationships, and data can be
displayed as text, numbers, or graphics. Interconnected worksheets are a
form of relational data and a workbook can be looked at as a *relational
data store* -- in fact, some spreadsheet applications, such as Excel,
allow SQL queries on workbooks.

Spreadsheet programs such as Excel are a popular *end-user development
tool (EUD)*. EUD refers to actions or methods used by non-professional
developers to build computational data objects and automated behavior
without requiring a deep understanding of programming or programming
languages. In many ways using spreadsheets to express computational
structure is more convenient and quicker than creating a typical program
that performs the same function, although languages such as R often
allow rapid development of sophisticated data-driven code. Such as
end-user developed spreadsheet if sometimes called a *spreadsheet
program* or a *spreadsheet model*.

## Tabular Structures

Information objects are stored in tabular form in:

-   CSV text data files
-   Excel (and other spreadsheet program) worksheets
-   Data frames in R and Python
-   Relational database tables
-   Two-Level XML documents

All tabular structures have the same format: rows with columns. Each row
is an instance of an information object, while each column is an
attribute value. Rows are generally numbered while columns are named.

### CSV

The example below shows data in a CSV. Note the separation of columns by
commas.

```         
Product,CerealName,Manufacturer,Calories,Sodium,Fiber,Carbs,Sugars,Shelf,Year,,
1,100% Bran,Nabisco,70,130,10,5,6,3,1942,,
2,All-Bran,Kellogg,70,260,9,7,5,3,1916,,
3,All-Bran w/Extra Fiber,Kellogg,50,140,14,8,0,3,1916,,
```

A cell or a range of cells cannot be directly specified until the CSV is
loaded into a program for processing.

### Excel

The image below shows an example of a tabular structure in Excel.
Columns are letters (A..Z, AA-ZZ, etc.) and rows are numbered; each
intersection of a column and a row is a *cell*. In addition to data
values, cells can also contain formulas which are algebraic expressions
referencing cells or ranges of cells. A range of cells is similar to a
vector, array, or a matrix in R.

![Sample Tabular Structure in an Excel
Worksheet](images/SampleExcel.jpg){width="50%"}

A cell in Excel is referenced as a column/row pair, *e.g.*, *D4*
references the *Calories* attribute of *Product* instance 3 and has the
value *50*. A range of cells can be specified by a starting and ending
cell, *e.g.*, *D2:D8* is the vector of values
$\{70,70,50,110,110,110,130\}$.

### Data Frames

Data frames are in-memory tabular structures common in several
programming languages, including R and Python. While they can be
programmatically generated, they are typically the result of loading a
CSV file, Excel worksheet, or SQL query result into a data frame.

For more information on working with data frames in R, see [Lesson 6.103
Working with Vectors and Data Frames in
R](http://artificium.us/lessons/06.r/l-6-103-vecs-and-dfs/l-6-103.html)
and for more information on reading files into data frames, see [Lesson
6.106 Import Data into R from CSV, TSV, and Excel
Files](http://artificium.us/lessons/06.r/l-6-106-load-csv-tsv-excel-files/l-6-106.html)

### XML

In some situations, XML documents where elements directly underneath the
root represent rows and the child elements under the rows represents
columns are used as an alternative to CSV. An example is shown below
where each <row> element is a row and the columns are <Girth>, <Height>,
and <Volume>.

``` xml
<?xml version="1.0"?>

<document>
  <row>
    <Girth>8.3</Girth>
    <Height>70</Height>
    <Volume>10.3</Volume>
  </row>
  <row>
    <Girth>8.6</Girth>
    <Height>65</Height>
    <Volume>10.3</Volume>
  </row>
  ...
</document>
```

Most commonly, such XML files are converted to a CSV or read into a
programming language as a table. For example, in R, the function
`xmlToDataFrame()` is often used to read such a specially-formatted XML
into a data frame.

For more information on using `xmlToDataFrame()`, see [Lesson 6.323 Load
Simple XML into Dataframe in R using
xmlToDataFrame](http://artificium.us/lessons/06.r/l-6-323-load-xml-xmlToDataFrame/l-6-323.html).

## Data Analytics Pipeline

Data analysis follows a process and must be methodical. For that
purpose, industry standard data project management methodologies, such
as
[CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining),
have been developed.

![Data Analytics Pipeline](images/i-56-800-data-analytics-process.jpg)

## Essentials of Excel and Sheet

### Objectives

-   Distinguish between values, types, and format of a cell
-   Apply conditional formatting

### Tutorial

Watch the tutorial below to get an overview of Excel and Google Sheet.

<iframe src="https://northeastern.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=f110c746-c9e8-41fa-a295-af8800e2ce8d&amp;autoplay=false&amp;offerviewer=true&amp;showtitle=false&amp;showbrand=false&amp;captions=false&amp;interactivity=all" height="270" width="480" style="border: 1px solid #464646;" data-external="1" />

**Slide Deck**: [Information Processing with Excel and Google Sheet: A
Primer](s-56-800-excel-primer.pptx)

## References, Ranges, and Simple Formulas

### Objectives

-   Create references
-   Build simple formulas
-   Use anchoring of cell references
-   Define named ranges

### Tutorial

Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.

<iframe src="https://northeastern.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=76d48684-569e-45a2-8946-af880108e095&amp;autoplay=false&amp;offerviewer=true&amp;showtitle=false&amp;showbrand=false&amp;captions=false&amp;interactivity=all" height="270" width="480" style="border: 1px solid #464646;" data-external="1" />

**Slide Deck**: [Information Processing with Excel and Google Sheet: A
Primer](s-56-800-excel-primer.pptx)

## Text Processing

Some attributes of data objects are textual and thos attributes do not
often come in a ready to process form. Text processing helps shape text
attributes into analyzable formats.

Furthermore, text data is not the same as numeric data, and text must be
often be converted to numeric format before calculations can be
performed on those values. We will see a number of different text
processing and text parsing function, plus the `IF` function for cases
where text patterns have exceptions.

### Tutorial

Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.

<iframe src="https://player.vimeo.com/video/789448030?h=8b030c7ccf" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1" />

**Slide Deck**: [Information Processing with Excel and Google Sheet: A
Primer](s-56-800-excel-primer.pptx)

### Worked Example

Load the workbook [textprocessing.xslx](data-files/textprocessing.xlsx)
into Excel or Google Sheet and then answer the following questions:

1.  In the worksheet "Exercise" are three rows of text data. Extract the
    each member's name and the fee and place them into the two marked
    columns.
2.  Note that it is possible that some text data fields do not have a
    member name. How would you deal with that scenario?
3.  Convert the fees to numbers and then calculate the sum of the fees.
4.  Format the fees column in *Accounting Format*.

The tutorial below go through a solution approach. Of course, only watch
it after you have diligently and honestly tried to work out the solution
for yourself.

```{=html}
<iframe width="480" height="270" src="https://www.youtube.com/embed/kWZkvg17XLI" title="Text Processing in Excel" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen data-external="1"></iframe>
```
## Table Lookup

Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.

<iframe src="https://player.vimeo.com/video/789707029?h=070ebe010b&amp;title=0&amp;byline=0&amp;portrait=0&amp;speed=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen title="Table Lookup with VLOOKUP" data-external="1" />

**Slide Deck**: [Information Processing with Excel and Google Sheet: A
Primer](s-56-800-excel-primer.pptx)

## Data Validation

Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.

<iframe src="https://player.vimeo.com/video/790832483?h=2c7d5b2d0b" width="480" height="338" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1">

</iframe>

**Slide Deck**: [Information Processing with Excel and Google Sheet: A
Primer](s-56-800-excel-primer.pptx)

## Excel vs R

Both Excel and R are excellent tools for analyzing tabular data, but R
has the distinct benefit, especially with the use of R Notebooks, if
making the analysis *reproducible*. Reproducability is critical for
analytics, data mining, and predictive modeling as it allows the process
to be inspected and audited -- one knows precisely how the result was
produced. This is not easy to do in Excel.

In fact, there have been numerous cases where a bad formula or an
incorrect cell reference led to incorrect conclusions. One such instance
occurred in 2013 when UMass-Amherst graduate student Thomas Herndon
uncovered major errors in the spreadsheet used by the economists
Reinhart and Rogoff in their influential 2010 paper *Growth in a Time of
Debt*. Their work was used to drive major economic policy which led to
the implementation of the 2010-2013 European austerity programs.
Unfortunately, the policy decision were based on incorrect conclusions
from a flawed Excel model -- an outcome that likely could have been
avoided if reproducible analysis within an R Notebook (or a similar
language) had been used.

Excel is not very good at summarizing data, grouping data, or finding
multiple matching values. For that, R and SQL are much better, although
some of those types of "queries" can be done in Excel using Pivot Tables
or by writing VBA macros.

Nevertheless, Excel has one significant benefit: *what-if analysis*.
Cells values can be changed interactively to explore outcomes of models
interactively. This can only be done if we build R Dashboards using
packages such as **shiny**. So, there are still good use cases for
spreadsheet programs, but the arsenal of the data analyst must also
include R (and similar languages).

## Summary

This lesson provided a primer on working with tabular information
objects in Excel and Google Sheet.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

None.

## Errata

[Let us
know](https://form.jotform.com/212187072784157){target="_blank"}.

