Upon completion of this tutorial, you will be able to:
Build flexible models in Excel and Google Sheet
Load data from files and the web
Analyze data in tables and pivot tables
Shape data through text processing
Handle errors in formulas
Use lookup tables
Introduction
This lesson provides a primer on the use of spreadsheets that follow the
standards set by Microsoft Excel. At its core, spreadsheets are a type
of computer application that allows one to organize, analyze, and store
data in tabular format. A spreadsheet is a computerized version of a
paper accounting worksheet that was commonly used in accounting prior to
the creation of electronic spreadsheets; the first of which was VisiCalc
which was followed by Lotus 1-2-3 and then Microsoft Excel, Apple Pages,
Google Sheet, and Zoho Sheet, among others.
A spreadsheet is an arrangement of data into tables consisting of
columns and rows. Numeric, date, currency, and other data as well as
text can be stored in each cell. Cells can also contain formulas which
automatically calculate and display a value based on other cells’
contents. Any stored value can be adjusted in a spreadsheet and the
effect observed on the calculated values can be observed. As a result,
the spreadsheet can be utilized for “what-if” analysis as many cases can
be quickly investigated without the need for manual recalculation.
Spreadsheets are a type of document containing structured information
data objects.
Structured information objects have properties in the form of
attributes. For example, instances of a Course have attributes such as
title, length, credits, among others, depending on the use cases.
Such information must often be externalized and stored in some structure
where they can be shared and queried. A tabular organization is a common
format for the externalization of structured information objects.
Today, spreadsheets are generally referred to as worksheets or,
simply, sheets. Worksheets are organized into workbooks. Multiple
sheets can be interconnected to represent relationships, and data can be
displayed as text, numbers, or graphics. Interconnected worksheets are a
form of relational data and a workbook can be looked at as a relational
data store – in fact, some spreadsheet applications, such as Excel,
allow SQL queries on workbooks.
Spreadsheet programs such as Excel are a popular end-user development
tool (EUD). EUD refers to actions or methods used by non-professional
developers to build computational data objects and automated behavior
without requiring a deep understanding of programming or programming
languages. In many ways using spreadsheets to express computational
structure is more convenient and quicker than creating a typical program
that performs the same function, although languages such as R often
allow rapid development of sophisticated data-driven code. Such as
end-user developed spreadsheet if sometimes called a spreadsheet
program or a spreadsheet model.
Tabular Structures
Information objects are stored in tabular form in:
CSV text data files
Excel (and other spreadsheet program) worksheets
Data frames in R and Python
Relational database tables
Two-Level XML documents
All tabular structures have the same format: rows with columns. Each row
is an instance of an information object, while each column is an
attribute value. Rows are generally numbered while columns are named.
CSV
The example below shows data in a CSV. Note the separation of columns by
commas.
A cell or a range of cells cannot be directly specified until the CSV is
loaded into a program for processing.
Excel
The image below shows an example of a tabular structure in Excel.
Columns are letters (A..Z, AA-ZZ, etc.) and rows are numbered; each
intersection of a column and a row is a cell. In addition to data
values, cells can also contain formulas which are algebraic expressions
referencing cells or ranges of cells. A range of cells is similar to a
vector, array, or a matrix in R.
Sample Tabular Structure in an Excel
Worksheet
A cell in Excel is referenced as a column/row pair, e.g., D4
references the Calories attribute of Product instance 3 and has the
value 50. A range of cells can be specified by a starting and ending
cell, e.g., D2:D8 is the vector of values
\(\{70,70,50,110,110,110,130\}\).
Data Frames
Data frames are in-memory tabular structures common in several
programming languages, including R and Python. While they can be
programmatically generated, they are typically the result of loading a
CSV file, Excel worksheet, or SQL query result into a data frame.
In some situations, XML documents where elements directly underneath the
root represent rows and the child elements under the rows represents
columns are used as an alternative to CSV. An example is shown below
where each element is a row and the columns are , ,
and .
Most commonly, such XML files are converted to a CSV or read into a
programming language as a table. For example, in R, the function
xmlToDataFrame() is often used to read such a specially-formatted XML
into a data frame.
Data analysis follows a process and must be methodical. For that
purpose, industry standard data project management methodologies, such
as
CRISP-DM,
have been developed.
Data Analytics Pipeline
Essentials of Excel and Sheet
Objectives
Distinguish between values, types, and format of a cell
Apply conditional formatting
Tutorial
Watch the tutorial below to get an overview of Excel and Google Sheet.
Some attributes of data objects are textual and thos attributes do not
often come in a ready to process form. Text processing helps shape text
attributes into analyzable formats.
Furthermore, text data is not the same as numeric data, and text must be
often be converted to numeric format before calculations can be
performed on those values. We will see a number of different text
processing and text parsing function, plus the IF function for cases
where text patterns have exceptions.
Tutorial
Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.
Load the workbook textprocessing.xslx
into Excel or Google Sheet and then answer the following questions:
In the worksheet “Exercise” are three rows of text data. Extract the
each member’s name and the fee and place them into the two marked
columns.
Note that it is possible that some text data fields do not have a
member name. How would you deal with that scenario?
Convert the fees to numbers and then calculate the sum of the fees.
Format the fees column in Accounting Format.
The tutorial below go through a solution approach. Of course, only watch
it after you have diligently and honestly tried to work out the solution
for yourself.
Table Lookup
Watch the tutorial below for details, examples, and demonstrations on
the concepts of this section.
Both Excel and R are excellent tools for analyzing tabular data, but R
has the distinct benefit, especially with the use of R Notebooks, if
making the analysis reproducible. Reproducability is critical for
analytics, data mining, and predictive modeling as it allows the process
to be inspected and audited – one knows precisely how the result was
produced. This is not easy to do in Excel.
In fact, there have been numerous cases where a bad formula or an
incorrect cell reference led to incorrect conclusions. One such instance
occurred in 2013 when UMass-Amherst graduate student Thomas Herndon
uncovered major errors in the spreadsheet used by the economists
Reinhart and Rogoff in their influential 2010 paper Growth in a Time of
Debt. Their work was used to drive major economic policy which led to
the implementation of the 2010-2013 European austerity programs.
Unfortunately, the policy decision were based on incorrect conclusions
from a flawed Excel model – an outcome that likely could have been
avoided if reproducible analysis within an R Notebook (or a similar
language) had been used.
Excel is not very good at summarizing data, grouping data, or finding
multiple matching values. For that, R and SQL are much better, although
some of those types of “queries” can be done in Excel using Pivot Tables
or by writing VBA macros.
Nevertheless, Excel has one significant benefit: what-if analysis.
Cells values can be changed interactively to explore outcomes of models
interactively. This can only be done if we build R Dashboards using
packages such as shiny. So, there are still good use cases for
spreadsheet programs, but the arsenal of the data analyst must also
include R (and similar languages).
Summary
This lesson provided a primer on working with tabular information
objects in Excel and Google Sheet.