55.102 The Tao of Data

Objectives

Upon completion of this lesson, you will be able to:

define “Big Data”
list key mechanisms of data analytics
distinguish data analytics and data science
describe the six V’s

Data-Driven Organizations

Data drives decision-making in most organizations, e.g., where to locate a new franchise, which customers to target in marketing, where bottlenecks exist in a process, how customers feel about a product, and so forth. Because of that, data has become an important asset and source of competitive advantage. Data analytics is the systematic process of examining, cleansing, transforming, and modeling data to extract useful information, inform conclusions, and support decision-making. It encompasses a wide range of techniques and methodologies aimed at understanding data patterns, future trends, and relationships.

At its core, data analytics seeks to convert raw data into actionable insights, providing a basis for strategic planning and data-driven decision making. The role of a data professional, be it a data analyst or a data scientist, is to turn data into actionable information.

Big Data

The term “big data” refers to extremely large datasets that cannot be easily managed, processed, or analyzed using traditional data processing tools. Big data is characterized by its scale and complexity, and its analysis requires specialized techniques and technologies, such as distributed computing frameworks and advanced database systems. Big data is a critical resource in modern analytics, enabling organizations to uncover deeper insights by analyzing diverse and voluminous datasets in real-time or near-real-time.

In the video below, Dr. Schedlbauer summarizes his perspective on “big data” and the key characteristics that shape any discussion around “big data”.

In this second video, Dr. Schedlbauer provides an overview of the landscape of the “data universe”. We invite you to watch it before continuing as it provides an important summary and sets context.

Data comes in both “structured” and “unstructured” forms and both are necessary for the data-driven organization. Naturally, structured data is more easily analyzable and useful for computational processing – it is prerequisite for data analytics and machine learning. However, unstructured data, such as emails or social media posts, are also invaluable but more difficult to process. Large Language Models (LLM) such as ChatGPT have made querying unstructured data more accessible, but extracting useful information is still difficult. One such example is extracting sentiment about a product from social media posts. The video below by Dr. Schedlbauer provides a comparison of “structured” and “unstructured” data.

Key Concepts

There are a lot of buzzwords floating around in the world of data science and the different terms can be very confusing. So, let’s try to make some sense of where these different terms fit in, what they mean, and what some of the key differences between them are.

Dan Ariely, a Duke Economics Professor, once said about big data: “Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” There is a lot of truth to that statement. And this shroud of mystery is also apparent in the terminology of the data domain. A great many people toss around terms like “data science,” “data analysis,” “big data,” and “data mining,” even the experts have trouble defining them. Add to that terms like “machine learning”, “predictive modeling”, and “data engineering” and you’ve got an even bigger puddle of confusion. If that’s not enough, statisticians will claim that all of this is really just statistical analysis anyway. And, finally, you have all of the data folks yelling “NoSQL”, “data warehouses”, “data marts”, and “data lakes”. That will make anyone’s head spin.

So, let’s try to define at least a few of the terms. For now, let’s focus on one of the more important distinctions as it relates to your data career: the often-muddled differences between data analytics and data science. Before reading the remainder of this section, you may wish to watch Dr. Schedlbauer’s perspective in the video below:

Data Analytics

Data analytics work is done by a data analyst. The responsibilities of data analysts can vary across industries and companies, but fundamentally, data analysts utilize data to draw meaningful insights and solve problems. They analyze well-defined sets of data using an arsenal of different tools to answer tangible business questions, such as “why sales did drop in a certain quarter”, “why did a marketing campaign fare better in certain regions”, “how does internal attrition affect revenue”, “what is the trend in customer growth”, “is there a difference in average sales between the North and South regions”, “what is the sales by product by sales executive”, and so forth. The questions are poignant and directed.

Data analysts carry a range of job titles, including (but not limited to) database analyst, business analyst, business data analyst, market research analyst, sales analyst, financial analyst, marketing analyst, advertising analyst, customer success analyst, operations analyst, pricing analyst, or international strategy analyst. The best data analysts have both technical expertise and domain knowledge, in addition to the ability to communicate quantitative findings to non-technical colleagues or clients.

Data analysts are often responsible for designing and maintaining data systems and databases, using statistical tools to interpret data sets, building custom database queries, and preparing reports with visualizations that effectively communicate trends, patterns, and predictions based on relevant findings.

Data analysis requires a mix of skills including knowledge of simple inferential statistics, trend forecasting, working with time-series data, and programming in scripting languages such as R and Python. In addition, strong knowledge of various products and platforms is often required, including Excel, SPSS, Tableau, reporting, relational database design, SQL, and visualization.

Most organizations distinguish between four types of data analytics projects:

Descriptive analytics examines historical data. Examples include analyzing prior monthly or quarterly revenue or website traffic during some period of time, generally focusing on understanding trends and tendencies.
Diagnostic analytics considers why something happened by comparing descriptive data sets to identify dependencies and patterns. This helps an organization determine the cause of a positive or negative outcome. Often A/B tests figure heavily into this type of analysis.
Predictive analytics seeks to determine likely outcomes by detecting tendencies in descriptive and diagnostic analyses. This allows an organization to take proactive action—like reaching out to a customer who is unlikely to renew a contract, for example.
Prescriptive analytics make recommendations as to which actions a business should take. While this type of analysis brings significant value in the ability to address potential problems or stay ahead of industry trends, it often requires the use of complex machine learning algorithms and advanced statistical analyses.

Data Science

Data scientists, on the other hand, estimate the unknown by asking questions, writing algorithms, and building more advanced statistical models. The main difference between a data analyst and a data scientist is that data scientists work with larger and more complex data sets, leverage heavy coding, use machine learning algorithms to discover patterns and make predictions, and thus need a deeper understanding of statistics and programming. Data scientists can arrange undefined sets of data using multiple tools at the same time, perform data transformations, integrate disparate data sets, and build their own automation systems and frameworks.

Drew Conway, data science expert and founder of Alluvium, describes a data scientist as someone who has mathematical and statistical knowledge, hacking skills, and substantive domain expertise to know what questions to ask and how to leverage the findings in the data to solve a business problem.

Key skills of a data scientist include supervised and unsupervised machine learning methods, software development, large relational and non-relational (NoSQL) databases, ability to design data warehouses, data analysis, integration of structured and unstructured data, natural language and text processing, plus solid programming skills in Java, C++, R, and Python.

Senior data scientists are often tasked with designing data modeling, building automated data acquisition processes, as well as creating algorithms and predictive models to extract the information needed by an organization to solve complex problems. They are aided by junior data scientists, data analysts, and data engineers.

In many organizations, the two disciplines – data science and data analytics – are often not clearly distinguished. In fact, in many organizations that claim to need a data scientist or “do data science” are, in fact, doing data analytics.

While the actual title might vary from organization to organization, the most common roles in “data science” include these.

Data Scientist: Design data analysis processes to select algorithms and construct predictive models, as well as perform custom analysis and develop new algorithms.
Data Analyst: Acquire and manipulate large data sets and mine them to identify trends and assist strategic business decision-making.
Data Engineer: Create tools and automate the process of cleaning, aggregating, and integrating data from disparate sources and transfer it to analytics stores and data warehouses.
Business Intelligence Specialist: Identify trends in data sets, build custom reports, configure dashboards.
Data Architect: Design, create, and manage an organization’s data architecture including analytics stores, data warehouses, and data marts.

At the end of the day, Data Science is an interdisciplinary field that includes “Big” Data Analytics, Data Mining, Predictive Modeling, Data Visualization, Mathematics, Computer Science, Algorithm Design, Computational Statistics, Computational Social Science, and Visualization. In fact, many of the methods used in data science are adapted from Statistics and thus Data Science is often viewed as applied statistics with really powerful computers. Some consider it a new paradigm in Science; alongside the Theoretical, Empirical, and Computational paradigms.

Machine Learning

Machine Learning (ML) is a branch of computer science that uses algorithms to map input values to a result. It uses data sets known as training data to discover patterns and relationships in sets of features that can be used to produce a prediction. A prediction can be a class or a number, called either classification and regression, respectively. For example, a machine learning algorithm can comb through historical data on customer churn and then discover a model that allows marketing and sales teams to identify which customers, based on purchase history, demographics, and other customer features, are likely to terminate their account or contract. That is a classification because the algorithm classifies customers as “likely to churn” or “unlikely to churn”. Forecasting the likely selling price of a boat based on geography, outfitting, age, among other features would be a regression model as it predicts a number. This type of machine learning is often referred to as predictive modeling as one seeks to build a model using an algorithm based on historical data to make a prediction about the future.

The reliance on pre-labeled data to “train” the model makes this supervised machine learning. There are many useful machine learning algorithms, among them k-Nearest Neighbor, Decision Trees, Linear Regression, Support Vector Machine, and Bayes Classifiers. One class of the algorithms for supervised machine learning that has received significant attention recently are those built around artificial neural networks (ANN). ANNs which have many (hidden) layers and use sophisticated techniques for training are referred to as deep learning network or, simply, deep learning.

Machine Learning combines computer science, algorithm design, computational statistics, and probability theory to come up with models that can be used to make predictions. These models resulting from applying a machine learning algorithm are called predictive models.

In addition to discovering models from historical data, data scientists also need to evaluate the validity and veracity of the models and how good the predictions likely are, and when and how they can be used in practice. Furthermore, the algorithms are often quite complex and require substantial data, so understanding how to make them efficient is key to training models – after all, if a model takes two weeks to train, it might not be useful when data has already changed after it’s been trained. Many data scientists working in building complex and advanced machine learning models have advanced graduate degrees in computer or data science.

The process of predictive modeling typically involves the following steps:

Defining the Problem: Clearly articulating the question or prediction objective.
Data Collection and Preparation: Gathering relevant data and performing wrangling to ensure quality.
Feature Engineering: Identifying and transforming variables to improve model performance.
Model Selection: Choosing an appropriate algorithm, such as linear regression, decision trees, or neural networks, depending on the problem and data characteristics.
Training and Testing: Dividing data into training and testing sets to evaluate model performance and avoid overfitting.
Evaluation and Deployment: Assessing the model’s accuracy and deploying it to generate predictions.

Examples of predictive modeling applications include fraud detection, customer behavior analysis, and medical diagnosis. Machine learning models rely on algorithms such as support vector machines, ensemble methods like random forests, and deep learning techniques to achieve high levels of predictive accuracy.

Data Mining

Data Mining is the process of discovering patterns in a data set, such as groupings or clusters, trends, or relationships. It is an important component in knowledge discovery; in fact, it used to be commonly referred to as a KDD – Knowledge Discovery in Databases, although that term is now supplanted by the term Data Mining. Other terms that were in use for a time included data harvesting, information harvesting, data archaeology, forensic analytics, and knowledge extraction.

Data Mining often requires analyzing a vast amount of historical data – data that was often ignored or even erased by organizations and businesses. Recently, business executives have started to understand that there is value in their historical data and that it can aid the decision-making process leading to data-driven decision-making.

Data Mining also uses machine learning algorithms, but attempts to discover the “labels” and thus does not require labeled training data. It is therefore also called unsupervised machine learning.

Artificial Intelligence

Unlike machine learning and data mining which provide information for decision-making by a human, AI uses algorithms (including machine learning algorithms) to automate decision making and to build autonomous devices that do not require human guidance or intervention. So, in some sense, machine learning is a sub-discipline of AI. More precisely, deep learning is a common method used in building autonomous agents. Other examples of autonomous agents are Large-Language Models that use vast amounts of textual training data to make word predictions and are used in products such as ChatGPT.

Data Engineering

Machine Learning, Data Mining, and Data Analytics efforts require vast amounts of data that is clean, usable, and fit for the intended purpose. Data shaping or wrangling to get the data into a form that can be consumed by algorithms requires specialized skills in programming. While in many organizations much of this “data engineering” work is done by Data Analysts or Data Scientists, many organizations have realized that this requires a different set of skills that are much more software development and programming oriented rather than data and algorithm oriented. So, the role of the Data Engineer has been added as a job role and title.

Data wrangling, also known as data munging, is the process of cleaning, organizing, and preparing data for analysis. This stage is essential for ensuring data quality, as raw data is often incomplete, inconsistent, or contains errors. Key activities in data wrangling include handling missing values, resolving inconsistencies, transforming data into a suitable format, and integrating datasets from multiple sources. Without thorough preparation, the validity and reliability of any analytical conclusions may be compromised. Effective data wrangling requires domain knowledge, technical skills, and an understanding of the analytical goals.

Machine Learning Engineering

Once a predictive model has been constructed using supervised machine learning algorithms, it needs to be deployed on web sites, within products, and within databases and applications. In addition, the model must be updated as new training data is collected and tuned as the model’s performance in practice is observed. The deployment and maintenance of machine learning derived predictive models is much more engineering than science, so it falls on Machine Learning Engineers in many organizations.

Relationships between Disciplines

Historical Perspective

Before considering the requisite skills of a data specialist such as a data analyst, data scientist, or data engineer, let’s briefly look at the evolution of the terms used to describe the data field; it will help clarify how the terms are currently used – and why there are duplicate terms and to some degree quite a bit of confusion.

According to Priya Pedamkar of EDUCBA, the term “Data Science” was first coined in the 1960s , although more as an alternative to “Computer Science” which was just emerging as an actual discipline separate from Electrical Engineering and Mathematics. However, since the early 2000s, it has started to carry a quite different meaning.

In 2008, D. J. Patil and Jeff Hammerbacher became the first individuals to call themselves “Data Scientists” in order to describe their roles at LinkedIn and Facebook, respectively. In fact, Dr. Patel was named the first Chief Data Scientist by Barak Obama and served from 2015 to 2017 in that role – in effect acknowledging the importance of data in governance and decision-making. In 2012, a Harvard Business Review article cited Data Scientist as the ‘Sexiest Job of the 21st Century’.

The term Data Mining evolved in parallel with Data Science. It became a common term in the database communities of the late 1990s. Data Mining owes its origin to KDD (Knowledge Discovery in Databases). KDD is the process of finding patterns and relationships automatically and without human assistance from information stored in very large databases. Today, Data Mining is often used interchangeably along with KDD. And the algorithms used to make automated discoveries are types of machine learning algorithms, so data mining is a flavor of machine learning where no training data is required. Data mining is now often called unsupervised machine learning instead by many in the machine learning community, although data professionals continue to use the term data mining.

Key Skills

Among the most important key technologies of the data analytics world are, of course, programming languages. The two programming languages used most commonly are Python and R. Then of course there are the databases: relational databases vs NoSQL databases, alongside numerous tools and platforms.

In the video below, Dr. Schedlbauer provides an overview of the key skills required of data analytics professionals. Watch it before diving deeper in th subsequent sections.

The answer to the question “What does a data analyst do?” will vary depending on the type of organization and the extent to which a business has adopted data-driven decision-making practices. Generally speaking, though, the responsibilities of a data analyst typically include the following:

Designing and maintaining data systems and databases; includes fixing coding errors and other data-related problems.
Mining data from primary and secondary sources, then reorganizing said data in a format that can be easily read by either human or machine.
Using statistical tools to interpret data sets, paying particular attention to trends and patterns that could be valuable for diagnostic and predictive analytics efforts.
Demonstrating the significance of their work in the context of local, national, and global trends that impact both their organization and industry.
Preparing reports for executive leadership that effectively communicate trends, patterns, and predictions using relevant data.
Collaborating with programmers, engineers, and organizational leaders to identify opportunities for process improvements, recommend system modifications, and develop policies for data governance.
Creating appropriate documentation that allows stakeholders to understand the steps of the data analysis process and duplicate or replicate the analysis if necessary.

Given those responsibilities, key requisite skills of a “data specialist” – to use a broader term for the moment – include the following:

Excel and spreadsheets: When you think of Excel, the first thing that comes to mind is likely a spreadsheet, but there’s a lot more analysis power under the hood of this tool. While a programming language like R or Python is better suited to handle a large data set, advanced Excel methods like writing macros and using lookups are still widely used for smaller lifts and quick analytics. If you are working at a lean company or startup, the first version of your database may even be in Excel. Over the years, the tool has remained a mainstay for businesses in every industry, so learning it is a must. Luckily, there is an abundance of great free resources online to help you get started, as well as structured data analytics classes for those looking for a deeper understanding of the tool. Unfortunately, Excel is limited to smaller datasets of a couple hundred thousand rows, so learning a data-oriented programming language is often another must as you move up in the sizes of data sets that you work with.
Statistics: Analysis of data requires at least some understanding of descriptive statistics so that you can get a feel for the shape of data, its tendency, whether there are outliers or missing values, whether the data values are somehow dependent (or correlated) to one another. Statistics is also the underpinning of many machine learning methods for building predictive models: regression is one that comes to mind, but many others also require some understanding of statistics (and probability).
Programming in R and/or Python: Data sets are often large and complicated, and the data is not often in a neat and easily digestible, and analyzable format. You will need to “shape” and “wrangle” the data and while Excel can help with simple manipulations, when the data gets large and complex you will need to use programming. And for that, knowing how to program in at least one of two data languages is key: R and Python. Like SQL, R and Python can handle what Excel can’t. Both are powerful data programming languages used to perform advanced analyses, create powerful visualizations, do predictive analytics on big data sets, apply machine learning and data mining, and create exploratory and explanatory visualizations. And they’re both industry standard. To truly work as a data analyst, you’ll need to go beyond SQL and master at least one of these languages. One main benefit of doing all data analysis work in a programming language is that your analysis steps are clearly defined and “reproducible”. Anyone can see how you arrived at your analysis, your model, or your visualization. So, which one should you learn? Both R and Python are open source and free, and employers typically don’t care which their employees choose to use as long as their analyses are accurate. Since it was built specifically for analytics, however, some analysts prefer R over Python for exploring data sets and doing ad-hoc analysis. So, where do visual programming environments like SPSS and SAS come in? Well, those can be useful too, but learning R or Python over these tools is better, because, like Excel, SPSS, and SAS programs are limited and complex programs for complex data sets are very difficult to build.
Relational Databases and SQL: The data has to be stored: files might be fine for small data sets, but large amounts of data require an actual database. And that requires a way to access the data in the database: SQL. SQL is the ubiquitous industry-standard data query language and among the most important skills for data analysts to have. The language is often thought of as the “graduated” version of Excel; it is able to handle large datasets that Excel simply can’t. Almost every organization needs someone who knows SQL—whether to manage and store data, relate multiple databases, build reports, or build data warehouses that simplify analytical queries and enable dashboards.
Data Visualization: Data analysts need to know how to build visualization to (a) understand their data, and (b) communicate their findings. Often, the analysis of a data set starts with exploratory data analysis (EDA) and that often involves some statistical analysis and many exploratory visualizations such as scatter plots, line charts, bar graphs, column charts, bubble charts, heat maps, among many others. Once the analysis is done, and insights have been uncovered, then it’s important to be able to tell a compelling story with data through the use of well-built, easily understandable explanatory visualizations. In those types of visualizations, presentation matters. And for that, knowing how humans consume visualizations is critical. So, knowing the key workings of cognitive psychology helps tremendously. If your findings can’t be easily and quickly identified, then you’re going to have a difficult time getting through to others. For this reason, data visualization can have a make-or-break effect when it comes to the impact of your data. Visualizations are often built in Excel or Tableau but to be reproducible they ought to be built in R or Python.
Critical Thinking: Using data to find answers to your questions means figuring out what to ask in the first place, which can often be quite tricky. To succeed as an analyst, you have to think like an analyst. It is the role of a data analyst to uncover and synthesize connections that are not always so clear. While this ability is innate to a certain extent, there are a number of tips you can try to help improve your critical thinking skills. For example, asking yourself basic questions about the issue at hand can help you stay grounded when searching for a solution, rather than getting carried away with an explanation that is more complex than it needs to be. Additionally, it is important that you remember to think for yourself instead of relying on what already exists.
Domain Knowledge: It most certainly helps to have “domain knowledge” – an understanding of the area of business in which you do your analysis. The most successful data analysts combine business acumen with critical thinking combined with solid technical skills.
Communication Skills: Data visualization and communication/presentation skills go hand-in-hand. But presenting doesn’t always come naturally to everyone, so seasoned data analysts know how to craft a visual story and know how to present their findings to an audience in an engaging and informative way. Unfortunately, others within your organization, especially often the key stakeholders, are likely not as data literate as you are. Whereas you might be fluent in the language of analytics, they’re fluent in the language of project management, or in the language of your organization’s business challenges.
Machine Learning: As artificial intelligence and predictive analytics are two of the hottest topics in the field of data science, an understanding of machine learning has been identified as a key component of an analyst’s toolkit. While not every analyst works with machine learning, the tools and concepts are important to know in order to get ahead in the field and to successfully work with data science groups. You’ll need to have your statistical programming skills down first to advance in this area, however. Thankfully, R and Python have many pre-built machine learning algorithms that can be used “out-of-the-box”, so a deep understanding of programming the often-complex algorithms is not necessary. However, knowing how the algorithms function is important as data analysts often need to tune the algorithms and that requires an understanding of what is tunable. Machine learning comes in two flavors: supervised and unsupervised. In supervised machine learning, a machine learning algorithm attempts to related data to a pre-known result and define a “mapping” so that the mapping can be used to make new predictions. A prediction is either in the form of a classification or a regression. An example of a classification model is deciding whether a message is spam or not; in other words, classifying the message as belonging to the class “spam” or the class “ham”. On the other hand, a regression model predicts a number, for example, the anticipated selling price of a home with certain features. With both kinds of models, you are making a “prediction”, hence the term “predictive model” – all predictive models are the results of applying a supervised machine learning algorithm. Data analysts must not only understand how to use the algorithms to build predictive models but also how to evaluate whether the models make good predictions that can be trusted and are useful in practice. Unlike supervised machine learning which requires a pre-labeled data set, an unsupervised machine learning algorithm attempts to discover patterns in the data on its own. An example might be to discover the different groups of customers a company has and what distinguished one group from another. That would be useful in building targeted marketing campaigns, for example. These types of methods are also often called data mining methods.

One of the best ways to develop skills is “by doing”. Analyze data sets, create databases, load data sets into databases, run queries, build reports, integrate tools, and practice programming. There are numerous data sets available on the web for virtually any domain: data.gov, Kaggle.com, are good starting points.

The Six V’s of Big Data

Big data is quite big – but not only in quantity. Big Data has a set of characteristics, called the 6 V’s (it was originally the 3 V’s, then the 5 V’s, and now you’ll hear data scientists discuss these as the 6 V’s). Data work, especially when working with very large amounts of data, is commonly described through the six V’s:

Volume: Refers to the sheer amount of data generated and collected, often measured in terabytes or petabytes.
Variety: Represents the diverse forms of data, including structured, semi-structured, and unstructured formats such as text, images, videos, and log files.
Velocity: Indicates the speed at which data is generated, processed, and analyzed, emphasizing the need for real-time analytics.
Veracity: Relates to the reliability and accuracy of data, addressing challenges associated with data inconsistencies and errors.
Validity: Highlights the importance of extracting meaningful and actionable insights from data.
Volatility: Acknowledges the changing nature of data over time and the need to adapt analytical methods accordingly.

Understanding and addressing these dimensions are essential for effective data analytics. Let’s take a closer look.

In the video chat below, Dr. Schedlbauer provides an overview of the Six V’s, before additional details are provided subsequently.

Volume

Volume is one of the core attributes of “big data”. Big Data implies enormous amounts of structured and unstructured data that is generated by social and sensor networks, transaction and search history, and manual data collection.

Let’s take the example of an online recruiting portal. On the portal, there are thousands upon thousands of resumes to manage in addition to thousands upon thousands of job postings and applications to those job postings. Recruiters and job applicants might comment on discussion boards they might make general posts about jobs and what it is like to work at companies. The company might have an active social media presence. And perhaps they manage the actual recruiting process filter with resumes setting up interviews and negotiating job offers. It might be conceivable that they also have a presence on YouTube and Twitter and one can immediately imagine that they are hundreds of thousands of data points being collected every day. So, there is a very large amount of data that needs to be managed that needs to be stored and that needs to be analyzed. This amount of data is referred to as volume. And big data is big volume.

Variety

Data comes from a variety of sources and contains both structured and unstructured data. Data types are not restricted to simply numbers and short text fields, but also include images, emails, text messages, web pages, blog entries, social media content, documents, audio, video, and time series.

Velocity

The flow of data that needs to be stored and analyzed is continuous. Human interactions, business processes, machines, and networks generate data continuously and in enormous quantities. The data is generally analyzed in real-time to gain a strategic advantage.

Sampling can help mitigate some of the problems with large data volume and velocity.

Veracity

Data veracity characterizes the inherent noise, biases, abnormalities, and mistakes present in virtually all data streams. “Dirty” data presents a significant risk as analyses are incorrect when based on “bad” data. Data must be cleaned in real-time, and processes must be established to keep “dirty data” from accumulating.

Validity

While the data may not be “dirty”, biased, or abnormal and it may not be valid or valuable for the intended use. Valid data for the intended use is essential to making decisions based on the data.

Volatility

Data changes over time and volatility (or variability) characterizes the degree to which the data changes over time. Decisions and analyses are based on data that has an “expiration date”. Data scientists must define at what point in time a data stream is no longer relevant and cannot be used to plan.

Concepts Pharma has built a data repository that collects self-reported eating habits of clinical trial participants through a mobile habit. The translational medicine group is using the data to determine if the drug in the trial is causing digestive issues when taken with certain food groups. Which of the V’s should be of most concern to them?

Careers in Data Science

Data science and analytics lie at the intersection of business, information technology, and data skills. Individuals who work in the data science landscape have to have skills in all 3 of these areas in order to be successful. You may see or hear about professionals in this area split into a data scientist and data analyst roles, but we are seeing a call for professionals with skills in both areas. These analytics professionals have business knowledge, analytics skills, and deep experience with information technology. As you can see in the diagram below, there is a significant overlap in all of the skills areas we will discuss in this course and beyond.

Though there is significant overlap in the type of work that people in the data science and analytics world do, we can broadly describe the roles as follows:

Data Analyst	Data Scientist
Use well-defined data sets to solve problems and create organizational knowledge	Work with undefined sets of data and look to learn about what is not known from the data
Have technical expertise in business, statistics, and technology	Have expertise and extensive skills in computer science, mathematics, and statistics (often with graduate degrees)
Responsible for designing and maintaining databases, analyzing and interpreting data, and communicating with business groups	Design processes, creating predictive models and writing algorithms
For more about this topic: What Does a Data Analyst Do?	For more about this topic: What Does a Data Scientist Do?

Looking at the table above, there is a fairly clear division of labor and expertise between the two roles. This can be helpful when you are looking for additional training and knowledge, but in reality, jobs in data analytics often blend skills from both of these career paths. A quick review of jobs in the data science world shows that employers are often looking for professionals who have skills from both the data analytics and data science areas.

Between the traditional data analyst and data scientist roles lies an area that is a blend of both. This role focuses on collecting, preparing, and using data. For this work, professionals need higher-level technical skills and an understanding of data science concepts, though they are not likely to work in the more technical areas of data science. Companies often use the term data scientist when looking for someone who can take data from various sources, load it into an analysis environment, visualize the data, provide insight, and build a simple model. This data science professional draws on skills from both the data analyst and data scientist roles and often includes data engineering skills to collect, clean, and shape data.

The following interviews will give you a bit of a sense of what it is like to work as a data scientist. The work is not always about math, programming, and algorithms. Notice how much time they spend on shaping data and determining the actual problem that needs to be solved.

Practical Applications

Data science, driven by analytics and big data, has the ability to change the world. In fact, areas of data science including machine learning and algorithms are already transforming how we live, learn, and manage the world around us.

While headlines typically focus on science-fiction stories like Uber’s self-driving car, which killed a pedestrian on its first day of use source, data science has provided many opportunities to improve human life and our ecosystem. Wherever technology is used, it is possible to imagine ways of using data science- and the use of data science is only limited by imagination. Big data, machine learning, and artificial intelligence are all areas where everyday researchers and data scientists are using this technology to better understand the world around us.

Machine learning has made it possible for ecological research which was nearly unheard of 50 years ago. Tracking and identifying living species of fish, animals, and insects is extremely time and resource-consuming. With the use of machine learning though, it is possible to capture and analyze data quickly and efficiently. New advances in data science not only power these types of projects but also allow researchers to discover new patterns and identify new ways of protecting animal species.

One example of the use of machine learning to understand and improve conservation comes from the Nisqually River Foundation. Technology was used to capture images of fish populations in the river and then identified the fish species, in order to provide an accurate idea of the river ecosystem. Tracking fish by hand is not typically feasible because of the costs and the impact on the ecosystem, however with the use of technology is was to automate data collection and interpretation, leading to better outcomes for river and animals that inhabit it source.

The Serengeti National Park and Fumeti Reserve in Tanzania are considered one of the last intact habitats for large mammals. Protecting this land and the community it supports is complicated because of the large variety and quantity of animals. Data science has allowed researchers to catalog the variety of specifies and analyze the movement and behavior of individual animals and community groups. By limiting the physical presence of human researchers, it is possible to study the community of animals in real-time while limiting harm and interference. Data science has given researchers the ability to understand and take steps to protect this habitat source.

Observing behavior in groups of animals like fish is difficult. Machine learning methods can help researchers to analyze the group and also study individuals within the group. This data has been used to makes sense of motion and posture for individual fish, as well as the school of fish source.

Data mining can also be used to catalog swarms of insects specifies to identify and predict the relationship between insects and the environment. Cataloging and monitoring insect species and behaviors using data science make it possible for scientists to predict impacts and take steps to mitigate issues source.

Data science is quietly but consistently making changes in healthcare and has given us the ability to identify people at risk for serious health issues and even diagnose disease more accurately and effectively. Data science methods including machine learning and artificial intelligence have already shown promise in managing healthcare, including providing monitoring for patients, which led to lower complications, and fewer hospitalizations [source] (https://www.modernhealthcare.com/indepth/how-ai-plays-role-in-population-health-management/).

Osteoarthritis of the knee is a common and debilitating condition in the United States. Machine learning algorithms, trained to read knee X-rays, are able to predict pain based on x-rays. In some measures, algorithms outperformed experienced traditionalists and were both more accurate and able to identify evidence of disease that impacted pain but was not yet visible to physicians. Tools like this, in conjunction with trained medical professionals, can extend what we know and can do about health conditions [source] (https://ai-med.io/more-news/new-deep-learning-model-reveals-racial-disparities-in-knee-pain-assessment/) and source.

Artificial intelligence is already in use in clinical environments. In the Mayo Clinic, AI has been used to:

Identify heart issues that may lead to heart failure without treatment
Improve the outcomes of individuals who had a stroke through faster diagnosis based on CT data
Detect atrial fibrillation based on heart rhythm tests
Screen for hypertrophic cardiomyopathy

See Source 1 and Source 2

Machine learning and artificial intelligence have also been used to detect and diagnose cancer. AI is able to detect precancerous lesions in cervical cancer, predict prostate cancer, and identify genomic changes in tumors source.

Disaster response and emergencies pose a challenge to first responders and groups who need to quickly assess and react to unfolding issues like storms, fires, and floods. Artificial intelligence has significant implications for managing disasters. AI is able to predict what resources might be necessary, how people may be displaced, and can assess damage in real-time. AI is also able to determine where and how to allocate resources for the most impact. Much or all of this can be done with minimal human risk, enabling more effective and safe responses while giving people involved in the emergency the best chance of survival source and source.

Data science has been used to identify areas where structures and human lives are at risk, before, during, and after disasters when it is dangerous, time-consuming, or costly to send in human workers source.

Data science technology and techniques have been used to identify and track disasters in real-time by mining social media and other readily available data source. By including machine learning and artificial intelligence in disaster planning and response, rescue personnel have more access to information, better ability to make decisions, and are able to determine more effectively where resources and personnel can do the most good.

Data science can be used in nearly all settings to improve human life and health and to protect the environment around us. Like all tools though, data science holds tremendous promise but requires that the people who manage them understand the power of the tools, how they can benefit the world, and the biases, inequality, and issues that they may inadvertently create when they build systems that can learn. The fact that the self-driving car killed a pedestrian on its first road trip is a harsh reminder that those who work with data and machine learning, and artificial intelligence need to be aware and constantly monitor the tools they build. It is important to make sure that data science tools are understood and monitored to ensure the solutions that we build are good for all.

Summary

Data analytics serves as a cornerstone of modern decision-making, providing a structured approach to derive insights from data. By understanding its relationship with data science, leveraging big data, and employing effective wrangling techniques, analysts can maximize the value of their datasets. Addressing the six V’s ensures a comprehensive understanding of data complexity, while predictive modeling with machine learning unlocks future possibilities through informed predictions. Together, these components form the foundation of a robust data analytics framework, equipping students with the tools and knowledge to excel in the field.

Data comes from many sources and in many formats. Unfortunately, the data is rarely in a format that is conducive to analysis. Therefore, the data scientist must “munge” and “wrangle” the data to suit the desired analytical and visualization goals. In practice that means format conversions, filling in missing data, converting data and fields to appropriate formats, and storing the data in a data store suitable for its intended purpose. Databases use different architectures to deal with different types and amounts of data. The data specialist must choose the appropriate database and storage architecture based on the data and how it will be used.

Reflection

Now that you have completed this lesson, browse the resources below, take notes on areas where you need to read further, or review the lesson, and then consider the following:

Where do you see yourself fitting in the data science and analytics world?
How did the information in this lesson confirm what you know and challenge what you thought?
Where do you already have the skills you will need to be successful in data science?
What skills do you want to and need to work on?
What type of professional experience will help you move deeper into your data analytics career?
Do I understand the 6 Vs? Can I explain them to someone else?

Certifications

Certificates and formal education through for credit courses at Universities (often leading to a degree) or online non-credit courses (on platforms such as Teachable, Udemy, Udacity, or Coursera) are a good way to build and maintain skills, in data science and other professional areas, including project management. These types of courses, offered by experts in the field who are actively working and publishing make sure that the information you receive is accurate, useful, and up to date.

Certification is a popular and valuable practice that helps a data science professional prepare for and demonstrate mastery of skills, tasks, and technology. Certifications can be vendor-neutral and focus on specific skills, which can be used throughout different roles and responsibilities, regardless of the platform or technology used for data storage and analytics work.

The Certified Analytics Professional (CAP) credential is overseen by the Analytics Certification Board (ACB) and sponsored by INFORMS (informs.org), an international association for professionals in data analytics. The CAP certification requires both education and experience, as well as an exam. More information about the CAP certification is available online https://certifiedanalytics.org

The International Institute of Business Analysis, IIBA, offers several vendor-neutral certifications related to data analytics and data science. Among these are CBAP (Certified Business Analysis Professional), CBDA (Certification in Business Data Analytics), CCBA (Certification of Capability in Business Analysis). Each IIBA certification has a defined focus area and set of competencies. Individuals interested in certification to complete exam questions based on scenarios and case studies. Also, they may be asked to provide evidence of work experience and professional learning. For more information on each certification, please see the organization’s home page iiba.org.

There are many other groups that provide the chance to demonstrate mastery, including the Data Science Council of America (DASCA) and the American Statistical Association. You might also want to pursue certifications in project management, like the Project Management Professional certification (PMP) through Project Management Institute, or methodologies such as Agile and Scrum can be helpful as well.

In addition to vendor-neutral certifications, there are numerous vendor and product-specific certifications. Information on these certifications is available through the vendor websites.

Key Terms

Like all fields, data science has a specific vocabulary and many concepts that you will want to be familiar with. You may want to familiarize yourself with the following terminology before moving on in the course.

Algorithm: a sequence of defined instructions that are designed to perform a computation or solve a problem

Artificial intelligence: an interdisciplinary branch of computer science that creates machines able to perform tasks that typically require human intelligence.

Big data: a field that deals with big data sets that are too large and complex to use traditional analysis methods. Big data, by definition is high volume, high variety, flows continuously (velocity), requires cleaning in real-time to ensure veracity and validity (accurateness and useability), and it is volatile (constantly changing).

Data analysis: a process of transforming and modeling data to discover useful information (insight) to inform and support decision making

Data mining: the process of turning raw data into useful information by finding patterns within large data sets

Data science: a field of study that uses statistics, scientific methods, computer algorithms and systems, and data analysis to extract knowledge from data sets

Machine learning: the study and use of computer systems that can adapt and improve performance automatically (without explicit instruction) by the use of models that analyze patterns in data

NoSQL databases: often called “not only SQL” databases, these databases store data that does not conform to the table structure common in RDBMS. As such, they can hold data in a variety of forms and lend themselves to efficient searches, queries, and computations. NoSQL databases are often considered complements to an RDBMS, as they extend the capabilities of an organization’s data storage.

Python: a powerful general purpose programming language

R: a programming language that was specifically designed for statistics and data analysis

Relational database: relational databases are structured storage systems that allow for efficient retrieval and manipulation of data.

SQL: a language used to modify, retrieve, and manage data in a relational database.

Structured vs unstructured data: structured data is data that has been processed and organized so that it can be easily accessed and used. Unstructured data, on the other hand, is usually in its native form and requires processing before it can be used. The vast majority of data in the world is unstructured, while database management systems in organizations are likely to use structured data.

Readings

Errata

None collected yet.

Let us know.

55.102
The Tao of Data

Martin Schedlbauer, PhD

2025-01-06

Objectives

Data-Driven Organizations

Big Data

Key Concepts

Data Analytics

Data Science

Machine Learning

Data Mining

Artificial Intelligence

Data Engineering

Machine Learning Engineering

Relationships between Disciplines

Historical Perspective

Key Skills

The Six V’s of Big Data

Volume

Variety

Velocity

Veracity

Validity

Volatility

Careers in Data Science

Practical Applications

Summary

Reflection

Certifications

Key Terms

Readings

Errata

55.102The Tao of Data

Martin Schedlbauer, PhD

2025-01-06

Objectives

Data-Driven Organizations

Big Data

Key Concepts

Data Analytics

Data Science

Machine Learning

Data Mining

Artificial Intelligence

Data Engineering

Machine Learning Engineering

Relationships between Disciplines

Historical Perspective

Key Skills

The Six V’s of Big Data

Volume

Variety

Velocity

Veracity

Validity

Volatility

Careers in Data Science

Practical Applications

Summary

Reflection

Certifications

Key Terms

Readings

Errata

55.102
The Tao of Data