r4ds-instructors
Instructors' Guide to accompany "R for Data Science"
Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Repository
Instructors' Guide to accompany "R for Data Science"
Basic Info
- Host: GitHub
- Owner: rstudio-education
- License: other
- Default Branch: master
- Size: 122 KB
Statistics
- Stars: 51
- Watchers: 10
- Forks: 14
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
R for Data Science Instructor's Guide
DRAFT: notes for people teaching R4DS with each chapter's learning objectives and key points.
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Hadley Wickham and Garrett Grolemund: R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed., O'Reilly Media, 2017.
Learner Personas
Nethira, 27, is wrapping up a PhD in nursing and trying to decide whether to do a post-doc or take a data analyst position with an NGO in Tamil Nadu. She did two courses on statistics as an undergraduate, both using Stata, and picked up a bit of R from a labmate in grad school, but has never really come to grips with it as a tool. Nethira would like to improve her skills so that she can finish analyzing the data she collected for her thesis and get a couple of papers out, and to prepare herself for a possible change of career. These lessons will show her how to use the tidyverse in R to clean up, analyze, visualize, and model her data without working long nights or weekends.
Hannu, 40, has worked as a traffic engineer for the Finnish Ministry of Transportation for the past 12 years, during which time he has become proficient with SQL and Python. As part of an open data initiative, his department has decided to build a traffic capacity dashboard using Shiny, and Hannu wants to learn the basics of modern R in a hurry so that he can join this project. These lessons will introduce him to the packages that make up the tidyverse, and prepare him for a deeper dive into more advanced R programming.
Derived constraints:
- Learners know what variables are, how to index a list, how loops and conditionals work, and grasp the basics of programming language syntax, such as how to write a string or a list (both).
- Learners only have a shaky grasp of variable scope and the call stack, and will not understand closures or higher-order functions without detailed exposition (Nethira).
- Learners know very basic statistics (mean, standard deviation, linear regression), but do not understand what a p-value is or why an observation can only be used once during hypothesis confirmation (Hannu).
- Learners have 20-40 hours to work through this material. They may be able to ask more advanced friends or colleagues for help, but will primarily be learning on their own and by searching online (both).
Note: definitions of terms are marked with _single underscores_, while other
form of emphasis uses *single asterisks*. This makes it easy to extract
definitions for glossary construction.
1. Introduction
Objectives
- Describe the steps in the basic data analysis cycle.
- Explain the relative strengths and weaknesses of visualization and modeling.
- Explain when techniques beyond those described in these lessons may be needed.
- Explain the differences between hypothesis generation and hypothesis confirmation.
- Describe and install the prerequisites for these lessons.
- Explain where and how to get help.
Key Points
- The basic data analysis cycle is import, tidy, repeatedly transform, visualize, and model, and then communicate.
- Visualizations provide novel insight, but don't scale well.
- Models scale well, but cannot provide unexpected insight.
- The techniques described in these lessons are good for tabular (rectangular) data up to about a gigabyte in size.
- Hypothesis generation is the ad hoc process of exploring data to find possible hypotheses. An observation can be used many times during hypothesis generation.
- Hypothesis confirmation is the rigorous application of mathematics to test falsifiable hypotheses. An observation can only be used once in hypothesis confirmation.
- These lessons require R, RStudio, and a set of R packages called the tidyverse.
- Help can be found by:
- Typing
?nameat an interactive R prompt (wherenameidentifies a package, function, or variable). - Copying and pasting the error message into a web search.
- Searching on Stack Overflow.
- Typing
- When asking for help, create a reproducible example that:
- Loads packages.
- Includes a small amount of data.
- Has short, readable code.
2. Introduction
- See above.
3. Data Visualization
Objectives
- Explain what a data frame is and how to access the ones included with the tidyverse.
- Explore the properties of a data frame.
- Explain what geometries and mappings are.
- Create a basic visualization of a single data data frame with a single geometry and a single mapping.
- Create visualizations using the
x,y,color,size,alpha, andshapeproperties. - Explain ways in which continuous and discrete variables should and shouldn't be visualized.
- Explain what facets are and use them to display subsets of data in a single plot.
- Create scatterplots and continuous line charts.
- Create plots that represent data in two or more ways.
- Describe three places in which the visual aspects of a plot can be specified.
- Explain what a stat is in a plot, and how stats relate to geometries.
- Create stacked and side-by-side bar charts.
- Create a scatterplot to display data with many repeated values.
- Flip the XY axes in a plot.
- Use polar coordinates in a plot.
- Describe the seven parameters to a full ggplot2 visualization.
Key Points
- A data frame is a rectangular set of observations (rows) of the same variables (columns).
- Example data frames can be loaded using
library(tidyverse)and then referred to by name (e.g.,mpg) or by fully-qualified name (e.g.,ggplot2::mpg). - To explore a data frame:
- Type the name of the frame to see its shape, column titles, and first few rows.
- Use
nrow(frame)to get the number of rows. - Use
ncol(frame)to get the number of columns. - Use
View(frame)in RStudio to visualize the data frame as a table.
- A geometry is an object that a plot uses to represent data, such as a scatterplot or a line.
- A mapping describes how to connect features of a data frame to properties of a geometry, and is described by an aesthetic.
- A very simple visualization has the form
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) - Continuous variables should be visualized using smoothly-varying properties such as size and color.
- Discrete variables should be visualized using properties such as shape and line type.
- A facet is a subplot that displays a subset of the overall data.
- Use
facet_wrap(<FORMULA>, nrow=<NUM>)with a single discrete variable as a formula (such as~COLUMN) to display a single sub-plot for each value of the discrete variable. - Use
facet_grid(<FORMULA>)with a formula of two discrete variables (such asFIRST ~ SECOND) to display a single sub-plot for each unique combination of the two variables. - Use
. ~ COLUMNorCOLUMN ~ .as a formula to display facets in rows or columns only. - Use
geom_pointto create a scatterplot andgeom_smoothto create a line chart. - Add multiple geometries after the initial call to
ggplot - The visual aspects of a plot can be specified as follows (each overrides the one(s) before):
- Globally by specifying a value in the initial
ggplotfunction. - For a particular geometry by specifying a value outside an aesthetic.
- In a data-dependent way by setting a property of an aesthetic.
- Globally by specifying a value in the initial
- A stat performs a data transformation, such as counting the number of elements in a subset of the data.
- Stats and geometries can often be used interchangeably since each stat has a default geometry and each geometry has a default stat.
- Map one variable to
xand another tofillin the aesthetic forgeom_barto create a stacked bar chart. - Set
position="dodge"ingeom_bar(outside the aesthetic) to create a side-by-side bar chart. - Use
position="jitter"ingeom_point(outside the aesthetic) to add randomization to a scatterplot to show data with duplication. - Add
coord_flipto a visualization to flip the XY axes. - Add
coord_polarto a visualization to use polar coordinates instead of Cartesian coordinates. - The seven parameters of a full ggplot2 visualization are:
- data: the data frame to be plotted
- geometry: how the data is to be displayed (e.g., scatterplot or line)
- mapping: how the properties of the data map to the properties of the geometry (e.g., which columns map to X and Y coordinates)
- stat: the transformation to apply to the data (e.g., count the number of observations)
- position: how to adjust the positions of displayed elements (e.g., jittering points in a scatterplot)
- coordinate function: whether to use Cartesian coordinates or polar coordinates
- facet: how to subset the data to create multiple subplots
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
4. Workflow: Basics
Objectives
- Assign values to variables.
- Call functions.
- Write readable code.
Key Points
- Use
name <- valueto assign a value to a variable (do not use=). - Use
function(value1, value2)to call a function. - Construct variable names out of words joined with underscores
like_this_example.
5. Data Transformation
Objectives
- Describe the five basic data transformation operations in the tidyverse and explain their purpose.
- Choose records by value using comparisons and logical operators.
- Explain why filter conditions shouldn't use
==, and correctly use%in%instead. - Explain the purpose of
NA, how it affects arithmetic and logical operations, and how to test for it. - Explain how filtering treats
NAand how to obtain different behavior. - Reorder records in ascending or descending order according to the values of one or more variables.
- Select a subset of variables for all records by variable name.
- Select and rename a subset of variables for all records by variable name.
- Add new variables to a frame by deriving new values from existing ones.
- Combine the values in a data frame to create one new value, or one new value per group.
- Explain how summarization treats missing values (
NA) and how to change this behavior. - Combine multiple transformations in order using pipe notation.
- Explain why and how to include counts in summarization as a check on the validity of conclusions.
- Name and describe half a dozen common summarization functions.
- Explain the relationship between grouping and summarization.
Key Points
- The five basic data transformation operations in the tidyverse are:
filter: choose records by value(s).arrange: reorder records.select: choose variables by name.mutate: derive new variables from existing ones.summarize: combine many values to create a single new value.
filter(frame, ...criteria...)keeps records that pass all of the specified criterianame == value: records must have the specified value for the named variable (note==rather than=).name > value: the records' values must be greater than the given value (and similarly for>=,!=,<, and<=).min_rank(name)to rank variables, giving the smallest values the smallest ranks.- Use
near(expression, value)to compare floating-point numbers rather than==. - Use
&(and) to require both conditions,|(or) to accept either condition, and!(not) to invert the sense of a condition.
- Use
name %in% (value1, value2, ...)to accept any of a fixed set of values for a variable. NA(meaning "not available") represents unknown values.- Most operations involving
NAproduceNA, because there's no way to know the output without knowing the input. - Use
is.na(value)to determine if a value isNA. filterdiscards records withFALSEandNAresults in tests.- Use
is.na(value)to includeNA's explicitly.
- Use
- Use
desc(name)to order by descending value of the variablenameinstead of by ascending value. - Use
select(frame, name1, name2, ...)to select only the named variables from the given frame. - Use
name1:name1to select all variables fromname1toname2(inclusive). - Use
-(name1:name2)to unselect all variables fromname1toname2(inclusive). - Use
rename(frame, new_name = old_name)to select and rename variables from a frame. - Use
everything()to select every variable that hasn't otherwise been selected. - Use
one_of(c("name1", "name2"))to select all of the variables named in the given vector. - Use
mutate(frame, name1=expression1, name2=expression2, ...)to add new variables to the end of the given frame. - Use
transmute(frame, name1=expression1, name2=expression2, ...)create a new data frame whose values are derived from the values of an existing frame. - Use
group_by(frame, name1, name2, ...)to group the values offrameaccording to distinct combinations of the values of the named variables.- Use
ungroupto restore the original data.
- Use
- Use
summarize(frame, name = function(...))to aggregate the values in an entire data frame or by group within a data frame. - By default
summarizeproducesNAas output when there areNAs in the input. - Use
na.rm = TRUEto removeNAs from the data before summarization. - Use
frame %>% operation1(...) %>% operation2(...) %>% ...to produce a new data frame by applying each operation to an existing one in order. - Use
n()(for a simple count) orsum(!is.na(name))(to count the number of non-NAs) when summarizing values in order to see how many records contribute to an aggregated result. - Common summarization functions include:
meanandmediansdfor standard deviationmin,quantile, andmaxfor extrema and intermediate valuesfirst,nth, andlastfor positional extrema and intermediate valuesn_distinctfor the number of distinct valuescountto calculate counts or weighted sums
- Each summarization peels off one layer of grouping.
6. Workflow: Scripts
Objectives
- Use the RStudio editor to write, save, and run R scripts.
- Describe two things that should not be put in scripts.
- Explain how to spot and fix syntax errors in the RStudio editor.
Key Points
- Use Cmd/Ctrl + Enter in the editor to run the current R expression in the console.
- Use Cmd/Ctrl + Shift + S to run the complete script in the console.
- Do not put
install.packagesorsetwdin scripts, since they will affect other people's machines when run. - The RStudio editor uses white-on-red X's and red squiggly underlining to highlight syntax errors.
7. Exploratory Data Analysis
Objectives
- Describe the steps in exploratory data analysis (EDA).
- Describe two types of questions that are useful to ask during EDA.
- Correctly define variable, value, observation, variation, and tidy data.
- Explain what a categorical variable is and how to best to store and visualize one.
- Explain what a continuous variable is and how best to store and visualize one.
- Explain why it is important to use a variety of bin widths when visualizing continuous variables as histograms.
- List three questions whose answers will help you understand your data.
- Describe and use a heuristic for identifying subgroups in data.
- Explain how to handle outliers or unusual values in data.
- Define covariance and describe how to visualize it for different combinations of two categorical and continuous variables.
- Explain how to make code clearer to experienced readers by omitting information.
Key Points
- Exploratory data analysis consists of:
- Generating questions about data.
- Searching for answers by visualizing, transforming, and modeling data.
- Using what is found to refine questions or generate new ones.
- Two questions that are always useful to ask during EDA are:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
- A variable is something that can be measured.
- A value is the state of a variable when measured.
- An observation is a set of measurements made under similar conditions, and may contain several values (each associated with a different variable).
- Variation is the tendency of values to differ from measurement to measurement.
- Tidy data is a set of values, each of which is associated with exactly one variable and observation. Tidy data is usually displayed in tabular form: each observation (or record) is a row, while each variable is a column with a name and a type.
- A categorical variable is one that takes on only one of a small set of values.
- Categorical variables are best represented using factors or character strings.
- The distribution of a categorical variable is best visualized using a bar chart (created using
geom_bar). dplyr::count(name)counts the number of occurrences of each value of a categorical variable.
- A continuous variable is one that takes on any of an infinite set of ordered values.
- Categorical variables are best represented using numbers or date-times.
- The distribution of a categorical variable is best visualized using a histogram.
dplyr::count(ggplot2::cut_width(name, width))divides occurrences into bins and counts the number of occurrences in each bin.
- Histograms with different bin widths can have very different visual appearances, so varying the bin width provides insight that no single bin width can.
- Use
geom_histogram(mapping=..., binwidth=value)to vary the width of histogram bins. - Or
geom_freqpolyto display histograms using lines instead of bars.
- Use
- Three questions to ask of any dataset are:
- Which values are most common (and why)?
- Which values are rare (and why)?
- What patterns are present in the data?
- How can you describe the pattern?
- How strong is it?
- Is it a coincidence?
- Does the pattern change if you examine subgroups of the data?
- Clusters of similar values suggest that data contains subgroups. To characterize these subgroups, ask:
- How are observations in each cluster similar?
- How do observations in different clusters differ?
- What might explain the existence of these clusters?
- How might the appearance of these clusters be misleading (e.g., an artifact of the visualization used)?
- If outliers are present, repeat each analysis with and without them.
- If there are only a few, and dropping them doesn't affect results, use
mutateandifelseto replace them withNA. - If there are many, or dropping them changes results, account for them in analysis and reporting.
- If there are only a few, and dropping them doesn't affect results, use
- Covariation is the tendency for some variables to vary in related ways.
- When visualizing the relationship between continuous and categorical variables:
- Displaying raw counts can be misleading if the number of items in different categories varies widely.
- Displaying densities (i.e., counts standardized so that the area of each curve is the same) can be more informative.
- Boxplots show less of the raw data, but are easier to interpret when there are many categories.
- Reorder unordered categorical variables to make trends easier to see.
- When visualizing the relationship between two categorical variables:
- Display the counts for each pairing of values (e.g., using
geom_countorgeom_tile) - In general, put the categorical variable with the greater number of categories or the longer labels on the Y axis.
- Display the counts for each pairing of values (e.g., using
- When visualizing the relationship between two continuous variables:
- Use a scatterplot with jittering or transparency to handle datasets with up to hundreds of points.
- Use
geom_bin2dorgeom_hexto bin values in two dimensions. - Bin one or both of the continuous variables so that visualizations for continuous variables can be used.
- Use
cut_widthorcut_numberto bin continuous values by value range or number of values respectively.
- Omitting argument names for commonly-used functions makes code easier for experienced programmers to understand.
- The first two arguments to
ggplotare the dataset and the mapping.
- The first two arguments to
8. Workflow: Projects
Objectives
- Explain why analysts should save scripts rather than environments.
- Explain what a working directory is and how to find what yours is.
- Explain why setting your working directory from within your script is a bad idea.
- Explain the difference between an absolute path and a relative path and the meaning of the symbol
~in a path. - Explain what an RStudio project is and how one is stored.
Key Points
- Analysts should save scripts rather than environments because it is much easier to reconstruct an environment from a script than to reconstruct a script from an environment.
- The working directory is the directory where R looks for and saves files by default, and is displayed by calling
getwd(). - Setting the working directory from within a script with
setwdmakes reproducibility more difficult because that directory may not exist on some other (person's) machine. - An absolute path specifies a single location starting from the top of the filesystem.
- A relative path specifies a location starting from the current directory, and may identify different locations depending on where it is used.
- The symbol
~refers to the user's home directory on macOS and Linux, and to the user'sDocumentsdirectory on Windows. - An RStudio project is a directory that contains the scripts and other files involved in an analysis.
- Each RStudio project contains a
.Rprojfile with information about the project.
10. Tibbles
Objectives
- Explain the relationship between a tibble and a
data.frameand the main ways in which tibbles differ fromdata.frames. - Create tibbles from
data.frames and from scratch. - Explain what a non-syntactic name is and how to create tibble columns with non-syntactic names.
- FIXME: explain how to use
tribble(which requires an understanding of~). - Display an arbitrary number of rows and columns of a tibble.
- Subset tibbles using
[[...]]. - Subset tibbles using
$. - FIXME: explain use of
[...](single bracket).
Key Points
- A tibble is a
data.framewhose behaviors have been modified to work better with the tidyverse.- Tibbles never change their inputs' types.
- Tibbles never adjust the names of variables.
- Tibbles evaluate their constructor arguments lazily and sequentially, so that later variables can use the values of earlier variables.
- Tibbles do not create row names.
- Tibbles only recycle inputs of length 1, because recycling longer inputs has been a frequent source of bugs.
- Tibbles can be created from
data.frames usingas_tibbleor from scratch usingtibble.- Use
is_tibbleto determine if something is a tibble or not. - Use
classto determine the classes of something.
- Use
- A non-syntactic name is one which is not a valid R variable name.
- To create a non-syntactic column name, enclose the name in back-quotes.
- Use
printwithnto set the number of rows andwidthto set the number of character columns. - Use
name[["variable"]]orname$variableto extract the column namedvariablefrom a tibble. - Use
name[[N]]to extract columnN(integer) from a tibble.
11. Data Import
Objectives
- Name six functions for reading tabular data and explain their use.
- Read CSV data files with multiple header lines, comments, missing headers, and/or markers for missing data.
- Explain how data reader functions determine whether they have extra and missing values, and how they handle them.
- Name four functions used to parse individual values and explain their use.
- Explain how to obtain a summary of parsing problems encountered by data reading functions.
- Define locale and explain its purpose and use.
- Define encoding and explain its purpose and use.
- Explain how
readrfunctions determine column types. - Set the data types of columns explicitly while reading data.
- Explain how to write well-formatted tabular data.
- Describe what information is lost when writing tibbles to delimited files and what formats can be used instead.
Key Points
- Use the following functions to read tabular data in common formats:
read_csv: comma-delimited files.read_csv2: semicolon-delimited files.read_tsv: tab-delimited files.read_delim: files using an arbitrary delimiter.read_fwf: files with fixed-width fields.read_table: read common fixed-width tabular formats with whitespace separators.
- Use
skip=nto skip the first N lines of a file. - Use
comment="#"(or something similar) to ignore lines starting with#. - Use
col_names=FALSEto stopread_csvfrom interpreting the first row as column headers. - Use
col_names=c("first", "second", "third")to specify column names by hand. - Use
na="."(or something similar) to specify the value(s) used to mark missing data. - Data reader functions use the number of values in the first row to determine the number of columns.
- Extra values in subsequent rows are omitted.
- Missing values in subsequent rows are set to
NA.
- Use
parse_integer,parse_number,parse_logical, andparse_dateto parse strings containing integers, general numbers, Booleans, and dates.- Use
na="."(or similar) to specify the value(s) that should be interpreted as missing data.
- Use
- Use
problems(name)to access theproblemsattribute of the output of data reading functions. - A locale is a collection of linguistic and/or regional settings for information formats, such as Canadian English or Brazilian Portuguese.
- Use
locale(...)to specify such things as the separator character used in long numbers.
- Use
- An encoding is a specification of how characters are represented digitally, such as ASCII or UTF-8.
- Specify
encoding="name"when parsing data to interpret the character data correctly. - UTF-8 is now the most commonly-used character encoding scheme.
- Specify
- Data reading functions read the first 1000 rows of the dataset and use the heuristics embodied in
guess_parserto guess the type of the column. - Use
col_types=cols(...)to manually specify the types of the columns of a data file.- Use
name = col_double()to set the column's name tonameand the type todouble. - Use
name = col_date()to set the name tonameand the type todate.
- Use
- Use
write_csvto write data in comma-separated format andwrite_tsvto write it in tab-separated format.- Use
write_excel_csvto write CSV with extra information so that it can immediately be loaded by Microsoft Excel. - Use
na="marker"to specify howNAshould be shown in the output.
- Use
- Delimited file formats only store column names, not column types, so the latter have to be re-guessed when the file is re-read.
- (Old) Saving data in R's custom binary format RDS will save type information.
- (New) Saving data in the cross-language Feather format will also save type information, and this data can be read in multiple languages.
12. Tidy Data
Objectives
- Describe the three rules tabular data must obey to be considered "tidy", and the advantages of storing data this way.
- Explain what gathering data means and use gather operations to tidy datasets.
- Explain what spreading data means and use spread operations to tidy datasets.
- Explain what separating data means and use separate operations to tidy datasets.
- Explain what uniting data means and use unite operations to tidy datasets.
- Describe two ways in which values can be missing from a dataset.
- Explain how to complete a dataset and use completion operations to tidy datasets.
- Explain why it can be useful to carry values forward and use this to tidy datasets.
Key Points
- Tidy data obeys three rules:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
- Tidy data is easier to process because:
- No subsidiary processing is required (e.g., to split names into personal and family names).
- Each column can be processed independently (e.g., there's no need to choose the type of processing based on a "type" field in another column).
- To gather data means to take N columns whose names are actually values and transform them into 2 columns where the first column holds the former column names and the second holds the values.
- Use
gather(name, name, ..., key="key_name", value="value_name")to transform the named columns into two columns with nameskey_nameandvalue_name.
- Use
- To spread data means to take two columns, the N values in the first of which identify the meanings of the values in the second, and create N+1 columns, one for each of the distinct values in the first column.
- Use
spread(key=first, value=second) to spread the values insecondaccording to the keys infirst`.
- Use
- To separate data means to split one column into multiple values.
- Use
separate(name, into=c("first", "second", ...))to separate the values in one column to create multiple new columns.
- Use
- To unite data means to combine the values of two or more columns into a single column.
- Use
unite(new_name, first, second, ...)to combine the named columns to create a column namednew_name. - Values will be combined with
_unlesssep="#"(or similar) is used (withsep=""to unite without a separator).
- Use
- Use
convert=TRUEwith these functions to (try to) convert data types. - Values can be explicitly missing (the presence of an absence) or their entries can be missing entirely (the absence of a presence).
- To complete a dataset means to fill in missing combinations of values.
- Use
complete(first, second, ...)to fill in missing combinations of the values from the named columns.
- Use
- Missing values sometimes indicate that the most recent value should be carried forward.
- Use
fill(first, second, ...)to carry the most recent observation(s) forward in the named column(s).
- Use
13. Relational Data
Objectives
- Define relational data and explain what keys are and how they are used when processing it.
- Explain the difference between a primary key and a foreign key, and explain how to determine whether a key is actually a primary key.
- Explain what a surrogate key is and why surrogate keys are sometimes needed.
- Explain how relations are represented in relational data and describe three types of relations.
- Define a mutating join and use mutating joins to combine information from two tables.
- Define four kinds of joins and use each to combine information from two tables.
- Explain what joins do if some keys are duplicated, and when this might occur.
- Describe and use some common criteria for joins.
- Define a filtering join, describe two types of filtering joins, and use them to combine information from two tables.
- Describe the difference between how mutating joins and filtering joins behave in the presence of duplicated keys.
- Describe three steps for identifying keys in tables that can be used in joins.
- Describe and use three set operations on records.
Key Points
- Relational data is made up of sets of tables that are related in some way.
- A key is a variable or set of variables whose values uniquely identify observations in a table.
- Keys are used to connect observations in one table to observations in another.
- A primary key uniquely identifies an observation in its own table.
- Use
count(name)andfilter(n > 1)to identify multiple occurrences of what is supposed to be a primary key.
- Use
- A foreign key uniquely identifies an observation in some other table, and is used to connect information between those tables.
- A surrogate key is an arbitrary identifier associated with an observation (such as a row number) that has no real-world meaning.
- Surrogate keys are sometimes added to data when the data itself has no valid primary keys.
- Relations are represented by matching primary keys in one table to foreign keys in another. Relations can be:
- One-to-one (or 1-1), meaning there is exactly one matching value in each table.
- One-to-many (or 1-N), meaning that each value in one table may have any number of matching values in another.
- Many-to-many (or N-N), meaning that there may be many matching values in each table.
- A mutating join updates one table with corresponding information from another table.
- An inner join combines observations from two tables when their keys are equal, discarding any unmatched rows.
- Use
inner_join(left, right, by="name")to join tablesleftandrighton equal values of the columnname.
- Use
- A left outer join (or simply left join) combines observations when keys are equal, keeping rows from the left table even if there are no corresponding values from the right table.
- Missing values from the right table are assigned
NAin the result. - Use
left_joinwith arguments as above.
- Missing values from the right table are assigned
- A right outer join (or simply right join) does the same, but keeps rows from the right table even when rows from the left are missing.
- Use
right_joinwith arguments as above.
- Use
- A full outer join (or simply full join) keeps all rows from both table, filling in for gaps in either.
- Use
full_joinwith arguments as above.
- Use
- If a key is duplicated in one or both tables, a join will produce all combinations of records with that key.
- This often arises when a key is a primary key in one table and a foreign key in another.
- If keys are duplicated in both tables, it may be a sign that the data is corrupt or that the supposed key actually isn't one.
- A natural join combines tables using equal values for all columns with identical names.
- Use
by=NULLin a join function to force a natural join.
- Use
- Use
by=c("name1", "name2", ...)to join on equal values of named columns. - Use
by=c("a" = "b", "c" = "d", ...)to join on columns with different names. - Use
suffix=("name", "name")to override the default.x,.ysuffixes used for name collisions. - A filtering join is one that keeps (or discards) observations from one table based on whether they match (or do not match) observations in a second table.
- Use
semi_join(left, right)to keep rows inleftthat have matches inright. - Use
anti_join(left, right)to keep rows inleftthat do not have matches inright.
- Use
- Because they only keep or discard rows, filtering joins never create duplicate entries, while mutating joins can if keys are duplicated.
- Three steps for identifying keys in tables that can be used in joins are:
- Identify the variable or variables that form the primary key for each table based on an understanding of the data.
- Check that each table's primary key has no missing values.
- Check that possible foreign keys match primary keys in other tables (e.g., by using
anti_jointo look for missing matches).
- Three set operations that work on entire records are:
union(left, right): returns unique observations from either or both table.intersect(left, right): returns unique observations that are in both tables.setdiff(left, right): returns observations that are in one of the tables but not both.
14. Strings
Objectives
- Write character strings in R, including ones that contain special characters.
- Write multiple strings to the terminal, respecting escaped characters.
- Use functions from the
stringrpackage to perform basic operations on strings. - Explain what a regular expression is and what kinds of patterns they can match.
- Describe two functions that implement regular expressions and use them to match simple patterns against text.
- Describe nine patterns provided by regular expressions.
- Capture subsections of matched text in regular expressions and re-match captured text within a pattern.
- Detect and extract matches between a pattern and the strings in a vector.
- Replace substrings that match regular expressions.
- Split strings based on regular expression matches.
- Locate substrings that match regular expressions.
- Control matching options in regular expressions.
- Find objects in the global environment whose names match a regular expression.
- Find files and directories whose names match a regular expression.
Key Points
- Character strings in R are enclosed in matching single or double quotes.
- Use backslash to escape special characters such as
\",\n, and\\. - Use
writeLinesto display a string or a vector of strings with special characters interpreted. - Use
str_lengthto get a string's length. - Use
str_cto concatenate strings. - Use
str_subto extract or replace substrings. - Use
str_to_lower,str_to_upper, andstr_to_titleto change the case of strings. - Use
str_sortto sort a vector of strings andstr_orderto get the - Use
str_orderto get the ordered indices of the strings in a vector. - Use
str_padto pad a string to fit a specified width andstr_trimto trim it to fit that width. - A regular expression is a pattern that matches text.
- Regular expressions are written as text using punctuation and other characters to express choice, repetition, and other operations.
- Regular expressions can express patterns that have fixed nesting, but not patterns that have unlimited nesting (such as nested parenthesization).
- Use
str_view(text, pattern)to find the first match ofpatterntotextandstr_view_allto view all matches. - Nine patterns used in regular expressions are:
.matches any single character.\escapes the character that follows it.^and$match the beginning and end of the string respectively (without consuming any characters).- Use
\dto match digits and\sto match whitespace. - Use
[abc]to match any single character in a set and[^abc]to match any character not in a set. - Use
left|rightto match either of two patterns. - Use
{M,N}to repeat a pattern M to N times. - Use
?to signal that a pattern is optional (i.e., repeated zero or one times),*to repeat a pattern zero or more times, and+to repeat a pattern at least once. - Use parentheses
(...)for grouping, just as in mathematics.
- Every set of parentheses in a regular expression creates a numbered capture group.
- Use
\1,\2, etc. to refer to capture groups within a pattern in order to match the same actual text two or more times.
- Use
- Use
str_detect(strings, pattern)to create a logical vector showing where a pattern does or doesn't match. - Use
str_subset(strings, pattern)to select the subset of strings that match a pattern andstr_countto count the number of matches. - Use
str_extract(strings, pattern)to extract the first match for the pattern in each string. - Use
str_extract_all(strings, pattern)to extract all matches for the pattern in each string. - Use
str_match(string, pattern)to extract parenthesized sub-matches for a pattern. - Use
tidyr::extractto extract parenthesized sub-matches from a tibble into new columns. - Use
str_replaceorstr_replace_allto replace substrings that match regular expressions. - Use
str_splitto split a string based on regular expression matches. - Use
str_locateandstr_locate_allto find the starting and ending positions of substrings that match regular expressions. - Use
regexexplicitly to construct a regular expression and control options such as multi-line matches and embedded comments. - Use
aproposto find objects in the global environment whose names match a regular expression. - Use
dirto find objects in the filesystem whose names match a regular expression.
15. Factors
Objectives
- Define factor and explain the purpose of factors in R.
- Create and (re-)order factors.
- Determine the valid levels of a factor.
- Rename the levels of a factor.
Key Points
- A factor is a variable that can take on one of a fixed set of values.
- Factors are ordered, but the order is not necessarily alphabetical.
- Use
factor(values, levels)to create a vector of factors by matching strings invaluesto level names inlevels.- Values that don't match level names are converted to
NA.
- Values that don't match level names are converted to
- The idiom
factor(values, unique(values))orders the factors according to their first appearance in the data.- Use
fct_reorder(factor, values)to reorder a factor according to a set of numeric values.
- Use
- Use
levels(factor)to recover the valid levels of a factor. - Use
fct_relevel(factors, "levels")to move the named levels to the front of the list of factors (e.g., for display purposes). - Use
fct_infreqto reorder factors by frequency. - Use
fct_revto reverse the order of factors. - Use
fct_recode(factor, "new_name_1" = "old_name_1", "new_name_2" = "old_name_2", ...)to rename some or all factors.- Assigning several old levels to a single new level combines entries.
- Use
fct_collapse(factors, new_name = c("old_name_1", "old_name_2"), ...)to collapse many levels at once.
- Use
fct_lump(factor, n=N)to combine the smallest factors, leavingNgroups.
16. Dates and Times
Objectives
- Describe three types of data that refer to an instant in time.
- Get the current date and time.
- Describe and use three ways to create a date-time.
- Convert dates to date-times and date-times to dates.
- Describe and use eight accessor functions to extract components of dates and date-times.
- Describe and use three functions for rounding dates.
- Explain how to modify components of dates and date-times.
- Explain an idiom for exploring patterns in the lower-order components of date-times.
- Explain how the difference between two moments in time is represented in base R and when using
lubridate. - Explain the difference between a difftime, a period, and an interval.
- Determine your current timezone.
Key Points
- Instants in time are described by date, time, and date-time.
- Use
todayto get the current date andnowto get the current date-time. - A date-time can be created from a string, from individual date and time components, or from an existing date-time.
- Use
lubridatefunctions such asymdordmyto parse year-month-day dates. - Use functions such as
ymd_hmsto parse full date-times. - Supplying a timezone with
tz="XYZ"forces the creation of a date-time instead of just a date. - Use
make_dateormake_datetimeto construct a date or date-time from numeric components.
- Use
- Use
as_datetimeto convert a date to a date-time andas_dateto convert a date-time to a date. - Use the following accessor functions to extract components from dates and date-times:
yearandmonthyday(day of the year),mday(day of the month), andwday(day of the week)hour,minute, andsecond
- Use
floor_date,round_date, andceiling_dateto round dates down, to nearest, or up to a specified unit. - Use an accessor function on the left side of assignment to modify a portion of a date or date-time in place.
- E.g., use
year(x) <- 2018to set the year of a date or date-time to 2018.
- E.g., use
- Use
update(existing, name=value, ...)to create a new date-time with modified values. - Use
updateto set the higher-order components of date-times to a constant in order to explore the variation in the lower-order components. - A difftime represents the absolute difference between two moments in time (in seconds).
- Use
as.duration(difftime)to convert to alubridateduration, which always uses seconds to represent differences in times. - Use
dyears,dseconds, etc. to construct differences explicitly.
- Use
- A period represents the difference between two times taking human factors into account (such as daylight savings time).
- An interval is a duration with a starting point, which makes it precise enough that its exact length can be determined.
- Use
Sys.timezone()to determine your current timezone.
18. Pipes
Objectives
- Describe the pros and cons of four ways to write successive operations on data.
- Explain the use of
%T>%,%$%, and%<>%.
Key Points
- Four ways to write successive operations on data are:
- Save each intermediate step as a new object: a lot of typing with many opportunities for transposition mistakes.
- Overwrite the original object many times: loss of originals makes debugging difficult, and repetition of a single makes reading difficult.
- Compose functions: unnatural reading order and parameters widely separated from function names.
- Use the pipe
%>%: simple to read if the transformations are sequential and applied to a single main stream of data.
%T>%("tee") returns its left side rather than its right.%$%unpacks the variables in adata.frame(which is useful when calling functions in base R that don't rely ondata.frames).%<>%assigns the result back to the starting variable.
19. Functions (and Control Flow)
Objectives
- Explain the benefits of creating functions.
- Describe three steps in the creation of a function.
- Define functions of zero or more arguments.
- Describe three rules that function names should follow.
- Describe the difference between data and details in function arguments.
- Define conditional statement and write conditional statements with multiple branches and a default branch.
- Explain what a short-circuit operator is and write conditions using these operators.
- Define precondition and implement preconditions in functions.
- Write functions that take (and pass on) a varying number of arguments.
- Describe and use two ways to return values from functions.
- Implement pipeable functions that perform transformations or have side effects.
Key Point
- Create functions to elminate duplicated code, make programs more readable, and simplify maintenance and evolution.
- When creating a function, select a name, decide on its arguments, and write its body.
- Function names should:
- Prefer verbs (actions) to nouns (things).
- Use full words and consistent typography.
- Be consistent with other functions in the same package or application.
- Arguments to functions are (broadly speaking) either:
- Data to be operated on (come first).
- Details controlling how the function operates (come last, and should have default values).
- When overriding the value of a default, use the full name of the argument.
- A conditional statement may or may not execute code depending on whether a condition is true or false.
- Each conditional statement must have one
if, zero or moreelse if, and zero or oneelsein that order. - Each branch except the
elsemust have a logical condition that determines whether it is selected. - Branch conditions are tested in order, and only the code associated with the first branch whose condition is true is executed.
- If no condition is true, and an
elseis present, the code in theelsebranch is executed.
- Each conditional statement must have one
- Conditions must be
TRUEorFALSE, not vectors orNAs. - A short-circuit operator stops evaluating terms as soon as it knows whether the overall value is
TRUEorFALSE.- "and", written
&&, stops as soon as a term isFALSE. - "or", written
||, stops as soon as a term isTRUE. - Use the functions
any,all, andidenticalto collapse vectors into single values for testing.
- "and", written
- Always indent the bodies of conditionals and functions (preferably by two spaces) and obey style rules for placement of curly braces.
- A precondition is something that must be true of a function's inputs in order for the function to work correctly.
- Use
ifandstopto check that inputs are sensible before processing it, and generate a meaningful error message when it's not. - Or use
stopifnotto check that one or more conditions are true (without generating a custom error message).
- Use
- Use
...(three dots) as a placeholder for zero or more arguments, which can then be passed into other functions.- Use
list(...)to convert the actual arguments to a list for processing.
- Use
- A function in R returns either:
- An explicit value when
return(value)is called. - The value of the last expression evaluated if no explicit
returnwas executed.
- An explicit value when
- To make a function pipeable:
- For a transformation, take the data to be transformed as the first argument and return a modified object.
- For a side effect, perform the operation (e.g., save to a file) and use
invisible(value)to return the value without printing it.
20. Vectors
Objectives
- Define atomic vector and list, explain the differences between them, and give examples of each.
- Explain what
NULLis used for and how it differs fromNA. - Determine the type and length of an arbitrary value.
- Describe the values that logical vectors can contain and how they are usually constructed.
- Describe the values that integer and double vectors can contain and the special values that each type can contain.
- Describe the values that character vectors can contain.
- Explain the difference between explicit coercion and implicit coercion and use the former to convert values from one type to another.
- Explain the rule used to determine the type of a vector explicitly constructed out of values of different types.
- Define recycling and correctly identify and interpret uses of it.
- Recycle values explicitly.
- Give vector elements names and explain when and why this is useful.
- Describe six ways to subset a vector.
- Explain the difference between single
[...]and double[[...]] - Create and inspect lists.
- Subset lists.
- Define attribute and augmented vector.
- Describe three ways vector attributes are used in R.
- Explain how factors are implemented using augmented vectors.
- Explain how tibbles are implemented using augmented vectors.
Key Points
- An atomic vector is a homogeneous structure that holds logical, integer, double, character, complex, or raw data.
- A list (sometimes called a recursive vector) is a vector that can hold heterogeneous data, including other vectors.
- The special value
NULLrepresents the absence of a vector, or a vector of length zero, whileNArepresents the absence of a value. - Use
typeof(thing)to obtain the name of the type ofthing.- Use
is_logical,is_integer, and similarly-named functions to test the types of values. - Use
is_scalar_integerand similarly-named functions to test the type of a value and whether it is scalar or vector.
- Use
- Use
length(thing)to obtain the (integer) length of the type ofthing. - Logical vectors can contain
TRUE,FALSE, andNA, and are often constructed using Boolean expressions such as comparisons. - Integer vectors contain integer values, which should be used for counting.
- To force a value to be stored as an integer, write it without a decimal portion and put
Lafter it (for "long"). - Integer vectors can contain the special value
NA. - Use
is.nato check for this special value.
- To force a value to be stored as an integer, write it without a decimal portion and put
- Double vectors contain floating-point numbers, which should be used for measurement.
- Double vectors can contain the special values
NA,NaN(not a number),Inf(infinity), and-Inf(negative infinity). - Use
is.finite,is.infinite, andis.nanto check for these special values.
- Double vectors can contain the special values
- Character vectors can contain character strings, each of which can be arbitrarily long.
- Explicit coercion is the use of a function to convert values from one type to another.
- Use
as.logical,as.character,as.integer, oras.doubleto create a new vector containing the converted values from an original.
- Use
- Implicit coercion occurs when a value or vector of one type is used where another type is expected.
- The function
c(value1, value2, ...)creates a vector whose type is the most complex of the types of the provided values.- In order of increasing complexity, types are logical - integer - double - character.
- To recycle values is to re-use those from the shorter vector involved in an operation to match the length of the longer vector.
- A "scalar" in R is actually a vector of length 1, and most recycling involves replicating a scalar to have the same length as a vector.
- Base R produces a warning if the length of the longer is not an integer multiple of the length of the shorter.
- Tidyverse functions throw errors in this case to forestall unexpected results.
- Use
rep(values, times)to recycle (or repeat) values explicitly. - Some or all vector elements can be given names when the vector is constructed using
c(name1=value1, ...).- Use
purrr:set_names(vector, names)to set the names of a vector's values after that fact.
- Use
- A vector can be subsetted using
[...]in four ways:- Subsetting with a vector of positive integers selects those elements in order (possibly with repeats).
- Subsetting with a vector of negative integers selects all elements except those identified.
- Subsetting with zero creates an empty vector.
- Subsetting with a logical vector keeps values corresponding to
TRUEelements of the logical vector. - Subsetting with a character vector keeps only those values with the given names (possibly with repeats).
- Using an empty subscript
[]returns the entire vector.
- Create lists using
list(value1, value2, ...)and inspect their structure withstr(name). - Subsetting a list with
[...]always returns a list. - Subsetting a list with
[[...]]returns a single component (i.e., has one less level of nesting than the original). list$namedoes the same thing as[[...]]for a named element of a list.- An augmented vector is one that has extra named attributes attached to it.
- Get the value of a vector attribute using
attr(vector, name). - Set the value of a vector attribute using
attr(vector, name) <- value. - Use
attributes(name)to display all of the attributes of a vector. - Vector attributes are used to:
- Name the elements of a vector (
names). - Store dimensions to make a vector behave like a matrix.
- Store a class name to implement classes in the S3 object-oriented system.
- Name the elements of a vector (
- A factor is an integer vector that has the class
factorand alevelsattribute with the factors' names. - A tibble is a list with three classes and
namesandrow.namesattributes.- All elements of a tibble must be vectors having identical lengths.
21. Iteration
Objectives
- Describe the parts of a simple
forloop. - Create empty vectors of a given type and length.
- Explain why it is safer to use
seq_along(x)than1:length(x). - Write loops that iterate over the columns of a tibble using either indices or names.
- Describe and use an efficient way to write loops when the size of the eventual output cannot be known in advance.
- Explain how to write a loop when the number of required iterations is not known in advance.
- Describe what happens when looping over the names of a vector that has some unnamed elements.
- Explain what higher-order functions are, explain why they're useful, and write higher-order functions.
- Describe the
mapfamily of functions and their purpose, and rewrite simpleforloops to usemapfunctions. - Describe the purpose and use of the
safely,possibly, andquietlyfunction. - Describe and use
map2andpmap. - Describe and use
walk,walk2, andpwalk. - Define predicate function and describe higher-order functions that work with predicate functions.
- Define reduction and use
reduceto implement it. - Define accumulation and use
accumulateto implement it.
Key Points
- A
forloop usually has:- A variable whose value changes for each iteration of the loop.
- A set of values being iterated over (such as the indices of a vector).
- A body that is executed once for each iteration.
- An output variable where results are stored (whose space is usually preallocated for efficiency).
- Use
vector("type", length)to generate a vector of the specified type and length (usually to be filled in later). 1:length(x)is non-empty whenxis empty;seq_along(x)is empty whenxis empty, and so is better to use in loop controls.- To loop over the columns a tibble:
- Use
for (variable in seq_along(tibble))to loop over the numeric indices of the columns. - Use
for (variable in names(tibble))to loop over the names of the columns.
- Use
- When the size of a loop's eventual output cannot be known in advance, use a list to collect partial results and then
unlistorpurrr:flatten_dblto combine them into a vector.- If values are tables, collect them in a list and use
bind_rowsto combine them all after the loop.
- If values are tables, collect them in a list and use
- If the number of required iterations is not known in advance, use a
whileloop instead of aforloop.- Make sure that the condition of the
whileloop can be changed by the loop body so that the loop does not run forever.
- Make sure that the condition of the
- If none of the elements of a vector have names,
names(vector)returnsNULL, so aforloop doesn't execute any iterations. - If some of the elements of a vector have names and some don't,
names(vector)returns empty strings for unnamed elements.- This means that a
forloop will execute, but that attempts to access unnamed vector elements by name will fail.
- This means that a
- A higher-order function is one that takes other functions as arguments.
- Higher-order functions allow programmers to write control flow once and re-use it with different operations.
map(object, function)appliesfunctionto eachobjectand returns a list of results.- The specialized functions
map_lgl,map_int, etc., operate on and return vectors of specific types (logical, integer, etc.) - These functions preserve names and pass extra arguments through to the function provided.
- The specialized functions
- FIXME: go through "21.5.1 Shortcuts" after learning about formulas.
safely(func)creates a new function that never throws an error, but instead always returns a list of two values:resultis either the original result (if the original function ran without an error) orNULL(if there was an error).erroris eitherNULL(if the original function ran without an error) or the error object (if there was an error).map(data, safely(func))will therefore return a list of pairs.- And
transpose(map(data, safely(func)))will return a pair of lists.
possibly(func)creates a new function that returns a user-supplied default value instead of throwing an error.quietly(func)works likesafelybut captures printed output, messages, and warnings.map2(vec1, vec2, function)appliesfunctionto corresponding elements fromvec1andvec2.pmap(list_of_lists, function)appliesfunctionto the values in each of the sub-lists.- It is safest to give the sub-lists names that match the names of the function's parameters rather than relying on positional matching.
- The
walkfamily of functions execute functions without collecting and returning their results. - A predicate function is one that returns a single logical value.
keepanddiscardkeep elements of the input where a predicate function returnsTRUEorFALSErespectively.someandeverydetermine whether a predicate us true for any or all elements of the input data.detectreturns the first element for which a predicate is true.detect_indexreturns the index of the first element for which a predicate is true.head_whileandtail_whilecollect runs of values from the start or end of a structure for which a predicate is true.- Reduction combines many values using a binary (two-argument) function to create a single resulting value.
- Use
reduce(data, function)to do this. reducethrows an error ofdatais empty unless an initial valueinitis provided.
- Use
- Accumulation performs the same operation as reduction, but keeps the intermediate results (i.e., calculates a running sum).
Owner
- Name: RStudio Education
- Login: rstudio-education
- Kind: organization
- Website: https://education.rstudio.com/
- Twitter: RStudioEDU
- Repositories: 24
- Profile: https://github.com/rstudio-education
Citation (CITATION.md)
# Citation Please cite this work as: > Greg Wilson (ed.): "R for Data Science Instructors' Guide". <https://github.com/rstudio-education/r4ds-instructors>, 2018.
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2