This data is currently in “wide” format: a row has more than one observation. That is, the same outcome variable appears in multiple columns.
The outcome observation that spans across several columns is income range.
The different incomes columns are essentially different levels of the same categorical variable.
Long data
We might want long format: the outcome variable only exists in one column, and a second column/variable tells us the different levels.
Each row has one observation, but the units of observation (in this case, number of people in each income bracket) are repeated down one column.
This is helpful for us to perform group_by() or facet_wrap().
Converting from wide -> long
We will do this with the pivot_longer() function. This function requires a couple of arguments:
cols: which columns to pivot into a “longer” format. That is, the columns that we should “move”
names_to: a string character that provides the new column name for the categorical variable you are creating
values_to: a string character that provides the new variable name for the response/outcome variable that is common across all levels of the categorical variable
# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
7 4849 1 1 NA NA NA NA NA NA NA NA NA
8 4850 1 1 NA 1 1 1 1 NA NA NA NA
9 4851 1 1 NA NA NA NA NA NA NA NA NA
10 4854 1 1 NA NA NA NA NA NA NA NA NA
11 4855 1 1 1 1 1 NA NA NA NA NA NA
12 4857 1 1 1 1 1 1 1 1 1 NA NA
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 NA NA NA NA NA NA
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 NA NA
17 4863 1 1 NA NA NA NA NA NA NA NA NA
18 4864 1 1 NA NA NA NA NA NA NA NA NA
19 4865 1 1 1 NA NA NA NA NA NA NA NA
Note that there are some NA values after pivoting. From the Help file, this is because misses were not directly recorded in the original form of the data. Try adding the argument values_fill = 0 to the pivot_wider() function.
Joins
Joining data frames
Assume we have two data frames, df1 and df2. There are some shared variable(s) across the two.
We want to combine df1 and df2 into one single data frame. The dplyr code will look something like:
xxx_join(df1, df2)
How do we combine df1 and df2 them? The following mutating joins give us rules:
left_join(): keep all cases from df1
right_join(): keep all cases from df2
full_join(): keep all cases from both df1 and df2
inner_join(): keep all cases from df1 where there are matching values in df2 (more complicated in the scenario of multiple matches)
I know this is confusing! Examples will help clarify!
Setup
For the next few slides…
df1
ID
name
1
x1
2
x2
3
x3
df2
ID
grade
1
y1
2
y2
4
y4
I recommend writing these two down, as we will work with them in the next few slides.
Note! The variable ID is shared between these two data frames. We will “join on” ID.
left_join(df1, df2)
Adds columns to df1 from df2, matching the cases in df1. Retains ALL cases from left data frame df1, and adds NA for variables from df2 that have no match.
df1
ID
name
1
x1
2
x2
3
x3
df2
ID
grade
1
y1
2
y2
4
y4
left_join(df1, df2, by ="ID")
ID
name
grade
1
x1
y1
2
x2
y2
3
x3
NA
Think of left data frame as fixed.
right_join(df1, df2)
Adds columns to df1 from df2, matching the cases in df2. Retains ALL cases from right data frame df2, and adds NA for variables from df1 that have no match.
df1
ID
name
1
x1
2
x2
3
x3
df2
ID
grade
1
y1
2
y2
4
y4
right_join(df1, df2, by ="ID")
ID
name
grade
1
x1
y1
2
x2
y2
4
NA
y4
Think of right data frame as fixed
full_join(df1, df2)
Adds columns to df1 from df2, fully combining the two data frames, matching when possible. Retains ALL cases from df1 AND df2, and adds NAs as necesary.
df1
ID
name
1
x1
2
x2
3
x3
df2
ID
grade
1
y1
2
y2
4
y4
full_join(df1, df2, by ="ID")
ID
name
grade
1
x1
y1
2
x2
y2
3
x3
NA
4
NA
y4
No chance for losing information.
inner_join(df1, df2)
Adds columns to df1 from df2, retaining only cases that exist in both df1 AND df2.
df1
ID
name
1
x1
2
x2
3
x3
df2
ID
grade
1
y1
2
y2
4
y4
inner_join(df1, df2, by ="ID")
ID
name
grade
1
x1
y1
2
x2
y2
Comprehension check
In which of the joins are NAs possibly introduced? In which of the joins are NAs never going to be introduced?
If we have some df1 and df2, in what scenario will the result of the four joins be the same? (i.e. what has to be true about df1 and df2?)
What do you think we call these “mutating” joins? Think about where we’ve seen that verb before!
Note: it’s possible to join on more than one variable!
Example
We have data on fishery harvests (in tons) by countries from 2016, stored in data frame fish:
fish |>slice(1:8)
country
capture
aquaculture
Afghanistan
1000
1200
Albania
7886
950
Algeria
95000
1361
American Samoa
3047
20
Andorra
0
0
Angola
486490
655
Antigua and Barbuda
3000
10
Argentina
755226
3673
Suppose I would like to explore the data on a continent level. We have another data frame called continent that looks like this:
continents |>slice(1:8)
Country
continent
Afghanistan
Asia
Åland Islands
Europe
Albania
Europe
Algeria
Africa
American Samoa
Oceania
Andorra
Europe
Angola
Africa
Anguilla
Americas
Discuss
fish data frame snapshot:
country
capture
aquaculture
Afghanistan
1000
1200
Albania
7886
950
Algeria
95000
1361
American Samoa
3047
20
Andorra
0
0
continents data frame snapshot:
Country
continent
Afghanistan
Asia
Åland Islands
Europe
Albania
Europe
Algeria
Africa
American Samoa
Oceania
We want to keep all rows and columns from fish and add a column for corresponding continents. Which join function should we use?
We want to keep only the rows from fish for which we have a corresponding continent and add a column for corresponding continents. Which join function should we use?
Which variable(s) are we joining on?
Example (cont.)
Question 1:
left_join(fish, continents, by =c("country"="Country")) |>slice(1:8)
country
capture
aquaculture
continent
Afghanistan
1000
1200
Asia
Albania
7886
950
Europe
Algeria
95000
1361
Africa
American Samoa
3047
20
Oceania
Andorra
0
0
Europe
Angola
486490
655
Africa
Antigua and Barbuda
3000
10
NA
Argentina
755226
3673
Americas
Notice the NA
Question 2:
inner_join(fish, continents, by =c("country"="Country")) |>slice(1:8)
country
capture
aquaculture
continent
Afghanistan
1000
1200
Asia
Albania
7886
950
Europe
Algeria
95000
1361
Africa
American Samoa
3047
20
Oceania
Andorra
0
0
Europe
Angola
486490
655
Africa
Argentina
755226
3673
Americas
Armenia
3758
16381
Asia
Notice we don’t have two copies of same country/Country variable!
Piping into xxx_join()
In the following line of code, when would an NA be introduced? (Answer forthcoming.)
right_join(fish, continents, by =c("country"="Country"))
Remember in data wrangling, we pipe from a previous line of code directly into another
We can pipe directly into the join functions! The following achieves the same as above:
fish |>right_join(continents, by =c("country"="Country"))
Answer to previous question
NAs would be introduced if there is a country in continents that does not appear in fish. See the last 50-ish cases: