Joins data frames and pivoting

2025-12-01

Housekeeping

Content today is optional; might be useful for your final project!

Pivoting

Wide data

Let’s take a look at first few observations of the relig_income data frame from tidyverse package:

relig_income |>
  head()
# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>
  • This data is currently in “wide” format: a row has more than one observation. That is, the same outcome variable appears in multiple columns.

    • The outcome observation that spans across several columns is income range.

    • The different incomes columns are essentially different levels of the same categorical variable.

Long data

We might want long format: the outcome variable only exists in one column, and a second column/variable tells us the different levels.

  • Each row has one observation, but the units of observation (in this case, number of people in each income bracket) are repeated down one column.

  • This is helpful for us to perform group_by() or facet_wrap().

Converting from wide -> long

We will do this with the pivot_longer() function. This function requires a couple of arguments:

  1. cols: which columns to pivot into a “longer” format. That is, the columns that we should “move”

  2. names_to: a string character that provides the new column name for the categorical variable you are creating

  3. values_to: a string character that provides the new variable name for the response/outcome variable that is common across all levels of the categorical variable

relig_income |>
  pivot_longer(cols = 2:11, 
               names_to = "income_range",  
               values_to = "count")

pivot_longer()

Resulting data frame:

# A tibble: 20 × 3
   religion income_range       count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
11 Atheist  <$10k                 12
12 Atheist  $10-20k               27
13 Atheist  $20-30k               37
14 Atheist  $30-40k               52
15 Atheist  $40-50k               35
16 Atheist  $50-75k               70
17 Atheist  $75-100k              73
18 Atheist  $100-150k             59
19 Atheist  >150k                 74
20 Atheist  Don't know/refused    76
  • Note that the unit of observation is repeated down several rows!
# This also works
relig_income |>
  pivot_longer(cols = !religion, names_to = "income_range",  values_to = "count")

Converting long -> wide

Let’s look at the fish_encounters data. Convince yourself that it’s currently in long format!

# A tibble: 6 × 3
  fish  station  seen
  <fct> <fct>   <int>
1 4842  Release     1
2 4842  I80_1       1
3 4842  Lisbon      1
4 4842  Rstr        1
5 4842  Base_TD     1
6 4842  BCE         1
  • We want to pivot the data such that each fish is an observation, and we can easily see which stations it was observed at.
  • We can do this using the pivot_wider() function!

pivot_wider()

The pivot_wider() function takes two arguments:

  1. names_from: the name of the variable(s) in the data frame to get the name of the output column.

  2. values_from: the name of the variable(s) in the data frame to get the cell values from

  • The input to names_from should be the categorical variable(s), and the different levels of this categorical variable will be the new column names.

    • What should we fill these cells with? The values specified by values_from.
fish_encounters |>
  pivot_wider(names_from = station, values_from = seen)

pivot_wider()

# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
  • Note that there are some NA values after pivoting. From the Help file, this is because misses were not directly recorded in the original form of the data. Try adding the argument values_fill = 0 to the pivot_wider() function.

Joins

Joining data frames

Assume we have two data frames, df1 and df2. There are some shared variable(s) across the two.

We want to combine df1 and df2 into one single data frame. The dplyr code will look something like:

xxx_join(df1, df2)
  • How do we combine df1 and df2 them? The following mutating joins give us rules:

    • left_join(): keep all cases from df1
    • right_join(): keep all cases from df2
    • full_join(): keep all cases from both df1 and df2
    • inner_join(): keep all cases from df1 where there are matching values in df2 (more complicated in the scenario of multiple matches)

I know this is confusing! Examples will help clarify!

Setup

For the next few slides…

df1
ID name
1 x1
2 x2
3 x3
df2
ID grade
1 y1
2 y2
4 y4

I recommend writing these two down, as we will work with them in the next few slides.

Note! The variable ID is shared between these two data frames. We will “join on” ID.

left_join(df1, df2)

Adds columns to df1 from df2, matching the cases in df1. Retains ALL cases from left data frame df1, and adds NA for variables from df2 that have no match.

df1
ID name
1 x1
2 x2
3 x3
df2
ID grade
1 y1
2 y2
4 y4
left_join(df1, df2, by = "ID")
ID name grade
1 x1 y1
2 x2 y2
3 x3 NA

Think of left data frame as fixed.

right_join(df1, df2)

Adds columns to df1 from df2, matching the cases in df2. Retains ALL cases from right data frame df2, and adds NA for variables from df1 that have no match.

df1
ID name
1 x1
2 x2
3 x3
df2
ID grade
1 y1
2 y2
4 y4
right_join(df1, df2, by = "ID")
ID name grade
1 x1 y1
2 x2 y2
4 NA y4

Think of right data frame as fixed

full_join(df1, df2)

Adds columns to df1 from df2, fully combining the two data frames, matching when possible. Retains ALL cases from df1 AND df2, and adds NAs as necesary.

df1
ID name
1 x1
2 x2
3 x3
df2
ID grade
1 y1
2 y2
4 y4
full_join(df1, df2, by = "ID")
ID name grade
1 x1 y1
2 x2 y2
3 x3 NA
4 NA y4

No chance for losing information.

inner_join(df1, df2)

Adds columns to df1 from df2, retaining only cases that exist in both df1 AND df2.

df1
ID name
1 x1
2 x2
3 x3
df2
ID grade
1 y1
2 y2
4 y4
inner_join(df1, df2, by = "ID")
ID name grade
1 x1 y1
2 x2 y2

Comprehension check

  1. In which of the joins are NAs possibly introduced? In which of the joins are NAs never going to be introduced?
  2. If we have some df1 and df2, in what scenario will the result of the four joins be the same? (i.e. what has to be true about df1 and df2?)
  3. What do you think we call these “mutating” joins? Think about where we’ve seen that verb before!
  • Note: it’s possible to join on more than one variable!

Example

We have data on fishery harvests (in tons) by countries from 2016, stored in data frame fish:

fish |>
    slice(1:8)
country capture aquaculture
Afghanistan 1000 1200
Albania 7886 950
Algeria 95000 1361
American Samoa 3047 20
Andorra 0 0
Angola 486490 655
Antigua and Barbuda 3000 10
Argentina 755226 3673

Suppose I would like to explore the data on a continent level. We have another data frame called continent that looks like this:

continents |>
  slice(1:8)
Country continent
Afghanistan Asia
Åland Islands Europe
Albania Europe
Algeria Africa
American Samoa Oceania
Andorra Europe
Angola Africa
Anguilla Americas

Discuss

fish data frame snapshot:

country capture aquaculture
Afghanistan 1000 1200
Albania 7886 950
Algeria 95000 1361
American Samoa 3047 20
Andorra 0 0

continents data frame snapshot:

Country continent
Afghanistan Asia
Åland Islands Europe
Albania Europe
Algeria Africa
American Samoa Oceania
  • We want to keep all rows and columns from fish and add a column for corresponding continents. Which join function should we use?

  • We want to keep only the rows from fish for which we have a corresponding continent and add a column for corresponding continents. Which join function should we use?

  • Which variable(s) are we joining on?

Example (cont.)

Question 1:

left_join(fish, continents, 
          by = c("country" = "Country")) |>
  slice(1:8)
country capture aquaculture continent
Afghanistan 1000 1200 Asia
Albania 7886 950 Europe
Algeria 95000 1361 Africa
American Samoa 3047 20 Oceania
Andorra 0 0 Europe
Angola 486490 655 Africa
Antigua and Barbuda 3000 10 NA
Argentina 755226 3673 Americas
  • Notice the NA

Question 2:

inner_join(fish, continents, 
           by = c("country" = "Country")) |>
  slice(1:8)
country capture aquaculture continent
Afghanistan 1000 1200 Asia
Albania 7886 950 Europe
Algeria 95000 1361 Africa
American Samoa 3047 20 Oceania
Andorra 0 0 Europe
Angola 486490 655 Africa
Argentina 755226 3673 Americas
Armenia 3758 16381 Asia
  • Notice we don’t have two copies of same country/Country variable!

Piping into xxx_join()

In the following line of code, when would an NA be introduced? (Answer forthcoming.)

right_join(fish, continents, by = c("country" = "Country"))
  • Remember in data wrangling, we pipe from a previous line of code directly into another

  • We can pipe directly into the join functions! The following achieves the same as above:

fish |>
  right_join(continents, by = c("country" = "Country"))

Answer to previous question

NAs would be introduced if there is a country in continents that does not appear in fish. See the last 50-ish cases: