Data wrangling with dplyr (part 1)

Published

September 24, 2025

The textbook Modern Dive has some great descriptions and visuals to explain some of these functions. I highly recommend you read through the online textbook linked here: https://moderndive.com/3-wrangling.html#filter.

library(tidyverse)

url_file <- "https://raw.githubusercontent.com/midd-stat201a-fall25/midd-stat201a-fall25.github.io/refs/heads/main/data/amazon_books.csv"
books <- read_csv(url_file)

By default, all dplyr functions expect the first argument to be a data frame.

Filter

In the following code, we use the filter() function to only retain the observations where the book’s list_price is less than or equal to $20, and then exactly equal to $20. This function requires specifying a logical condition, and keeps observations in which the condition is met (i.e. TRUE).

books |>
  filter(list_price <= 20)

# A tibble: 12 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
 1   6.85e8 Earne…      12            9.09       128 P         1995    5.2  NA  
 2   8.87e7 Kawas…      16           12.5        224 P         2000    8.2   5.3
 3   1.60e9 O'Bri…      17.0         10.8        368 H         2011   NA    NA  
 4   3.86e8 Dashn…       9.99         9.99       400 P         2010    8.2   5.5
 5   6.19e8 Richa…      17.0          9.16       464 P         2008    8.1   5.6
 6   3.85e8 Grish…      15           10.2        384 P         2004    8     5.4
 7   3.85e8 Shaff…      15            8.05       290 H         2009    7.8   5.4
 8   7.43e8 Jodi …      16           10.8        448 P         2005    8.2   5.4
 9   1.46e8 Mark …      15.0         15.0        276 P         2011    7.8   5.1
10   1.57e7 Lewis       13            9.64       168 P         1964    8     5.3
11   1.57e9 Gruen       15.0          8.14       350 P         2007    8.2   5.4
12   6.07e7 Steve…      16.0          9.95       315 P         2009    7.9   5.3
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

books |>
  filter(amazon_price == 20)

# A tibble: 0 × 13
# ℹ 13 variables: isbn_10 <dbl>, author <chr>, list_price <dbl>,
#   amazon_price <dbl>, num_pages <dbl>, cover <chr>, pub_year <dbl>,
#   height <dbl>, width <dbl>, thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

books |>
  filter(is.na(height))

# A tibble: 2 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
1    3.76e8 Paoli…       28.0         14.0       704 H         2011     NA    NA
2    1.60e9 O'Bri…       17.0         10.8       368 H         2011     NA    NA
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

books |>
  filter(!is.na(height))

# A tibble: 18 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
 1   6.85e8 Earne…      12            9.09       128 P         1995    5.2  NA  
 2   6.79e8 Jospe…      24           16.3        624 H         1995    8.4   5.3
 3   8.87e7 Kawas…      16           12.5        224 P         2000    8.2   5.3
 4   3.86e8 Dashn…       9.99         9.99       400 P         2010    8.2   5.5
 5   3.07e8 Erik …      26           13          464 H         2011    9.5   6  
 6   6.19e8 Richa…      17.0          9.16       464 P         2008    8.1   5.6
 7   3.85e8 Grish…      15           10.2        384 P         2004    8     5.4
 8   1.93e9 Timot…      28.0         20.0        144 P         2008   10.8   8.4
 9   3.46e8 Paula…      25           14.5        336 H         2011    9.3   6.4
10   7.81e8 Augus…      21.1         16.5         NA H         1986    9     6  
11   3.85e8 Shaff…      15            8.05       290 H         2009    7.8   5.4
12   7.43e8 Jodi …      16           10.8        448 P         2005    8.2   5.4
13   1.46e8 Mark …      15.0         15.0        276 P         2011    7.8   5.1
14   5.20e8 R.W. …      25.0         22.8        349 P         2005    9     6.1
15   1.57e7 Lewis       13            9.64       168 P         1964    8     5.3
16   1.57e9 Gruen       15.0          8.14       350 P         2007    8.2   5.4
17   6.07e7 Steve…      16.0          9.95       315 P         2009    7.9   5.3
18   1.59e9 Kawas…      27.0         15.0        211 H         2011    8.3   5.5
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

What’s going on here?

Start with the data frame books
Pipe (|>) the data frame to the filter() function and specify that we want to retain the cases where a particular condition is TRUE
The result is a new data frame

We can use the following to create logical conditions: ==, >, <, >=, <=, is.na(), is.numeric(), and many more!

To check for the lack of equality we use either != or !is.__(). In general, the ! reverses the boolean (so !TRUE == FALSE).

Note

What is the difference between using is.na() within filter() and the following function na.omit()? Try running it!

books |>
  na.omit()

We can also filter for more than one condition at once. Within filter(), the comma , specifies that all conditions must be true. It can be read as “and”. In the following code, we retain cases where a book’s price on Amazon is that same as its list price, and has a list price under $20.

books |>
  filter(list_price == amazon_price, list_price < 20)

# A tibble: 2 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
1 385737955 Dashn…       9.99         9.99       400 P         2010    8.2   5.5
2 146352719 Mark …      15.0         15.0        276 P         2011    7.8   5.1
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

If we just need just one of multiple conditions to be true, we can use the | operator which stands for “or”:

books |>
  filter( (list_price == amazon_price) | list_price < 20)

# A tibble: 12 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
 1   6.85e8 Earne…      12            9.09       128 P         1995    5.2  NA  
 2   8.87e7 Kawas…      16           12.5        224 P         2000    8.2   5.3
 3   1.60e9 O'Bri…      17.0         10.8        368 H         2011   NA    NA  
 4   3.86e8 Dashn…       9.99         9.99       400 P         2010    8.2   5.5
 5   6.19e8 Richa…      17.0          9.16       464 P         2008    8.1   5.6
 6   3.85e8 Grish…      15           10.2        384 P         2004    8     5.4
 7   3.85e8 Shaff…      15            8.05       290 H         2009    7.8   5.4
 8   7.43e8 Jodi …      16           10.8        448 P         2005    8.2   5.4
 9   1.46e8 Mark …      15.0         15.0        276 P         2011    7.8   5.1
10   1.57e7 Lewis       13            9.64       168 P         1964    8     5.3
11   1.57e9 Gruen       15.0          8.14       350 P         2007    8.2   5.4
12   6.07e7 Steve…      16.0          9.95       315 P         2009    7.9   5.3
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

While not necessary, the extra ( ) for the first logical condition makes the code easier to read!

We can also filter() based on categorical variables. If I only wanted hardcover books, I can write the following code:

books |>
  filter(cover == "H")

# A tibble: 8 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
1    3.76e8 Paoli…       28.0        14.0        704 H         2011   NA    NA  
2    6.79e8 Jospe…       24          16.3        624 H         1995    8.4   5.3
3    1.60e9 O'Bri…       17.0        10.8        368 H         2011   NA    NA  
4    3.07e8 Erik …       26          13          464 H         2011    9.5   6  
5    3.46e8 Paula…       25          14.5        336 H         2011    9.3   6.4
6    7.81e8 Augus…       21.1        16.5         NA H         1986    9     6  
7    3.85e8 Shaff…       15           8.05       290 H         2009    7.8   5.4
8    1.59e9 Kawas…       27.0        15.0        211 H         2011    8.3   5.5
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

If there are many levels of a categorical variable that I’d like to filter for, we can use %in% logical operator:

books |>
  filter(publisher %in% c("Crown", "Delta"))

# A tibble: 2 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
1 307408841 Erik …         26         13         464 H         2011    9.5   6  
2 385337930 Grish…         15         10.2       384 P         2004    8     5.4
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

I read this as: is this books’s value of publisher in the vector ("Crown", "Delta")?

You could also use the “or” operator | but that can get a bit tedious!

books |>
  filter( (publisher == "Crown") | (publisher ==  "Delta"))

# A tibble: 2 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
1 307408841 Erik …         26         13         464 H         2011    9.5   6  
2 385337930 Grish…         15         10.2       384 P         2004    8     5.4
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

Mutate to add a new variable

It is typical for us to add new variables to a given data frame. We do this with the mutate() function. We must specify:

The name of the new variable and
How to calculate the value of that new variable for each observation. This will typically involve operations involving variables already present in the data frame. Importantly, you should think of those function operating row-wise (i.e. each case will get its own new value).

We link the two with ONE equals sign =.

In the following, we create a new variable called body_wt_lbs based off the rule weight_oz / 16. Thus, each case gets its own new value of weight_lbs based on its original value of weight. You can see the first three cases here:

books |>
  mutate(weight_lbs = weight_oz / 16)

Note

New variables from mutate() always get added to the last column of the data frame!

Storing our work from mutate

We’re doing all this good data wrangling, but we’re not “saving our work on the data frame”. That is, even though we’re using dplyr functions, all of the work is going into the ether unless we deliberatively store our wrangled data frame.

For example, what’s the purpose of creating a new variable if we don’t use it later on? Let’s create a new data frame called books2 that includes the new weight_lbs variable:

books2 <- books |>
  mutate(weight_lbs = weight_oz / 16)

Note

Remember, when we store values in R, we never see the executed output. So when you run the above code, you don’t see a snapshot of the data frame. If you wanted to see your new books2 data frame, use View()! Just remember to remove View() before rendering!

Summarise

The summarise() function gives us an easy way to calculate summary statistics of variables in the data frame! We just need to know the name of the function such as mean() or sd() that will calculate the summary statistic for us.

Remember that the output is a data frame. Convention is to use the function like this: summarise(informative_name = function for statistic).

Think about summarise() as operating column-wise: it will use ALL the cases for the specified variable.

Let’s find the average weight (in pounds) of all our books:

books2 |>
  summarise(mean_list_price = mean(list_price))

# A tibble: 1 × 1
  mean_list_price
            <dbl>
1            19.0

books2 |>
  summarise(mean_list_price = mean(list_price), sd_list_price = sd(list_price))

# A tibble: 1 × 2
  mean_list_price sd_list_price
            <dbl>         <dbl>
1            19.0          5.81

You can obtain multiple summary statistics at once by separating the desired summary statistics with commas.

Note

The summarise() function changes the data frame entirely. It collapses rows down to a summary statistic, and removes all columns that are irrelevant to the calculation.