Data wrangling with dplyr (part 2)

Remember, we’ve learned up to this point:

Let’s re-load in our books data frame here along with the tidyverse package:

library(tidyverse)

url_file <- "https://raw.githubusercontent.com/midd-stat201a-fall25/midd-stat201a-fall25.github.io/refs/heads/main/data/amazon_books.csv"
books <- read_csv(url_file)

Group by

This is by far one of the most powerful dplyr functions! But it’s also a bit confusing.

Suppose we’d like to get the mean price of books, split by cover type (paperback or hardback). We’d still need to summarise(), but we need a way to encode our desire to treat the cases in the Paperback category separately from the case in the Hardback category. This is easily done using the group_by() function, which groups cases together based on a particular level of a categorical variable. This grouping is done behind the scenes.

We specify group_by(<categorical variable to group on>):

books |>
  group_by(cover) |>
  summarise(mean_price_amazon = mean(amazon_price))
# A tibble: 2 × 2
  cover mean_price_amazon
  <chr>             <dbl>
1 H                  13.5
2 P                  12.3
TipNote!

To be useful, group_by() is always followed by other dplyr functions.

Remember: we group on a categorical variable. Then all cases with a particular level of that categorical variable are treated as a mini new data frame, separate from cases with a different level of that variable.

Count

Let’s count the number of times we see a particular value of a variable. For example, I want to know how many paperback and how many hardback books there are in the data set. That’s saying: count up the number of cases in each level! To code this, we pipe to count(<variable we'd like to count the different levels of>).

books |>
  count(cover)
# A tibble: 2 × 2
  cover     n
  <chr> <int>
1 H         8
2 P        12

We use count() to create frequency tables!

TipNote

The count() function, like summarise(), changes the data frame entirely. It retains only the variable(s) you used in count(), and adds a new variable n which is represents the counts for that particular level.

Arrange

Another very common data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. We can do this with arrange(), where we sort the rows according the values of a variable that you specify.

In the code below, we can re-order the rows based on the listing price:

books |>
  arrange(list_price)
# A tibble: 10 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
 1   3.86e8 Dashn…       9.99         9.99       400 P         2010    8.2   5.5
 2   6.85e8 Earne…      12            9.09       128 P         1995    5.2  NA  
 3   1.57e7 Lewis       13            9.64       168 P         1964    8     5.3
 4   1.46e8 Mark …      15.0         15.0        276 P         2011    7.8   5.1
 5   1.57e9 Gruen       15.0          8.14       350 P         2007    8.2   5.4
 6   3.85e8 Grish…      15           10.2        384 P         2004    8     5.4
 7   3.85e8 Shaff…      15            8.05       290 H         2009    7.8   5.4
 8   6.07e7 Steve…      16.0          9.95       315 P         2009    7.9   5.3
 9   8.87e7 Kawas…      16           12.5        224 P         2000    8.2   5.3
10   7.43e8 Jodi …      16           10.8        448 P         2005    8.2   5.4
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

Notice that the default is increasing order. If you’d like the go in decreasing order, we can use the following:

books |>
  arrange(desc(list_price))
# A tibble: 10 × 13
    isbn_10 author list_price amazon_price num_pages cover pub_year height width
      <dbl> <chr>       <dbl>        <dbl>     <dbl> <chr>    <dbl>  <dbl> <dbl>
 1   3.76e8 Paoli…       28.0        14.0        704 H         2011   NA    NA  
 2   1.93e9 Timot…       28.0        20.0        144 P         2008   10.8   8.4
 3   1.59e9 Kawas…       27.0        15.0        211 H         2011    8.3   5.5
 4   3.07e8 Erik …       26          13          464 H         2011    9.5   6  
 5   5.20e8 R.W. …       25.0        22.8        349 P         2005    9     6.1
 6   3.46e8 Paula…       25          14.5        336 H         2011    9.3   6.4
 7   6.79e8 Jospe…       24          16.3        624 H         1995    8.4   5.3
 8   7.81e8 Augus…       21.1        16.5         NA H         1986    9     6  
 9   1.60e9 O'Bri…       17.0        10.8        368 H         2011   NA    NA  
10   6.19e8 Richa…       17.0         9.16       464 P         2008    8.1   5.6
# ℹ 4 more variables: thick <dbl>, weight_oz <dbl>, title <chr>,
#   publisher <chr>

Distinct

Sometimes, we’d like to identify unique rows/cases based on specific columns. That is, we’re looking for distinct values of a variable. If we wanted to know: what are the levels of cover, we could code:

books |>
  distinct(cover)
# A tibble: 2 × 1
  cover
  <chr>
1 H    
2 P    

Even though we had 20 cases in books, the code above is asking: How many unique values of cover are there? Just two: H and P.

Just like summarise() and count(), the function distinct() completely changes the data frame. It will drop all the other variables.

Maybe I’d like to know how many authors are represented in my data frame, without caring about the authors themselves. We can use distinct(), and pipe to the function nrow() that we’ve seen before.

books |>
  distinct(author) |>
  nrow()
[1] 19

Note: The function nrow() is not a dplyr function. We can see that the output from nrow() is a numeric object, not a data frame! However, it can accept a data frame as input.

We can also use distinct() on several variables at once to yield instances of all unique combinations of levels observed in the data frame. In the following, we’re looking for all combinations of book cover types and publish years that appear in books, arranged by year:

books |>
  distinct(pub_year, cover) |>
  arrange(pub_year)
# A tibble: 14 × 2
   pub_year cover
      <dbl> <chr>
 1     1964 P    
 2     1986 H    
 3     1995 P    
 4     1995 H    
 5     2000 P    
 6     2004 P    
 7     2005 P    
 8     2007 P    
 9     2008 P    
10     2009 H    
11     2009 P    
12     2010 P    
13     2011 H    
14     2011 P    

So we see, for example, that we have cases with both cover types in the pub_year 1995, 2009, and 2011!

Select

Sometimes, we don’t want/need to have all the variables in the data frame. This can especially be true when we mutate() new variables along the way. If we want to present a “cleaner” data frame with only select few variables, we can do that by using the function select() and specifying which variables I want to keep:

books |>
  select(isbn_10, amazon_price)
# A tibble: 20 × 2
      isbn_10 amazon_price
        <dbl>        <dbl>
 1  375856110        14.0 
 2  684801221         9.09
 3  679437223        16.3 
 4   88730995        12.5 
 5 1596435704        10.8 
 6  385737955         9.99
 7  307408841        13   
 8  618918248         9.16
 9  385337930        10.2 
10 1929133472        20.0 
11  345521307        14.5 
12  780707508        16.5 
13  385341008         8.05
14  743454537        10.8 
15  146352719        15.0 
16  520246980        22.8 
17   15676248         9.64
18 1565125606         8.14
19   60731338         9.95
20 1591843790        15.0 

Notice that the variable names don’t go in quotes here! We don’t treat them as character objects!