Categorical data

Visualizations and contingency tables

2025-09-22

Housekeeping

Problem set 2 due tonight! Please be sure to submit both written and rendered parts by combining into a single PDF
Problem set 1 graded

Categorical data

Recall that a variable is either numerical or categorical
Categorical variables are variables that can take one of a limited (usually fixed) number of possible values, known as levels
- Represent data that can be divided into groups
Two types:
- Ordinal: the levels have a special ordering
- Nominal: the levels don’t have an ordering
  - We almost exclusively treat categorical variables as nominal in this class
Example:
- Blood type (A, B, AB, O)
- Education level (high school, college, graduate degree, other)

Insurance data

Univariate EDA

If we are interested in understanding the distribution of a single categorical variable, it is common to:

Display a frequency table, which is a table of counts of each level

# A tibble: 2 × 2
  smoker     n
  <chr>  <int>
1 no       155
2 yes       45

Create a bar plot, where different levels are displayed on one axis and the counts are portrayed on the other

Bivariate EDA

Perhaps we are interested in examining the distribution of two categorical variables at the same time
Summarize the distribution using a two-way table known as a contingency table:
- Each value in the table counts the number of times a particular combination of variable 1 and variable 2 levels occurred in data
  
  Contingency table
  
  smoker female male
  
  no 87 68
  
  yes 17 28
How can we use contingency table to obtain the distribution of just one of the variables?

Dodged bar plot

The dodged bar plot directly converts the contingency table to a visualization.

Contingency table
smoker	female	male
no	87	68
yes	17	28

Stacked bar plot

The stacked bar plot looks at the counts either row-wise or column-wise.

Contingency table
smoker	female	male
no	87	68
yes	17	28

Proportions

Can convert the contingency table to proportions row-wise or column-wise to obtain the fractional breakdown of one variable in another.

Contingency table
smoker	female	male
no	87	68
yes	17	28

Row-wise proportions
smoker	female	male
no	0.561	0.439
yes	0.378	0.622

What does the quantity 0.378 represent?
If we take the proportions row-wise, does each row need to sum to 1?
If we take the proportions row-wise, does each column need to sum to 1?

Proportions (cont.)

Set up how to find the column-wise proportions using our contingency table

Contingency table
smoker	female	male
no	87	68
yes	17	28

Standardized bar plot

The standardized bar plot visualizes these row-wise or column-wise proportions.

Choosing a bar plot

Using any of the plots, do you believe the smoker status and sex are associated?
When might you prefer to use the stacked, dodged, or standardized bar plot?

Live code

Bar plots
Aesthetics: fill, shape
Faceting
Plot background

Bar plot (univariate)

ggplot(data = insurance, mapping = aes(x = smoker)) +
  geom_bar()

Note: if your data are already in the form of frequency table, we should use geom_col() instead!

Bivariate bar plots

ggplot(insurance, aes(x = smoker, fill = sex)) +
  geom_bar(position = "dodge")

ggplot(insurance, aes(x = smoker, fill = sex)) +
  geom_bar(position = "stack") # this is default

Bivariate bar plots (cont.)

ggplot(insurance, aes(x = smoker, fill = sex)) +
  geom_bar(position = "fill")

How might we make the bars horizontal instead of vertical?

Visualizing numerical and categorical

ggplot(data = insurance, mapping = aes(x = bmi, y = charges, col = smoker)) +
  geom_point()

What do you notice about the legend for color compared to the legend for color from last week?

Aesthetic: shape

ggplot(data = insurance, mapping = aes(x = bmi, y = charges, shape = smoker)) +
  geom_point()

facet_wrap()

Faceting is used when we want to split a particular visualization by the values of another (categorical) variable

ggplot(data = insurance, 
       mapping = aes(x = bmi)) +
  geom_histogram() +
  facet_wrap(~ smoker)

ggplot(data = insurance, 
       mapping = aes(x = bmi)) +
  geom_histogram() +
  facet_wrap(~ smoker, scales = "free_y")

Side-by-side box plots

ggplot(data = insurance, 
       mapping = aes(x = smoker, y = bmi)) +
  geom_boxplot()

Like faceting, but only for box plots. Really good for comparing a numerical variable across across a categorical!

Changing plot theme

Change the background of plots by adding on any one of the following:

theme_bw(), theme_minimal(), theme_gray(), theme_void() and a few more (see all options by checking the help file for any one of these)

ggplot(data = insurance, 
       mapping = aes(x = smoker, y = bmi)) +
  geom_boxplot() +
  theme_minimal()

Categorical data

Housekeeping

Categorical data

Insurance data

Univariate EDA

Bivariate EDA

Dodged bar plot

Stacked bar plot

Proportions

Proportions (cont.)

Standardized bar plot

Choosing a bar plot

Live code

Bar plot (univariate)

Bivariate bar plots

Bivariate bar plots (cont.)

Visualizing numerical and categorical

Aesthetic: shape

facet_wrap()

facet_grid()

Side-by-side box plots

Changing plot theme