Visualization and summary statistics (part 1)
2025-09-15
Variables can be broadly broken into two categories: numerical (quantitative) or categorical (qualitative)
Numerical variables take a wide range of numerical values, and it is sensible to add/subtract/do mathematical operations with those values. Two types:
Discrete if it can only take on finitely many numerical values within a given interval
Continuous if it can take on any infinitely many values within a given interval
Categorical variables are essentially everything else (more on this next week!)
Examples and non-examples?
We will be looking at some medical insurance data throughout these slides.
Which of the following variables are numerical? Which are discrete vs. continuous?
We often care about variable’s distribution: the different values the variable can take on along with how often
Rather than provide someone with an entire dataset, it is more useful to provide quick “snapshot” information
Two pieces of quantitative information that describe a distribution:
Center
Spread
Most common way to measure the center of the distribution of a numerical variable is using the mean (also called the average)
Sample mean: a mean calculated using sampled data. The sample mean is typically denoted as \(\bar{x}\)
BMI, charges)The sample mean is the sum over all the observed values of the variable, divided by total number of observations \(n\):
\[\bar{x} = \frac{x_{1} + x_{2} + \ldots x_{n}}{n} = \frac{1}{n} \sum_{i=1}^{n} x_{i}\]
The sample mean \(\bar{x}\) is an example of a sample statistic
The mean over the entire population is an example of a population parameter. The population mean is denoted \(\mu\) (Greek letter mu)
Let’s calculate the sample mean weight of a piece of candy in our bag. Let \(x\) be the weight of a candy.
Calculate your \(\bar{x}\)
How would we obtain the population mean \(\mu\)?
What is the average of the following values? \(\qquad 1, 4, 4\)
If instead there were ten 1’s and twenty 4’s, would the average be the same?
Thus, we see that means depend on proportions!
At the heart of statistics is also the variability or spread of the distribution of the variable
We will work with variance and standard deviation, which are ways to describe how spread out data are from their mean

We begin with deviation, which is the distance or difference between an observation from the (sample) mean
The sample variance \(s^2\) squares the deviations and takes an average:
\[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_{i} - \bar{x})^2 \]
Set-up the calculation of the sample variance of your sample
RThe sample standard deviation \(s\) is the simply the square root of the sample variance (\(s = \sqrt{s^2}\))
Like the mean, the population values for variance and standard deviation are denoted with Greek letters:
\(\sigma\) for population standard deviation (Greek letter “sigma”)
\(\sigma^2\) for population variance
If the calculation of standard deviation is a more complicated quantity than the variance, why do we bother with standard deviation?
Functions to calculate sample mean, variance, and standard deviation in R. Each expects a vector of numerical values as input:
mean()
var()
sd()
Scatterplots are bivariate (two-variable) visualizations that provide a case-by-case view of the data for two numerical variables

Use scatterplots to reveal:
Association (positive, negative, none), and if there is an association:
The strength (very weak to very strong)
The type of association (e.g. linear, quadratic)

To visualize the distribution (i.e. behavior) of a single variable, we could create a dot plot where:
Each case is plotted on a horizontal axis as a dot
Values that appear multiple times in the dataset would have stacked dots
We can make a dot plot from our activity
Pros and cons?
bmi variable has observed values of \(15.96\) through \(49.6\). Consider the following bins of size 5: [15, 19), [19, 23), [23, 27), …, [49, 53)
Histograms are visualizations that display the binned counts as bars for each bin.
| bmi_bin | count |
|---|---|
| [15, 19) | 5 |
| [19, 23) | 12 |
| [23, 27) | 35 |
| [27, 31) | 58 |
| [31, 35) | 41 |
| [35, 39) | 35 |
| [39, 43) | 13 |
| [49, 52) | 1 |

A convenient way to describe a variable’s behavior is through the shape of its distribution. Using histograms, we should identify:
How would you describe the shape (i.e. skewness and modality) of the distributions in the following two histograms?


Working in your groups: