| x | y | y_hat | residual |
|---|---|---|---|
| -2.991 | 2.481 | -0.130 | 2.611 |
| -1.005 | -1.302 | 0.691 | -1.994 |
| 3.990 | 3.929 | 2.757 | 1.172 |
2025-11-10
Homework 8 due tonight!
Project proposal feedback (revisions due tonight midnight)
Crash course; take STAT 211 for more depth!
Recall equation of a line: \(y = mx + b\)
Intercept \(b\) and slope \(m\) determine specific line
This function is deterministic: as long as we know \(x\), we know value of \(y\) exactly
Simple linear regression: statistical method where the relationship between variables \(x\) and \(y\) is modeled as a line + error:
\[ y = \underbrace{\beta_{0} \ +\ \beta_{1} x}_{\text{line}} \ + \underbrace{\epsilon}_{\text{error}} \]
\[ y = \beta_{0} + \beta_{1} x + \epsilon \]
We have two variables:
\(\beta_{0}\) and \(\beta_{1}\) are the model parameters (intercept and slope)
\(\epsilon\) (epsilon) represents the error
Accounts for variability: we do not expect all data to fall perfectly on the line!
Sometimes we drop the \(\epsilon\) term for convenience
Suppose we have the following data:
Suppose we have some specific estimates \(b_0\) and \(b_{1}\). We could approximate the linear relationship using these values as:
\[ \hat{y} = b_{0} + b_{1} x \]
The hat on \(y\) signifies an estimate: \(\hat{y}\) is the estimated/fitted value of \(y\) given these specific values of \(x\), \(b_{0}\) and \(b_{1}\)
Note that the fitted value is obtained without the error
Residuals (denoted as \(e\)) are the remaining variation in the data after fitting a model.
\[ \text{observed response} = \text{fit} + \text{residual} \]
\[y_{i} = \hat{y}_{i} + e_{i} \Rightarrow e_{i} = y_{i} - \hat{y}_{i}\]
Residual = difference between observed and expected
In the plot, the residual is indicated by the vertical dashed line
What is the ideal value for a residual? What does a positive/negative residual indicate?
Residual values for the three highlighted observations:
| x | y | y_hat | residual |
|---|---|---|---|
| -2.991 | 2.481 | -0.130 | 2.611 |
| -1.005 | -1.302 | 0.691 | -1.994 |
| 3.990 | 3.929 | 2.757 | 1.172 |
Residuals are very helpful in evaluating how well a model fits a set of data
Residual plot: original \(x\) values plotted against corresponding residuals on \(y\)-axis

Residual plots can be useful for identifying characteristics/patterns that remain in the data even after fitting a model.
Just because you fit a model to data, does not mean the model is a good fit!

Can you identify any patterns remaining in the residuals?
Different data may exhibit different strength of linear relationships:
Correlation is describes the strength of a linear relationship between two variables
Always takes a value between -1 and 1
-1 = perfectly linear and negative
1 = perfectly linear and positive
0 = no linear relationship
Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship

In Algebra class, there exists a single (intercept, slope) pair because the \((x,y)\) points had no error; all points landed on the line.
Now, we assume there is error
How do we choose a single “best” \((b_{0}, b_{1})\) pair?
The following display the same set of 50 observations.

Which line would you say fits the data the best?
There are infinitely many choices of \((b_{0}, b_{1})\) that could be used to create a line
We want the BEST choice (i.e. the one that gives us the “line of best fit”)
How to define “best”?
One way to define a “best” is to choose the specific values of \((b_{0}, b_{1})\) that minimize the total residuals across all \(n\) data points. Results in following possible criterion:
\[ |e_{1} | + |e_{2}| + \ldots + |e_{n}| \]
\[ e_{1}^2 + e_{2}^2 +\ldots + e_{n}^2 \]
The choice of \((b_{0}, b_{1})\) that satisfy least squares criterion yields the least squares line, and will be our criterion for “best”
On previous slide, yellow line is the least squares line, whereas pink line is the least absolute line
Remember, our linear regression model is:
\[ y = \beta_{0} + \beta_{1}x + \epsilon \]
While not wrong, it can be good practice to be specific about an observation \(i\):
\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i}, \qquad i = 1,\ldots, n \]
Here, we are stating that each observation \(i\) has a specific:
In SLR, we further assume that the errors \(\epsilon_{i}\) are independent and Normally distributed
Like when using CLT, we should check some conditions before saying a linear regression model is appropriate!
Assume for now that \(x\) is continuous numerical.
Linearity: data should show a linear trend between \(x\) and \(y\)
Independence: the observations \(i\) are independent of each other
e.g. random sample
Non-example: time-series data
Normality/nearly normal residuals: the residuals should appear approximately Normal
Equal variability: variability of points around the least squares line remains roughly constant
We will see how to check for these four LINE conditions using the cherry data from openintro.
| diam | volume |
|---|---|
| 8.3 | 10.3 |
| 8.6 | 10.3 |
| 8.8 | 10.2 |
| 10.5 | 16.4 |
| 10.7 | 18.8 |
Explanatory variable \(x\): diam
Response variable \(y\): volume
Our candidate linear regression model is as follows
\[ \text{volume} = \beta_{0} + \beta_{1} \text{diameter} +\epsilon \]
Assess before fitting the linear regression model by making a scatterplot of \(x\) vs. \(y\):
Does there appear to be a linear relationship between diameter and volume?
Assess before fitting the linear regression model by understanding how your data were sampled.
cherry data do not explicitly say that the trees were randomly sampled, but it might be a reasonable assumptionAn example where independence is violated:

Here, the data are a time series, where observation at time point \(i\) depends on the observation at time \(i-1\).
Because the first two conditions are met, we can go ahead and fit the linear regression model (i.e. estimate the values of the coefficients)
\[ \widehat{\text{volume}} = -36.94 + 5.07 \times \text{diameter} \]
Remember: the “hat” denotes an estimated/fitted value!
We will soon see how \(b_{0}\) and \(b_{1}\) are calculated and how to interpret them
The next two checks can only occur after fitting the model.
Assess after fitting the model by making histogram of residuals and checking for approximate Normality.
Do the residuals appear approximately Normal?
Assess after fitting the model by examining a residual plot and looking for patterns.
A good residual plot:

A bad residual plot:

We usually add a horizontal line at 0.
Let’s examine the residual plot of our fitted model for the cherry data:
Do we think equal variance is met?
I would say there is a definite pattern in the residuals, so equal variance condition is not met.
Some of the variability in the errors appear related to diameter