📚Study Guide: Exploring Two-Variable Data
Unit 2: Exploring Two-Variable Data
When analyzing relationships between two quantitative variables, we use scatterplots, correlation, and linear regression. This unit teaches you to describe the direction, form, and strength of associations, compute and interpret the correlation coefficient r, and fit least-squares regression lines. You will learn to distinguish between explanatory and response variables, interpret the slope and intercept in context, and analyze residuals to assess the appropriateness of a linear model. Transformations to achieve linearity, such as logarithmic transformations, are also introduced. The AP exam emphasizes interpreting regression output and residual plots rather than performing extensive hand calculations.
Key Concepts
- Scatterplots: Graphical display of two quantitative variables; describe direction (positive/negative), form (linear/nonlinear), and strength (weak/moderate/strong).
- Correlation r: Measures the strength and direction of a linear association, ranging from -1 to 1. Correlation has no units and is unaffected by switching x and y.
- Least-Squares Regression: Minimizes the sum of squared residuals. The line is y_hat = a + bx.
- Slope Interpretation: For every 1 unit increase in x, the predicted y changes by b units.
- Residuals: e = y - y_hat. Residual plots should show random scatter around zero for a linear model to be appropriate.
- Coefficient of Determination r^2: The proportion of variability in y explained by the linear model with x.
- Influential Points: Points that, if removed, significantly change the regression line, often extreme in the x-direction.
- Transformations: Log or power transformations can linearize nonlinear relationships for modeling.
Vocabulary
- Explanatory variable: The variable that may explain or predict changes in the response variable (x-axis).
- Response variable: The outcome variable measured in a study (y-axis).
- Residual: The difference between the observed value and the value predicted by the regression line.
- Least-squares regression line: The line that minimizes the sum of the squared residuals.
- Influential point: A data point whose removal causes a substantial change in the regression equation.
- Lurking variable: A variable not included in the analysis that may influence the relationship between the explanatory and response variables.
Formulas
- Regression line: y_hat = a + bx
- Slope: b = r * (s_y / s_x)
- Intercept: a = y_bar - b*x_bar
- Residual: e = y - y_hat
- r^2 = (explained variation) / (total variation)
Common Mistakes
- Assuming correlation implies causation; always consider lurking variables.
- Interpreting the y-intercept outside the meaningful range of x-values in the data.
- Using the regression line to predict far outside the observed x-range (extrapolation).
- Concluding a linear model is good just because r is high without checking the residual plot for patterns.
AP Exam Strategies
- When interpreting slope, always include "predicted" or "on average" because regression gives predictions, not exact values.
- On FRQs, describe residual plots by stating whether there is a pattern; random scatter supports linearity, curves suggest transformation.
- If asked for a prediction, substitute into y_hat = a + bx; if asked for a residual, compute observed minus predicted.
- Report r^2 as a percentage when explaining how much variation is accounted for by the model.
Real-World Applications
- Economics: Regression models relate advertising spending to sales revenue for budget allocation.
- Environmental Science: Scatterplots correlate CO2 levels with global temperature anomalies.
- Medicine: Linear models predict patient recovery time based on initial severity scores.