Regression Analysis: A Practical Guide
Regression remains the workhorse for predicting continuous outcomes and explaining relationships between variables. Despite the rise of deep learning, linear and generalised linear models provide transparency, speed, and strong baselines. This guide covers assumptions, diagnostics, regularisation, and when to switch to non‑linear approaches.
1) Assumptions & Diagnostics
Check linearity (partial residual plots), independence (design), homoscedasticity (residual vs. fitted), and normality of errors (QQ plots). Violations are common and not fatal—robust regression, transformations, or heteroscedasticity‑consistent errors can mitigate issues.
2) Feature Selection & Multicollinearity
Use domain knowledge first. Quantify collinearity via variance inflation factors (VIF). Penalised regression (ridge/lasso) handles multicollinearity and reduces variance. Remember that sparsity from the lasso is a modelling choice, not a truth claim.
3) Regularisation
Ridge (L2) shrinks coefficients, lasso (L1) sets some exactly to zero, and elastic‑net blends both. Cross‑validate the penalty parameter and prefer simpler models when performance differences are negligible.
4) Non‑Linearities & Interactions
Use splines, polynomials with care, or tree‑based models to capture non‑linear effects. Interaction terms reveal conditional relationships (e.g., price sensitivity differs by customer segment).
5) GLMs & Beyond
When outcomes are counts or rates, Poisson or negative binomial may fit better. For bounded outcomes consider beta regression. Quantile regression estimates conditional quantiles and is robust to outliers—excellent for service‑level guarantees.
FAQ
How do I choose between RMSE and MAE?
MAE is robust to outliers; RMSE penalises large errors more. Choose based on business cost of large misses.
Can I interpret coefficients causally?
Only under strong identification assumptions. For interventions use causal inference tools (IVs, DID, RCTs).