PCR vs PLS: When Fewer Features Beat More
How much should a baseball team pay its players? The 1986 Major League season gives us 263 hitters with 19 statistics each: at-bats, hits, home runs, years played, and more. Predicting salary from performance sounds like a textbook regression problem, but 19 correlated features make it anything but. Throw them all into a linear regression and the model fits the training data beautifully but falls apart on held-out players. The coefficient estimates are wildly unstable, and salary predictions swing by thousands on minor input changes. The fix is not a fancier model. It is fewer features, chosen more carefully. This post covers two classic strategies for doing exactly that: Principal Component Regression (PCR) and Partial Least Squares (PLS). By the end, you'll understand how both methods compress correlated features into a handful of components, why PLS typically needs fewer components than PCR, and when each approach is the right tool. Click the badge to open the interactive notebook: We'll use the classic ISLR Hitters dataset: 263 baseball players with 19 features (at-bats, hits, home runs, years played, etc.) predicting salary in thousands of dollars. import numpy as np import pandas as pd from sklearn.preprocessing import scale from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import KFold, cross_val_score, train_test_split from sklearn.metrics import mean_squared_error # Load and prepare the Hitters dataset df = pd.read_csv( 'https://raw.githubusercontent.com/selva86/datasets/master/Hitters.csv' ).dropna() dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']]) y = df['Salary'] X = pd.concat([ df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64'), dummies[['League_N', 'Division_W', 'NewLeague_N']] ], axis=1) # Train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1) # PCR: PCA on scaled training data, then regression pca = PCA() X_train_pc = pca.fit_transform(scale(X_train)) X_test_pc = pca.transform(scale(X_test)) # Use 10-fold CV to find the best number of components kf = KFold(n_splits=10, shuffle=True, random_state=1) regr = LinearRegression() mse_by_ncomp = [] for k in range(1, X.shape[1] + 1): score = -cross_val_score( regr, X_train_pc[:, :k], y_train.to_numpy(), cv=kf, scoring='neg_mean_squared_error' ).mean() mse_by_ncomp.append(score) best_k = np.argmin(mse_by_ncomp) + 1 print(f"Best PCR components: {best_k}") print(f"CV MSE: {mse_by_ncomp[best_k - 1]:,.0f}") # Evaluate on test set regr.fit(X_train_pc[:, :best_k], y_train) pcr_test_mse = mean_squared_error(y_test, regr.predict(X_test_pc[:, :best_k])) print(f"PCR test MSE ({best_k} components): {pcr_test_mse:,.0f}") # Compare to full OLS regr_full = LinearRegression() regr_full.fit(X_train_pc, y_train) ols_test_mse = mean_squared_error(y_test, regr_full.predict(X_test_pc)) print(f"Full OLS test MSE (19 features): {ols_test_mse:,.0f}") The result: PCR with just 6 components achieves a test MSE of ~112,000, beating full OLS (test MSE ~117,000) using all 19 features. Fewer features, better predictions. Now let's try PLS, which uses the target variable during dimension reduction: # PLS: find the best number of components via CV pls_mse = [] for k in range(1, X.shape[1] + 1): pls = PLSRegression(n_components=k) score = -cross_val_score( pls, scale(X_train), y_train.to_numpy(), cv=kf, scoring='neg_mean_squared_error' ).mean() pls_mse.append(score) best_pls_k = np.argmin(pls_mse) + 1 pls_best = PLSRegression(n_components=2) pls_best.fit(scale(X_train), y_train) pls_test_mse = mean_squared_error(y_test, pls_best.predict(scale(X_test))) print(f"PLS test MSE (2 components): {pls_test_mse:,.0f}") PLS with just 2 components achieves a test MSE of ~105,000, beating both PCR and OLS. That is the power of supervised dimension reduction: PLS finds the directions that matter for the target, not just the directions of maximum variance. Both methods solve the same problem: your 19 features are correlated (career stats like CAtBat, CHits, CRuns all move together), so fitting a separate coefficient for each one leads to noisy, unstable estimates. The solution is to compress correlated features into a smaller set of components before regressing. The difference is how they choose those components. PCR works in two steps: PCA finds the directions of maximum variance in $X$, ignoring $y$ entirely Linear regression fits $y$ on the top $k$ principal components # Step 1: PCA finds directions of maximum variance pca = PCA() X_train_pc = pca.fit_transform(scale(X_train)) # 19 features → 19 PCs # Step 2: Regress salary on just the first k PCs k = 6 regr = LinearRegression() regr.fit(X_train_pc[:, :k], y_train) The first principal component captures the direction along which the features vary the most. In our Hitters data, PC1 captures 39.9% of the total variance, and by PC7 we're at 93.4%. But here's the catch: the directions of maximum variance in $X$ are not necessarily the directions most useful for predicting $y$. PC1 might capture the spread between high-career and low-career players, but if salary depends more on a subtle interaction between recent performance and league, that signal could be buried in PC8 or PC12. PLS finds directions that simultaneously: Explain variance in $X$ (like PCA) Correlate with $y$ (unlike PCA) # PLS finds directions that maximise covariance between X and y pls = PLSRegression(n_components=2) pls.fit(scale(X_train), y_train) predictions = pls.predict(scale(X_test)) This is why PLS needs only 2 components where PCR needs 6. PLS searches directly for the features that predict salary, while PCR has to hope that the high-variance directions in $X$ also happen to predict $y$. Both methods use 10-fold cross-validation to select the number of components: import matplotlib.pyplot as plt fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # PCR ax1.plot(range(1, 20), mse_by_ncomp, '-o', color='steelblue', markersize=5) ax1.axvline(x=best_k, color='red', linestyle='--', alpha=0.7, label=f'Best: {best_k}') ax1.set_xlabel('Number of Components') ax1.set_ylabel('10-Fold CV MSE') ax1.set_title('PCR') ax1.legend() ax1.grid(True, alpha=0.3) # PLS ax2.plot(range(1, 20), pls_mse, '-s', color='darkorange', markersize=5) ax2.axvline(x=2, color='green', linestyle='--', alpha=0.7, label='Selected: 2 (parsimonious)') ax2.axvline(x=best_pls_k, color='red', linestyle=':', alpha=0.5, label=f'CV min: {best_pls_k}') ax2.set_xlabel('Number of Components') ax2.set_ylabel('10-Fold CV MSE') ax2.set_title('PLS') ax2.legend(fontsize=9) ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() The PCR curve dips at 6 components and rises again: adding noisy components hurts predictions. The PLS curve is more interesting: the strict CV minimum is at 11 components, but 2 components achieve nearly the same MSE (143,564 vs 142,554). We select 2 because the simpler model generalises better on the test set (MSE 104,839 vs 106,891 with 11). This is a common pattern: when the CV curve is flat near the minimum, prefer the simpler model. What do these principal components actually capture? The PCA loadings reveal which original features contribute to each component: loadings = pd.DataFrame( pca.components_[:5].T, columns=[f'PC{i+1}' for i in range(5)], index=X.columns ) print(loadings.sort_values('PC1', ascending=False).round(3)) The heatmap reveals the correlation structure clearly. PC1 (39.9% of variance) is dominated by career statistics: CRuns, CRBI, CHits, CAtBat, and CHmRun all have loadings above 0.30. PC2 (21.5%) separates current-season stats (AtBat, Hits, Runs with positive loadings) from career longevity (Years with a negative loading). PC3 picks up the league indicator variables. PCA compresses these correlated groups into single components, which is exactly why dimension reduction works here. PCR decomposes $X$ using PCA: where $V$ contains the principal component directions (eigenvectors of $X^TX$). We keep only the first $k$ columns of $Z = XV_k$ and regress: This is equivalent to OLS on the reduced feature set. The key insight: since the PCs are orthogonal, the regression coefficients don't change when you add or remove components. Each component's contribution is independent. PLS maximises the covariance between $X$ and $y$: The first PLS direction $w_1$ is simply $X^T y$ (normalised): the covariance between each feature and the target. Subsequent directions are found by deflating $X$ and repeating. PCR is better when: The high-variance directions in $X$ genuinely predict $y$ (common in spectroscopy, genomics) You have many features and few observations (PCA provides stable variance estimates) You want an unsupervised feature extraction that you can reuse across multiple targets PLS is better when: The predictive signal sits in low-variance directions of $X$ You have a single target and want the most efficient compression Your features include many irrelevant high-variance variables In our Hitters example, PLS wins convincingly: 2 components vs 6, and lower test error. The salary signal does not align perfectly with the directions of maximum variance in the batting statistics. Both methods trade bias for lower variance: Method Components Test MSE RMSE ($k) Full OLS 19 117,301 342 PCR 6 112,167 335 PLS 2 104,839 324 Ridge -- 99,741 316 Full OLS uses all 19 features but has high variance (unstable coefficients). PCR and PLS introduce some bias by discarding information, but the reduction in variance more than compensates. Ridge regression (included for comparison) achieves the lowest error by shrinking coefficients rather than discarding components. The practical message: when features are correlated, you rarely need all of them. The question is whether to reduce dimensions unsupervised (PCR), supervised (PLS), or regularise without reducing dimensions at all (Ridge). Few features, many observations. If $p \ll n$, multicollinearity is less of a problem and OLS works fine. Interpretability is critical. The principal components are linear combinations of all features, so individual feature effects are obscured. If you need to say "an extra home run is worth $X in salary," use Ridge or Lasso instead. Non-linear relationships. PCR and PLS are linear methods. For non-linear patterns, consider Gaussian process regression or tree-based models. Sparse signals. If only a few features matter and the rest are noise, Lasso (L1 regularisation) does feature selection rather than feature combination, which is usually more effective. The idea of using principal components as regression predictors dates to Massy (1965), "Principal Components Regression in Exploratory Statistical Research", published in the Journal of the American Statistical Association. Massy was working on marketing research problems where survey data had dozens of correlated variables. He proposed a two-step procedure: extract principal components, then regress on the top $k$. His key insight: "By using the principal components as the independent variables in the regression, we avoid the multicollinearity problem since the components are orthogonal." The underlying PCA dates back further to Hotelling (1933), "Analysis of a complex of statistical variables into principal components", Journal of Educational Psychology. Hotelling formalised the eigenvalue decomposition of the covariance matrix, though the core idea appeared even earlier in Pearson (1901). PLS was developed by Herman Wold in the 1960s and 1970s, originally for econometrics. The foundational paper is Wold (1975), "Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach," in Perspectives in Probability and Statistics. Herman's son, Svante Wold, later popularised PLS in chemometrics with a landmark review: Wold, Sjostrom & Eriksson (2001), "PLS-regression: a basic tool of chemometrics", Chemometrics and Intelligent Laboratory Systems. The modern computational algorithm used in most implementations (including sklearn) is SIMPLS by de Jong (1993), "SIMPLS: An alternative approach to partial least squares regression". de Jong's algorithm computes PLS components without the iterative deflation step, making it both faster and numerically more stable. This tutorial is based on the lab exercise in James, Witten, Hastie & Tibshirani (2021), An Introduction to Statistical Learning, Chapter 6. ISLR provides an excellent treatment of PCR and PLS in the context of the bias-variance tradeoff, alongside Ridge and Lasso regression. The Hitters dataset used here has become a standard benchmark for comparing regularisation and dimension reduction methods. With 19 correlated features, it sits in the sweet spot where these methods make a visible difference. ISLR Chapter 6 (free online) - PCR, PLS, Ridge, and Lasso side by side Hastie, Tibshirani & Friedman (2009), The Elements of Statistical Learning, Chapter 3.5 - Rigorous treatment Abdi (2010), "Partial least squares regression and projection on latent structure regression" - Excellent modern overview The interactive notebook includes exercises: Scree plot. Plot the explained variance per component and the cumulative curve. How many components do you need to capture 95% of the variance? PLS loadings. Compare the PLS weight vectors to the PCA loadings. Which features does PLS prioritise that PCA does not? Ridge vs PCR. Add a Ridge regression (with RidgeCV) to the comparison. In what sense is Ridge a "soft" version of PCR? Log-transform the target. Salary is right-skewed. Does predicting $\log(\text{Salary})$ change which method wins? Understanding PCR and PLS builds directly on linear regression and connects to Bayesian inference through the regularisation-as-prior interpretation. When the linear assumption breaks down, Gaussian process regression offers a non-parametric alternative that handles high-dimensional inputs gracefully. Regression Playground — Fit and compare regression models interactively in the browser Linear Regression Five Ways — The foundation both PCR and PLS build on LDA vs PCA: Supervised vs Unsupervised Dimensionality Reduction — The classification counterpart to PCR vs PLS From MLE to Bayesian Inference — How regularisation connects to Bayesian priors Gaussian Process Regression from Scratch — A non-parametric alternative when linearity breaks down PCR finds directions of maximum variance in the features without considering the target variable, then regresses on those directions. PLS finds directions that maximise the covariance between the features and the target simultaneously. Because PLS is supervised from the start, it typically needs fewer components to achieve the same predictive performance. PCR is preferable when the high-variance directions in your features genuinely predict the target, which is common in spectroscopy and genomics. It is also useful when you want an unsupervised feature extraction that can be reused across multiple target variables. PLS is better when the predictive signal sits in low-variance directions or when many high-variance features are irrelevant to the outcome. Use k-fold cross-validation to evaluate predictive performance at each number of components and select the value that minimises the cross-validation error. When the error curve is flat near the minimum, prefer the simpler model with fewer components, as it tends to generalise better on unseen data. The salary signal in the baseball data does not align well with the directions of maximum variance in the batting statistics. Career statistics dominate the first few principal components, but salary depends on a subtler combination of recent performance and league factors. PLS finds these salary-relevant directions directly, so it needs far fewer components. Both methods address multicollinearity, but in different ways. PCR discards the least important principal components entirely, introducing a hard cutoff. Ridge regression shrinks all coefficients towards zero without discarding any, acting as a soft version of dimension reduction. Ridge often achieves lower test error because it retains some information from every direction. Not directly. The components are linear combinations of all original features, so individual feature effects are obscured. If you need to say that a specific feature is worth a certain amount, use Ridge or Lasso regression instead, which produce interpretable coefficients for each original variable.
