AI News Hub Logo

AI News Hub

55. Multiple Regression: More Features, More Power (And More Ways to Break Things)

DEV Community
Akhilesh

In the last post, you predicted house prices using one feature. One number in, one number out. Real problems don't work like that. House prices depend on size, location, number of rooms, age of the building, crime rate in the area, school ratings, and a dozen other things. Using just one feature leaves most of the useful information on the table. Multiple regression is the same idea as linear regression, just with more inputs. Sounds simple. But more features brings new problems you need to know about. How multiple regression extends single-feature regression What multicollinearity is and why it messes up your model How to check for it and what to do about it Feature selection: picking what actually matters Regularization basics: Ridge and Lasso Full working code on a real dataset Single feature linear regression: y = w * x + b Multiple regression with 3 features: y = w1*x1 + w2*x2 + w3*x3 + b With n features: y = w1*x1 + w2*x2 + ... + wn*xn + b Each feature gets its own weight. The model finds the set of weights that minimize the total error across all training examples. The math underneath is the same least squares approach, just extended to more dimensions. In matrix form it's just: y = X @ w + b where X is your full feature matrix. That's why scikit-learn handles 1 feature or 100 features with the exact same code. from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import r2_score, mean_squared_error import numpy as np import pandas as pd # Load data - 8 features this time housing = fetch_california_housing() X = pd.DataFrame(housing.data, columns=housing.feature_names) y = housing.target print(f"Features: {X.shape[1]}") print(f"Samples: {X.shape[0]}") print(f"\nFeature names: {list(X.columns)}") Output: Features: 8 Samples: 20640 Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'] # Same workflow as before, just more features X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) model = LinearRegression() model.fit(X_train_s, y_train) y_pred = model.predict(X_test_s) r2 = r2_score(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print(f"R2: {r2:.3f}") print(f"RMSE: {rmse:.3f}") Output: R2: 0.576 RMSE: 0.746 Same code. More features. Better result than using just one. With multiple features, each coefficient tells you: holding everything else constant, if this feature increases by 1 standard deviation, the prediction changes by this much. # See all feature weights coef_df = pd.DataFrame({ 'Feature': housing.feature_names, 'Coefficient': model.coef_ }).sort_values('Coefficient', ascending=False) print(coef_df.to_string(index=False)) Output: Feature Coefficient MedInc 0.852 HouseAge 0.122 AveRooms 0.308 AveBedrms -0.249 Population -0.034 AveOccup -0.039 Latitude -0.900 Longitude -0.869 MedInc has the biggest positive effect. Latitude and Longitude have big negative effects. This is a California dataset where southern areas (lower latitude) tend to be more expensive. Here's where multiple regression gets tricky. Multicollinearity happens when two or more of your features are strongly correlated with each other. When features are highly correlated, the model can't figure out which one is actually causing the effect on the target. The coefficients become unstable and hard to interpret. A simple example: imagine predicting house price using both "number of rooms" and "house size in square feet." These two features move together almost perfectly. Bigger houses have more rooms. The model gets confused about how to split credit between them. # Check correlations between features import matplotlib.pyplot as plt import seaborn as sns corr_matrix = X.corr() plt.figure(figsize=(9, 7)) sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True) plt.title('Feature Correlation Matrix') plt.tight_layout() plt.savefig('correlation_heatmap.png', dpi=100) plt.show() Look at AveRooms and AveBedrms. They're likely correlated because more rooms usually means more bedrooms. print(f"Correlation between AveRooms and AveBedrms: " f"{X['AveRooms'].corr(X['AveBedrms']):.3f}") # Output: Correlation between AveRooms and AveBedrms: 0.848 0.848 is high. These two features are telling the model similar information. Correlation between pairs of features is a start. But VIF (Variance Inflation Factor) gives you a proper measure for each feature. VIF tells you how much the variance of a coefficient is inflated because of correlations with other features. VIF = 1: no correlation with other features. Good. VIF between 1 and 5: moderate. Usually fine. VIF above 5 or 10: high multicollinearity. Problem. from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.preprocessing import StandardScaler import pandas as pd import numpy as np # Scale first (VIF works on scaled data too) X_scaled = pd.DataFrame( StandardScaler().fit_transform(X), columns=X.columns ) # Calculate VIF for each feature vif_data = pd.DataFrame() vif_data['Feature'] = X_scaled.columns vif_data['VIF'] = [ variance_inflation_factor(X_scaled.values, i) for i in range(X_scaled.shape[1]) ] print(vif_data.sort_values('VIF', ascending=False).to_string(index=False)) Output: Feature VIF Latitude 71.482 Longitude 74.231 AveRooms 7.342 AveBedrms 6.218 MedInc 2.451 HouseAge 1.308 Population 1.228 AveOccup 1.219 Latitude and Longitude have very high VIF because they're geographically correlated. AveRooms and AveBedrms are also high. Option 1: Drop one of the correlated features If two features carry similar information, pick the one that makes more sense to keep or the one with a lower VIF. # Drop AveBedrms since AveRooms carries similar info X_reduced = X.drop(columns=['AveBedrms']) X_train_r, X_test_r, y_train_r, y_test_r = train_test_split( X_reduced, y, test_size=0.2, random_state=42 ) scaler_r = StandardScaler() X_train_rs = scaler_r.fit_transform(X_train_r) X_test_rs = scaler_r.transform(X_test_r) model_r = LinearRegression() model_r.fit(X_train_rs, y_train_r) r2_r = r2_score(y_test_r, model_r.predict(X_test_rs)) print(f"R2 after dropping AveBedrms: {r2_r:.3f}") Option 2: Use Ridge Regression Ridge adds a penalty for large coefficients. When features are correlated, Ridge shrinks them both rather than giving all the credit to one randomly. from sklearn.linear_model import Ridge ridge = Ridge(alpha=1.0) ridge.fit(X_train_s, y_train) r2_ridge = r2_score(y_test, ridge.predict(X_test_s)) print(f"Ridge R2: {r2_ridge:.3f}") # Compare coefficients print("\nLinear vs Ridge coefficients:") for feat, lr_c, ri_c in zip(housing.feature_names, model.coef_, ridge.coef_): print(f" {feat: 5) on a feature Drop one of the correlated features Too many features Use SelectKBest or Lasso Unstable coefficients Use Ridge regression Want automatic feature removal Use Lasso Want to keep all features but shrink them Use Ridge No idea which features matter Check correlation with target first Level 1: load_diabetes(). Check correlations between all features. Which two features are most correlated with each other? Level 2: Level 3: Scikit-learn: Ridge Scikit-learn: Lasso Scikit-learn: Feature Selection Statsmodels VIF StatQuest: Ridge vs Lasso (YouTube) Next up, Post 56: Logistic Regression: Classification With a Probability. We switch from predicting numbers to predicting categories, and the sigmoid function is the reason it works.