AI News Hub Logo

AI News Hub

54. Linear Regression: Predicting Numbers From Patterns

DEV Community
Akhilesh

You want to predict something. A number. How much a house will sell for. How many units you'll sell next month. What temperature it'll be tomorrow. That's a regression problem. And linear regression is the first tool you reach for. It's the simplest ML model that actually does something useful. Every more complex model builds on the ideas here. You can't skip this one. What linear regression actually does The equation y = mx + b and what each part means in ML What a cost function is and why we need one How least squares fitting works Building linear regression from scratch and with scikit-learn How to evaluate regression models (not accuracy, different metrics) Multiple features and what changes You have two things that seem related. Hours studied and exam score. House size and house price. Temperature and ice cream sales. Plot them on a graph. You get a scatter of dots. Linear regression draws the best possible straight line through those dots. Once you have that line, you can plug in a new value on the X axis and read off a prediction on the Y axis. That's it. That's linear regression. You've seen this since school. In ML we write it slightly differently but it's the same thing. y = mx + b y = the thing you're predicting (output) x = the input feature you know m = slope (how much y changes when x increases by 1) b = intercept (what y is when x is 0) In ML notation: y_hat = w * x + b y_hat = predicted value w = weight (same as slope m) x = input feature b = bias (same as intercept) The model's job is to find the right values of w and b that make the line fit the data as well as possible. For any line you draw, some predictions will be too high and some too low. The difference between your prediction and the real answer is called the residual or error. error = actual - predicted You want to minimize the total error across all your training examples. But you can't just add up raw errors because positive and negative errors cancel out. So instead, you square each error and add them all up. This is called the Mean Squared Error (MSE). MSE = (1/n) * sum((actual - predicted)^2) Squaring does two things: makes all errors positive, and punishes big errors more than small ones. A prediction that's off by 10 is penalized 4x more than one off by 5. The line that gives you the lowest MSE is your best fit line. That's what scikit-learn finds when you call .fit(). Before using scikit-learn, let's implement linear regression manually so you see what's actually happening. import numpy as np import matplotlib.pyplot as plt # Simple dataset: study hours vs exam score hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float) scores = np.array([52, 55, 60, 65, 68, 72, 75, 81, 85, 90], dtype=float) # Calculate slope (w) and intercept (b) using the least squares formula n = len(hours) mean_x = np.mean(hours) mean_y = np.mean(scores) # Slope formula: w = sum((x - mean_x) * (y - mean_y)) / sum((x - mean_x)^2) numerator = np.sum((hours - mean_x) * (scores - mean_y)) denominator = np.sum((hours - mean_x) ** 2) w = numerator / denominator # Intercept formula: b = mean_y - w * mean_x b = mean_y - w * mean_x print(f"Slope (w): {w:.2f}") print(f"Intercept (b): {b:.2f}") print(f"Equation: score = {w:.2f} * hours + {b:.2f}") # Make predictions predictions = w * hours + b # Plot plt.figure(figsize=(8, 5)) plt.scatter(hours, scores, color='blue', label='Actual scores', zorder=5) plt.plot(hours, predictions, color='red', linewidth=2, label=f'y = {w:.1f}x + {b:.1f}') plt.xlabel('Hours Studied') plt.ylabel('Exam Score') plt.title('Linear Regression From Scratch') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('linear_regression_scratch.png', dpi=100) plt.show() Output: Slope (w): 4.24 Intercept (b): 47.27 Equation: score = 4.24 * hours + 47.27 This tells you: for every extra hour of study, the score goes up by about 4.24 points. If someone studies 0 hours, the model predicts 47.27. Now predict for a new student who studied 7.5 hours: new_hours = 7.5 predicted_score = w * new_hours + b print(f"Predicted score for {new_hours} hours: {predicted_score:.1f}") # Output: Predicted score for 7.5 hours: 79.1 You'll always use scikit-learn in practice. Let's do the same thing but properly. from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score import numpy as np # Same data, shaped for sklearn (needs 2D input) hours = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1, 1) scores = np.array([52,55,60,65,68,72,75,81,85,90]) # Split X_train, X_test, y_train, y_test = train_test_split( hours, scores, test_size=0.2, random_state=42 ) # Train model = LinearRegression() model.fit(X_train, y_train) print(f"Slope (w): {model.coef_[0]:.2f}") print(f"Intercept (b): {model.intercept_:.2f}") # Predict y_pred = model.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"\nMSE: {mse:.2f}") print(f"RMSE: {np.sqrt(mse):.2f}") print(f"MAE: {mae:.2f}") print(f"R2: {r2:.3f}") For classification you use accuracy. For regression, accuracy doesn't make sense. You use these instead. MAE (Mean Absolute Error) MAE = mean(|actual - predicted|) MSE (Mean Squared Error) MSE = mean((actual - predicted)^2) RMSE (Root Mean Squared Error) RMSE = sqrt(MSE) R2 Score (R-squared) R2 = 1.0: perfect predictions R2 = 0.8: model explains 80% of the variation in the data R2 = 0.0: model is no better than just predicting the mean every time R2 < 0: model is worse than predicting the mean (something is very wrong) # Quick comparison of all metrics print("Metric comparison:") print(f" MAE: {mae:.2f} <- average error in original units") print(f" RMSE: {np.sqrt(mse):.2f} <- also in original units, penalizes big errors") print(f" R2: {r2:.3f} <- 0 to 1, higher is better") Let's use a real dataset with actual features. from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error, r2_score import numpy as np import pandas as pd # Load data housing = fetch_california_housing() X = pd.DataFrame(housing.data, columns=housing.feature_names) y = housing.target # median house value in $100,000s print(X.head()) print(f"\nTarget range: ${y.min()*100:.0f}k to ${y.max()*100:.0f}k") # Split and scale X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train model = LinearRegression() model.fit(X_train_scaled, y_train) # Evaluate y_pred = model.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"RMSE: ${rmse*100:.0f}k") print(f"R2: {r2:.3f}") Output: RMSE: $74k R2: 0.576 The model is off by about $74k on average. R2 of 0.576 means it explains about 58% of the variation in house prices. Not bad for a simple linear model with no tuning. One great thing about linear regression: you can see exactly how much each feature matters. # Feature importance from coefficients coefficients = pd.Series(model.coef_, index=housing.feature_names) coefficients_sorted = coefficients.sort_values() print("Feature weights (bigger absolute value = more influence):") print(coefficients_sorted) Output: Feature weights: AveOccup -0.391 Latitude -0.900 Longitude -0.870 HouseAge 0.123 AveRooms 0.323 Population -0.003 AveBedrms -0.049 MedInc 0.827 MedInc (median income) has the biggest positive weight. Higher income neighborhoods have higher house prices. Makes sense. Latitude has a big negative weight. Moving north in California (higher latitude) generally means lower prices. Also makes sense geographically. This interpretability is one of the biggest advantages of linear regression. You can explain every prediction. Mistake 1: Not scaling features Linear regression works with raw numbers. If one feature ranges from 0 to 1,000,000 and another from 0 to 1, the big-range feature dominates the coefficients. Always scale when features have very different ranges. # Always do this before linear regression from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) Mistake 2: Using it for non-linear relationships Linear regression assumes the relationship is, well, linear. If the real pattern curves, a straight line won't fit it. Check a scatter plot first. Mistake 3: Reporting R2 without checking residuals R2 can look decent even when your model is completely wrong in systematic ways. Always plot actual vs predicted and residuals vs predicted. import matplotlib.pyplot as plt # Residual plot residuals = y_test - y_pred plt.figure(figsize=(8, 4)) plt.scatter(y_pred, residuals, alpha=0.3, color='blue') plt.axhline(y=0, color='red', linewidth=1) plt.xlabel('Predicted Values') plt.ylabel('Residuals') plt.title('Residual Plot - should look like random scatter around 0') plt.savefig('residual_plot.png', dpi=100) plt.show() If residuals show a pattern (a curve, a funnel shape), your linear model is missing something. Thing Code Train model model = LinearRegression(); model.fit(X_train, y_train) Get slope model.coef_ Get intercept model.intercept_ Predict model.predict(X_test) MAE mean_absolute_error(y_test, y_pred) RMSE np.sqrt(mean_squared_error(y_test, y_pred)) R2 r2_score(y_test, y_pred) Scale features StandardScaler().fit_transform(X_train) Level 1: load_diabetes() from sklearn. Train a linear regression model. Print the RMSE and R2. Which feature has the highest positive weight? Level 2: Level 3: Scikit-learn: LinearRegression Scikit-learn: Regression metrics StatQuest: Linear Regression (YouTube) Khan Academy: Least Squares Next up, Post 55: Multiple Regression: More Features, More Power. What changes when you go from 1 input to 10 inputs, how multicollinearity breaks your model, and how to pick the right features.