Neural Networks: A Broad Overview

DEV Community

Shaurya Pethe

May 12, 2026, 10:17 AM

Before we start, this blog is for people who have somewhat heard the terms involved with neural networks. This gives an extremely high overview of a few concepts, helpful for building intuition. Neural Networks are essentially used to fit a curve to some set of data points. That's it. Atleast that's how much I know. Theoretically, by this definition, you can fit any sort of data using a neural network. A standard Neural Network is just a series of nested functions. Data flows strictly in one direction, from input to output. Each layer takes the data, multiplies it by its Weights (which determine the strength of a connection), adds a Bias (which shifts the activation function), and hands it directly to the next neuron. The output of the activation function from the first layer becomes the input for the second layer. This sequential processing allows the network to build hierarchal representations. Every neural network architecture embodies a fundamental tension between bias and variance: High bias (underfitting) occurs when your network lacks the capacity to capture the underlying data patterns. Basically, the network fails to capture the underlying pattern or "meaning" of the data. A network that's too shallow or too constrained produces consistently wrong predictions across both training and test data. The model's assumptions are too rigid. High variance (overfitting) emerges when your network is too complex. It memorizes training data, including the noise, rather than learning generalizable patterns..Deep networks with millions of parameters can achieve near- zero training error while failing catastrophically on unseen data. The model is too flexible, capturing noise as signal. The tradeoff is unavoidable: reducing bias typically increases variance and vice versa. Engineering a neural network is a constant balancing act between building a model expressive enough to learn the data, but constrained enough to generalize to the real world. Neural networks learn through two types of parameters: Weights (W) define the strength and sign of connections between neurons. In a fully connected layer, the weight matrix transforms input vectors into new representational spaces. Each weight determines how much influence one neuron has on another—positive weights amplify signals, negative weights inhibit them. Biases (b) provide translation invariance, shifting activation thresholds independently of input magnitude. Without biases, neurons can only learn transformations passing through the origin— a severe limitation. The bias term allows: output = W·x + b , enabling the network to learn arbitrary decision boundaries. Together, these parameters define the network's hypothesis space. A network with L layers, each containing n neurons, can have millions of parameters—each one adjusted during training to minimize loss. Activation functions are the crucial non-linear components that make deep learning possible. Without them, stacking multiple layers would collapse into a single linear transformation— rendering depth meaningless. Without activation functions, a neural network is just a series of linear matrix multiplications. No matter how many layers you stack, it collapses into a single linear transformation. ReLU (Rectified Linear Unit) is the industry standard for hidden layers. Its math is simple: f(x) = max(0, x) Why ReLU? It is easier to compute as compared to exponential functions. And it has a constant gradient of 1 for positive inputs which prevents gradients from vanishing. There are some edge cases though. It is also called as the "Dying ReLU" problem where neurons which get input less than zero get stuck at zero due to the nature of the ReLU function. This causes a lot of neurons to "die" or go to a "dead" state, effectively decreasing the accuracy of the neural network. This is overcome by adding a very small slope for all negative inputs, so that the gradient is non-zero. This keeps a steady flow of gradients. Loss functions are used to find out how wrong our model is after training. Like activation functions, there are many loss functions. For regression tasks, we often use the Residual Sum of Squares (RSS), or Sum of Squared Errors. The formula is: RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 Where y_i is the actual value and haty_i is the predicted value. We square the residuals for two critical reasons: Positivity: It turns all negative errors into positive numbers, preventing opposing errors from canceling each other out. Penalization: Squaring mathematically punishes large errors much more aggressively than small ones, forcing the model to prioritize its biggest mistakes. However, for classification tasks (like predicting the next word in a language model), RSS falls short. Instead, we use the Cross-Entropy Loss function. This function takes a predicted probability distribution and compares it against the true answer, calculating a smooth, differentiable error score that teaches the model how to adjust its confidence for the next pass. More about Cross Entropy Loss in another blog. I studied all of this from the YouTube channel "StatQuest" (amazing resource). And the narrator sang a silly song about Gradient Descent that I still remember. "Gradient Descent is decent....at optimising parameters!" And that's it. After calculating the loss (using the loss function), we update parameters (weights and biases) one by one and repeat our calculations again till the loss is near zero meaning our model is trained optimally. Here's a useful visualisation for understanding gradient descent better, Imagine the model is blindfolded on a high-dimensional mountain, trying to find the lowest point in the valley (the minimum error). Because the model is blindfolded, it uses calculus to feel the slope of the ground directly under its feet. Gradient Descent is the mathematical process of taking a step downward. The size of that step is dictated by your Learning Rate. In simple terms, gradient descent uses the chain rule to calculate derivative of the loss function with respect to a particular weight. θ_new = θ_old - η*∇L(θ) η*∇ where is the learning rate and L is the loss gradient with respect to parameters and θ is the parameter(either weigh or bias) which has to be updated. Backpropagation is the algorithm that makes training deep networks tractable. It computes gradients of the loss with respect to every parameter through recursive application of the chain rule. Starting at the final output (the Loss), it uses the Chain Rule from calculus to propagate backward through the network, calculating the exact gradient (slope) for every single weight and uses Gradient descent to update the parameters. Attention has revolutionized neural networks, enabling models to focus on relevant information rather than processing all inputs equally. The mechanism computes weighted combinations of input representations based on learned relevance scores. The core attention operation: Attention(Q, K, V) = softmax(Q*K^T/√d_k)*V where queries (Q), keys (K), and values (V) are learned projections of inputs. This allows the model to learn which parts of the input are most relevant for each prediction. Attention mechanisms underpin Transformers, which have become the dominant architecture for NLP (GPT, BERT, Claude, etc.). Self-attention enables modeling long-range dependencies without the recurrence bottlenecks of RNNs. A detailed exploration of attention mechanisms, multi-head attention, and Transformer architectures will be covered in an upcoming blog post. As data passes through dozens of layers, the distribution of those activations can shift wildly, called as Internal Covariate Shift, causing gradients to explode or vanish. Batch Normalization normalizes the activations across the batch dimension. It is highly effective for CNNs and standard FFNs. Layer Normalization normalizes the activations across the feature dimension for a single data point. This is the stabilization technique of choice for modern Transformer models, ensuring that sequence lengths and batch sizes don't disrupt the mathematical flow. That's it for this blog. Hope you liked it! Be sure to suggest any changes if needed :)