How CNNs Work — From Convolution Kernels to ResNet

DEV Community

shangkyu shin

May 8, 2026, 12:32 AM

CNNs changed computer vision because they stopped treating images like flat lists of numbers. An image has structure. Pixels near each other usually matter together. That is exactly what CNNs are built to capture. A Convolutional Neural Network is designed for spatial data. Instead of looking at every pixel independently, it scans small regions with filters. Those filters learn useful patterns. Edges. Corners. Textures. Shapes. As layers get deeper, simple visual patterns become higher-level features. A basic CNN flow looks like this: Image → Convolution → Activation → Pooling → Deeper Features → Classifier The important part is convolution. A convolution kernel moves across the image and extracts local features. In simple terms: Kernel + Local Image Region → Feature Value So the CNN does not memorize the whole image at once. It learns reusable visual detectors. At a high level, a CNN layer works like this: take an input image slide a kernel over small regions compute feature values apply activation optionally reduce spatial size with pooling pass feature maps to the next layer This is why CNNs are practical for image tasks. The same kernel is reused across the image. That means fewer parameters than a fully connected layer over all pixels. It also means the model can detect the same pattern in different locations. Imagine a model looking at a cat image. Early convolution layers may detect: edges curves color contrasts Middle layers may combine them into: eyes ears whisker-like patterns Deeper layers may combine those into: face-like structures object-level features That hierarchy is the core intuition. CNNs do not jump directly from pixels to “cat.” They build features step by step. A standard fully connected neural network treats an image mostly as a flat vector. That loses the spatial structure. CNNs preserve local relationships. Standard neural network: flattens image early connects many pixels directly uses many parameters does not naturally preserve spatial locality CNN: keeps spatial structure uses local filters shares weights across locations learns hierarchical visual features That is why CNNs became the default architecture for image recognition. They match the structure of the data. The kernel is the core mechanism. A kernel is a small matrix that scans over the image. It responds strongly when it finds a pattern it has learned. For example: one kernel may detect vertical edges another may detect horizontal edges another may respond to texture In deep learning, these kernels are learned from data. You do not manually define all of them. The model discovers useful filters during training. Early CNNs showed that convolution worked. But deeper CNNs learned richer features. That created a new problem: How do you train very deep networks without making optimization unstable? This is where the model timeline becomes useful. LeNet showed the basic CNN idea. AlexNet showed that CNNs could dominate large-scale image recognition. VGGNet showed the power of simple depth. GoogLeNet improved efficiency with Inception modules. ResNet made very deep networks trainable with residual connections. A simple timeline looks like this: LeNet → AlexNet → VGGNet → GoogLeNet → ResNet Each model solved a different pressure point. LeNet: early CNN structure useful for digit recognition showed convolution could work AlexNet: large-scale breakthrough helped trigger the deep learning boom proved CNNs could scale with data and GPUs VGGNet: simple repeated convolution blocks showed depth could improve representation easy to understand structurally GoogLeNet: focused on efficiency and multi-scale features used Inception-style modules reduced unnecessary computation ResNet: solved the degradation problem in very deep networks used skip connections made extremely deep CNNs practical CNN progress was not only about architecture. It also needed large-scale benchmarks. ImageNet and ILSVRC gave researchers a clear way to compare models. That mattered because architecture improvements became measurable. Better models were not just theoretically interesting. They produced visible gains on the same benchmark. This is why AlexNet became such a turning point. It showed that deep CNNs could outperform older computer vision pipelines at scale. If CNNs feel like a list of model names, learn them in this order: Convolutional Neural Network Convolution Kernel Deep Convolutional Network LeNet AlexNet VGGNet GoogLeNet ResNet ImageNet / ILSVRC This order works because you first understand the mechanism. Then you understand the architecture. Then you understand the historical model flow. CNNs work because they match the structure of images. Images are spatial. Local patterns matter. The same pattern can appear in many locations. CNNs use convolution kernels to capture that structure efficiently. The shortest version is: Local filters + shared weights + deep feature hierarchy = CNN power If you remember one idea, remember this: CNNs turn pixels into features by repeatedly detecting local patterns and composing them into higher-level visual concepts. When learning CNNs, do you find it easier to start from the convolution kernel itself, or from the model timeline like LeNet → AlexNet → ResNet? Originally published at zeromathai.com. https://zeromathai.com/en/cnn-complete-hub-en/ GitHub Resources https://github.com/zeromathai/zeromathai-ai