From LeNet to Vision Transformers: decades of progress
Computer vision has transformed from a field struggling with hand-coded features to one powered by deep learning that matches and exceeds human performance on many tasks. This evolution spans decades of research, from the first convolutional networks to today's vision transformers that power everything from medical imaging to autonomous vehicles.
This article traces the key milestones in this evolution: the foundational CNN architectures, the ImageNet breakthrough that validated deep learning for vision, the ResNet revolution that enabled training of very deep networks, the YOLO family that enabled real-time detection, and the Vision Transformer that brought transformer architectures to bear on visual tasks.
Convolutional Neural Networks (CNNs) are the foundation of modern computer vision. They exploit the spatial structure of images through localized connections and parameter sharing.
A convolution slides a kernel (filter) across the input, computing dot products at each position:
Input (H×W×C) * Kernel (K×K×C×F) → Output (H×W×F)
Where:
H, W = height, width
C = channels (3 for RGB)
K = kernel size (typically 3 or 5)
F = number of filters (output channels)
Each filter learns to detect a specific feature: edges, textures, shapes, or more abstract patterns.
The hierarchical composition of these layers enables increasingly abstract representations: early layers detect edges, intermediate layers combine edges into textures and shapes, deeper layers recognize objects and concepts.
Before 2012, computer vision relied on hand-crafted features like SIFT, HOG, and SIFT. The 2012 AlexNet result at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) changed everything.
ImageNet's 1000-class classification task became the benchmark for visual recognition. The error rate (top-5, meaning correct answer in top 5 predictions) tells the story:
| Year | Method | Top-5 Error |
|---|---|---|
| 2011 | Best traditional methods | 25.7% |
| 2012 | AlexNet (Krizhevsky et al.) | 16.4% |
| 2014 | VGGNet, GoogLeNet | 7.3% |
| 2015 | ResNet (He et al.) | 3.6% |
| 2017 | SENet | 2.3% |
| 2021+ | Vision Transformers, CoAtNet | ~1.0% |
Human performance on ImageNet is approximately 5.1%—deep learning surpassed human performance by 2015.
ResNet (He et al., 2016) solved the degradation problem that plagued very deep networks. As networks got deeper, they started performing worse than shallower ones—not due to overfitting, but because deeper networks were harder to optimize.
Standard layer: y = F(x)
ResNet block: y = F(x) + x
Where F(x) learns the residual (what needs to change),
and the skip connection preserves the input.
The skip connection creates an information highway that lets gradients flow directly through the network, enabling training of networks with 100+ layers (compared to ~20 before ResNet).
While image classification assigns one label per image, object detection localizes and classifies multiple objects. The YOLO (You Only Look Once) family revolutionized real-time detection.
Unlike earlier two-stage detectors (R-CNN family) that first propose regions then classify, YOLO does detection in a single forward pass:
Input image → Grid (e.g., 13×13)
↓
Each grid cell predicts:
- B bounding boxes (x, y, w, h, confidence)
- C class probabilities
↓
Post-processing: NMS (non-maximum suppression)
| Version | COCO mAP | FPS (V100) | Key Innovation |
|---|---|---|---|
| YOLOv3 | 55.3% | 35 | Multi-scale detection |
| YOLOv5 | 68.0% | 155 | Anchor-free, improved training |
| YOLOv8 | 72.7% | 80 | Decoupled heads, improved backbone |
| YOLO11 | 74.4% | 100+ | Anchor-free, optimized architecture |
| YOLOX | 68.3% | 68 | Anchor-free, OTA assignment |
YOLOv8 and YOLO11 represent current state-of-the-art for real-time detection, with COCO mAP scores competitive with much slower two-stage detectors.
In 2020, the Vision Transformer (ViT) applied the transformer architecture—proven in NLP—to computer vision. The key change: images as sequences of patches.
Image (224×224×3) → Patch embedding (16×16 patches)
→ 196 patches × 768 dimensions
→ Add positional embeddings
→ Transformer encoder layers
→ [CLS] token → Classification head
ViT divides the image into 16×16 patches, linearly embeds each patch, adds positional information, and processes the resulting sequence with a standard transformer encoder.
| Model | Parameters | ImageNet Top-1 | Training Data |
|---|---|---|---|
| ResNet-50 | 25M | 76.1% | ImageNet (1.2M) |
| ViT-B/16 | 86M | 77.9% | ImageNet (1.2M) |
| ViT-L/16 | 307M | 76.5% | ImageNet (1.2M) |
| ViT-L/16 | 307M | 87.1% | JFT-300M |
| Swin-T | 29M | 81.3% | ImageNet |
| EfficientNetV2-L | 118M | 85.7% | ImageNet |
ViT requires more data than CNNs to train effectively (hence training on JFT-300M). Hybrid approaches and better training recipes have closed this gap.
Swin Transformer introduced hierarchical representations with shifted windows, bringing CNN-like efficiency to transformers:
EfficientNetV2 uses compound scaling (depth, width, resolution) optimized through neural architecture search. It achieves better accuracy than predecessors with significantly improved training speed.
ConvNeXt modernized ResNet with techniques borrowed from transformers: larger kernels (7×7), layer normalization, fewer activations, and inverted bottleneck. The result: a pure CNN that matches Vision Transformers on their own terms.
FDA-approved AI systems like IDx-DR (diabetic retinopathy) and Lunit INSIGHT (chest X-ray) demonstrate clinical viability.
Labeling images is expensive. Self-supervised learning trains on unlabeled data by creating pretext tasks:
Models pretrained with self-supervision (like DINOv2) can be fine-tuned with far fewer labeled examples for specific tasks.
Computer vision has undergone a remarkable transformation from hand-crafted features to learned representations that exceed human performance on standardized benchmarks. The progression from LeNet to AlexNet to ResNet to Vision Transformers reflects both algorithmic innovation and increased computational power.
Today's practitioners have access to pretrained models that can be fine-tuned for specific applications with modest labeled datasets. The remaining challenges lie in robustness (handling distribution shift), efficiency (deployment on edge devices), and integration with other modalities for richer understanding.