Computer Vision Evolution

From LeNet to Vision Transformers: decades of progress

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Computer vision and pattern recognition visualization

Computer vision has transformed from a field struggling with hand-coded features to one powered by deep learning that matches and exceeds human performance on many tasks. This evolution spans decades of research, from the first convolutional networks to today's vision transformers that power everything from medical imaging to autonomous vehicles.

This article traces the key milestones in this evolution: the foundational CNN architectures, the ImageNet breakthrough that validated deep learning for vision, the ResNet revolution that enabled training of very deep networks, the YOLO family that enabled real-time detection, and the Vision Transformer that brought transformer architectures to bear on visual tasks.

CNN Fundamentals

Convolutional Neural Networks (CNNs) are the foundation of modern computer vision. They exploit the spatial structure of images through localized connections and parameter sharing.

Convolution Operation

A convolution slides a kernel (filter) across the input, computing dot products at each position:

Input (H×W×C) * Kernel (K×K×C×F) → Output (H×W×F)

Where:
  H, W = height, width
  C = channels (3 for RGB)
  K = kernel size (typically 3 or 5)
  F = number of filters (output channels)
    

Each filter learns to detect a specific feature: edges, textures, shapes, or more abstract patterns.

Key CNN Components

The hierarchical composition of these layers enables increasingly abstract representations: early layers detect edges, intermediate layers combine edges into textures and shapes, deeper layers recognize objects and concepts.

The ImageNet Moment

Before 2012, computer vision relied on hand-crafted features like SIFT, HOG, and SIFT. The 2012 AlexNet result at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) changed everything.

ImageNet Competition

ImageNet's 1000-class classification task became the benchmark for visual recognition. The error rate (top-5, meaning correct answer in top 5 predictions) tells the story:

Year Method Top-5 Error
2011 Best traditional methods 25.7%
2012 AlexNet (Krizhevsky et al.) 16.4%
2014 VGGNet, GoogLeNet 7.3%
2015 ResNet (He et al.) 3.6%
2017 SENet 2.3%
2021+ Vision Transformers, CoAtNet ~1.0%

Human performance on ImageNet is approximately 5.1%—deep learning surpassed human performance by 2015.

ResNet: The Residual Revolution

ResNet (He et al., 2016) solved the degradation problem that plagued very deep networks. As networks got deeper, they started performing worse than shallower ones—not due to overfitting, but because deeper networks were harder to optimize.

The Skip Connection Solution

Standard layer:    y = F(x)
ResNet block:      y = F(x) + x

Where F(x) learns the residual (what needs to change),
and the skip connection preserves the input.
    

The skip connection creates an information highway that lets gradients flow directly through the network, enabling training of networks with 100+ layers (compared to ~20 before ResNet).

ResNet Variants

Object Detection: The YOLO Family

While image classification assigns one label per image, object detection localizes and classifies multiple objects. The YOLO (You Only Look Once) family revolutionized real-time detection.

YOLO Approach

Unlike earlier two-stage detectors (R-CNN family) that first propose regions then classify, YOLO does detection in a single forward pass:

Input image → Grid (e.g., 13×13)
             ↓
Each grid cell predicts:
  - B bounding boxes (x, y, w, h, confidence)
  - C class probabilities
             ↓
Post-processing: NMS (non-maximum suppression)
    

YOLO Evolution

Version COCO mAP FPS (V100) Key Innovation
YOLOv3 55.3% 35 Multi-scale detection
YOLOv5 68.0% 155 Anchor-free, improved training
YOLOv8 72.7% 80 Decoupled heads, improved backbone
YOLO11 74.4% 100+ Anchor-free, optimized architecture
YOLOX 68.3% 68 Anchor-free, OTA assignment

YOLOv8 and YOLO11 represent current state-of-the-art for real-time detection, with COCO mAP scores competitive with much slower two-stage detectors.

Vision Transformers (ViT)

In 2020, the Vision Transformer (ViT) applied the transformer architecture—proven in NLP—to computer vision. The key change: images as sequences of patches.

ViT Architecture

Image (224×224×3) → Patch embedding (16×16 patches)
                   → 196 patches × 768 dimensions
                   → Add positional embeddings
                   → Transformer encoder layers
                   → [CLS] token → Classification head
    

ViT divides the image into 16×16 patches, linearly embeds each patch, adds positional information, and processes the resulting sequence with a standard transformer encoder.

ViT vs CNN Performance

Model Parameters ImageNet Top-1 Training Data
ResNet-50 25M 76.1% ImageNet (1.2M)
ViT-B/16 86M 77.9% ImageNet (1.2M)
ViT-L/16 307M 76.5% ImageNet (1.2M)
ViT-L/16 307M 87.1% JFT-300M
Swin-T 29M 81.3% ImageNet
EfficientNetV2-L 118M 85.7% ImageNet

ViT requires more data than CNNs to train effectively (hence training on JFT-300M). Hybrid approaches and better training recipes have closed this gap.

Modern Architectures

Swin Transformer

Swin Transformer introduced hierarchical representations with shifted windows, bringing CNN-like efficiency to transformers:

EfficientNetV2

EfficientNetV2 uses compound scaling (depth, width, resolution) optimized through neural architecture search. It achieves better accuracy than predecessors with significantly improved training speed.

ConvNeXt

ConvNeXt modernized ResNet with techniques borrowed from transformers: larger kernels (7×7), layer normalization, fewer activations, and inverted bottleneck. The result: a pure CNN that matches Vision Transformers on their own terms.

Industry Applications

Medical Imaging

FDA-approved AI systems like IDx-DR (diabetic retinopathy) and Lunit INSIGHT (chest X-ray) demonstrate clinical viability.

Autonomous Vehicles

Industrial Inspection

Self-Supervised Learning

Labeling images is expensive. Self-supervised learning trains on unlabeled data by creating pretext tasks:

Models pretrained with self-supervision (like DINOv2) can be fine-tuned with far fewer labeled examples for specific tasks.

Conclusion

Computer vision has undergone a remarkable transformation from hand-crafted features to learned representations that exceed human performance on standardized benchmarks. The progression from LeNet to AlexNet to ResNet to Vision Transformers reflects both algorithmic innovation and increased computational power.

Today's practitioners have access to pretrained models that can be fine-tuned for specific applications with modest labeled datasets. The remaining challenges lie in robustness (handling distribution shift), efficiency (deployment on edge devices), and integration with other modalities for richer understanding.