Cloud vs edge vs fog computing, AI chips, and latency-critical architectures
The云计算 model—centralizing computation in massive data centers—has dominated for the past two decades. But as IoT devices proliferate and AI applications demand millisecond latency, the pendulum is swinging toward distributed computation. Edge computing brings intelligence closer to where data is generated, reducing latency, bandwidth, and privacy concerns.
This article explores the edge computing landscape: the architectural paradigms, the specialized AI hardware enabling edge inference, and the practical considerations for deploying AI at the edge.
Traditional cloud computing centralizes resources in large data centers. All devices send data to the cloud for processing, which returns results.
Device → [Internet] → Cloud Data Center → [Internet] → Device
Advantages:
- Unlimited compute (scale up/down as needed)
- Centralized management
- Latest hardware available
Disadvantages:
- Latency: 50-200ms round-trip
- Bandwidth: Transmitting all raw data is expensive
- Privacy: Sensitive data leaves the device
- Connectivity: Requires internet connection
Edge computing pushes computation to the "edge" of the network—devices, gateways, or nearby servers. Data is processed locally with only relevant results sent to the cloud.
Device → Local Processing → Summary Results → Cloud
Advantages:
- Ultra-low latency: 1-10ms
- Reduced bandwidth: Only transmit processed data
- Privacy: Raw data never leaves the device
- Offline capability: Works without internet
Disadvantages:
- Limited compute: Constrained by device hardware
- Management complexity: Distributed updates and monitoring
- Model size: Must fit in device memory
Fog computing sits between cloud and edge—an intermediate layer of computing resources (local servers, gateway devices) that provides more resources than edge while being closer than cloud.
Device → Fog Node (local server) → Cloud Data Center
Example: A factory has a fog server that aggregates
sensor data from multiple machines, does initial
processing, and only sends summaries to the cloud.
| Application | Required Latency | Edge Feasible? |
|---|---|---|
| Autonomous driving | < 10ms | Essential |
| Industrial robotics | < 5ms | Essential |
| AR/VR | < 20ms | Essential |
| Voice assistants | < 200ms perceived | Yes (partial processing) |
| Smart home | < 1 second acceptable | Flexible |
| Predictive maintenance | Minutes acceptable | Flexible |
The simplest option: run inference on the device CPU. Modern CPUs from Intel (Core series), Qualcomm (Snapdragon), and Apple (M-series) include vector extensions (AVX, Neon, AMX) that accelerate matrix operations.
NVIDIA dominates GPU-based edge computing with the Jetson family:
| Device | GPU | AI Performance (TOPS) | Power | Typical Use |
|---|---|---|---|---|
| Jetson Nano | Maxwell 128-core | 0.5 | 5-10W | Prototyping, simple CV |
| Jetson Xavier NX | Volta 384-core | 21 | 10-15W | Robotics, smart cameras |
| Jetson AGX Orin | Ampere 2048-core | 275 | 15-60W | Autonomous vehicles, high-end robotics |
| Jetson Thor (2025) | Ada Lovelace | 700+ | 60-90W | Level 4+ autonomous driving |
Apple's Neural Engine is a dedicated AI accelerator integrated into A-series and M-series chips:
The ANE enables real-time AI features on iPhones: photo segmentation, Siri processing, live transcription. Core ML abstracts hardware differences, allowing developers to target the ANE without explicit optimization.
Google's Edge TPU is a purpose-built ASIC for edge inference:
| Form Factor | Performance | Power | Use Case |
|---|---|---|---|
| Coral Dev Board | 4 TOPS | 2W | Prototyping |
| USB Accelerator | 4 TOPS | 2W | Adding AI to existing devices |
| Edge TPU Module | 8 TOPS | 4W | Production embedded |
Edge TPUs are optimized for INT8 quantized models and run efficiently with TensorFlow Lite. They're particularly popular for vision applications in retail and industrial settings.
Qualcomm's Hexagon DSP includes dedicated AI processing capabilities:
Quantization reduces model weights from FP32 to INT8 or even INT4:
FP32 (32-bit float): 4 bytes per weight
INT8 (8-bit integer): 1 byte per weight
INT4 (4-bit integer): 0.5 bytes per weight
For a 10M parameter model:
FP32: 40 MB
INT8: 10 MB (75% reduction, ~2% accuracy loss)
INT4: 5 MB (87.5% reduction, ~5% accuracy loss)
Quantization-aware training (QAT) produces better accuracy than post-training quantization by simulating quantization during training.
Pruning removes redundant weights or neurons:
Modern vision transformers can often be pruned 30-50% with minimal accuracy loss.
Large Teacher Model → Small Student Model
(accuracy) (compressed)
Training signal: Student mimics teacher logits + intermediate activations
Result: Student model much smaller but retains most teacher capability
DistilBERT uses knowledge distillation to achieve 60% smaller and 60% faster than BERT while retaining 97% of language understanding.
Some architectures are inherently more edge-friendly:
TensorFlow Lite (TFLite) converts TensorFlow models to efficient edge formats:
1. Convert: TF Model → TFLite FlatBuffer format
2. Optimize: Quantize, prune, optimize ops
3. Deploy: Run on mobile, embedded, or microcontrollers
Supported platforms:
- Android (CPU, GPU, NNAPI, EdgeTPU)
- iOS (Core ML, Metal)
- Linux (x86, ARM)
- Microcontrollers (TensorFlow Lite for Microcontrollers)
ONNX Runtime provides cross-platform inference:
# Convert from various frameworks to ONNX
torch.onnx.export(model, input, "model.onnx")
keras.export(model, "model.onnx")
# Run with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
results = session.run(None, {"input": data})
ONNX Runtime supports hardware acceleration on CPU, CUDA, TensorRT, Core ML, and more.
PyTorch's mobile stack enables direct deployment:
# PyTorch Mobile
model.classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
model.eval()
model_scripted = torch.jit.script(model)
model_scripted.save("mobilenet.pt")
# ExecuTorch (newer, more flexible)
from executorch import exir
exir.capture(model, input) → compiled executable
Many applications use a hybrid approach:
Edge device does initial filtering:
- Voice keyword detection (always-on, minimal compute)
- If keyword detected → transcribe and send to cloud
- Cloud does full NLP processing
Result: Cloud only contacted when relevant, saving bandwidth
Level 1 (Device): Filter, compress, simple anomaly detection
Level 2 (Gateway): Aggregate, correlate, complex analytics
Level 3 (Cloud): Historical analysis, model retraining
Train models across distributed edge devices without centralizing data:
1. Cloud sends model to edge devices
2. Edge devices train locally on local data
3. Edge devices send model updates (not raw data) to cloud
4. Cloud aggregates updates, updates global model
5. Repeat
This enables learning from private data while keeping data local.
Edge devices have limited RAM. Always compute memory requirements:
Memory = Model weights + Activations + KV cache (for transformers)
For MobileNet-V3 Large:
Weights: ~5 MB
Activations (batch=1): ~5 MB
Total: ~10 MB (fits in any device)
For Llama 3 8B (impossible on most edge devices):
Weights (FP16): 16 GB
KV cache (8K context): ~1 GB
Total: ~17 GB (requires server-class hardware)
AI inference generates heat. Continuous high-power inference may require active cooling:
Managing deployed edge devices requires robust update mechanisms:
Edge computing is essential for latency-critical, privacy-sensitive, or connectivity-constrained applications. Specialized AI hardware—Jetson, Edge TPU, Apple Neural Engine—has made significant inference capability available at the edge. Model optimization techniques (quantization, pruning, distillation) enable capable AI within tight resource budgets.
The future is hybrid: edge devices handle immediate processing, fog nodes provide regional aggregation, and cloud offers centralized training and complex analytics. The right split depends on latency requirements, data sensitivity, and update frequency. Understanding these trade-offs is essential for architects building next-generation IoT and AI systems.