Edge Computing and IoT

Cloud vs edge vs fog computing, AI chips, and latency-critical architectures

Published: January 2026 | Reading Time: 14 minutes | Category: Infrastructure

Data center representing edge computing infrastructure

The云计算 model—centralizing computation in massive data centers—has dominated for the past two decades. But as IoT devices proliferate and AI applications demand millisecond latency, the pendulum is swinging toward distributed computation. Edge computing brings intelligence closer to where data is generated, reducing latency, bandwidth, and privacy concerns.

This article explores the edge computing landscape: the architectural paradigms, the specialized AI hardware enabling edge inference, and the practical considerations for deploying AI at the edge.

The Computing Paradigms

Cloud Computing

Traditional cloud computing centralizes resources in large data centers. All devices send data to the cloud for processing, which returns results.

Device → [Internet] → Cloud Data Center → [Internet] → Device

Advantages:
  - Unlimited compute (scale up/down as needed)
  - Centralized management
  - Latest hardware available
  
Disadvantages:
  - Latency: 50-200ms round-trip
  - Bandwidth: Transmitting all raw data is expensive
  - Privacy: Sensitive data leaves the device
  - Connectivity: Requires internet connection
    

Edge Computing

Edge computing pushes computation to the "edge" of the network—devices, gateways, or nearby servers. Data is processed locally with only relevant results sent to the cloud.

Device → Local Processing → Summary Results → Cloud

Advantages:
  - Ultra-low latency: 1-10ms
  - Reduced bandwidth: Only transmit processed data
  - Privacy: Raw data never leaves the device
  - Offline capability: Works without internet
  
Disadvantages:
  - Limited compute: Constrained by device hardware
  - Management complexity: Distributed updates and monitoring
  - Model size: Must fit in device memory
    

Fog Computing

Fog computing sits between cloud and edge—an intermediate layer of computing resources (local servers, gateway devices) that provides more resources than edge while being closer than cloud.

Device → Fog Node (local server) → Cloud Data Center

Example: A factory has a fog server that aggregates
sensor data from multiple machines, does initial
processing, and only sends summaries to the cloud.
    

Latency Requirements by Application

Application Required Latency Edge Feasible?
Autonomous driving < 10ms Essential
Industrial robotics < 5ms Essential
AR/VR < 20ms Essential
Voice assistants < 200ms perceived Yes (partial processing)
Smart home < 1 second acceptable Flexible
Predictive maintenance Minutes acceptable Flexible
The Speed of Light Limit: A round trip to a distant cloud datacenter adds 30-50ms minimum from light propagation alone (speed of light through fiber is ~200,000 km/s, plus routing overhead). Edge processing eliminates this fundamental limit.

Edge AI Hardware

CPU-based Inference

The simplest option: run inference on the device CPU. Modern CPUs from Intel (Core series), Qualcomm (Snapdragon), and Apple (M-series) include vector extensions (AVX, Neon, AMX) that accelerate matrix operations.

GPU-based Edge Devices

NVIDIA dominates GPU-based edge computing with the Jetson family:

Device GPU AI Performance (TOPS) Power Typical Use
Jetson Nano Maxwell 128-core 0.5 5-10W Prototyping, simple CV
Jetson Xavier NX Volta 384-core 21 10-15W Robotics, smart cameras
Jetson AGX Orin Ampere 2048-core 275 15-60W Autonomous vehicles, high-end robotics
Jetson Thor (2025) Ada Lovelace 700+ 60-90W Level 4+ autonomous driving

Apple Neural Engine (ANE)

Apple's Neural Engine is a dedicated AI accelerator integrated into A-series and M-series chips:

The ANE enables real-time AI features on iPhones: photo segmentation, Siri processing, live transcription. Core ML abstracts hardware differences, allowing developers to target the ANE without explicit optimization.

Google Edge TPU

Google's Edge TPU is a purpose-built ASIC for edge inference:

Form Factor Performance Power Use Case
Coral Dev Board 4 TOPS 2W Prototyping
USB Accelerator 4 TOPS 2W Adding AI to existing devices
Edge TPU Module 8 TOPS 4W Production embedded

Edge TPUs are optimized for INT8 quantized models and run efficiently with TensorFlow Lite. They're particularly popular for vision applications in retail and industrial settings.

Qualcomm Hexagon DSP

Qualcomm's Hexagon DSP includes dedicated AI processing capabilities:

Model Optimization for Edge

Quantization

Quantization reduces model weights from FP32 to INT8 or even INT4:

FP32 (32-bit float):  4 bytes per weight
INT8 (8-bit integer):  1 byte per weight
INT4 (4-bit integer): 0.5 bytes per weight

For a 10M parameter model:
  FP32: 40 MB
  INT8: 10 MB (75% reduction, ~2% accuracy loss)
  INT4: 5 MB (87.5% reduction, ~5% accuracy loss)
    

Quantization-aware training (QAT) produces better accuracy than post-training quantization by simulating quantization during training.

Pruning

Pruning removes redundant weights or neurons:

Modern vision transformers can often be pruned 30-50% with minimal accuracy loss.

Knowledge Distillation

Large Teacher Model → Small Student Model
     (accuracy)         (compressed)

Training signal: Student mimics teacher logits + intermediate activations
Result: Student model much smaller but retains most teacher capability
    

DistilBERT uses knowledge distillation to achieve 60% smaller and 60% faster than BERT while retaining 97% of language understanding.

Architecture Choices

Some architectures are inherently more edge-friendly:

Edge AI Frameworks

TensorFlow Lite

TensorFlow Lite (TFLite) converts TensorFlow models to efficient edge formats:

1. Convert: TF Model → TFLite FlatBuffer format
2. Optimize: Quantize, prune, optimize ops
3. Deploy: Run on mobile, embedded, or microcontrollers

Supported platforms:
  - Android (CPU, GPU, NNAPI, EdgeTPU)
  - iOS (Core ML, Metal)
  - Linux (x86, ARM)
  - Microcontrollers (TensorFlow Lite for Microcontrollers)
    

ONNX Runtime

ONNX Runtime provides cross-platform inference:

# Convert from various frameworks to ONNX
torch.onnx.export(model, input, "model.onnx")
keras.export(model, "model.onnx")

# Run with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
results = session.run(None, {"input": data})
    

ONNX Runtime supports hardware acceleration on CPU, CUDA, TensorRT, Core ML, and more.

PyTorch Mobile and ExecuTorch

PyTorch's mobile stack enables direct deployment:

# PyTorch Mobile
model.classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
model.eval()
model_scripted = torch.jit.script(model)
model_scripted.save("mobilenet.pt")

# ExecuTorch (newer, more flexible)
from executorch import exir
exir.capture(model, input) → compiled executable
    

Hybrid Architectures

Many applications use a hybrid approach:

Streaming/Caching

Edge device does initial filtering:
  - Voice keyword detection (always-on, minimal compute)
  - If keyword detected → transcribe and send to cloud
  - Cloud does full NLP processing
  
Result: Cloud only contacted when relevant, saving bandwidth
    

Hierarchical Processing

Level 1 (Device):    Filter, compress, simple anomaly detection
Level 2 (Gateway):   Aggregate, correlate, complex analytics
Level 3 (Cloud):     Historical analysis, model retraining
    

Federated Learning

Train models across distributed edge devices without centralizing data:

1. Cloud sends model to edge devices
2. Edge devices train locally on local data
3. Edge devices send model updates (not raw data) to cloud
4. Cloud aggregates updates, updates global model
5. Repeat
    

This enables learning from private data while keeping data local.

Practical Considerations

Memory Constraints

Edge devices have limited RAM. Always compute memory requirements:

Memory = Model weights + Activations + KV cache (for transformers)

For MobileNet-V3 Large:
  Weights: ~5 MB
  Activations (batch=1): ~5 MB
  Total: ~10 MB (fits in any device)

For Llama 3 8B (impossible on most edge devices):
  Weights (FP16): 16 GB
  KV cache (8K context): ~1 GB
  Total: ~17 GB (requires server-class hardware)
    

Thermal Management

AI inference generates heat. Continuous high-power inference may require active cooling:

OTA Updates

Managing deployed edge devices requires robust update mechanisms:

Conclusion

Edge computing is essential for latency-critical, privacy-sensitive, or connectivity-constrained applications. Specialized AI hardware—Jetson, Edge TPU, Apple Neural Engine—has made significant inference capability available at the edge. Model optimization techniques (quantization, pruning, distillation) enable capable AI within tight resource budgets.

The future is hybrid: edge devices handle immediate processing, fog nodes provide regional aggregation, and cloud offers centralized training and complex analytics. The right split depends on latency requirements, data sensitivity, and update frequency. Understanding these trade-offs is essential for architects building next-generation IoT and AI systems.