Edge Computing and IoT

Cloud vs edge vs fog computing, AI chips, and latency-critical architectures

Published: January 2026 | Reading Time: 14 minutes | Category: Infrastructure

Data center representing edge computing infrastructure

The云计算 model—centralizing computation in massive data centers—has dominated for the past two decades. But as IoT devices proliferate and AI applications demand millisecond latency, the pendulum is swinging toward distributed computation. Edge computing brings intelligence closer to where data is generated, reducing latency, bandwidth, and privacy concerns.

This article explores the edge computing landscape: the architectural paradigms, the specialized AI hardware enabling edge inference, and the practical considerations for deploying AI at the edge.

The Computing Paradigms

Cloud Computing

Traditional cloud computing centralizes resources in large data centers. All devices send data to the cloud for processing, which returns results.

Device → [Internet] → Cloud Data Center → [Internet] → Device

Advantages:
  - Unlimited compute (scale up/down as needed)
  - Centralized management
  - Latest hardware available
  
Disadvantages:
  - Latency: 50-200ms round-trip
  - Bandwidth: Transmitting all raw data is expensive
  - Privacy: Sensitive data leaves the device
  - Connectivity: Requires internet connection

Edge Computing

Edge computing pushes computation to the "edge" of the network—devices, gateways, or nearby servers. Data is processed locally with only relevant results sent to the cloud.

Device → Local Processing → Summary Results → Cloud

Advantages:
  - Ultra-low latency: 1-10ms
  - Reduced bandwidth: Only transmit processed data
  - Privacy: Raw data never leaves the device
  - Offline capability: Works without internet
  
Disadvantages:
  - Limited compute: Constrained by device hardware
  - Management complexity: Distributed updates and monitoring
  - Model size: Must fit in device memory

Fog Computing

Fog computing sits between cloud and edge—an intermediate layer of computing resources (local servers, gateway devices) that provides more resources than edge while being closer than cloud.

Device → Fog Node (local server) → Cloud Data Center

Example: A factory has a fog server that aggregates
sensor data from multiple machines, does initial
processing, and only sends summaries to the cloud.

Latency Requirements by Application

Application	Required Latency	Edge Feasible?
Autonomous driving	< 10ms	Essential
Industrial robotics	< 5ms	Essential
AR/VR	< 20ms	Essential
Voice assistants	< 200ms perceived	Yes (partial processing)
Smart home	< 1 second acceptable	Flexible
Predictive maintenance	Minutes acceptable	Flexible

        The Speed of Light Limit: A round trip to a distant cloud datacenter adds 30-50ms minimum from light propagation alone (speed of light through fiber is ~200,000 km/s, plus routing overhead). Edge processing eliminates this fundamental limit.
    

Edge AI Hardware

CPU-based Inference

The simplest option: run inference on the device CPU. Modern CPUs from Intel (Core series), Qualcomm (Snapdragon), and Apple (M-series) include vector extensions (AVX, Neon, AMX) that accelerate matrix operations.

Suitable for: Small models (MobileNet, DistilBERT), non-real-time applications
Performance: 10-100 inferences/second for small models
Advantage: No specialized hardware required

GPU-based Edge Devices

NVIDIA dominates GPU-based edge computing with the Jetson family:

Device	GPU	AI Performance (TOPS)	Power	Typical Use
Jetson Nano	Maxwell 128-core	0.5	5-10W	Prototyping, simple CV
Jetson Xavier NX	Volta 384-core	21	10-15W	Robotics, smart cameras
Jetson AGX Orin	Ampere 2048-core	275	15-60W	Autonomous vehicles, high-end robotics
Jetson Thor (2025)	Ada Lovelace	700+	60-90W	Level 4+ autonomous driving

Apple Neural Engine (ANE)

Apple's Neural Engine is a dedicated AI accelerator integrated into A-series and M-series chips:

M3 Pro: 18 cores, ~35 TOPS
M3 Max: 40 cores, ~75 TOPS
A17 Pro (iPhone 15 Pro): 35 TOPS

The ANE enables real-time AI features on iPhones: photo segmentation, Siri processing, live transcription. Core ML abstracts hardware differences, allowing developers to target the ANE without explicit optimization.

Google Edge TPU

Google's Edge TPU is a purpose-built ASIC for edge inference:

Form Factor	Performance	Power	Use Case
Coral Dev Board	4 TOPS	2W	Prototyping
USB Accelerator	4 TOPS	2W	Adding AI to existing devices
Edge TPU Module	8 TOPS	4W	Production embedded

Edge TPUs are optimized for INT8 quantized models and run efficiently with TensorFlow Lite. They're particularly popular for vision applications in retail and industrial settings.

Qualcomm Hexagon DSP

Qualcomm's Hexagon DSP includes dedicated AI processing capabilities:

Snapdragon 8 Gen 3: 45 TOPS, dedicated AI accelerator
Hexagon Tensor Accelerator: Vector and matrix operations
Used in: Android smartphones, AR glasses, automotive

Model Optimization for Edge

Quantization

Quantization reduces model weights from FP32 to INT8 or even INT4:

FP32 (32-bit float):  4 bytes per weight
INT8 (8-bit integer):  1 byte per weight
INT4 (4-bit integer): 0.5 bytes per weight

For a 10M parameter model:
  FP32: 40 MB
  INT8: 10 MB (75% reduction, ~2% accuracy loss)
  INT4: 5 MB (87.5% reduction, ~5% accuracy loss)

Quantization-aware training (QAT) produces better accuracy than post-training quantization by simulating quantization during training.

Pruning

Pruning removes redundant weights or neurons:

Unstructured pruning: Remove individual weights anywhere. High sparsity but irregular structure.
Structured pruning: Remove entire channels or attention heads. Lower sparsity but regular structure for efficient inference.

Modern vision transformers can often be pruned 30-50% with minimal accuracy loss.

Knowledge Distillation

Large Teacher Model → Small Student Model
     (accuracy)         (compressed)

Training signal: Student mimics teacher logits + intermediate activations
Result: Student model much smaller but retains most teacher capability

DistilBERT uses knowledge distillation to achieve 60% smaller and 60% faster than BERT while retaining 97% of language understanding.

Architecture Choices

Some architectures are inherently more edge-friendly:

MobileNet: Depthwise separable convolutions reduce compute 8-9x
EfficientNet: Compound scaling optimized for efficiency
SSDs (Single Shot Detectors): One-shot detection, faster than two-stage
SSM/Linear models: Mamba, RWKV offer Transformer-like quality with linear complexity

Edge AI Frameworks

TensorFlow Lite

TensorFlow Lite (TFLite) converts TensorFlow models to efficient edge formats:

1. Convert: TF Model → TFLite FlatBuffer format
2. Optimize: Quantize, prune, optimize ops
3. Deploy: Run on mobile, embedded, or microcontrollers

Supported platforms:
  - Android (CPU, GPU, NNAPI, EdgeTPU)
  - iOS (Core ML, Metal)
  - Linux (x86, ARM)
  - Microcontrollers (TensorFlow Lite for Microcontrollers)

ONNX Runtime

ONNX Runtime provides cross-platform inference:

# Convert from various frameworks to ONNX
torch.onnx.export(model, input, "model.onnx")
keras.export(model, "model.onnx")

# Run with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
results = session.run(None, {"input": data})

ONNX Runtime supports hardware acceleration on CPU, CUDA, TensorRT, Core ML, and more.

PyTorch Mobile and ExecuTorch

PyTorch's mobile stack enables direct deployment:

# PyTorch Mobile
model.classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
model.eval()
model_scripted = torch.jit.script(model)
model_scripted.save("mobilenet.pt")

# ExecuTorch (newer, more flexible)
from executorch import exir
exir.capture(model, input) → compiled executable

Hybrid Architectures

Many applications use a hybrid approach:

Streaming/Caching

Edge device does initial filtering:
  - Voice keyword detection (always-on, minimal compute)
  - If keyword detected → transcribe and send to cloud
  - Cloud does full NLP processing
  
Result: Cloud only contacted when relevant, saving bandwidth

Hierarchical Processing

Level 1 (Device):    Filter, compress, simple anomaly detection
Level 2 (Gateway):   Aggregate, correlate, complex analytics
Level 3 (Cloud):     Historical analysis, model retraining

Federated Learning

Train models across distributed edge devices without centralizing data:

1. Cloud sends model to edge devices
2. Edge devices train locally on local data
3. Edge devices send model updates (not raw data) to cloud
4. Cloud aggregates updates, updates global model
5. Repeat

This enables learning from private data while keeping data local.

Practical Considerations

Memory Constraints

Edge devices have limited RAM. Always compute memory requirements:

Memory = Model weights + Activations + KV cache (for transformers)

For MobileNet-V3 Large:
  Weights: ~5 MB
  Activations (batch=1): ~5 MB
  Total: ~10 MB (fits in any device)

For Llama 3 8B (impossible on most edge devices):
  Weights (FP16): 16 GB
  KV cache (8K context): ~1 GB
  Total: ~17 GB (requires server-class hardware)

Thermal Management

AI inference generates heat. Continuous high-power inference may require active cooling:

Passive cooling: Sufficient for <5W sustained
Active cooling (fan): 5-20W, requires airflow
Thermal throttling: Most devices will throttle performance if overheated

OTA Updates

Managing deployed edge devices requires robust update mechanisms:

Differential updates: Only send weight deltas, not full models
Rollback capability: Revert to previous version if update fails
A/B partitioning: Run old and new versions simultaneously

Conclusion

Edge computing is essential for latency-critical, privacy-sensitive, or connectivity-constrained applications. Specialized AI hardware—Jetson, Edge TPU, Apple Neural Engine—has made significant inference capability available at the edge. Model optimization techniques (quantization, pruning, distillation) enable capable AI within tight resource budgets.

The future is hybrid: edge devices handle immediate processing, fog nodes provide regional aggregation, and cloud offers centralized training and complex analytics. The right split depends on latency requirements, data sensitivity, and update frequency. Understanding these trade-offs is essential for architects building next-generation IoT and AI systems.

Local LLM Deployment Guide Privacy Computing Overview 5G Real Applications