What edge AI unlocks for your product.
Edge inference is the right choice for any user-facing feature where latency is perceptible — smart autocomplete, real-time content suggestions, inline sentiment scoring on support ticket input. Workers AI eliminates the round-trip to a cloud data centre.
- Llama 3.1 8B (edge) — Sub-50ms reasoning for common queries at the network edge
- Mistral 7B (edge) — Lightweight general reasoning with fast first-token latency
- Streaming responses — Token-by-token streaming for chat interfaces that feel instant
Whisper Large v3 on Workers AI transcribes audio at the edge — meeting recordings, voice notes, call-centre audio — without audio data leaving the region it was recorded in. Ideal for GDPR-sensitive audio processing pipelines.
- Whisper Large v3 — State-of-the-art transcription, processed in-region
- Multi-language — 99 language transcription with automatic language detection
- Timestamped output — Word-level timestamps for meeting minutes and search indexing
Run a fast lightweight model at the edge before routing to an expensive cloud model. Cloudflare Workers AI acts as a cost gate — simple queries are resolved at the edge, complex ones are escalated to GPT-4o or Gemini 1.5 Pro. This alone can cut inference costs by 40–60%.
- Fast pre-classification — Tag query complexity at the edge, route accordingly
- Intent detection — Identify user intent before invoking a full RAG pipeline
- Cost-gating — Resolve simple queries at edge cost; escalate complex ones
With 300+ global points of presence, Workers AI ensures users in any geography get low-latency inference. Regional data boundaries are respected automatically — EU users' data is processed in EU data centres, US data stays in the US.
- Regional routing — Requests routed to the nearest Cloudflare PoP automatically
- GDPR data residency — EU inference stays within EU; configurable per-region
- Zero egress costs — Cloudflare's flat-rate model eliminates data transfer fees
How Gilligan Tech uses edge inference.
- Edge entry: User requests arrive at the nearest Cloudflare PoP. Workers AI intercepts the request before it reaches origin servers — no round-trip latency to a cloud region.
- Fast classification: A lightweight edge model (Llama 3.1 8B or Mistral 7B) classifies the request in under 50ms. Simple, high-confidence queries are resolved here.
- Smart escalation: Complex queries, multi-document lookups, or low-confidence edge results are forwarded to the appropriate cloud model — Gemini, GPT-4o, or Llama on Bedrock.
- Response caching: Cloudflare's edge cache stores common query responses. Repeated queries on the same content return in single-digit milliseconds with zero inference cost.
- Unified logging: Edge inference events are forwarded to the same audit log as cloud inference — giving a complete picture of cost, latency, and resolution tier per query.
Workers AI models we deploy.
| Model | Latency | Best for |
|---|---|---|
| Llama 3.1 8B Instruct | <50ms | Edge reasoning, fast Q&A, real-time suggestions |
| Mistral 7B Instruct | <40ms | Lightweight general reasoning, autocomplete, triage |
| Whisper Large v3 | Near-real-time | Audio transcription in-region (EU, US, APAC) |
| BGE Small EN v1.5 | <10ms | Fast edge embeddings for semantic classification |
| BAAI BGE M3 | <15ms | Multilingual edge embeddings for global deployments |