Run Kokoro-82M Locally — Free Open-Source TTS Guide
54 voices, 8 languages, 96× real-time on T4 GPU. The #1 open-source text-to-speech model — runs on CPU, zero API costs.
What Is Kokoro-82M
Kokoro-82M is an open-source text-to-speech model by hexgrad, built on an improved StyleTTS 2 architecture. With 82 million parameters — about 1/100 the size of a typical LLM — it delivers speech quality that beats models 14× larger on the TTS Spaces Arena benchmark.
The model is trained on less than 100 hours of audio data (total training cost: ~$1,000 on an A100), yet achieves state-of-the-art results. It's released under Apache 2.0, meaning you can use it commercially, deploy it in products, or run it entirely offline.
Why Kokoro Matters
1. Lightweight by Design
Most high-quality TTS models are heavy — Fish Speech uses ~500M params, XTTS v2 pushes past 1B. Kokoro achieves competitive quality at 82M params, a 6–12× reduction. Runs on any laptop CPU (tested on M1 MacBook Air 8GB), under 2 GB VRAM on GPU, sub-500ms latency for short sentences. Works in the browser via ONNX + Transformers.js — no server needed.
2. Truly Open (Apache 2.0)
Unlike many "open-source" TTS models with non-commercial licenses (XTTS v2) or usage restrictions, Kokoro is Apache 2.0. Use it in commercial products, deploy behind a paid API, fork and modify model weights, redistribute freely.
espeak-ng for phoneme generation, which is GPL licensed. For most use cases this is a non-issue, but enterprise deployments should consult legal. The ONNX path (Method 2 below) avoids espeak-ng entirely.
3. Multi-Language Support
| Language | Code | Quality | Voices |
|---|---|---|---|
| American English | a | ★★★★★ | 14 |
| British English | b | ★★★★★ | 8 |
| Spanish | e | ★★★★ | 6 |
| French | f | ★★★★ | 6 |
| Italian | i | ★★★★ | 6 |
| Japanese | j | ★★★ | 6 |
| Mandarin Chinese | z | ★★★ | special config |
English is strongest. Non-English, especially Chinese/Japanese, has noticeable accent artifacts.
4. Cost: Kokoro vs ElevenLabs
| Kokoro (self-hosted) | ElevenLabs (API) | |
|---|---|---|
| Monthly cost | $0 (your hardware) | $99/mo (Creator) |
| Per-hour audio | ~$0 (electricity) | ~$11 |
| API calls | Unlimited | 500K chars/mo |
| Voice cloning | No (fixed voices) | Yes |
| Latency | <500ms (local) | 200–800ms (API) |
| Offline use | Yes | No |
How to Install & Run Kokoro-82M
Method 1: Python pip (Quickest)
# 1. Install dependencies
pip install kokoro>=0.9.4 soundfile
# 2. Install espeak-ng (REQUIRED)
# Ubuntu/Debian:
sudo apt-get install espeak-ng
# macOS:
brew install espeak-ng
# Windows: download from github.com/espeak-ng/espeak-ng/releases
# Add to PATH. See FAQ for troubleshooting.
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code='a') # 'a' = American English
text = "Hello world! Kokoro is running locally on my machine."
generator = pipeline(text, voice='af_heart', speed=1.0)
for i, (gs, ps, audio) in enumerate(generator):
sf.write(f'output_{i}.wav', audio, 24000) # 24kHz sample rate
print(f"Generated segment {i}")
Method 2: ONNX (Cross-platform, No espeak-ng)
ONNX path eliminates the espeak-ng dependency entirely — smaller model files (q4 quantized: ~305 MB) and runs in the browser via kokoro-js + Transformers.js.
pip install kokoro-onnx soundfile
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices.json")
samples, sample_rate = kokoro.create(
"Life is like a box of chocolates.",
voice="af_sarah",
speed=1.0
)
import soundfile as sf
sf.write("output.wav", samples, sample_rate)
Browser usage:
import { KokoroTTS } from "kokoro-js";
const tts = await KokoroTTS.from_pretrained(
"onnx-community/Kokoro-82M-ONNX", { dtype: "q4" }
);
const audio = await tts.generate("Hello from your browser!", {
voice: "af_heart"
});
Method 3: Docker (Kokoro-FastAPI)
The most popular community deployment. Pre-built images for CPU, NVIDIA GPU, AMD ROCm, and Apple Silicon.
# CPU-only (any machine)
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
# NVIDIA GPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
# Test the API
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "kokoro", "input": "Kokoro is running via FastAPI.", "voice": "af_heart"}' \
--output speech.mp3
Features: OpenAI-compatible /v1/audio/speech endpoint, voice mixing, per-word timestamps, phoneme endpoints.
Method 4: Open WebUI Integration
# Deploy Kokoro Web (one command)
docker run -p 3000:3000 ghcr.io/remsky/kokoro-web:latest
In Open WebUI → Admin Settings → Audio → TTS Settings:
- TTS Engine: OpenAI
- API Base URL:
http://localhost:3000/api/v1 - API Key:
any-string - TTS Model:
kokoro - TTS Voice:
af_heart
54 Voices Overview
Voice naming: af_* = American Female (14 voices), am_* = American Male (8), bf_* = British Female (8), bm_* = British Male (4). Plus Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin.
Voice Blending
Create custom voices by blending existing ones with weighted combinations:
# Blend 50% bella + 50% sarah
generator = pipeline(text, voice='af_bella', voice2='af_sarah', voice2_weight=0.5)
Popular community blends:
af_bella × 0.3 + af_sarah × 0.7— Warm, professional narrationaf_heart × 0.5 + am_adam × 0.5— Gender-neutral podcast voiceaf_nicole × 0.4 + am_michael × 0.6— Energetic tutorial voice
Production Deployment
Architecture:
Docker Compose (CPU)
# docker-compose.yml
services:
kokoro:
image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
ports:
- "8880:8880"
environment:
- DEFAULT_VOICE=af_heart
- DEFAULT_LANG=a
restart: unless-stopped
HF 429 Fix — Local Model Caching
If deploying on cloud (GCP, AWS Lambda), you'll hit HuggingFace rate limits. Cache model files locally:
# Pre-download model files (do once)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('hexgrad/Kokoro-82M', local_dir='./kokoro_model')
"
# Set environment variables in deployment:
export HF_HOME=/path/to/cached/models
export HF_HUB_OFFLINE=1 # After caching, go fully offline
FAQ
espeak-ng won't install — what do I do?
Windows: Download from espeak-ng releases, install to C:\Program Files\eSpeak NG\, add to System PATH, restart terminal. If still failing — use kokoro-onnx (Method 2) which doesn't need espeak-ng.
macOS M1/M2/M3: brew install espeak-ng then export PYTORCH_ENABLE_MPS_FALLBACK=1
Linux: sudo apt-get install espeak-ng (Debian/Ubuntu) or sudo dnf install espeak-ng (Fedora)
Can Kokoro do voice cloning?
No. Kokoro only has fixed 54 voices and does not support zero-shot voice cloning. If cloning is essential, consider these complementary tools:
- Chatterbox — MIT license, 63.75% blind test wins vs ElevenLabs
- Fish Speech S2 Pro — Multi-language voice cloning
How do I handle long text / audiobook generation?
Kokoro auto-chunks, but you can manually split for better control:
def chunk_text(text, max_chars=400):
sentences = text.replace('\n', ' ').split('. ')
chunks, current = [], ""
for s in sentences:
if len(current) + len(s) < max_chars:
current += s + '. '
else:
chunks.append(current.strip())
current = s + '. '
if current:
chunks.append(current.strip())
return chunks
for i, chunk in enumerate(chunk_text(long_text)):
generator = pipeline(chunk, voice='af_heart')
for _, _, audio in generator:
sf.write(f'chapter_{i}.wav', audio, 24000)
I keep getting "429 Too Many Requests" on HuggingFace
Three-step fix:
- Get a HF token:
huggingface-cli login - Cache models locally (see Production section above)
- Set
HF_HUB_OFFLINE=1to go fully offline
What GPU do I need to run Kokoro?
None required. Kokoro runs on CPU. For GPU acceleration, any NVIDIA GPU with 2+ GB VRAM works. T4 achieves 96× real-time speed. Even a laptop CPU produces speech faster than real-time for short sentences.
Kokoro vs ElevenLabs — which should I choose?
Choose Kokoro if: you want zero API costs, need offline/local TTS, are building a free product, or need unlimited generation. Choose ElevenLabs if: you need voice cloning, the absolute highest quality, or don't want to manage infrastructure. The quality gap is real but shrinking — Kokoro is #1 on TTS Arena for a reason.