How do I handle long text or audiobook generation with Kokoro?

Kokoro auto-chunks input. For manual control, split text into ~400-character chunks (roughly 10 seconds of speech), process each chunk through the pipeline, and save as separate WAV files.

Why do I keep getting 429 Too Many Requests on HuggingFace?

Cloud environments (GCP, AWS) often hit HF rate limits. Fix: authenticate with huggingface-cli login, pre-download model files to local cache, and set HF_HUB_OFFLINE=1 for fully offline operation.

♫ Apache 2.0 · TTS Arena #1

Run Kokoro-82M Locally — Free Open-Source TTS Guide

Q: espeak-ng won't install — what do I do?

Windows: Download from espeak-ng releases, install to C:\Program Files\eSpeak NG, add to System PATH. macOS: brew install espeak-ng then export PYTORCH_ENABLE_MPS_FALLBACK=1. Linux: sudo apt-get install espeak-ng. If still failing, use kokoro-onnx which doesn't need espeak-ng.

Q: Can Kokoro do voice cloning?

No. Kokoro only has fixed 54 voices. For voice cloning, use Chatterbox (MIT license, 63.75% blind test wins vs ElevenLabs) or Fish Speech S2 Pro (multi-language voice cloning).

Q: What GPU do I need to run Kokoro?

None required. Kokoro runs on CPU. Any NVIDIA GPU with 2+ GB VRAM works for acceleration. T4 achieves 96x real-time speed. Even a laptop CPU is faster than real-time for short sentences.

Q: Kokoro vs ElevenLabs — which should I choose?

Choose Kokoro for zero API costs, offline/local TTS, free products, or unlimited generation. Choose ElevenLabs for voice cloning, the absolute highest quality, or managed infrastructure. Kokoro is #1 on TTS Arena.

54 voices, 8 languages, 96× real-time on T4 GPU. The #1 open-source text-to-speech model — runs on CPU, zero API costs.

13.4M HF downloads

54 voices

8 languages

CPU friendly

Get Started

What Is Kokoro-82M

Kokoro-82M is an open-source text-to-speech model by hexgrad, built on an improved StyleTTS 2 architecture. With 82 million parameters — about 1/100 the size of a typical LLM — it delivers speech quality that beats models 14× larger on the TTS Spaces Arena benchmark.

The model is trained on less than 100 hours of audio data (total training cost: ~$1,000 on an A100), yet achieves state-of-the-art results. It's released under Apache 2.0, meaning you can use it commercially, deploy it in products, or run it entirely offline.

Key numbers: 13.4M downloads on HuggingFace · 6.3k likes · 54 built-in voices · Under 2 GB VRAM · 96× real-time on T4 GPU · Runs on CPU

Why Kokoro Matters

1. Lightweight by Design

Most high-quality TTS models are heavy — Fish Speech uses ~500M params, XTTS v2 pushes past 1B. Kokoro achieves competitive quality at 82M params, a 6–12× reduction. Runs on any laptop CPU (tested on M1 MacBook Air 8GB), under 2 GB VRAM on GPU, sub-500ms latency for short sentences. Works in the browser via ONNX + Transformers.js — no server needed.

2. Truly Open (Apache 2.0)

Unlike many "open-source" TTS models with non-commercial licenses (XTTS v2) or usage restrictions, Kokoro is Apache 2.0. Use it in commercial products, deploy behind a paid API, fork and modify model weights, redistribute freely.

Caveat: Kokoro depends on espeak-ng for phoneme generation, which is GPL licensed. For most use cases this is a non-issue, but enterprise deployments should consult legal. The ONNX path (Method 2 below) avoids espeak-ng entirely.

3. Multi-Language Support

Language	Code	Quality	Voices
American English	`a`	★★★★★	14
British English	`b`	★★★★★	8
Spanish	`e`	★★★★	6
French	`f`	★★★★	6
Italian	`i`	★★★★	6
Japanese	`j`	★★★	6
Mandarin Chinese	`z`	★★★	special config

English is strongest. Non-English, especially Chinese/Japanese, has noticeable accent artifacts.

4. Cost: Kokoro vs ElevenLabs

	Kokoro (self-hosted)	ElevenLabs (API)
Monthly cost	$0 (your hardware)	$99/mo (Creator)
Per-hour audio	~$0 (electricity)	~$11
API calls	Unlimited	500K chars/mo
Voice cloning	No (fixed voices)	Yes
Latency	<500ms (local)	200–800ms (API)
Offline use	Yes	No

How to Install & Run Kokoro-82M

Method 1: Python pip (Quickest)

# 1. Install dependencies
pip install kokoro>=0.9.4 soundfile

# 2. Install espeak-ng (REQUIRED)
# Ubuntu/Debian:
sudo apt-get install espeak-ng

# macOS:
brew install espeak-ng

# Windows: download from github.com/espeak-ng/espeak-ng/releases
# Add to PATH. See FAQ for troubleshooting.

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # 'a' = American English

text = "Hello world! Kokoro is running locally on my machine."
generator = pipeline(text, voice='af_heart', speed=1.0)

for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'output_{i}.wav', audio, 24000)  # 24kHz sample rate
    print(f"Generated segment {i}")

Kokoro automatically splits long text into chunks. For manual chunking, see FAQ.

Method 2: ONNX (Cross-platform, No espeak-ng)

ONNX path eliminates the espeak-ng dependency entirely — smaller model files (q4 quantized: ~305 MB) and runs in the browser via kokoro-js + Transformers.js.

pip install kokoro-onnx soundfile

from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices.json")
samples, sample_rate = kokoro.create(
    "Life is like a box of chocolates.",
    voice="af_sarah",
    speed=1.0
)

import soundfile as sf
sf.write("output.wav", samples, sample_rate)

Browser usage:

import { KokoroTTS } from "kokoro-js";
const tts = await KokoroTTS.from_pretrained(
    "onnx-community/Kokoro-82M-ONNX", { dtype: "q4" }
);
const audio = await tts.generate("Hello from your browser!", {
    voice: "af_heart"
});

Method 3: Docker (Kokoro-FastAPI)

The most popular community deployment. Pre-built images for CPU, NVIDIA GPU, AMD ROCm, and Apple Silicon.

# CPU-only (any machine)
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# NVIDIA GPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

# Test the API
curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro", "input": "Kokoro is running via FastAPI.", "voice": "af_heart"}' \
  --output speech.mp3

Features: OpenAI-compatible /v1/audio/speech endpoint, voice mixing, per-word timestamps, phoneme endpoints.

Method 4: Open WebUI Integration

# Deploy Kokoro Web (one command)
docker run -p 3000:3000 ghcr.io/remsky/kokoro-web:latest

In Open WebUI → Admin Settings → Audio → TTS Settings:

TTS Engine: OpenAI
API Base URL: http://localhost:3000/api/v1
API Key: any-string
TTS Model: kokoro
TTS Voice: af_heart

54 Voices Overview

Voice naming: af_* = American Female (14 voices), am_* = American Male (8), bf_* = British Female (8), bm_* = British Male (4). Plus Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin.

Voice Blending

Create custom voices by blending existing ones with weighted combinations:

# Blend 50% bella + 50% sarah
generator = pipeline(text, voice='af_bella', voice2='af_sarah', voice2_weight=0.5)

Popular community blends:

af_bella × 0.3 + af_sarah × 0.7 — Warm, professional narration
af_heart × 0.5 + am_adam × 0.5 — Gender-neutral podcast voice
af_nicole × 0.4 + am_michael × 0.6 — Energetic tutorial voice

Production Deployment

Architecture:

Client → Kokoro-FastAPI (Docker, Port 8880, OpenAI-compatible) → Kokoro 82M

Docker Compose (CPU)

# docker-compose.yml
services:
  kokoro:
    image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
    ports:
      - "8880:8880"
    environment:
      - DEFAULT_VOICE=af_heart
      - DEFAULT_LANG=a
    restart: unless-stopped

HF 429 Fix — Local Model Caching

If deploying on cloud (GCP, AWS Lambda), you'll hit HuggingFace rate limits. Cache model files locally:

# Pre-download model files (do once)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('hexgrad/Kokoro-82M', local_dir='./kokoro_model')
"

# Set environment variables in deployment:
export HF_HOME=/path/to/cached/models
export HF_HUB_OFFLINE=1  # After caching, go fully offline

FAQ

espeak-ng won't install — what do I do?

Windows: Download from espeak-ng releases, install to C:\Program Files\eSpeak NG\, add to System PATH, restart terminal. If still failing — use kokoro-onnx (Method 2) which doesn't need espeak-ng.

macOS M1/M2/M3: brew install espeak-ng then export PYTORCH_ENABLE_MPS_FALLBACK=1

Linux: sudo apt-get install espeak-ng (Debian/Ubuntu) or sudo dnf install espeak-ng (Fedora)

Can Kokoro do voice cloning?

No. Kokoro only has fixed 54 voices and does not support zero-shot voice cloning. If cloning is essential, consider these complementary tools:

Chatterbox — MIT license, 63.75% blind test wins vs ElevenLabs
Fish Speech S2 Pro — Multi-language voice cloning

How do I handle long text / audiobook generation?

Kokoro auto-chunks, but you can manually split for better control:

def chunk_text(text, max_chars=400):
    sentences = text.replace('\n', ' ').split('. ')
    chunks, current = [], ""
    for s in sentences:
        if len(current) + len(s) < max_chars:
            current += s + '. '
        else:
            chunks.append(current.strip())
            current = s + '. '
    if current:
        chunks.append(current.strip())
    return chunks

for i, chunk in enumerate(chunk_text(long_text)):
    generator = pipeline(chunk, voice='af_heart')
    for _, _, audio in generator:
        sf.write(f'chapter_{i}.wav', audio, 24000)

I keep getting "429 Too Many Requests" on HuggingFace

Three-step fix:

Get a HF token: huggingface-cli login
Cache models locally (see Production section above)
Set HF_HUB_OFFLINE=1 to go fully offline

What GPU do I need to run Kokoro?

None required. Kokoro runs on CPU. For GPU acceleration, any NVIDIA GPU with 2+ GB VRAM works. T4 achieves 96× real-time speed. Even a laptop CPU produces speech faster than real-time for short sentences.

Kokoro vs ElevenLabs — which should I choose?

Choose Kokoro if: you want zero API costs, need offline/local TTS, are building a free product, or need unlimited generation. Choose ElevenLabs if: you need voice cloning, the absolute highest quality, or don't want to manage infrastructure. The quality gap is real but shrinking — Kokoro is #1 on TTS Arena for a reason.