kokoro

give your agent a voice.

on-device text-to-speech for Swift. 54 voices. CoreML on Apple Silicon. no cloud. no MLX.

get started source

Fast, not waiting

6-16x faster than real-time on Apple Silicon. CoreML handles inference while your agent keeps thinking.

Local, not calling home

no API keys. no latency. no usage fees. runs entirely on-device. your users' words stay on their hardware.

Native, not ported

pure Swift, CoreML under the hood. not a Python wrapper. not MLX. just a package you add and call.

CoreML on Apple Silicon

CoreML picks the fastest path on your hardware -- Neural Engine, GPU, or both. that's how you get 112ms per chunk while the rest of the system stays free for your app.

no MLX. no Python runtime. no metal shaders. just CoreML doing what it does best. you get native performance without managing any of it.

54 voices that don't sound like robots

american, british, and accents from around the world. male and female. each with its own character.

+42 more

click a voice to hear it

the command line

speech from your terminal. pipe it, script it, alias it.

$ kokoro say "your deploy is live"

$ kokoro say -v am_adam -s 1.3 "moving fast"

$ kokoro say --stream "start talking before I finish thinking"

$ echo "long document contents" | kokoro say --stream

$ kokoro say -o briefing.wav "save this for later"

$ kokoro say --list-voices
af_alloy    af_aoede    af_bella    af_heart    af_jessica  ...

$ kokoro daemon start
# models stay warm — subsequent calls are 3x faster

three lines to speech

let engine = try KokoroEngine()
for await event in try engine.speak("hello from the other side", voice: "af_heart") {
    player.play(event)
}

async streaming. audio chunks arrive as they're synthesized. playback starts immediately.

what's under the hood

streaming synthesis

text goes in, audio chunks stream out via AsyncStream. playback starts before the full text is processed.

automatic chunking

pass any length text. kokoro splits it at sentence and clause boundaries, merges short fragments, handles everything.

speed control

0.5x to 2.0x. slow it down for clarity, speed it up for notifications. one parameter.

IPA input

need precise pronunciation? skip the G2P pipeline entirely. pass IPA phonemes directly.

daemon mode

keep models loaded in memory. 3x faster for repeat synthesis. Unix socket IPC. start it, forget it.

smart G2P

lexicon lookup, CamelCase splitting, neural BART fallback, number expansion. technical terms just work.

6-16x faster than real-time

~112ms CoreML inference

24kHz sample rate

54 distinct voices and accents

0.5-2x speed control

built for agents

your AI assistant shouldn't have to phone a server to say something out loud.

conversational agents

stream responses as speech in real-time. the voice keeps pace with the LLM output.

accessibility

read any content aloud with natural-sounding voices. no internet required.

CLI tools

kokoro say "deploy complete" — pipe text in, get audio out. works from any script.

get started

macOS 15+ or iOS 18+. Apple Silicon.

Homebrew

brew install jud/kokoro-coreml/kokoro

Swift Package Manager

.package(url: "https://github.com/jud/kokoro-coreml.git", from: "0.3.0")

models (~640MB) download automatically on first run.

the API, briefly

synthesize

let result = try engine.synthesize(
    text: "the quick brown fox",
    voice: "af_heart",
    speed: 1.2
)
// result.samples → [Float] at 24kHz
// result.realTimeFactor → 12.4x

stream

for await event in try engine.speak(
    "long text...",
    voice: "am_adam"
) {
    switch event {
    case .audio(let buf): player.schedule(buf)
    case .chunkFailed(let e): print(e)
    }
}

IPA

let result = try engine.synthesize(
    ipa: "hˈɛloʊ wˈɜːld",
    voice: "bf_emma"
)

daemon

# keep models warm for fast repeat synthesis
kokoro daemon start

# subsequent calls are 3x faster
kokoro say "instant response"