kokoro
give your agent a voice.
on-device text-to-speech for Swift. 54 voices. CoreML on Apple Silicon. no cloud. no MLX.
Why
Fast, not waiting
6-16x faster than real-time on Apple Silicon. CoreML handles inference while your agent keeps thinking.
Local, not calling home
no API keys. no latency. no usage fees. runs entirely on-device. your users' words stay on their hardware.
Native, not ported
pure Swift, CoreML under the hood. not a Python wrapper. not MLX. just a package you add and call.
powered by
CoreML on Apple Silicon
CoreML picks the fastest path on your hardware -- Neural Engine, GPU, or both. that's how you get 112ms per chunk while the rest of the system stays free for your app.
no MLX. no Python runtime. no metal shaders. just CoreML doing what it does best. you get native performance without managing any of it.
54 voices that don't sound like robots
american, british, and accents from around the world. male and female. each with its own character.
click a voice to hear it
the command line
speech from your terminal. pipe it, script it, alias it.
$ kokoro say "your deploy is live"
$ kokoro say -v am_adam -s 1.3 "moving fast"
$ kokoro say --stream "start talking before I finish thinking"
$ echo "long document contents" | kokoro say --stream
$ kokoro say -o briefing.wav "save this for later"
$ kokoro say --list-voices
af_alloy af_aoede af_bella af_heart af_jessica ...
$ kokoro daemon start
# models stay warm — subsequent calls are 3x faster
three lines to speech
let engine = try KokoroEngine()
for await event in try engine.speak("hello from the other side", voice: "af_heart") {
player.play(event)
}
async streaming. audio chunks arrive as they're synthesized. playback starts immediately.
what's under the hood
streaming synthesis
text goes in, audio chunks stream out via AsyncStream. playback starts before the full text is processed.
automatic chunking
pass any length text. kokoro splits it at sentence and clause boundaries, merges short fragments, handles everything.
speed control
0.5x to 2.0x. slow it down for clarity, speed it up for notifications. one parameter.
IPA input
need precise pronunciation? skip the G2P pipeline entirely. pass IPA phonemes directly.
daemon mode
keep models loaded in memory. 3x faster for repeat synthesis. Unix socket IPC. start it, forget it.
smart G2P
lexicon lookup, CamelCase splitting, neural BART fallback, number expansion. technical terms just work.
built for agents
your AI assistant shouldn't have to phone a server to say something out loud.
conversational agents
stream responses as speech in real-time. the voice keeps pace with the LLM output.
accessibility
read any content aloud with natural-sounding voices. no internet required.
CLI tools
kokoro say "deploy complete" — pipe text in, get audio out. works from any script.
get started
macOS 15+ or iOS 18+. Apple Silicon.
brew install jud/kokoro-coreml/kokoro
.package(url: "https://github.com/jud/kokoro-coreml.git", from: "0.3.0")
models (~640MB) download automatically on first run.
the API, briefly
synthesize
let result = try engine.synthesize(
text: "the quick brown fox",
voice: "af_heart",
speed: 1.2
)
// result.samples → [Float] at 24kHz
// result.realTimeFactor → 12.4x
stream
for await event in try engine.speak(
"long text...",
voice: "am_adam"
) {
switch event {
case .audio(let buf): player.schedule(buf)
case .chunkFailed(let e): print(e)
}
}
IPA
let result = try engine.synthesize(
ipa: "hˈɛloʊ wˈɜːld",
voice: "bf_emma"
)
daemon
# keep models warm for fast repeat synthesis
kokoro daemon start
# subsequent calls are 3x faster
kokoro say "instant response"