EasyVoice
VoicesPricingAPI
EasyVoice

Free text-to-speech powered by open source AI.

Product

  • Voices
  • Pricing
  • API

Resources

  • Blog
  • Documentation
  • About

Legal

  • Privacy Policy
  • Terms of Service

© 2026 EasyVoice. Powered by Kokoro-82M (Apache 2.0).

Built with ❤️ and open source AI.

Built by InfoDriven

Dubai, United Arab Emirates · Support@infodriven.ae · infodriven.ae

  1. Home
  2. /TTS API
  3. /Low Latency TTS API — 300-600ms First-Byte on Kokoro-82M

Low Latency TTS API — 300-600ms First-Byte on Kokoro-82M

Latency is the single most-asked TTS spec for chatbot, IVR, and voice-agent use cases — because the perceived responsiveness of the entire user experience hinges on how fast the first audio byte reaches the speaker. EasyVoice's TTS API runs Kokoro-82M with a typical warm first-byte latency of 300-600ms, end-to-end generation time of 1-3 seconds for a typical sentence, and streaming via chunked transfer-encoding so playback can start before generation completes. This page covers the measured numbers, how Kokoro-82M's 82-million-parameter architecture beats larger model latency, what cold-start looks like, and how to design your client integration to minimize the rest of the round-trip.

5,000 characters per day on the free tier, no credit card. Pro $9.99/mo unlimited. 46 voices, 8 languages.

Part of the Best TTS APIs in 2026 hub — compares EasyVoice, OpenAI tts-1, ElevenLabs, Google Cloud TTS, and Azure Speech.

What 'low latency TTS' actually means

Three different latency numbers get conflated in TTS marketing pages. First-byte latency is the time from your POST request landing at the server to the first byte of audio leaving the server — this is what users perceive as 'how fast does the voice start speaking.' Total generation time is the time from request to the last byte of audio — this matters for batch processing but not for streaming UX. Round-trip time includes network latency from the user's device to the server and back — this varies by geography (US-East users hitting a Frankfurt-hosted endpoint pay an extra 80-120ms). For real-time conversational UX, first-byte is the number that matters; for batch audiobook generation, total generation time matters more.

EasyVoice's measured warm first-byte latency, sampled across 1,000 requests from our US-East monitoring against the production endpoint, sits at a median of 380ms with a P90 of 580ms and a P99 of 850ms. Total generation time for a 200-character (~30-second-audio) sentence is 800ms-1.5s. The 'warm' qualifier matters: cold-start adds 1-2 seconds when the model has been evicted from memory after a long idle period. In practice, the production GPU pool keeps the model resident continuously, so cold-start is rare for any caller doing more than a few requests per hour.

Why Kokoro-82M is fast

Kokoro-82M is an 82-million-parameter neural TTS model from hexgrad released on Hugging Face under Apache-2.0. The size matters: 82M parameters is roughly 1/10th the size of OpenAI's tts-1 (estimated ~600M parameters based on inference characteristics) and 1/40th the size of ElevenLabs Multilingual v2. Smaller models execute faster on the same hardware. Kokoro's architecture trades off the very-top-of-stack expressive realism (character voice acting, extreme emotion range) for inference speed and lower compute cost — which is the right tradeoff for the vast majority of TTS use cases (chatbots, accessibility, IVR, content audio versions) where the listener doesn't need theatrical performance, just clean natural narration.

The other latency lever is co-locating GPU inference with the API server. EasyVoice runs inference on dedicated GPU instances in the same datacenter region as the HTTP frontend, which eliminates the inter-service network hop most cloud TTS architectures incur. By comparison, OpenAI's tts-1 routes through OpenAI's general inference fabric — fast, but not optimized for single-region single-model latency. The OpenAI tts-1 first-byte latency from the same US-East monitoring point sits at a median of ~900ms vs our 380ms — a 2.4× difference that's perceptible in conversational UX.

Streaming with chunked transfer-encoding

The endpoint streams audio via HTTP chunked transfer-encoding. The response opens with Content-Type: audio/mpeg (or audio/wav, audio/opus) and Transfer-Encoding: chunked. Bytes flow as the model generates them — typical chunk size is 4-16KB, which corresponds to roughly 200-800ms of playback audio. Any standard HTTP library handles this without special configuration: fetch in browsers/Node treats the response.body as a ReadableStream, requests in Python supports stream=True, Go's http.Client streams by default. You don't need WebSockets or gRPC for low-latency TTS — chunked HTTP is sufficient and simpler.

Client-side, the playback pattern is to pipe incoming bytes into a MediaSource (browser) or audio buffer queue (native). Most modern audio APIs handle this transparently — feed bytes into a queue and the audio thread plays them as they arrive. For browser apps, the simplest pattern is to fetch with response.body.getReader() and pipe into a MediaSource via SourceBuffer.appendBuffer(). For Node apps generating audio for downstream services, write the chunks to a writable stream as they arrive. The end-to-end pattern means a user clicking 'play' on a chatbot response hears audio within 380ms instead of after the full 1.5-second generation completes — a 4× perceived-latency improvement.

Cold-start behavior

Cold-start in TTS APIs refers to the latency added when the model isn't already resident in GPU memory and needs to be loaded. For EasyVoice's production endpoint, the model is kept warm continuously across the GPU pool — cold-starts occur only after long idle periods on under-utilized inference workers, which the load balancer routes around. In practice, callers making at least one request per ~60 minutes will never see a cold-start latency penalty.

If your usage pattern is highly bursty (silent for hours, then a burst of traffic), the first request in the burst may pay a 1-2 second cold-start penalty. The mitigation is a periodic keep-warm request — a single 1-character generation every 10-15 minutes is enough to keep your route's worker hot. For chatbot and IVR use cases where bursts are common, this is the right pattern and adds negligible cost on the free tier (a 1-character keep-warm is well under any cap). For batch jobs that run once a week and then stop, the cold-start is unavoidable but only affects the first request of the batch.

Network and geography

Network latency is independent of the TTS API and matters as much as the inference latency. From the same US-East monitoring point, ping to easyvoice.ae's primary region is ~30ms; from EU-West, ~25ms; from APAC, ~180-250ms depending on routing. APAC users will see total first-byte latency around 550-800ms even on warm inference, simply because of TCP round-trip time. The mitigation is the same as for any geo-sensitive service: place a Cloudflare or similar edge in front of your application to terminate TLS close to the user, even if the TTS request itself still traverses to the origin.

For ultra-low-latency conversational use cases (sub-300ms perceived first-byte from anywhere in the world), the only architectural answer is regional inference replicas — which we'll ship when demand from users in non-US/EU regions justifies it. Today, US and EU users see the benchmark numbers above; APAC users see modestly higher first-byte latency that's still competitive with all the alternatives in the same geography.

Comparison: latency vs OpenAI, ElevenLabs, Google, Azure

Measured first-byte latency from the same US-East monitoring point, median across 1,000 requests, same input (200 characters), warm-state: EasyVoice Kokoro-82M ~380ms. OpenAI tts-1 ~900ms. OpenAI tts-1-hd ~1,400ms. ElevenLabs Turbo v2.5 ~400ms. ElevenLabs Multilingual v2 ~1,800ms. Google Cloud TTS WaveNet ~700ms. Azure Neural TTS ~800ms. EasyVoice and ElevenLabs Turbo v2.5 are roughly tied at the low end; everything else trails by 2-5×.

ElevenLabs Turbo v2.5 is the closest competitor on raw latency, but it's gated behind ElevenLabs' tiered + per-character pricing model, not their lowest-cost plan. EasyVoice's $9.99/mo flat unlimited applies to every voice, every endpoint, every request. For real-time conversational voice agents that need both low first-byte latency and predictable bills as call volume grows, the flat-rate + Kokoro-82M combination is the configuration to beat.

Code samples

Drop-in examples for the EasyVoice TTS API. Every request below assumes you've set EASYVOICE_API_KEY as an environment variable.

Streaming response — Node 18+

Pipe chunks into the audio output as they arrive
const res = await fetch("https://easyvoice.ae/api/tts/generate", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.EASYVOICE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    voice: "am_adam",
    input: "This audio starts playing before generation completes.",
    response_format: "mp3",
  }),
});
const reader = res.body.getReader();
const start = performance.now();
let firstByteAt = null;
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  if (!firstByteAt) {
    firstByteAt = performance.now() - start;
    console.log(`First-byte latency: ${firstByteAt.toFixed(0)}ms`);
  }
  // Pipe `value` (Uint8Array) into your audio sink
  audioQueue.push(value);
}

Streaming response — Python

Use stream=True + iter_content for chunked playback
import time, requests
with requests.post(
    "https://easyvoice.ae/api/tts/generate",
    headers={"Authorization": f"Bearer {os.environ['EASYVOICE_API_KEY']}"},
    json={"voice": "am_adam", "input": "Hello", "response_format": "mp3"},
    stream=True,
) as res:
    start = time.time()
    first_byte_at = None
    for chunk in res.iter_content(chunk_size=4096):
        if not first_byte_at:
            first_byte_at = (time.time() - start) * 1000
            print(f"First-byte latency: {first_byte_at:.0f}ms")
        # Pipe chunk into pyaudio / ffmpeg / wherever
        audio_buffer.write(chunk)

Keep-warm pattern

Avoid cold-start on bursty traffic patterns
# Run every 10 minutes from cron / a sidecar process
*/10 * * * * curl -s -o /dev/null -X POST \
  https://easyvoice.ae/api/tts/generate \
  -H "Authorization: Bearer $EASYVOICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"voice":"am_adam","input":".","response_format":"mp3"}'

Voices to try with the API

Every voice below is callable via the same voice parameter — preview samples and read the full character profile.

AdamFree
American English · am_adam
HeartFree
American English · af_heart
MichaelFree
American English · am_michael

Frequently asked questions

What's EasyVoice's first-byte latency?▾

Median 380ms warm, P90 580ms, P99 850ms, measured from US-East against the production endpoint with a 200-character input. Cold-start (model not resident in GPU memory) adds 1-2 seconds but is rare in practice — the production pool keeps the model warm continuously across all callers. Total generation time for a 200-character (~30-second-audio) sentence is 800ms-1.5s end-to-end.

How does that compare to OpenAI tts-1 or ElevenLabs?▾

Measured from the same US-East monitoring point, median across 1,000 same-input requests: EasyVoice Kokoro-82M ~380ms, OpenAI tts-1 ~900ms, OpenAI tts-1-hd ~1.4s, ElevenLabs Turbo v2.5 ~400ms, ElevenLabs Multilingual v2 ~1.8s, Google Cloud TTS WaveNet ~700ms, Azure Neural ~800ms. EasyVoice and ElevenLabs Turbo v2.5 are roughly tied at the low end. Everything else trails by 2-5×.

Does the API support streaming?▾

Yes — chunked transfer-encoding. The response opens with Transfer-Encoding: chunked and audio bytes flow as the model generates them. Any standard HTTP library handles it: fetch with response.body.getReader() in Node/browser, requests with stream=True in Python, Go's http.Client streams by default. You don't need WebSockets or gRPC. Typical chunk size is 4-16KB, corresponding to 200-800ms of playback audio.

What causes cold-start latency?▾

Cold-start is the 1-2 second penalty when the model isn't already resident in GPU memory. In production it's rare — the GPU pool keeps the model warm. If your traffic is bursty (silent for hours, then a burst), schedule a keep-warm 1-character generation every 10-15 minutes to avoid the penalty on your first burst request. The keep-warm pattern fits under any tier's quota.

Is Kokoro-82M fast because it's small?▾

Largely yes. At 82 million parameters Kokoro is roughly 1/10th the size of OpenAI tts-1 (estimated ~600M params) and 1/40th the size of ElevenLabs Multilingual v2. Smaller models execute faster on the same hardware. Kokoro trades top-of-stack expressive realism (theatrical character voice acting) for inference speed — the right tradeoff for chatbot, IVR, accessibility, and most narration use cases where clean natural delivery matters more than performance.

Will latency be consistent in production?▾

Yes, as long as you're hitting the warm pool. The variance between median and P99 latency (380ms vs 850ms) reflects ordinary network and GPU scheduling jitter, not architectural inconsistency. For ultra-low-latency conversational UX from non-US/EU regions, you'll see slightly higher latency due to TCP round-trip time; we'll ship regional inference replicas when demand from those regions justifies it.

Related TTS API guides

Best TTS API for Customer Support Chatbots and Voice Agents

Best TTS API for customer support chatbots. EasyVoice wires into Twilio Voice, Voiceflow, Dialogflow CX in minutes. Flat $9.99 unlimited. Low-latency streaming.

TTS API for Developers — Bearer Auth, OpenAI Shape, Flat Pricing

TTS API for developers — Bearer auth, OpenAI-compatible request shape, curl/JS/Python/Go samples. 5K chars/day free. $9.99/mo unlimited Pro. 46 voices.

Comparing vendors? See EasyVoice vs elevenlabs →

Start building with the EasyVoice TTS API

5,000 characters per day free, no credit card. Pro $9.99/mo unlimited. OpenAI-compatible request shape.

More TTS API guides

← TTS API hubOpenAI TTS API Alternative — Drop-in Migration to Flat-RateFree TTS API — 5,000 Characters Per Day, No Credit CardTTS API for Developers — Bearer Auth, OpenAI Shape, Flat PricingBest TTS API for Customer Support Chatbots and Voice AgentsMultilingual TTS API — 8 Languages, 46 Voices, Single Endpoint