EasyVoice
VoicesPricingAPI
EasyVoice

Free text-to-speech powered by open source AI.

Product

  • Voices
  • Pricing
  • API

Resources

  • Blog
  • Documentation
  • About

Legal

  • Privacy Policy
  • Terms of Service

© 2026 EasyVoice. Powered by Kokoro-82M (Apache 2.0).

Built with ❤️ and open source AI.

Built by InfoDriven

Dubai, United Arab Emirates · Support@infodriven.ae · infodriven.ae

  1. Home
  2. /TTS API
  3. /Multilingual TTS API — 8 Languages, 46 Voices, Single Endpoint

Multilingual TTS API — 8 Languages, 46 Voices, Single Endpoint

Most TTS APIs that claim 'multilingual support' do one of two things: either they expose a single multilingual neural model that handles every language with the same generic voice (OpenAI tts-1), or they expose hundreds of locale-specific voices behind tier gating that makes each language a separate budget conversation (Google Cloud TTS, Azure Neural). EasyVoice ships a middle path that matches what most multilingual apps actually need: 46 native-speaker voices across 8 high-value languages (American and British English, Spanish, French, Italian, Portuguese, Japanese, Hindi, Chinese), callable from the same endpoint with the same Bearer key at the same flat $9.99/mo Pro price. This guide covers what 'multilingual support' actually means in practice, which language voices are right for which use cases, and how to swap between locales programmatically in a single conversation flow.

5,000 characters per day on the free tier, no credit card. Pro $9.99/mo unlimited. 46 voices, 8 languages.

Part of the Best TTS APIs in 2026 hub — compares EasyVoice, OpenAI tts-1, ElevenLabs, Google Cloud TTS, and Azure Speech.

What 'multilingual TTS' actually requires in production

The textbook definition of multilingual TTS — 'an API that synthesizes audio in multiple languages' — undersells the actual product requirement. What real multilingual apps need is: (1) native-speaker voices in each target language, not English-engine voices reading translated text (which sound visibly off to native listeners and undermine trust); (2) a single authentication surface and single endpoint URL across all languages (separate auth per language is operationally expensive); (3) consistent pricing model across all languages (per-language pricing tiers — common in Google Cloud TTS Studio voices — make budget forecasting harder); (4) language-detection or language-passing convention that lets the request specify which voice to use without requiring out-of-band coordination.

EasyVoice satisfies all four. 46 native-speaker voices across the 8 supported languages — every voice in the catalog is recorded by a native speaker of its target language, not a multilingual model approximating an accent. Single endpoint at https://easyvoice.ae/api/tts/generate; single Bearer key; flat $9.99/mo Pro covers all languages and all voices identically. The voice parameter in the request body identifies both the language and the specific voice (e.g. ef_dora = Spanish female 'Dora', ff_siwis = French female 'Siwis'); no separate language code is required, though the locale prefix in the voice ID makes language-routing trivial in code.

The 8 supported languages and 46 voices

American English (en-US, 22 voices including af_heart, af_aoede, af_bella, af_nova, am_adam, am_michael, am_onyx) — the largest catalog and the most common starting point for English-language apps. British English (en-GB, 8 voices including bf_emma, bf_alice, bm_george, bm_daniel) — modern Received Pronunciation for UK-targeted content, audiobook narration of British literature, and any context where the listener expects a UK accent rather than American. Spanish (es, 3 voices including ef_dora, em_alex) — covers Spanish-speaking markets across Spain and Latin America with a neutral Iberian register; for Mexican-specific Spanish, ef_dora pronounces tail vocabulary with broadly intelligible accent across LATAM markets.

French (fr, 1 voice: ff_siwis) — French (France) native-speaker voice, suitable for both French and Canadian French content with light register adjustments. Italian (it, 2 voices: if_sara, im_nicola) — standard Italian (Italy) voices, work well for both peninsula and Italian-Swiss content. Portuguese (pt, 3 voices including pf_dora, pm_alex) — Brazilian Portuguese register, also intelligible for European Portuguese audiences. Japanese (ja, 5 voices including jf_alpha, jf_gongitsune, jm_kumo) — standard Tokyo Japanese with the natural pitch-accent patterns native Japanese listeners expect. Hindi (hi, 4 voices including hf_alpha, hm_omega) — standard Hindi (Delhi/UP register), supports Devanagari script input and transliterated Roman-script Hindi. Chinese (zh — Mandarin) coverage via additional catalog voices reachable through the /api/voices/list endpoint.

Switching languages mid-conversation

Multilingual apps frequently need to switch languages within a single user session — a support chatbot detects the customer's preferred language from their first message and switches the bot voice to match, an educational app reads bilingual content alternating between source and target languages, a translation tool reads both the input and the translated output. The pattern is straightforward: each call to /api/tts/generate is stateless, so changing the voice parameter between calls changes the language with zero latency penalty. There's no language-detection call to make first; you decide the language client-side (from the LLM's response, from the user's stated preference, from a language-detection library) and pass the appropriate voice ID.

For pages that mix multiple languages in the same audio stream (e.g. a Hindi-English code-switched script common in Indian content), the best pattern is to chunk the text at language boundaries, synthesize each chunk with the appropriate voice, and concatenate the audio buffers client-side. The chunks should align to sentence or phrase boundaries, not mid-word — splitting mid-word in code-switched content produces audibly choppy output. For pure single-language content, just pin a voice and call the endpoint with the full text; the model handles intra-sentence flow naturally.

Comparison: multilingual coverage vs competitors

OpenAI tts-1 has 6 voices but a single multilingual model that handles 57 languages. Pronunciation is solid on the big locales but degrades on the long tail — Hindi, Korean, and Vietnamese all have audible accent issues compared to native-speaker voices. Voice choice is the same six identities regardless of which language you synthesize, which means your Spanish bot and your French bot literally sound like the same English-speaking person reading translated text. Acceptable for some products; off-brand for any product where listener trust matters.

Google Cloud TTS has 220+ voices across 40+ languages — the broadest catalog if you need tail languages (Zulu, Cebuano, Bengali, Welsh, Khmer). The cost is GCP authentication overhead and tiered per-language pricing (Studio voices are $160/1M chars vs $4/1M for Standard). Azure Neural TTS has 400+ voices across 140+ locales — the absolute coverage leader, also $16/1M chars on Neural. ElevenLabs Multilingual v2 covers 29+ languages with their tier-pricing model. EasyVoice's 8-language coverage at flat $9.99/mo is narrower than the GCP/Azure giants but covers roughly 4.5B of the world's 8B people — the high-volume locales most multilingual apps actually ship to. If you need Zulu, go Azure or Google. If you ship to English, Spanish, French, Italian, Portuguese, Japanese, Hindi, or Chinese audiences, the flat-rate is the configuration to beat.

Language detection and routing patterns

Production multilingual apps need a language-detection step somewhere in the request flow to decide which voice to call. Three common patterns: (1) Trust the user's explicit preference — a settings toggle in your app that the user sets once. Simple, predictable, no detection accuracy concerns. (2) Detect from the LLM's response — if your chatbot stack is OpenAI/Anthropic/Llama, prompt the LLM to emit the language code alongside its response (e.g. response: '...', lang: 'es'). The model is generally accurate at identifying its own output language. (3) Use a separate language-detection library on the text — fasttext, lingua-py, or franc in JS — for static content. All three work fine with EasyVoice; we don't currently expose a server-side language detection endpoint because the client-side libraries are excellent and adding it would be redundant.

Once you've detected the language, map it to an EasyVoice voice ID with a simple lookup table: {en: 'am_adam', 'en-US': 'am_adam', 'en-GB': 'bm_george', es: 'ef_dora', fr: 'ff_siwis', it: 'if_sara', pt: 'pf_dora', ja: 'jf_alpha', hi: 'hf_alpha', zh: 'zf_xiaoxiao'}. For app-wide voice consistency across languages, pin a single 'persona' per language (one female voice across all 8 langs, or one male voice). For LLM-driven dynamic voice selection (the bot picks the voice based on context), expose the full voice catalog to the LLM as a tool call and let it select per response.

Quality and accent notes per language

Spanish: ef_dora and em_alex are neutral Iberian Spanish (Castilian register) — fully intelligible across LATAM markets but slightly off-register for Mexico-specific marketing copy where neutral Mexican Spanish would be more natural. For most cross-border Spanish-language apps (Latin America + Spain), the neutral Iberian voices are a safe default. French: ff_siwis is metropolitan French (Île-de-France region); Canadian French and Belgian French audiences will perceive it as France-French. Italian: if_sara and im_nicola are standard Italian — work for both peninsular and Italian-Swiss content. Portuguese: Brazilian Portuguese register dominates the catalog; European Portuguese audiences will perceive a Brazilian accent but tail vocabulary remains intelligible.

Japanese: jf_alpha and similar voices follow standard Tokyo pitch-accent patterns; Kansai-dialect content will read in a Tokyo accent. Hindi: hf_alpha covers standard Delhi/UP Hindi register and accepts both Devanagari and Roman-script Hindi input (transliteration is handled internally for the common script-mixing patterns). Where the model genuinely loses: heavily code-switched intra-sentence content (Hinglish, Spanglish, Singlish) is read as the language of the dominant voice — for that use case, chunk by language and concatenate as covered above. Tail vocabulary (specialized scientific terms, rare proper nouns) is correctly pronounced more often than not but doesn't reach human-narrator reliability — for high-stakes content (medical narration, legal proceedings), a human review pass before publishing is the right pattern.

Code samples

Drop-in examples for the EasyVoice TTS API. Every request below assumes you've set EASYVOICE_API_KEY as an environment variable.

Language → voice lookup

Simple mapping table for multilingual app routing
const VOICE_BY_LANG = {
  "en": "am_adam",      "en-US": "am_adam",   "en-GB": "bm_george",
  "es": "ef_dora",      "fr": "ff_siwis",     "it": "if_sara",
  "pt": "pf_dora",      "pt-BR": "pf_dora",
  "ja": "jf_alpha",     "hi": "hf_alpha",
};

async function ttsMultilingual(text, langCode) {
  const voice = VOICE_BY_LANG[langCode] ?? VOICE_BY_LANG["en"];
  const res = await fetch("https://easyvoice.ae/api/tts/generate", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.EASYVOICE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ voice, input: text, response_format: "mp3" }),
  });
  return Buffer.from(await res.arrayBuffer());
}

// Same endpoint, same key, voice ID switches the language
const enAudio = await ttsMultilingual("Hello world", "en");
const esAudio = await ttsMultilingual("Hola mundo", "es");
const jaAudio = await ttsMultilingual("こんにちは世界", "ja");

Mid-conversation language switch

Support bot detects customer language, switches voice on next turn
import os, requests
from typing import Iterable

VOICE_BY_LANG = {
    "en": "am_adam", "es": "ef_dora", "fr": "ff_siwis",
    "it": "if_sara", "pt": "pf_dora", "ja": "jf_alpha",
    "hi": "hf_alpha",
}

def speak(text: str, lang: str = "en") -> bytes:
    voice = VOICE_BY_LANG.get(lang, "am_adam")
    res = requests.post(
        "https://easyvoice.ae/api/tts/generate",
        headers={"Authorization": f"Bearer {os.environ['EASYVOICE_API_KEY']}"},
        json={"voice": voice, "input": text, "response_format": "mp3"},
    )
    return res.content

# Conversation flow — customer switches language mid-session
turn_1_audio = speak("Welcome to support, how can I help?", "en")
# Customer responds in Spanish, bot detects and switches
turn_2_audio = speak("Claro, dígame cómo puedo ayudarle.", "es")

Code-switched content chunking

Split a Hindi-English script and concatenate per-chunk audio
import re, requests, os
def detect_chunks(text: str) -> Iterable[tuple[str, str]]:
    # Naive Hindi-English chunker: Devanagari = hi, ASCII = en.
    # Real apps should use a language-detection library.
    pattern = re.compile(r"([\u0900-\u097F\s।]+|[A-Za-z\s.,'?!]+)")
    for match in pattern.finditer(text):
        seg = match.group(0).strip()
        if not seg: continue
        lang = "hi" if re.search(r"[\u0900-\u097F]", seg) else "en"
        yield (seg, lang)

def speak_codeswitched(text: str) -> bytes:
    audio = b""
    for chunk, lang in detect_chunks(text):
        voice = "hf_alpha" if lang == "hi" else "am_adam"
        res = requests.post(
            "https://easyvoice.ae/api/tts/generate",
            headers={"Authorization": f"Bearer {os.environ['EASYVOICE_API_KEY']}"},
            json={"voice": voice, "input": chunk, "response_format": "mp3"},
        )
        audio += res.content
    return audio

Voices to try with the API

Every voice below is callable via the same voice parameter — preview samples and read the full character profile.

Dora
Spanish · ef_dora
Siwis
French · ff_siwis
Alpha
Hindi · hf_alpha

Frequently asked questions

How many languages does EasyVoice's TTS API support?▾

Eight languages with native-speaker voices in each: American English (en-US, 22 voices), British English (en-GB, 8), Spanish (es, 3), French (fr, 1), Italian (it, 2), Portuguese (pt, 3), Japanese (ja, 5), Hindi (hi, 4). Total catalog: 46 voices across the 8 languages, plus additional Mandarin Chinese voices reachable via /api/voices/list. All voices and languages are included on every plan — there's no per-language tier gating.

How do I switch languages in a multilingual app?▾

Change the voice parameter in your request body. Each call to /api/tts/generate is stateless, so switching the voice between calls switches the language with no latency penalty. Map your app's language codes (en, es, fr, etc.) to voice IDs (am_adam, ef_dora, ff_siwis) with a lookup table — see the code example above. The same Bearer key and same endpoint URL handle all languages.

Are these real native-speaker voices or one multilingual model?▾

Real native-speaker voices. Every voice in the catalog is recorded by a native speaker of its target language, not a multilingual model approximating accents from English-trained data. This matters more than most TTS marketing pages admit — English-engine voices reading translated Spanish/French/Hindi/Japanese text sound visibly off to native listeners and undermine product trust. OpenAI tts-1 uses the single-multilingual-model approach; we don't.

How does this compare to Google Cloud TTS or Azure Neural's language coverage?▾

Google has 220+ voices across 40+ languages; Azure has 400+ voices across 140+ locales. Both are broader than EasyVoice's 8 languages. If you need Zulu, Welsh, Cebuano, Bengali, or other tail languages, go Google or Azure. If you ship to English, Spanish, French, Italian, Portuguese, Japanese, Hindi, or Chinese audiences, EasyVoice covers roughly 4.5B of the world's 8B people — the high-volume locales most multilingual apps actually ship to — at flat $9.99/mo vs Google's $4-$160/1M chars and Azure's $16/1M Neural.

Can I handle code-switched content (Hinglish, Spanglish)?▾

Pure single-language content reads natively. Code-switched intra-sentence content (Hindi-English, Spanish-English, etc.) reads in the language of the chosen voice — a Hindi voice reading English chunks will pronounce them with a Hindi accent. The right pattern is to chunk the text at language boundaries with a language-detection library, synthesize each chunk with the appropriate voice, and concatenate the audio buffers. See the code example above for a Hindi-English chunker.

Does pricing differ by language?▾

No. Flat $9.99/mo Pro unlimited covers every voice in every language. Free tier is 5,000 chars/day across all voices and all languages combined (the cap is per-account, not per-language). This is intentionally simpler than Google Cloud TTS (Studio voices cost $160/1M while Standard voices cost $4/1M) or ElevenLabs (per-character billing varies by model). Multilingual apps don't need to model per-language pricing into their TTS budget.

Related TTS API guides

Best TTS API for Customer Support Chatbots and Voice Agents

Best TTS API for customer support chatbots. EasyVoice wires into Twilio Voice, Voiceflow, Dialogflow CX in minutes. Flat $9.99 unlimited. Low-latency streaming.

TTS API for Developers — Bearer Auth, OpenAI Shape, Flat Pricing

TTS API for developers — Bearer auth, OpenAI-compatible request shape, curl/JS/Python/Go samples. 5K chars/day free. $9.99/mo unlimited Pro. 46 voices.

Comparing vendors? See EasyVoice vs google tts →

Start building with the EasyVoice TTS API

5,000 characters per day free, no credit card. Pro $9.99/mo unlimited. OpenAI-compatible request shape.

More TTS API guides

← TTS API hubOpenAI TTS API Alternative — Drop-in Migration to Flat-RateFree TTS API — 5,000 Characters Per Day, No Credit CardLow Latency TTS API — 300-600ms First-Byte on Kokoro-82MTTS API for Developers — Bearer Auth, OpenAI Shape, Flat PricingBest TTS API for Customer Support Chatbots and Voice Agents