New: Audio-to-Audio+Text API powered by Gemma 4
Text-Generator.io now offers a single API endpoint that accepts audio (and optional text), understands it with Gemma 4, generates a text response, and speaks it back using Kokoro TTS -- all in one request.
Why Gemma 4?
Google's Gemma 4 is the latest open-weight multimodal model. The E4B variant natively understands audio input alongside text, supporting 140+ languages with a small 4B parameter footprint. It runs efficiently on a single GPU.
Key advantages:
- Native audio understanding -- no separate STT step needed
- Multimodal reasoning across audio and text simultaneously
- Apache 2.0 licensed, open weights
- Runs on commodity GPU hardware
The Multimodal Generate API
Send audio + text in, get text + audio back. One request.
POST /api/v1/multimodal-generate
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| audio_file | file | required | Audio input (WAV, MP3, etc) |
| text | string | "" | Text prompt alongside audio |
| voice | string | "af_heart" | Kokoro TTS voice name |
| speed | float | 1.0 | TTS speed multiplier |
| max_length | int | 500 | Max tokens for text generation |
Response
{
"generated_text": "The audio contains someone saying hello...",
"audio_base64": "UklGR...",
"audio_sample_rate": 24000,
"audio_format": "wav"
}
Example: Python
import base64
import requests
import os
API_KEY = os.getenv("TEXT_GENERATOR_API_KEY")
headers = {"secret": API_KEY}
with open("question.wav", "rb") as f:
resp = requests.post(
"https://api.text-generator.io/api/v1/multimodal-generate",
files={"audio_file": ("question.wav", f, "audio/wav")},
data={"text": "Answer this question", "voice": "af_heart"},
headers=headers,
timeout=120,
)
data = resp.json()
print("Response:", data["generated_text"])
# Save the spoken response
audio_bytes = base64.b64decode(data["audio_base64"])
with open("response.wav", "wb") as f:
f.write(audio_bytes)
Also new: Gemini-powered Speech-to-Text
Our existing /api/v1/audio-extraction endpoint now supports a "model": "gemini" option alongside the existing Whisper backend. Gemini excels at multilingual audio and noisy environments.
import requests
headers = {"secret": "YOUR_SECRET_HERE"}
data = {
"audio_url": "https://example.com/audio.mp3",
"model": "gemini"
}
resp = requests.post(
"https://api.text-generator.io/api/v1/audio-extraction",
json=data,
headers=headers,
)
print(resp.json()["text"])
Kokoro TTS: Fast, Natural Speech
The audio output is generated by Kokoro-82M, an 82M parameter TTS model that produces natural-sounding speech at 24kHz. It supports 8 languages and 54+ voices.
Available voice families:
- American English: af_heart, af_bella, af_nova, am_adam, am_echo, am_michael, and more
- British English: bf_*, bm_* voices
- Spanish, French, Italian, Japanese, Portuguese, Chinese
Use Cases
- Voice assistants -- send a spoken question, get a spoken answer
- Audio translation -- send foreign audio, get English text + speech
- Podcast Q&A -- ask questions about audio content
- Accessibility -- convert any audio interaction to text and back
Pricing
2c USD per request covering both the Gemma 4 inference and Kokoro TTS generation. Free tier of 50 requests per month.
Plug
We are on a mission to bring about affordable AI for everyone.
Text Generator offers APIs for text, speech, code generation, and now multimodal audio understanding. Secure, affordable, and accurate.
Try it yourself at: Text Generator Docs
Sign up