Gemma 4 Multimodal Audio API - Audio In, Text + Audio Out

Text Generator > Blog > Gemma 4 Multimodal Audio API

New: Audio-to-Audio+Text API powered by Gemma 4

Text-Generator.io now offers a single API endpoint that accepts audio (and optional text), understands it with Gemma 4, generates a text response, and speaks it back using Kokoro TTS -- all in one request.

Why Gemma 4?

Google's Gemma 4 is the latest open-weight multimodal model. The E4B variant natively understands audio input alongside text, supporting 140+ languages with a small 4B parameter footprint. It runs efficiently on a single GPU.

Key advantages:

Native audio understanding -- no separate STT step needed
Multimodal reasoning across audio and text simultaneously
Apache 2.0 licensed, open weights
Runs on commodity GPU hardware

The Multimodal Generate API

Send audio + text in, get text + audio back. One request.

POST /api/v1/multimodal-generate

Parameters

Parameter	Type	Default	Description
audio_file	file	required	Audio input (WAV, MP3, etc)
text	string	""	Text prompt alongside audio
voice	string	"af_heart"	Kokoro TTS voice name
speed	float	1.0	TTS speed multiplier
max_length	int	500	Max tokens for text generation

Response

{
  "generated_text": "The audio contains someone saying hello...",
  "audio_base64": "UklGR...",
  "audio_sample_rate": 24000,
  "audio_format": "wav"
}

Example: Python

import base64
import requests
import os

API_KEY = os.getenv("TEXT_GENERATOR_API_KEY")
headers = {"secret": API_KEY}

with open("question.wav", "rb") as f:
    resp = requests.post(
        "https://api.text-generator.io/api/v1/multimodal-generate",
        files={"audio_file": ("question.wav", f, "audio/wav")},
        data={"text": "Answer this question", "voice": "af_heart"},
        headers=headers,
        timeout=120,
    )

data = resp.json()
print("Response:", data["generated_text"])

# Save the spoken response
audio_bytes = base64.b64decode(data["audio_base64"])
with open("response.wav", "wb") as f:
    f.write(audio_bytes)

Also new: Gemini-powered Speech-to-Text

Our existing /api/v1/audio-extraction endpoint now supports a "model": "gemini" option alongside the existing Whisper backend. Gemini excels at multilingual audio and noisy environments.

import requests

headers = {"secret": "YOUR_SECRET_HERE"}
data = {
    "audio_url": "https://example.com/audio.mp3",
    "model": "gemini"
}
resp = requests.post(
    "https://api.text-generator.io/api/v1/audio-extraction",
    json=data,
    headers=headers,
)
print(resp.json()["text"])

Kokoro TTS: Fast, Natural Speech

The audio output is generated by Kokoro-82M, an 82M parameter TTS model that produces natural-sounding speech at 24kHz. It supports 8 languages and 54+ voices.

Available voice families:

American English: af_heart, af_bella, af_nova, am_adam, am_echo, am_michael, and more
British English: bf_*, bm_* voices
Spanish, French, Italian, Japanese, Portuguese, Chinese

Use Cases

Voice assistants -- send a spoken question, get a spoken answer
Audio translation -- send foreign audio, get English text + speech
Podcast Q&A -- ask questions about audio content
Accessibility -- convert any audio interaction to text and back

Pricing

2c USD per request covering both the Gemma 4 inference and Kokoro TTS generation. Free tier of 50 requests per month.

Plug

We are on a mission to bring about affordable AI for everyone.

Text Generator offers APIs for text, speech, code generation, and now multimodal audio understanding. Secure, affordable, and accurate.

Try it yourself at: Text Generator Docs