Audio Capabilities
The RouteLLM API provides first-class support for audio understanding (speech input / audio analysis) and audio generation (text-to-speech). Both capabilities use the same unified /v1/chat/completions endpoint.
Supported Models​
- OpenAI Audio
- Google Gemini TTS
| Model ID | Aliases | Capabilities |
|---|---|---|
gpt-4o-audio-preview | gpt-audio, gpt-4o-audio-preview-2024-12-17 | Audio input + Audio output |
gpt-4o-mini-audio-preview | gpt-audio-mini | Audio input + Audio output |
| Model ID | Capabilities |
|---|---|
gemini-2.5-flash-preview-tts | Audio output (TTS) |
gemini-2.5-pro-preview-tts | Audio output (TTS) |
Gemini TTS models are dedicated text-to-speech models and do not support audio input. For audio input understanding with Gemini, use standard Gemini chat models (e.g.,
gemini-2.5-pro,gemini-2.5-flash).
Capability Summary​
| Capability | Supported Models |
|---|---|
| Audio Input (Understanding) | gpt-4o-audio-preview, gpt-4o-mini-audio-preview, gemini-2.5-pro, gemini-2.5-flash |
| Audio Output (TTS) | gpt-4o-audio-preview, gpt-4o-mini-audio-preview, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts |
Auto-routing: If audio input is detected in the
messagesarray and no explicit model is specified, the API automatically routes the request togpt-4o-audio-preview.
Pricing​
| Model | Input (per 1K tokens) | Output (per 1K tokens) |
|---|---|---|
gpt-4o-audio-preview | $0.0025 | $0.010 |
gpt-4o-mini-audio-preview | $0.00015 | $0.0006 |
gemini-2.5-flash-preview-tts | $0.0005 | $0.002 |
gemini-2.5-pro-preview-tts | $0.001 | $0.005 |
Audio Understanding (Audio Input)​
Send audio clips as part of the conversation using the input_audio content type. The model will transcribe, analyze, and respond to the audio content.
Request Schema​
{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-audio>",
"format": "mp3"
}
},
{
"type": "text",
"text": "What is being said in this audio clip?"
}
]
}
]
}
input_audio Content Item Fields​
| Field | Type | Required | Description |
|---|---|---|---|
data | string | Yes | Base64-encoded audio data. |
format | string | Yes | Audio format. Supported: wav, mp3, ogg, flac, webm, m4a, aac, pcm, mpga. |
Code Examples​
- Python SDK
- TypeScript/JavaScript
- cURL
import base64
from openai import OpenAI
client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)
# Load and encode audio file
with open("audio_clip.mp3", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "mp3"
}
},
{
"type": "text",
"text": "Transcribe this audio and summarize the key points."
}
]
}
]
)
print(response.choices[0].message.content)
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI({
baseURL: '<your base url>',
apiKey: '<your_api_key>',
});
const audioData = fs.readFileSync('audio_clip.mp3').toString('base64');
const response = await openai.chat.completions.create({
model: 'gpt-4o-audio-preview',
messages: [
{
role: 'user',
content: [
{
type: 'input_audio',
input_audio: {
data: audioData,
format: 'mp3'
}
} as any,
{
type: 'text',
text: 'Transcribe this audio and summarize the key points.'
}
]
}
]
});
console.log(response.choices[0].message.content);
AUDIO_B64=$(base64 -i audio_clip.mp3)
curl -X POST "<your base url>/chat/completions" \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gpt-4o-audio-preview\",
\"messages\": [
{
\"role\": \"user\",
\"content\": [
{
\"type\": \"input_audio\",
\"input_audio\": {
\"data\": \"${AUDIO_B64}\",
\"format\": \"mp3\"
}
},
{
\"type\": \"text\",
\"text\": \"Transcribe this audio and summarize the key points.\"
}
]
}
]
}"
Audio Generation (Text-to-Speech)​
Generate spoken audio from text using OpenAI GPT-4o Audio models or Google Gemini TTS models.
Request Schema​
{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": "Read the following announcement aloud: Welcome to Abacus AI."
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "alloy",
"format": "mp3"
}
}
Important: For OpenAI audio models,
modalitiesmust be["text", "audio"]— both must be specified together. Using["audio"]alone will return a validation error.
audio Parameter Fields​
| Field | Type | Required | Description |
|---|---|---|---|
voice | string | Yes | Voice to use for audio generation. See Available Voices. |
format | string | No | Output audio format: mp3 (default), wav, opus, aac, flac. |
Available Voices​
The same voice names work across both OpenAI and Gemini TTS models. The API automatically maps them to the appropriate native voice.
| Voice | Character | Gemini Equivalent |
|---|---|---|
alloy | Neutral, balanced | Kore |
echo | Male, soft | Charon |
fable | Expressive, British | Puck |
onyx | Deep, authoritative | Fenrir |
nova | Female, energetic | Aoede |
shimmer | Female, warm | Zephyr |
Response Format​
The audio data is returned as base64-encoded content in the audio field of the response message.
Non-streaming:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "gpt-4o-audio-preview",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Welcome to Abacus AI.",
"audio": {
"data": "<base64-encoded-audio>",
"format": "mp3"
}
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 30,
"total_tokens": 45
}
}
Streaming (stream: true): Audio data is delivered incrementally in delta.audio of each chunk:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"gpt-4o-audio-preview","choices":[{"index":0,"delta":{"role":"assistant","content":"Welcome","audio":{"data":"<partial-base64>","format":"mp3"}},"finish_reason":null}]}
Code Examples​
- OpenAI Audio (Python)
- Gemini TTS (Python)
- TypeScript/JavaScript
- cURL
import base64
from openai import OpenAI
client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)
response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": "Read the following announcement aloud: Welcome to Abacus AI."
}
],
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "mp3"}
)
# Extract and save generated audio
audio_data = response.choices[0].message.audio["data"]
with open("output.mp3", "wb") as f:
f.write(base64.b64decode(audio_data))
print("Audio saved to output.mp3")
print("Text response:", response.choices[0].message.content)
import base64
from openai import OpenAI
client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)
response = client.chat.completions.create(
model="gemini-2.5-flash-preview-tts",
messages=[
{
"role": "user",
"content": "Say this in a friendly tone: Hello! How can I assist you today?"
}
],
modalities=["text", "audio"],
audio={"voice": "nova", "format": "mp3"}
)
audio_data = response.choices[0].message.audio["data"]
with open("output.mp3", "wb") as f:
f.write(base64.b64decode(audio_data))
print("Audio saved to output.mp3")
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI({
baseURL: '<your base url>',
apiKey: '<your_api_key>',
});
const response = await openai.chat.completions.create({
model: 'gpt-4o-audio-preview',
messages: [
{
role: 'user',
content: 'Read the following announcement aloud: Welcome to Abacus AI.'
}
],
modalities: ['text', 'audio'],
audio: { voice: 'alloy', format: 'mp3' }
} as any);
const audioData = (response.choices[0].message as any).audio?.data;
if (audioData) {
fs.writeFileSync('output.mp3', Buffer.from(audioData, 'base64'));
console.log('Audio saved to output.mp3');
}
console.log('Text response:', response.choices[0].message.content);
curl -X POST "<your base url>/chat/completions" \
-H "Authorization: Bearer <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": "Read the following announcement aloud: Welcome to Abacus AI."
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "alloy",
"format": "mp3"
}
}' | jq -r '.choices[0].message.audio.data' | base64 --decode > output.mp3
Streaming Audio Generation​
- Python SDK
import base64
from openai import OpenAI
client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)
audio_chunks = []
with client.chat.completions.stream(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": "Narrate a short story about a robot exploring the ocean."
}
],
modalities=["text", "audio"],
audio={"voice": "fable", "format": "mp3"}
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta if chunk.choices else None
if delta and hasattr(delta, "audio") and delta.audio:
audio_chunks.append(delta.audio.get("data", ""))
# Combine and save all audio chunks
full_audio = base64.b64decode("".join(audio_chunks))
with open("streamed_output.mp3", "wb") as f:
f.write(full_audio)
print("Streamed audio saved to streamed_output.mp3")
Combining Audio Input and Output​
Send an audio clip as input and request an audio response in the same call — enabling full voice-to-voice interactions.
- Python SDK
import base64
from openai import OpenAI
client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)
# Load input audio
with open("user_question.wav", "rb") as f:
input_audio = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": input_audio,
"format": "wav"
}
}
]
}
],
modalities=["text", "audio"],
audio={"voice": "nova", "format": "mp3"}
)
# Save audio response
audio_data = response.choices[0].message.audio["data"]
with open("assistant_response.mp3", "wb") as f:
f.write(base64.b64decode(audio_data))
print("Voice response saved to assistant_response.mp3")
print("Text transcript:", response.choices[0].message.content)
Validation and Error Guidance​
| Scenario | Resolution |
|---|---|
| Audio input sent to a non-audio model | Switch to gpt-4o-audio-preview or another audio-capable model |
modalities: ["audio"] without "text" on OpenAI models | Use modalities: ["text", "audio"] |
modalities: ["text", "audio"] on a non-audio model | Switch to gpt-4o-audio-preview, gpt-4o-mini-audio-preview, or a Gemini TTS model |
Invalid base64 in input_audio.data | Ensure the audio is correctly base64-encoded |
| Unsupported audio format | Use one of: wav, mp3, ogg, flac, webm, m4a, aac, pcm, mpga |