Skip to main content

Audio Capabilities

The RouteLLM API provides first-class support for audio understanding (speech input / audio analysis) and audio generation (text-to-speech). Both capabilities use the same unified /v1/chat/completions endpoint.

Supported Models​

Model IDAliasesCapabilities
gpt-4o-audio-previewgpt-audio, gpt-4o-audio-preview-2024-12-17Audio input + Audio output
gpt-4o-mini-audio-previewgpt-audio-miniAudio input + Audio output

Capability Summary​

CapabilitySupported Models
Audio Input (Understanding)gpt-4o-audio-preview, gpt-4o-mini-audio-preview, gemini-2.5-pro, gemini-2.5-flash
Audio Output (TTS)gpt-4o-audio-preview, gpt-4o-mini-audio-preview, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts

Auto-routing: If audio input is detected in the messages array and no explicit model is specified, the API automatically routes the request to gpt-4o-audio-preview.

Pricing​

ModelInput (per 1K tokens)Output (per 1K tokens)
gpt-4o-audio-preview$0.0025$0.010
gpt-4o-mini-audio-preview$0.00015$0.0006
gemini-2.5-flash-preview-tts$0.0005$0.002
gemini-2.5-pro-preview-tts$0.001$0.005

Audio Understanding (Audio Input)​

Send audio clips as part of the conversation using the input_audio content type. The model will transcribe, analyze, and respond to the audio content.

Request Schema​

{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-audio>",
"format": "mp3"
}
},
{
"type": "text",
"text": "What is being said in this audio clip?"
}
]
}
]
}

input_audio Content Item Fields​

FieldTypeRequiredDescription
datastringYesBase64-encoded audio data.
formatstringYesAudio format. Supported: wav, mp3, ogg, flac, webm, m4a, aac, pcm, mpga.

Code Examples​

import base64
from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

# Load and encode audio file
with open("audio_clip.mp3", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "mp3"
}
},
{
"type": "text",
"text": "Transcribe this audio and summarize the key points."
}
]
}
]
)

print(response.choices[0].message.content)

Audio Generation (Text-to-Speech)​

Generate spoken audio from text using OpenAI GPT-4o Audio models or Google Gemini TTS models.

Request Schema​

{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": "Read the following announcement aloud: Welcome to Abacus AI."
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "alloy",
"format": "mp3"
}
}

Important: For OpenAI audio models, modalities must be ["text", "audio"] — both must be specified together. Using ["audio"] alone will return a validation error.

audio Parameter Fields​

FieldTypeRequiredDescription
voicestringYesVoice to use for audio generation. See Available Voices.
formatstringNoOutput audio format: mp3 (default), wav, opus, aac, flac.

Available Voices​

The same voice names work across both OpenAI and Gemini TTS models. The API automatically maps them to the appropriate native voice.

VoiceCharacterGemini Equivalent
alloyNeutral, balancedKore
echoMale, softCharon
fableExpressive, BritishPuck
onyxDeep, authoritativeFenrir
novaFemale, energeticAoede
shimmerFemale, warmZephyr

Response Format​

The audio data is returned as base64-encoded content in the audio field of the response message.

Non-streaming:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "gpt-4o-audio-preview",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Welcome to Abacus AI.",
"audio": {
"data": "<base64-encoded-audio>",
"format": "mp3"
}
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 30,
"total_tokens": 45
}
}

Streaming (stream: true): Audio data is delivered incrementally in delta.audio of each chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"gpt-4o-audio-preview","choices":[{"index":0,"delta":{"role":"assistant","content":"Welcome","audio":{"data":"<partial-base64>","format":"mp3"}},"finish_reason":null}]}

Code Examples​

import base64
from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": "Read the following announcement aloud: Welcome to Abacus AI."
}
],
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "mp3"}
)

# Extract and save generated audio
audio_data = response.choices[0].message.audio["data"]
with open("output.mp3", "wb") as f:
f.write(base64.b64decode(audio_data))

print("Audio saved to output.mp3")
print("Text response:", response.choices[0].message.content)

Streaming Audio Generation​

import base64
from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

audio_chunks = []

with client.chat.completions.stream(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": "Narrate a short story about a robot exploring the ocean."
}
],
modalities=["text", "audio"],
audio={"voice": "fable", "format": "mp3"}
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta if chunk.choices else None
if delta and hasattr(delta, "audio") and delta.audio:
audio_chunks.append(delta.audio.get("data", ""))

# Combine and save all audio chunks
full_audio = base64.b64decode("".join(audio_chunks))
with open("streamed_output.mp3", "wb") as f:
f.write(full_audio)

print("Streamed audio saved to streamed_output.mp3")

Combining Audio Input and Output​

Send an audio clip as input and request an audio response in the same call — enabling full voice-to-voice interactions.

import base64
from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

# Load input audio
with open("user_question.wav", "rb") as f:
input_audio = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": input_audio,
"format": "wav"
}
}
]
}
],
modalities=["text", "audio"],
audio={"voice": "nova", "format": "mp3"}
)

# Save audio response
audio_data = response.choices[0].message.audio["data"]
with open("assistant_response.mp3", "wb") as f:
f.write(base64.b64decode(audio_data))

print("Voice response saved to assistant_response.mp3")
print("Text transcript:", response.choices[0].message.content)

Validation and Error Guidance​

ScenarioResolution
Audio input sent to a non-audio modelSwitch to gpt-4o-audio-preview or another audio-capable model
modalities: ["audio"] without "text" on OpenAI modelsUse modalities: ["text", "audio"]
modalities: ["text", "audio"] on a non-audio modelSwitch to gpt-4o-audio-preview, gpt-4o-mini-audio-preview, or a Gemini TTS model
Invalid base64 in input_audio.dataEnsure the audio is correctly base64-encoded
Unsupported audio formatUse one of: wav, mp3, ogg, flac, webm, m4a, aac, pcm, mpga