Text to Speech
Learn how to turn text into lifelike spoken audio.
Overview
The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to:
- Narrate a written blog post
- Produce spoken audio in multiple languages
- Give real time audio output using streaming
Quickstart
The speech endpoint takes in three key inputs: the model, the text that should be turned into audio, and the voice to be used for the audio generation.
Generate spoken audio from input text
import fs from "fs";
import path from "path";
import OpenAI from "openai";
const openai = new OpenAI();
const speechFile = path.resolve("./speech.mp3");
const mp3 = await openai.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: "Today is a wonderful day to build something people love!",
});
const buffer = Buffer.from(await mp3.arrayBuffer());
await fs.promises.writeFile(speechFile, buffer);
Voice Options
alloy
A unique voice optimized for natural-sounding speech in English.
echo
A unique voice optimized for natural-sounding speech in English.
fable
A unique voice optimized for natural-sounding speech in English.
onyx
A unique voice optimized for natural-sounding speech in English.
nova
A unique voice optimized for natural-sounding speech in English.
shimmer
A unique voice optimized for natural-sounding speech in English.
Streaming Real Time Audio
The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated.
Streaming example
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello world! This is a streaming test.",
)
response.stream_to_file("output.mp3")
Supported Formats
MP3
Default format, widely supported
Opus
For internet streaming, low latency
AAC
Preferred by YouTube, Android, iOS
FLAC
Lossless audio compression
WAV
Uncompressed audio, low latency
PCM
24kHz 16-bit signed raw samples
FAQ
How can I control the emotional range of the generated audio?
There is no direct mechanism to control the emotional output. Factors like capitalization or grammar may influence the output but results may vary.
Can I create a custom copy of my own voice?
No, this is not currently supported.
Do I own the outputted audio files?
Yes, like with all outputs from our API, the person who created them owns the output. You must inform end users that they are hearing AI-generated audio.