Text to Speech

Learn how to turn text into lifelike spoken audio.

Overview

The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to:

Narrate a written blog post
Produce spoken audio in multiple languages
Give real time audio output using streaming

Quickstart

The speech endpoint takes in three key inputs: the model, the text that should be turned into audio, and the voice to be used for the audio generation.

Generate spoken audio from input text

import fs from "fs";
import path from "path";
import OpenAI from "openai";

const openai = new OpenAI();
const speechFile = path.resolve("./speech.mp3");

const mp3 = await openai.audio.speech.create({
  model: "tts-1",
  voice: "alloy",
  input: "Today is a wonderful day to build something people love!",
});

const buffer = Buffer.from(await mp3.arrayBuffer());
await fs.promises.writeFile(speechFile, buffer);

Voice Options

alloy

A unique voice optimized for natural-sounding speech in English.

echo

A unique voice optimized for natural-sounding speech in English.

fable

A unique voice optimized for natural-sounding speech in English.

onyx

A unique voice optimized for natural-sounding speech in English.

nova

A unique voice optimized for natural-sounding speech in English.

shimmer

A unique voice optimized for natural-sounding speech in English.

Streaming Real Time Audio

The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated.

Streaming example

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

Supported Formats

MP3

Default format, widely supported

Opus

For internet streaming, low latency

AAC

Preferred by YouTube, Android, iOS

FLAC

Lossless audio compression

WAV

Uncompressed audio, low latency

PCM

24kHz 16-bit signed raw samples

FAQ

How can I control the emotional range of the generated audio?

There is no direct mechanism to control the emotional output. Factors like capitalization or grammar may influence the output but results may vary.

Can I create a custom copy of my own voice?

No, this is not currently supported.

Do I own the outputted audio files?

Yes, like with all outputs from our API, the person who created them owns the output. You must inform end users that they are hearing AI-generated audio.