Aria — Open Source Text-to-Speech

Features

Everything you asked for.
Nothing you didn't.

Every feature is a direct answer to something broken in existing open TTS.

✅

Zero-shot voice cloning

Drop in a reference clip. That's it. No transcript of the reference needed — Aria figures it out on its own.

✅

Paralinguistic tags

Write [laughs], [sighs], [coughs] inline. Your voices actually sound human.

✅

Streaming + word highlights

Sub-200ms to first chunk. Word-level timestamps stream alongside audio — no post-processing.

✅

Runs on CPU

2.7–3.2× realtime on an m6a.large. Two vCPUs. 8 GB RAM. No GPU. $42/mo reserved. Not a typo.

✅

Runs on iOS

ONNX runtime. On-device inference. Private. No round trip. 112M params fits comfortably in your device.

✅

Actually open

MIT license. Weights included. No "open" with asterisks. No usage caps. No phone-home. Do what you want.

✅

EOS gating via alignment

The alignment head tells us exactly where we are in the text at every step. While text remains unspoken, we gate the end-of-speech logit entirely — the model cannot stop early. No other open TTS does this.

✅

Forced stop, squashed hallucinations

Once the alignment head confirms every token has been spoken, we force EOS immediately. The model never rambles, never repeats, never invents. Alignment-gated generation is a hard guarantee, not a sampling trick.

Small models sound small.
Aria doesn't.

Every lightweight TTS model you've tried is buzzy. Thin. Artifacts everywhere. We don't accept that tradeoff. 32 kHz clean audio. Full frequency response. No buzz. 112M parameters shouldn't sound this good — but it does.

Benchmarks

Receipts.

Real numbers. One m6a.large. CPU only. No tricks.

Short — "The quick brown fox jumps over the lazy dog."

Audio duration: 2.85 s

Inference time: 0.88 s

Prefill: 50.3 ms

Decode: 56 steps · 13.5 ms mean

RTF: 0.309× (3.2× realtime)

Long — multi-clause paragraph (233 decode steps)

Audio duration: 11.65 s

Inference time: 4.35 s

Prefill: 175.5 ms

Decode: 232 steps · 17.7 ms mean

RTF: 0.374× (2.7× realtime)

Specifications

The full sheet.

Parameters	~112M
Architecture	Autoregressive transformer · 8 layers · 960 dims
Audio output	32 kHz · 3-codebook RVQ (1024 / 4096 / 8192)
Codec framerate	20 fps (50 ms per frame)
RTF (CPU)	0.31–0.37× on m6a.large (2.7–3.2× realtime)
First chunk latency	<150 ms (KV-cached reference)
Decode step	13–18 ms mean
Zero-shot	Reference audio only — no transcript required
Word timestamps	Streaming · parallel to audio
Paralinguistic	[laughs] [sighs] [coughs] [gasps] [clears throat] and more
Deployment	CPU (ONNX) · iOS (ONNX) · GPU
Language	English
License	MIT
Codec	Released separately — link coming soon

Text-to-speech that actually runs anywhere.

Zero-shot voice cloning

Paralinguistic tags

Streaming + word highlights

Runs on CPU

Runs on iOS

Actually open

EOS gating via alignment

Forced stop, squashed hallucinations

Small models sound small.
Aria doesn't.

Coming Soon.

Text-to-speech that actually runs anywhere.

Zero-shot voice cloning

Paralinguistic tags

Streaming + word highlights

Runs on CPU

Runs on iOS

Actually open

EOS gating via alignment

Forced stop, squashed hallucinations

Small models sound small.Aria doesn't.

Coming Soon.

Small models sound small.
Aria doesn't.