Open Source · MIT Licensed

Text-to-speech that actually runs anywhere.

112M parameters. CPU-only. Faster than realtime. Streams with word-level timestamps. Runs on iOS. Zero-shot cloning with no reference transcript. No buzz.

Try the Demo ↓ View on GitHub
112M
Parameters
0.31x
RTF on CPU
<150ms
First chunk latency
$42/mo
Total infra cost. Unlimited.
Features
Everything you asked for.
Nothing you didn't.

Every feature is a direct answer to something broken in existing open TTS.

Zero-shot voice cloning

Drop in a reference clip. That's it. No transcript of the reference needed — Aria figures it out on its own.

Paralinguistic tags

Write [laughs], [sighs], [coughs] inline. Your voices actually sound human.

Streaming + word highlights

Sub-200ms to first chunk. Word-level timestamps stream alongside audio — no post-processing.

Runs on CPU

2.7–3.2× realtime on an m6a.large. Two vCPUs. 8 GB RAM. No GPU. $42/mo reserved. Not a typo.

Runs on iOS

ONNX runtime. On-device inference. Private. No round trip. 112M params fits comfortably in your device.

Actually open

MIT license. Weights included. No "open" with asterisks. No usage caps. No phone-home. Do what you want.

EOS gating via alignment

The alignment head tells us exactly where we are in the text at every step. While text remains unspoken, we gate the end-of-speech logit entirely — the model cannot stop early. No other open TTS does this.

Forced stop, squashed hallucinations

Once the alignment head confirms every token has been spoken, we force EOS immediately. The model never rambles, never repeats, never invents. Alignment-gated generation is a hard guarantee, not a sampling trick.

Small models sound small.
Aria doesn't.

Every lightweight TTS model you've tried is buzzy. Thin. Artifacts everywhere. We don't accept that tradeoff. 32 kHz clean audio. Full frequency response. No buzz. 112M parameters shouldn't sound this good — but it does.

Benchmarks
Receipts.

Real numbers. One m6a.large. CPU only. No tricks.

Short — "The quick brown fox jumps over the lazy dog."
Audio duration: 2.85 s
Inference time: 0.88 s
Prefill: 50.3 ms
Decode: 56 steps · 13.5 ms mean
RTF: 0.309×  (3.2× realtime)
Long — multi-clause paragraph (233 decode steps)
Audio duration: 11.65 s
Inference time: 4.35 s
Prefill: 175.5 ms
Decode: 232 steps · 17.7 ms mean
RTF: 0.374×  (2.7× realtime)
Try it. Right now.

This is the actual inference server. This page is hosted right now on a m6a.large. Type something. Watch the word timestamps stream in real-time. Listen.

aria — inference console
waiting for audio…
First Chunk
Gen Time
Audio Dur
Realtime ×
Waiting for word timestamps…
Specifications
The full sheet.
Parameters~112M
ArchitectureAutoregressive transformer · 8 layers · 960 dims
Audio output32 kHz · 3-codebook RVQ (1024 / 4096 / 8192)
Codec framerate20 fps (50 ms per frame)
RTF (CPU)0.31–0.37× on m6a.large (2.7–3.2× realtime)
First chunk latency<150 ms (KV-cached reference)
Decode step13–18 ms mean
Zero-shotReference audio only — no transcript required
Word timestampsStreaming · parallel to audio
Paralinguistic[laughs] [sighs] [coughs] [gasps] [clears throat] and more
DeploymentCPU (ONNX) · iOS (ONNX) · GPU
LanguageEnglish
LicenseMIT
CodecReleased separately — link coming soon

Coming Soon.

We're not quite ready to open the doors yet.
Check back soon.