112M parameters. CPU-only. Faster than realtime. Streams with word-level timestamps. Runs on iOS. Zero-shot cloning with no reference transcript. No buzz.
Every feature is a direct answer to something broken in existing open TTS.
Drop in a reference clip. That's it. No transcript of the reference needed — Aria figures it out on its own.
Write [laughs], [sighs], [coughs] inline. Your voices actually sound human.
Sub-200ms to first chunk. Word-level timestamps stream alongside audio — no post-processing.
2.7–3.2× realtime on an m6a.large. Two vCPUs. 8 GB RAM. No GPU. $42/mo reserved. Not a typo.
ONNX runtime. On-device inference. Private. No round trip. 112M params fits comfortably in your device.
MIT license. Weights included. No "open" with asterisks. No usage caps. No phone-home. Do what you want.
The alignment head tells us exactly where we are in the text at every step. While text remains unspoken, we gate the end-of-speech logit entirely — the model cannot stop early. No other open TTS does this.
Once the alignment head confirms every token has been spoken, we force EOS immediately. The model never rambles, never repeats, never invents. Alignment-gated generation is a hard guarantee, not a sampling trick.
Every lightweight TTS model you've tried is buzzy. Thin. Artifacts everywhere. We don't accept that tradeoff. 32 kHz clean audio. Full frequency response. No buzz. 112M parameters shouldn't sound this good — but it does.
Real numbers. One m6a.large. CPU only. No tricks.
This is the actual inference server. This page is hosted right now on a m6a.large. Type something. Watch the word timestamps stream in real-time. Listen.
| Parameters | ~112M |
| Architecture | Autoregressive transformer · 8 layers · 960 dims |
| Audio output | 32 kHz · 3-codebook RVQ (1024 / 4096 / 8192) |
| Codec framerate | 20 fps (50 ms per frame) |
| RTF (CPU) | 0.31–0.37× on m6a.large (2.7–3.2× realtime) |
| First chunk latency | <150 ms (KV-cached reference) |
| Decode step | 13–18 ms mean |
| Zero-shot | Reference audio only — no transcript required |
| Word timestamps | Streaming · parallel to audio |
| Paralinguistic | [laughs] [sighs] [coughs] [gasps] [clears throat] and more |
| Deployment | CPU (ONNX) · iOS (ONNX) · GPU |
| Language | English |
| License | MIT |
| Codec | Released separately — link coming soon |
We're not quite ready to open the doors yet.
Check back soon.