ModelSpeech models

MisoTTS

MisoTTS is an 8B-parameter text-to-speech model from Miso Labs for expressive speech and dialogue generation. Miso Labs says the weights are open source on Hugging Face, the model can condition on text and optional audio context, and API access is coming soon.

Why it matters

MisoTTS is page-worthy because it is a fresh voice-model release with primary-source technical detail, open weights, and early community attention around expressive speech generation. It also gives GetLLMs a current speech-model explainer without pretending the API/catalog fields are stable enough for a `/models` directory record.

Source-backed summary

Miso Labs announced MisoTTS on June 3, 2026 as an 8B-parameter model for emotive speech and dialogue generation. The official blog describes a 7.7B backbone plus 300M decoder, residual vector quantization with 32 audio codebooks, text and optional audio-context conditioning, current half-duplex limitations, open-source weights on Hugging Face, and API access coming soon. The Hugging Face model card confirms the MisoLabs/MisoTTS repository, text-to-speech task, model summary, license field, and that it is not deployed by an Inference Provider. Hacker News is used only as freshness and demand evidence.

Primary use cases

Track Miso Labs speech-model release facts and availability.
Compare expressive TTS models that use text plus audio context.
Understand RVQ-style audio tokenization in a reader-friendly way.
Decide whether to wait for hosted API access before production evaluation.

What Miso Labs confirms

The official release describes MisoTTS as an 8B-parameter speech model for expressive speech and dialogue generation. It generates from text and optional audio context, uses residual vector quantization, and is built around a 7.7B-parameter backbone with a 300M-parameter decoder.

Task: text-to-speech and expressive speech generation.
Architecture: a large temporal backbone plus a smaller depth decoder for audio codebooks.
Availability: open weights on Hugging Face; API access is described as coming soon.

Why this is not a model-directory record yet

The Hugging Face card is useful for model identity, task, setup, and architecture, but it also says the model is not deployed by an Inference Provider. Until a stable public API, provider ID, pricing, limits, and hosted availability are verified, this should remain an entity explainer rather than a structured `/models` catalog entry.

Limits and caveats

Miso Labs says the current model handles individual turns and half-duplex audio, but does not yet solve turn-taking or full-duplex conversation. Treat quality and expressiveness claims as vendor-provided until independent evaluations and provider listings mature.

Related concepts

AI Model API

The API-selection layer where hosted availability, model IDs, pricing, limits, and provider fields need verification.

Image-to-3D

Another modality-expansion concept where directory records depend on stable provider metadata.

Related entities

MOSS-TTS

OpenMOSS speech-generation family with flagship, local, realtime, dialogue, voice-design, and Nano variants.

MAI-Thinking-1

Recent model entity that also separates launch facts from directory-grade API metadata.

MAI-Code-1-Flash

Recent Microsoft model entity with deferred `/models` catalog status.

Sources

Source confidence

official-docs

Releasing the MisoTTS

Miso Labs

official-docs

MisoLabs/MisoTTS model card

Hugging Face / MisoLabs

kol-community

MisoTTS Hacker News discussion

Hacker News

MisoTTS FAQ

Page-level questions for MisoTTS.

What is MisoTTS?+

MisoTTS is an 8B-parameter text-to-speech model from Miso Labs for expressive speech and dialogue generation. It can generate speech from text and optional audio context, with open weights available on Hugging Face.

Can I use MisoTTS through a hosted API?+

Not as a stable GetLLMs catalog record yet. Miso Labs says API access is coming soon, and the Hugging Face model card says the model is not deployed by an Inference Provider, so production users should recheck hosted availability before planning around it.

Why does MisoTTS use residual vector quantization?+

Miso Labs uses residual vector quantization to represent audio with multiple codebook indices instead of one very large flat vocabulary. The goal is to cover more speech variation while keeping the model architecture practical.