G
GetLLMs
ModelSpeech models

MOSS-TTS

MOSS-TTS is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team, covering flagship text-to-speech, spoken-dialogue generation, voice design, real-time streaming TTS, sound effects, and lightweight local TTS variants.

Why it matters

MOSS-TTS matters because open voice generation is splitting into several jobs at once: high-quality voice cloning, long-form narration, multi-speaker dialogue, real-time voice agents, sound effects, and CPU-friendly local deployment. A single family page helps readers understand which variant solves which job before they choose a model record or runtime.

Source-backed summary

The OpenMOSS/MOSS-TTS repository describes MOSS-TTS as an Apache-2.0 open-source speech and sound generation family from MOSI.AI and OpenMOSS, with v1.5, Local Transformer v1.5, Nano, Realtime, TTSD, VoiceGenerator, and SoundEffect surfaces. The MOSS-TTS technical report explains the discrete-audio-token and autoregressive recipe. Hugging Face model cards confirm the current v1.5 model identity and the Local Transformer v1.5 deployment path. The MOSS-TTS-Nano repository documents the 0.1B CPU-friendly branch. MOSI and OpenMOSS official X posts add freshness signals for v1.5 and Local Transformer v1.5, while X and Reddit discussion are used only for demand around local setup, language support, VRAM, Windows, Apple Silicon, voice cloning, and comparison questions.

Primary use cases
  • Generate multilingual text-to-speech with voice cloning.
  • Create long-form narration, audiobook, dubbing, podcast, and dialogue audio.
  • Build low-latency voice-agent speech output with streaming synthesis.
  • Run lightweight local TTS demos or services with the Nano branch.
  • Evaluate pronunciation control through Pinyin, IPA, duration, pause, and language-tag controls.
  • Compare open speech models before choosing a hosted API or local runtime.
What the family includes

OpenMOSS frames MOSS-TTS as a speech and sound generation family rather than one monolithic TTS checkpoint. The family separates flagship single-speaker TTS, long-form spoken dialogue, voice design, real-time streaming speech for voice agents, sound-effect generation, and lightweight local TTS.

  • MOSS-TTS v1.5: flagship voice cloning, long-form generation, multilingual synthesis, pronunciation control, duration control, and explicit pause control.
  • MOSS-TTS-Local-Transformer v1.5: 48 kHz stereo speech, native streaming, 31 languages, and a serving path through SGLang-Omni.
  • MOSS-TTS-Nano: 0.1B-parameter local TTS for realtime CPU-friendly deployment and simple demos.
  • MOSS-TTSD, Realtime, VoiceGenerator, and SoundEffect: specialized branches for dialogue, voice agents, text-described voices, and sound generation.
Why v1.5 gets the model record

The structured model-directory record focuses on MOSS-TTS-v1.5 because it has a stable public model ID, Hugging Face model card, documented capabilities, Apache-2.0 license, and a direct Transformers loading path. The other variants are important, but their reader jobs are more specific and should become separate records only when the task asks for that variant or search demand is clearly variant-specific.

Local and serving angle

MOSS-TTS is especially useful for builders who care about local or controllable voice infrastructure. OpenMOSS documents llama.cpp, ONNX, SGLang-Omni, vLLM-Omni, and Nano CPU paths, while SGLang-Omni documents why the Local Transformer model needs a multi-stage speech-serving runtime rather than a plain text-only LLM loop.

Evidence caveat

Use OpenMOSS, MOSI, Hugging Face model cards, papers, and runtime documentation for facts about model IDs, license, architecture, supported features, and setup. Use official X posts for launch freshness and community X or Reddit sources only to understand what users ask about: which variant to run, whether it works on Windows or Apple Silicon, how much VRAM is needed, how voice cloning compares, and whether a smaller local model is enough.

MOSS-TTS FAQ

Page-level questions for MOSS-TTS.

What is MOSS-TTS?+

MOSS-TTS is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It includes variants for voice cloning, long-form text-to-speech, spoken dialogue, voice design, real-time streaming speech, sound effects, and small local deployment.

Which MOSS-TTS variant should I start with?+

Start with MOSS-TTS-v1.5 when you want the flagship voice-cloning and long-form TTS model, MOSS-TTS-Local-Transformer-v1.5 when you need 48 kHz stereo streaming deployment, MOSS-TTS-Nano when CPU-friendly local TTS matters most, and MOSS-TTSD when your input is multi-speaker dialogue. Treat VoiceGenerator, Realtime, and SoundEffect as specialized branches for voice design, voice agents, and sound generation.

Can MOSS-TTS run locally?+

Yes, but the right local path depends on the variant and hardware. OpenMOSS documents PyTorch, llama.cpp, ONNX, SGLang-Omni, vLLM-Omni, and Nano CPU-oriented paths, while community discussion focuses on VRAM, Windows setup, and whether the smaller Nano branch is enough for a given product.

Why does GetLLMs list MOSS-TTS-v1.5 as the model record instead of every variant?+

GetLLMs lists MOSS-TTS-v1.5 first because it is the clearest current flagship record with a stable model ID, model card, license, and documented loading path. Other variants deserve separate records when the user specifically asks for that variant or when source evidence shows distinct search demand around setup, pricing, examples, or deployment.