Neural speech models

Designed for everyone, everywhere

OpenSpeech provides reference implementations for various ASR (Automatic Speech Recognition) modeling papers, supporting 20+ models, and includes recipes for three widely-used languages. PyTorch Lightning based framework Multi-GPU and TPU support Mixed-precision training Hierarchical configuration management Supports LibriSpeech, AISHELL-1, and KsponSpeech datasets

Fork the project on GitHub

We welcome all contributions to OpenSpeech! Whether it’s a bug fix, documentation improvement, or a major feature addition, feel free to open an issue or submit a pull request.

View GitHub Project

Interestd in hiring our expert?

We are offering the services of a Neural Speech Engineer Consultant who can help design, train, and deploy streaming deep learning speech systems for Speech-to-Text (S2T), Text-to-Speech (T2S), and Speech-to-Speech (S2S). With expertise in optimizing low-resource systems, our consultant will ensure high-performance solutions tailored to your needs.

Contact us

Speech-to-text (S2T)

Our neural speech-to-text training framework delivers state-of-the-art accuracy with remarkable efficiency, transforming raw audio into precise transcriptions effortlessly. Designed with scalability and adaptability in mind, it supports diverse languages, accents, and acoustic conditions, ensuring robust performance across real-world applications. With cutting-edge deep learning models and optimized training pipelines, our framework sets a new standard for speech recognition technology.

Text-to-speech (T2S)

Our neural text-to-speech (TTS) training framework is a state-of-the-art solution leveraging GAN-based architectures and seq2seq models to generate high-fidelity, natural speech with precise prosody and articulation. It supports discrete unit-based synthesis through vector quantization and self-supervised representation learning, enhancing phonetic control and speaker adaptation. With an incremental training paradigm, reinforcement learning-based optimization, and Transformer-based acoustic modeling, it ensures fast convergence, robustness, and superior synthesis quality, setting a new benchmark in neural TTS systems.

Speech-to-speech (S2S)

Our speech-to-speech (S2S) framework leverages advanced speech disentanglement techniques and semantic codecs to achieve high-fidelity voice transformation. It enables real-time streaming for voice conversion, accent adaptation, and seamless speech-to-speech translation while preserving speaker identity and natural prosody. By incorporating factorized speech representations and adaptive modulation, it ensures efficient and low-latency processing, making it a powerful solution for multilingual communication and personalized voice applications.

Built with flexibility

Our toolkit provides unparalleled flexibility, enabling you to design and train highly customized models that are perfectly suited to your unique needs and requirements. Whether you're optimizing for performance, resource efficiency, or specific use cases, our toolkit empowers you to achieve the best results.

50+

Recepes

6.7x

Faster

25

Architectures