Need help with training ASR model from scratch.

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

espnet

15 7,872 10.0 Python

End-to-End Speech Processing Toolkit

You actually dont need to have phone level alignment for your data. Both hybrid and end-2-end approaches can work with utterance level alignment. For the hybrid approach, you would need a lexicon which maps each unique word in your training transcription to its phone sequence. You can obtain this with CMU's tool. For end-2-end approach you will need a byte pair encoder to tokenize the words in the transcriptions to its sub-words.

NeMo

29 10,021 9.8 Python

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

This is relatively small amount of speech to train the model from scratch, but you can train using another pre-trained model for initialization. There are numbers of end-to-end ASR toolkits which can be used for this: https://github.com/NVIDIA/NeMo and https://github.com/espnet/espnet

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project