Our great sponsors
-
NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
You actually dont need to have phone level alignment for your data. Both hybrid and end-2-end approaches can work with utterance level alignment. For the hybrid approach, you would need a lexicon which maps each unique word in your training transcription to its phone sequence. You can obtain this with CMU's tool. For end-2-end approach you will need a byte pair encoder to tokenize the words in the transcriptions to its sub-words.
This is relatively small amount of speech to train the model from scratch, but you can train using another pre-trained model for initialization. There are numbers of end-to-end ASR toolkits which can be used for this: https://github.com/NVIDIA/NeMo and https://github.com/espnet/espnet