
Scaling Speech AI with Ease: How Valohai Supercharges NVIDIA NeMo
by Toni Perämäki | on June 06, 2025Large Language Models aren't just for text anymore. With NVIDIA NeMo, enterprises are unlocking the power of generative AI for speech, translation, and audio understanding. But building real-world, production-grade pipelines? That’s where the pain begins.
Unless you bring Valohai to the party.
In this technical spotlight, we’re showcasing how you can take the state-of-the-art NeMo ASR (Automatic Speech Recognition) pipeline and wrap it in a reproducible, versioned, and scalable Valohai workflow –from dataset preprocessing to fine-tuning and evaluation. This is not just theory. This is a working, production-style example straight from our GitHub to yours.
Why NeMo?
NVIDIA NeMo is one of the most powerful open-source toolkits for building, training, and deploying large-scale speech and language models. With NeMo, you can:
- Fine-tune large ASR models like QuartzNet and Conformer-CTC
- Leverage pre-trained models from NVIDIA’s Model Catalog
- Scale training on multi-GPU or distributed compute environments
In short: you get world-class speech AI, backed by NVIDIA’s research muscle.
But NeMo, by itself, doesn’t solve all your MLOps headaches.
Enter Valohai
Valohai is the only MLOps platform that was built for deep learning workflows from the ground up. We don’t pretend notebooks are pipelines. We don’t settle for local hacks. We give you:
- Automatic pipeline versioning for every experiment and dataset
- Reproducibility across cloud or on-prem – no more “it worked on my machine”
- Seamless orchestration of NeMo workflows across Docker, Git, and GPU clusters
And yes, it’s fully hybrid. Train on your infra, deploy on any cloud, and keep your data where it belongs.
The Project: NeMo + Valohai in Action
This example project shows you how to:
1. Preprocess Audio Data
Use Valohai to convert raw .wav
files and manifest them into a format that NeMo can digest – reliably and repeatably.
2. Train a QuartzNet ASR Model
Spin up a training job with NeMo’s CLI, powered by Valohai’s GPU-capable pipelines. You define your hyperparameters and configs in Git. Valohai handles the execution.
3. Evaluate & Predict
Run model evaluation and prediction steps in isolation or in sequence. Chain them into a pipeline or run A/B tests on multiple checkpoints.
4. Track Everything
Every artifact, every parameter, every run: logged, versioned, and reproducible by default. Even your random seed is tracked.
Why This Combo Matters
Let’s be honest – most MLOps setups still rely on hand-stitched scripts, notebooks, and wishful thinking. But when you combine NeMo’s raw modeling power with Valohai’s reproducibility and automation, you get a solution that’s ready for:
- Enterprise-scale experimentation
- Regulated environments (yes, reproducibility matters)
- Fast onboarding of new team members
- True separation of code and infrastructure
This is what MLOps is supposed to look like.
Get Started Now
Ready to take speech AI into production – without reinventing the wheel?
➡️ Check out the example project on GitHub
➡️ Sign up for Valohai and run your first NeMo job in minutes
➡️ Explore NVIDIA Partnership
➡️ Or drop us a line. We’ll show you how Valohai + NeMo can cut your dev time in half.
Speech AI at scale. MLOps that works. That’s the Valohai way.