smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3

Source: MarkTechPost

Audio AI has had a breakout year. Automatic speech recognition has gotten dramatically better with models like OpenAI’s Whisper variants, NVIDIA’s Parakeet, and Mistral’s Voxtral. Audio understanding stepped forward with models like NVIDIA’s Audio Flamingo 3. Dialogue-grade text-to-speech arrived via Nari Labs’ Dia-1.6B. And Meta shipped the Perception Encoder Audiovisual (PE-AV), a multimodal encoder capable of learning a shared embedding space across audio, video, and text. The frontier has never moved faster.

The catch? The practical knowledge required to actually work with these models — how to fine-tune them, adapt them to new languages, or run efficient inference — is scattered across GitHub issues, research blogs, and private notebooks that never see the light of day. If you are an ML engineer who just wants to fine-tune Whisper on a new domain or run zero-shot video classification with PE-AV, you are often starting from scratch.

That is the gap smol-audio is designed to close.

What is smol-audio ?

Released under the Apache-2.0 license by the Deep-unlearning team, smol-audio is a flat repository of self-contained Jupyter notebooks, each focused on a single practical audio AI task. Every notebook is designed to be opened directly in Google Colab, requires no local GPU setup, and is built entirely on the Hugging Face ecosystem — specifically transformers, datasets, peft, and accelerate. Most recipes fit within a 16 GB Colab runtime, which means a free or standard Colab tier is sufficient for the majority of tasks.

The “flat repo” design is a deliberate choice. Rather than wrapping recipes inside a framework or hiding complexity behind convenience functions, smol-audio exposes every step. You can read the training loop, understand the data pipeline, and modify the configuration without reverse-engineering a library. For early-career engineers, that transparency is genuinely educational.

ASR Fine-Tuning: Whisper, Parakeet, Voxtral, and Granite Speech

The largest category in the repo today covers ASR fine-tuning across four distinct model families. Each requires meaningfully different handling.

The Whisper notebook covers fine-tuning using transformers and datasets, making it straightforward to adapt the encoder-decoder architecture to a custom language or narrow domain. Whisper uses a sequence-to-sequence approach, generating transcripts token by token — familiar territory for anyone who has worked with language models.

NVIDIA’s Parakeet uses a CTC (Connectionist Temporal Classification) architecture rather than a sequence-to-sequence setup. CTC is faster and lighter for inference but requires alignment between audio frames and output tokens rather than autoregressive decoding. The smol-audio notebook covers both full fine-tuning and LoRA (Low-Rank Adaptation) for Parakeet, which is important because full fine-tuning large CTC models can be memory-intensive.

Mistral’s Voxtral is architecturally distinct from both Whisper and Parakeet. Rather than a traditional ASR encoder-decoder, Voxtral is built on a large language model backbone — Ministral 3B for Voxtral Mini and Mistral Small 3.1 24B for Voxtral Small — making it an LLM-based speech understanding model. The smol-audio notebook handles fine-tuning for ASR with prompt masking, supporting both full fine-tuning and LoRA. Prompt masking is important here precisely because of this LLM architecture: when a model accepts text prompts alongside audio input, you typically do not want to compute loss on the prompt tokens themselves — only on the generated transcription. Getting this wrong leads to degraded training dynamics, so having a working reference implementation saves significant debugging time.

IBM’s Granite Speech gets its own notebook focused on Italian ASR using the YODAS-Granary dataset. This is a useful example beyond just the model: it demonstrates domain- and language-specific fine-tuning on a real multilingual speech corpus, a common production scenario.

Audio Understanding with NVIDIA’s Audio Flamingo 3

Audio Flamingo 3, developed by NVIDIA, is a Large Audio Language Model (LALM) for reasoning and understanding across speech, sound, and music. The smol-audio notebook fine-tunes it specifically for the audio captioning task — generating a natural language description of an audio clip, which is useful for accessibility tooling, content indexing, and retrieval systems. The notebook covers both full fine-tuning and LoRA-based fine-tuning, giving practitioners the choice between maximum performance and memory efficiency.

LoRA, for those newer to parameter-efficient fine-tuning, works by freezing the original model weights and injecting small trainable rank-decomposition matrices into specific layers. For large multimodal models like Audio Flamingo 3, LoRA can reduce GPU memory requirements by an order of magnitude compared to full fine-tuning, enabling iteration on commodity hardware.

Dialogue TTS with Dia-1.6B

The Dia-1.6B notebook covers dialogue-style text-to-speech, where the goal is not just synthesizing a single speaker but generating natural conversational exchanges. Dia is a 1.6-billion-parameter TTS model by Nari Labs capable of producing multi-speaker dialogue, making it relevant for anyone building voice agents, podcast generation tools, or conversational interfaces.

Multimodal Inference with Meta’s PE-AV

Perhaps the most forward-looking notebook in the current release covers inference with Meta’s Perception Encoder Audiovisual (PE-AV). PE-AV is a multimodal encoder that learns a single shared embedding space across audio, video, and text — enabling zero-shot video classification without any task-specific fine-tuning, and audio↔text retrieval on benchmarks like AudioCaps. Because all three modalities map into the same embedding space, cross-modal queries such as retrieving an audio clip from a text description work via simple dot-product similarity.

The notebook demonstrates how to run these inference pipelines directly, which is valuable because multimodal models with joint audio-visual-text encoders are architecturally more complex than single-modality models and typically require careful preprocessing of multiple input modalities.

Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.