Feb 10, 2026 | Read time min

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

TDT: How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

Word Error Rate (WER) is a useful metric to try to optimise, but if your model takes 10 seconds to transcribe 1 second of audio, nobody's shipping it. The Huggingface Open ASR Leaderboard tracks both accuracy and speed. At the time of writing, in the huggingface top 10, Nvidia's Parakeet TDT models are more than 3x ahead of the nearest competition in RTFx (Inverse Real Time Factor/Throughput, i.e. how many seconds of audio the model can process per second of wall-clock time).

These models are significantly faster than the competition while maintaining competitive WERs. The mechanism? A modification to the RNN-Transducer called the Token-and-Duration Transducer (TDT). In this post, we'll build up from first principles to understand how TDT works, why it's faster, and what the tradeoffs are.


The RNN-Transducer: A Quick Derivation

Without going into too much detail, there are a few ways to train a speech-to-text model: CTC (Connectionist Temporal Classification loss), AED (Attention-based Encoder-Decoder model with cross-entropy loss), Decoder-only (Language model style self-attention with cross-entropy loss) or RNN-T/TDT.

RNN-T hits a useful middle ground in this field: it has enough modeling capacity to capture label dependencies (unlike CTC), its autoregressive component is lightweight (unlike AED/decoder-only), and it can be trained end-to-end with a well-understood loss function. RNN-T is already fast - much faster than AED or decoder-only models, and only somewhat slower than CTC. But there's still space to speed up; the key observation is to allow time-frame skipping.

To understand TDT, we first need to understand RNN-T properly. Let's build it up.

Architecture

An RNN-T consists of three components:

  1. Encoder hh: Maps audio X\mathbf{X} to hidden representations ht(X)RHench_t(\mathbf{X}) \in \mathbb{R}^{H_\text{enc}} for each frame tt.
  2. Predictor (a.k.a. decoder) gg: An autoregressive network that maps the previous non-blank tokens y<u\mathbf{y}_{<u} to representations gu(y<u)RHpredg_u(\mathbf{y}_{<u}) \in \mathbb{R}^{H_\text{pred}}.
  3. Joint network JJ: Combines encoder and predictor outputs to produce logits over the vocabulary V{}\mathcal{V} \cup \{\varnothing\} (where \varnothing is the blank symbol):

zt,u=J(ht(X),gu(y<u))RV+1\mathbf{z}_{t,u} = J(h_t(\mathbf{X}), g_u(\mathbf{y}_{<u})) \in \mathbb{R}^{|\mathcal{V}| + 1}

The joint network generates a probability distribution over tokens at each (t,u)(t, u) position in the lattice - a T×(U+1)T \times (U+1) grid where TT is the number of encoder frames and UU is the number of target tokens. This output is shape ([Bsz],T,U,V+1)(\text{[Bsz]}, T, U, |\mathcal{V}|+1). This means that, for every possible time-step, and every possible position through the target sequence, we get a full probability distribution over the next token prediction.

The Lattice and Alignments

The output of an RNN-T model is a full lattice of transition probabilities. An alignment is a path through this lattice from (0,0)(0, 0) to (T,U)(T, U). At each position (t,u)(t, u), the model either:

  • Emits blank \varnothing: moves from (t,u)(t+1,u)(t, u) \to (t+1, u) - a step right along the time axis.
  • Emits a token yu+1y_{u+1}: moves from (t,u)(t,u+1)(t, u) \to (t, u+1) - a step up along the label axis - in this case we only care about the probability of the "correct" yu+1y_{u+1}

At each point we get the full probability distribution over all V+1|V| + 1 options: each word in the vocab VV + blank \varnothing. However, in the lattice, we only consider the above two logits: probability of the next target token P(yu+1t,u)P(y_{u+1} \mid t, u), and probability of blank P(t,u)P(\varnothing \mid t, u).

Here's what the lattice looks like for a concrete example. Suppose we have T=8T = 8 encoder frames and the target transcription y=[the,quick,brown,fox]\mathbf{y} = [\text{the}, \text{quick}, \text{brown}, \text{fox}] (U=4U = 4):

IMAGE 1 (1)

Every valid path from bottom-left [start] to top-right [end] that emits exactly the target sequence is a valid alignment. Different paths correspond to different timings of the same transcription. For example:

Path A (early speech):    ∅, the, quick, brown, fox, ∅, ∅, ∅, ∅, ∅, ∅, ∅
       (orange)           → all tokens emitted by t=4, rest is silence

Path B (spread out):      the, ∅, ∅, quick, ∅, brown, ∅, ∅, fox, ∅, ∅, ∅
       (pink)             → tokens spread across the utterance

Path C (late speech):     ∅, ∅, ∅, ∅, ∅, the, quick, brown, ∅, ∅, fox, ∅
       (blue)             → speech starts late, around t=5

All three paths produce the same transcription "the quick brown fox" - they just disagree on when each token aligns to the audio. The total probability of y\mathbf{y} is the sum over all such paths:

P(yX)=aA(y)P(aX)P(\mathbf{y} \mid \mathbf{X}) = \sum_{\mathbf{a} \in \mathcal{A}(\mathbf{y})} P(\mathbf{a} \mid \mathbf{X})

where A(y)\mathcal{A}(\mathbf{y}) is the set of all valid alignments for y\mathbf{y}. The sequence y\mathbf{y} here represents the correct transcription. Any probability mass assigned to incorrect tokens isn't counted in this summation.

The Forward-Backward Algorithm

We can't enumerate all paths (there are exponentially many), so we use dynamic programming.

The forward variable α(t,u)\alpha(t, u) represents the total probability mass that flows from the start state (0,0)(0, 0) to node (t,u)(t, u) - i.e., the sum of probabilities over all partial paths that reach (t,u)(t, u):

α(t,u)=paths from (0,0) to (t,u)P(path)\alpha(t, u) = \sum_{\text{paths from } (0,0) \text{ to } (t,u)} P(\text{path})

The recurrence is:

α(t,u)=α(t1,u)P(t1,u)+α(t,u1)P(yut,u1)\alpha(t, u) = \alpha(t-1, u) \cdot P(\varnothing \mid t-1, u) + \alpha(t, u-1) \cdot P(y_u \mid t, u-1)

with α(0,0)=1\alpha(0, 0) = 1. Each term says: the mass arriving at (t,u)(t, u) is the mass that was at the predecessor node, times the probability of the transition that leads here.

The total log-likelihood is:

logP(yX)=logα(T1,U) \log P(\mathbf{y} \mid \mathbf{X}) = \log \alpha(T-1, U)

This is (negated) RNNT loss! We want to minimise this total negative log likelihood:

LRNNT=logP(yX)=logα(T1,U)\mathcal{L}_\text{RNNT} = - \log P(\mathbf{y} \mid \mathbf{X}) = - \log \alpha(T-1, U)

The backward variable β(t,u)\beta(t, u) is the mirror image: it represents the total probability mass that flows from node (t,u)(t, u) to the final state - "how much probability mass can still reach the target from here."

β(t,u)=β(t+1,u)P(t,u)+β(t,u+1)P(yu+1t,u)\beta(t, u) = \beta(t+1, u) \cdot P(\varnothing \mid t, u) + \beta(t, u+1) \cdot P(y_{u+1} \mid t, u)

[IMAGE 3 FROM PPT]

The gradient follows naturally from this picture. The product α(t,u)β(t,u)\alpha(t, u) \cdot \beta(t, u) gives the total probability mass that passes through node (t,u)(t, u) on its way from start to end. To compute the gradient of the log-likelihood with respect to a particular transition (say, emitting token vv at position (t,u)(t, u)), we ask: how much would shifting this single transition's probability change the final log-likelihood? Intuitively, that’s the amount of the final probability mass that flows through this transition: the amount of probability mass that gets to this transition * the probability mass that leaves the transition and gets to the end point correctly. We can also think of this as the sum over all paths that use this transition.

IMAGE 4 (1)

P(yX)(total P)=α(t,u)P mass inP(vt,u)(transition P)β(t,u)P mass out\underbrace{\partial P(\mathbf{y} \mid \mathbf{X})}_{\partial \text{(total P)}} = \underbrace{\alpha(t,u)}_{\text{P mass \textbf{in}}} \cdot \underbrace{\partial P(v \mid t, u)}_{\partial \text{(transition P)}} \cdot \underbrace{\beta(t', u')}_{\text{P mass \textbf{out}}}

P(yX)P(vt,u)=α(t,u)β(t,u)total P through transition\frac{\partial P(\mathbf{y} \mid \mathbf{X})}{\partial P(v \mid t, u)} = \underbrace{\alpha(t,u) \cdot \beta(t', u')}_{\text{total P through transition}}

where (t,u)(t', u') is the node reached after the transition. The forward variable gets the mass to the transition, and the backward variable represents the mass that will eventually arrive at the target from that point. The full loss gradient normalizes by the total likelihood (see the original Graves 2012 paper for the complete derivation):

LRNNT=logP(yX)\mathcal{L}_\text{RNNT} = - \log P(\mathbf{y} \mid \mathbf{X})

LRNNTP(vt,u)=LRNNTP(yX)P(yX)P(vt,u)=α(t,u)β(t,u)P(yX)\frac{\partial \mathcal{L}_\text{RNNT}}{\partial P(v \mid t, u)} = \frac{\partial \mathcal{L}_\text{RNNT}}{\partial P(\mathbf{y} \mid \mathbf{X})} \cdot \frac{\partial P(\mathbf{y} \mid \mathbf{X})}{\partial P(v \mid t, u)} = -\frac{\alpha(t,u) \cdot \beta(t', u')}{P(\mathbf{y} \mid \mathbf{X})}

This gives a nice result! The gradient with respect to any given transition probability is just the proportion of the total probability mass that flows through that transition. Early in training, when P(yX)P(\mathbf{y} \mid \mathbf{X}) is small, the gradient is still significant for any correct transitions. This also explains why lattice paths tend to collapse to a small number of dominant alignments later in training - the highest-probability paths receive the largest gradients, incentivizing further path concentration. This isn't necessarily a bad thing, but may limit the effectiveness of standard EMBR training.

n.b. This "path collapse" is a key insight of the Dan Povey's RNN-T pruned loss which simplifies the gradient computation significantly by only considering paths near (in time) to the high-probability alignments.

The forward-backward algorithm computes all of this in O(TU)O(T \cdot U) time.

The Inference Problem

RNN-T is already fast relative to AED and decoder-only models, since the predictor network is small. But it still has a structural constraint. During greedy decoding:

# RNN-T Greedy Decoding (simplified)
t = 0
u = 0
output = []
while t < T:
    logits = joint(encoder[t], predictor(output))
    token = argmax(logits)
    if token == BLANK:
        t += 1          # advance ONE frame
    else:
        output.append(token)
        # stay at same t, advance u

The model processes the encoder output one frame at a time. For every single frame, it must:

  1. Run the joint network
  2. Check if the output is blank or a token
  3. If blank, advance by exactly one frame; Else, run the predictor network

For a 10-second utterance at 80ms frame rate (after subsampling), that's ~125 sequential joint network calls at minimum. Most of those will be blanks - in typical speech, tokens are sparse relative to frames. The model spends most of its time predicting "nothing is happening" one frame at a time. The joint network is cheap per call, but the sequential one-frame-at-a-time structure leaves performance on the table.

TDT addresses this.


Token-and-Duration Transducer (TDT)

The core idea of TDT (Xu et al., 2023): instead of predicting just a token at each lattice position, jointly predict the token and how many frames it covers.

The Key Modification

In standard RNN-T, the joint network outputs a single distribution over V+1|\mathcal{V}| + 1 symbols (vocabulary + blank). In TDT, the joint network outputs two independent distributions:

  1. Token distribution: P(vt,u)ΔV+1P(v \mid t, u) \in \Delta^{|\mathcal{V}|+1} - same as RNN-T
  2. Duration distribution: P(dt,u)ΔDP(d \mid t, u) \in \Delta^{|\mathcal{D}|} - probability over a set of allowed durations

where D\mathcal{D} is a predefined set of durations. A typical choice is D={0,1,2,3,4}\mathcal{D} = \{0, 1, 2, 3, 4\}, though the set can be configured - for example, {1,2,3,4}\{1, 2, 3, 4\} (omitting 0) is also valid.

The two heads share the same encoder and predictor representations but are independently normalized (separate softmax operations):

# TDT Joint Network Output
logits = joint(encoder[t], predictor(output))  # shape: [V + 1 + |D|]

# Split into token and duration logits
token_logits = logits[:V+1]                    # shape: [V + 1]
duration_logits = logits[V+1:]                 # shape: [|D|]

# Independent softmax
token_probs = softmax(token_logits)
duration_probs = softmax(duration_logits)

How Durations Change the Lattice

In standard RNN-T, transitions in the lattice are simple:

  • Blank: (t,u)(t+1,u)(t, u) \to (t+1, u) - always advances by exactly 1 frame
  • Token: (t,u)(t,u+1)(t, u) \to (t, u+1) - stays at the same frame

In TDT, transitions become:

  • Blank with duration dd: (t,u)(t+d,u)(t, u) \to (t+d, u) - advances by d1d \geq 1 frames
  • Token with duration dd: (t,u)(t+d,u+1)(t, u) \to (t+d, u+1) - advances by d0d \geq 0 frames and emits a token

Note the asymmetry: blanks must have d1d \geq 1 (you must advance at least one frame when emitting nothing), but tokens can have d=0d = 0 if 0 is in D\mathcal{D} (emitting a token without advancing - useful for fast speech or multi-token emissions at a single frame). If D\mathcal{D} doesn't include 0, every emission also advances at least one frame.

IMAGE 5 (1)

This is a much richer lattice - the model can now "skip ahead" multiple frames in a single step.

Why This Makes Inference Fast

The inference speedup is immediate. During greedy decoding:

# TDT Greedy Decoding (simplified)
t = 0
output = []
while t < T:
    logits = joint(encoder[t], predictor(output))
    token = argmax(token_logits)
    duration = argmax(duration_logits)

    if token == BLANK:
        t += max(1, duration)    # skip MULTIPLE frames!
    else:
        output.append(token)
        t += duration            # can also skip frames on token emission

Instead of advancing one frame at a time, the model can skip over stretches of silence or steady-state audio. If the model predicts blank with duration 4, it skips 4 frames in one step - reducing joint network calls for that stretch proportionally.

The TDT paper reports up to 2.82x faster inference than standard RNN-T on speech recognition tasks, with comparable or better accuracy. The speedup is more pronounced on longer utterances with more silence.


Training TDT: The Modified Forward-Backward

Training TDT requires modifying the forward-backward algorithm to account for the duration variable. The loss is still the negative log-likelihood logP(yX)-\log P(\mathbf{y} \mid \mathbf{X}), but the lattice transitions are now richer. We now have two independent distributions predicted from each node:

P(yu,dt,u)=PT(yut,u)PD(dt,u)P(y_u,d \mid t, u) = P_T(y_u \mid t, u) \cdot P_D(d \mid t, u)

Modified Forward Variable

The forward variable α(t,u)\alpha(t, u) now has a more complex recurrence. At each position (t,u)(t, u), we must sum over all durations that could have led here:

α(t,u)=dD,d1α(td,u)P(,dtd,u)blank transitions from various durations back+dDα(td,u1)P(yu,dtd,u1)token transitions from various durations back\alpha(t, u) = \underbrace{\sum_{d \in \mathcal{D}, d \geq 1} \alpha(t-d, u) \cdot P(\varnothing, d \mid t-d, u)}_{\text{blank transitions from various durations back}} + \underbrace{\sum_{d \in \mathcal{D}} \alpha(t-d, u-1) \cdot P(y_u, d \mid t-d, u-1)}_{\text{token transitions from various durations back}}

Backward Variable and Gradients

The backward variable β(t,u)\beta(t, u) follows the same pattern but in reverse:

β(t,u)=dD,d1P(,dt,u)β(t+d,u)+dDP(yu+1,dt,u)β(t+d,u+1)\beta(t, u) = \sum_{d \in \mathcal{D}, d \geq 1} P(\varnothing, d \mid t, u) \cdot \beta(t+d, u) + \sum_{d \in \mathcal{D}} P(y_{u+1}, d \mid t, u) \cdot \beta(t+d, u+1)

The gradient computation uses both α\alpha and β\beta in the standard way, summing over each possible duration for a given token prediction and scaling by the duration probabilities (PDP_D). For the token logit, the gradient at position (t,u,v)(t, u, v) is:

LPT(vt,u)=(t,u)Ct,uα(t,u)β(t,u)PD(ttt,u)P(yX)\frac{\partial \mathcal{L}}{\partial P_T(v \mid t, u)} = -\sum_{(t', u') \in C_{t,u}} \frac{\alpha(t,u) \cdot \beta(t',u') \cdot P_D(t'-t \mid t, u)}{P(\mathbf{y} \mid \mathbf{X})}

where Ct,uC_{t,u} is the set of reachable states from (t,u)(t, u):

C={(t+d,u+1)d{0D}}predicting the next token yu    {(t+d,u)d{1D}}predicting blank C = \underbrace{\{(t + d, u+1) \mid d \in \{0 \ldots D\}\}}_{\text{predicting the next token } y_u} \;\cup\; \underbrace{\{(t + d, u) \mid d \in \{1 \ldots D\}\}}_{\text{predicting blank } \varnothing}

For the duration logits, the gradient at position (t,u,d)(t, u, d) accounts for all transitions that use duration dd, either the correct token or a blank transition:

LPD(d>0t,u)=α(t,u)β(t+d,u+1)PT(yu+1t,u)P(yX)α(t,u)β(t+d,u)PT(t,u)P(yX)\frac{\partial \mathcal{L}}{\partial P_D(d_{>0} \mid t, u)} = - \frac{\alpha(t,u) \cdot \beta(t+d, u+1) \cdot P_T(y_{u+1} \mid t, u)}{P(\mathbf{y} \mid \mathbf{X})} - \frac{\alpha(t,u) \cdot \beta(t+d, u) \cdot P_T(\varnothing \mid t, u)}{P(\mathbf{y} \mid \mathbf{X})}

or for d=0d=0 (blank not allowed at zero duration):

LPD(d=0t,u)=α(t,u)β(t,u+1)PT(yu+1t,u)P(yX)\frac{\partial \mathcal{L}}{\partial P_D(d=0 \mid t, u)} = - \frac{\alpha(t,u) \cdot \beta(t, u+1) \cdot P_T(y_{u+1} \mid t, u)}{P(\mathbf{y} \mid \mathbf{X})}

This too is somewhat intuitive. It represents the sum over all valid paths that use this duration. Now we're done! This is all the maths required to understand the efficient TDT training mechanics. For the full derivation see the TDT paper.

Some More Training Tricks

The Sigma Trick - Logit Under-Normalization: Every transition in the lattice, whether blank or token, gets penalized by σ\sigma (typically 0.05) in log-space. Since this penalty is applied per transition, paths with more steps accumulate a larger total penalty. This biases the model toward using fewer, larger-duration steps rather than many duration-1 steps.

The Omega Trick - Sampled RNN-T Loss: with probability ω\omega, the loss falls back to the standard RNN-T loss (ignoring durations entirely). This acts as a regularizer, ensuring the token predictions remain well-calibrated even without duration information.


Practical Considerations and Pitfalls

Training Memory

TDT has the same memory footprint challenge as standard RNN-T: the joint network output is a 4D tensor of shape (B,T,U,V+D)(B, T, U, V + |\mathcal{D}|). For large vocabularies and long sequences, this can be enormous. The standard mitigation is fused loss computation - instead of materializing the full joint tensor, compute the loss and gradients in a fused kernel that only materializes one (t,u)(t, u) slice at a time.

Batch Inference

One subtlety: during batch inference, different utterances in a batch will have different predicted durations, meaning they advance through the encoder at different rates. This makes batched greedy decoding trickier than for standard RNN-T, where all utterances advance by exactly one frame per step.

Duration Set Design

The choice of duration set D\mathcal{D} matters. The paper uses {0,1,2,3,4}\{0, 1, 2, 3, 4\} as the default. Some considerations:

  • Must include 1: Duration 1 is needed to recover standard single-frame advancement. Duration 0 is optional - it allows token emission without frame advancement (useful for fast speech), but some configurations omit it.
  • Larger durations = more skipping: The model learns when to use large skips vs. small ones. In practice, the model is conservative enough that large durations don't cause problems.
  • More durations = slightly slower training: The forward-backward complexity scales linearly with D|\mathcal{D}|, though with typical set sizes (4-5 elements) this is a small constant factor.

Comparison with Multi-Blank Transducer

TDT is related to but distinct from the Multi-Blank Transducer, which adds multiple blank symbols (big-blank-2, big-blank-3, etc.) that skip different numbers of frames. The key difference:

Multi-BlankTDT
Duration predictionImplicit (via blank type)Explicit (separate head)
Token durationsAlways 0 (no frame skip on token)Variable (tokens can skip frames too)
Vocab size increaseD\lvert \mathcal{D} \rvert blank symbolsNo vocab increase; separate duration head
IndependenceToken and duration coupledToken and duration independently normalized

TDT's independent normalization means the model doesn't need to use vocabulary capacity on multiple blank symbols, and the duration prediction can be more fine-grained.


How Does It Actually Work? An Inference Walkthrough

Let's trace through a concrete example. Suppose we have:

  • Audio: 8 encoder frames (T=8T = 8)
  • Target: "hi" → tokens [h, i] (U=2U = 2)
  • Durations: D={0,1,2,3}\mathcal{D} = \{0, 1, 2, 3\}

Forward Pass (Inference)

t=0: joint(enc[0], pred([]))
     → token=h (p=0.8), duration=0 (p=0.7)
     → emit 'h', stay at t=0

t=0: joint(enc[0], pred([h]))
     → token=i (p=0.6), duration=2 (p=0.5)
     → emit 'i', jump to t=2

t=2: joint(enc[2], pred([h, i]))
     → token=blank (p=0.9), duration=3 (p=0.6)
     → skip to t=5

t=5: joint(enc[5], pred([h, i]))
     → token=blank (p=0.95), duration=3 (p=0.8)
     → skip to t=8 → DONE!

4 joint network calls instead of 8+ for standard RNN-T. That's the speedup.


Summary

TDT extends RNN-T by jointly predicting tokens and their durations. The key ideas are:

  1. Two-headed joint network: independently predict token and duration distributions
  2. Variable-stride lattice: transitions can skip multiple frames, not just one
  3. Modified forward-backward: same algorithm structure, just summing over durations at each step
  4. Training tricks: logit under-normalization (σ\sigma) and sampled RNN-T loss (ω\omega) for stable training

The result: models that are up to 2.82x faster at inference with comparable or better accuracy than standard transducers - and RNN-T was already fast to begin with. This is how Nvidia's Parakeet-TDT models dominate the RTFx column at the top of the HuggingFace leaderboard.

The NeMo toolkit has a full implementation, and pretrained Parakeet-TDT checkpoints are available on HuggingFace.


References: