S4-260209 - AI Summary

[FS_ULBC] Alignment Analysis on Complexity of DAC model

Back to Agenda Download Summary
AI-Generated Summary AI

Alignment Analysis on Complexity of DAC Model

1. Introduction

This contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics:

  • S4-260165: ~3M parameter model (32 kHz) requires 0.79 GMACS
  • S4-260155: ~3M parameter model (32 kHz) requires approximately 1.41 GMACS (derived from 2821 MFlops/s)

Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate.

The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF.

2. Architectural Analysis and Discrepancy Resolution

2.1 The "Model Size" Trap

A detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints:

| Metric | [2] (16k, ~3M) | [1] (32k, ~3M) |
|--------|----------------|----------------|
| Input Rate | 16,000 Hz | 32,000 Hz |
| Total Stride | 320 (2×4×5×8) | 1280 (4×4×8×10) |
| Latent Rate | 50.0 Hz | 25.0 Hz |
| Encoder MACs (M) | 436.30 | 461.92 |
| Quantizer MACs (M) | 2.25 | 0.50 |
| Decoder MACs (M) | 984.50 | 1037.12 |
| Total MFlops/s | 1423.05 | 1499.54 |

Key Analysis:

  • The S4-260165 (32k, ~3M) model runs at 2× higher input rate (32k vs 16k), increasing encoder computational cost
  • The S4-260165 model uses 4× higher stride (1280 vs 320), reducing the latent rate to 25Hz (compared to standard 50Hz)
  • The reduced latent rate significantly lowers decoder cost (fewer frames to upsample)
  • Higher input cost balances with lower decoder/latent cost, resulting in comparable total MFlops/s

Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration.

2.2 Verification of Complexity Metrics

Theoretical complexity (GMACS) was recalculated to validate the analysis:

  • Using the standard conversion: GMACS ≈ MFlops/s / 1000 × 0.5
  • The S4-260165 (32k, ~3M) model at 32 kHz yields ~1,499.5 MFlops/s
  • Calculated GMACS: 1499.5 / 1000 × 0.5 ≈ 0.75 GMACS
  • This aligns closely with the reference value of 0.79 GMACS reported in S4-260165

3. GMACS as the Metric

When RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures.

Key Findings:

  • RTF scales linearly with GMACS across different CPU tiers (Efficiency, Performance, Prime cores)
  • A specific GMACS budget (e.g., 2.0 GMACS) yields predictable RTF on a target CPU core and frequency, regardless of architectural choices (high-sample-rate input vs. large parameter count in decoder)
  • This metric decouples complexity constraint from specific architectural choices (stride, latent rate), allowing codec designers flexibility in optimization
  • High-complexity validation: S4-260155's 20M model (~5.14 GMACS) demonstrates RTF of 0.9 in power-efficient execution mode on high-end 2023 device, aligning with mid-range Prime Core (3.0 GHz) trend where ~5.3 GMACS corresponds to RTF ≈ 1.0

4. Conclusion

By adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices.

5. Proposal

Propose to include this analysis in 3GPP TR 26.940, specifically capturing:

  • Model Size is not a consistent proxy for complexity across varying architectures (e.g., high-stride vs. low-stride configurations)
  • GMACS/GFLOPs demonstrates strong linear correlation with real-time performance on mobile devices
  • This analysis provides a solid basis for defining complexity constraints for ULBC candidates

References

[1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis"

[2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling"

Document Information
Source:
vivo Mobile Communication Co.,
Type:
pCR
For:
Agreement
Original Document:
View on 3GPP
Title: [FS_ULBC] Alignment Analysis on Complexity of DAC model
Agenda item: 7.8
Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)
Doc type: pCR
For action: Agreement
Abstract: In this meeting, several companies have presented complexity analyses for AI-based codecs suitable for ULBC. notably S4-260165 [1] (Dolby et al.) and S4-260155 [2] (vivo et al.). While both contributions agree on the general feasibility of AI codecs on modern smartphones, there appeared to be a discrepancy when comparing "Model Size" (parameter count) to Real-Time Factor (RTF). Specifically, for a model of a similar parameter count (e.g., ~3M parameters), the reported complexity and RTF varied significantly between the architectures tested in [1] and [2]. For instance, the [1] ~3M parameter model (32 kHz) requires 0.79 GMACS, whereas the equivalent [2] ~3M parameter model (32 kHz) requires approximately 1.41 GMACS (derived from 2821 MFlops/s). In fact, the [1] model's complexity (0.79 GMACS) aligns more closely with the [2] model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate. This contribution provides a detailed analysis of this discrepancy. It demonstrates that "Model Size" is an insufficient metric for constraining complexity across different neural architectures, or even within the same architecture (e.g., DAC). Instead, we show that GMACS (Giga Multiply-Accumulate Operations per Second) provides a robust, linear correlation with RTF across different architectures, sampling rates, and frame rates.
Release: Rel-20
Specification: 26.94
Version: 0.5.1
Related WIs: FS_ULBC
Spec: 26.94
Contact: Wang Dong
Uploaded: 2026-02-03T17:47:01.727000
Contact ID: 107237
Revised to: S4-260443
TDoc Status: revised
Reservation date: 03/02/2026 17:43:14
Agenda item sort order: 20