[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling
This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware.
Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:
Key complexity observations from Table 1:
- Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536)
- Model sizes range from 4.3 MB to 283.6 MB
- Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz)
Critical finding: For a given model variant, computational complexity scales linearly with sample rate:
- enc32dec768 example:
- 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s)
- 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase
- 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase
Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended.
Energy-conserving state with severe constraints:
Typical sustained workload state:
High-performance state approaching sustained limits:
Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated.
Analysis at peak locked frequencies establishes absolute upper bounds:
Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices.
Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices.
Critical "Complexity vs. Bandwidth" trade-off identified:
Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category.
Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores:
- 32 kHz → 10M parameters
- 16 kHz → 20M parameters
- 8 kHz → 39M parameters
Provided concrete RTF measurements across representative mobile hardware configurations, revealing that:
- Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks
- Memory bandwidth and thermal constraints significantly impact feasibility
- Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity
Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection.
The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.