S4-260155 - AI Summary

[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling

AI-Generated Summary AI

Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling

1. Introduction and Motivation

This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware.

2. Experimental Setup

2.1 Model Configuration

Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:

Architecture: Fully convolutional encoder-decoder with Residual Vector Quantization (RVQ)
Frame length: 40ms (fixed across all variants)
Total up/down-sampling factor: 320 (consistent across variants)
Sample rates tested: 8 kHz (320 samples), 16 kHz (640 samples), 32 kHz (1280 samples)
Export format: ONNX with Float32 precision

Key complexity observations from Table 1:
- Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536)
- Model sizes range from 4.3 MB to 283.6 MB
- Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz)

2.2 Device Under Test (DUT) Environment

Platform: MediaTek Dimensity 1200 (6nm) - representative mid-range SoC
Inference engine: ONNX Runtime v1.14+ with CPU execution provider (single-threaded)
CPU clusters tested:
Efficiency cluster: Cortex-A55
Performance cluster: Cortex-A78
Prime core: Cortex-A78+
Methodology: Frequency-locked operation with disabled thermal services and power HALs to eliminate dynamic frequency scaling noise

3. Results and Analysis

3.1 Complexity Scaling vs. Bandwidth

Critical finding: For a given model variant, computational complexity scales linearly with sample rate:
- enc32dec768 example:
- 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s)
- 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase
- 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase

Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended.

3.2 Real-Time Factor (RTF) Analysis Across Three Frequency Tiers

3.2.1 Tier 1: Low Frequency (A55@750MHz, A78@902MHz, A78+@1.1GHz)

Energy-conserving state with severe constraints:

Cortex-A55 @ 750 MHz: Only smallest models (enc8dec144) maintain real-time at 8 kHz; 16/32 kHz unfeasible
Cortex-A78 @ 902 MHz:
32 kHz: Limited to <3M parameters
16 kHz: Supports up to ~8M parameters
8 kHz: Supports up to ~10M parameters
Cortex-A78+ @ 1.108 GHz: Similar to A78 but extends 16 kHz limit closer to 10M parameters

3.2.2 Tier 2: Mid Frequency (A55@1.0GHz, A78@1.16GHz, A78+@1.37GHz)

Typical sustained workload state:

Cortex-A55 @ 1.0 GHz: 8 kHz supports up to ~2M parameters; 16/32 kHz remain largely unfeasible
Cortex-A78 @ 1.162 GHz:
32 kHz: ~5M parameter limit
16 kHz: ~10M parameters (covers "Low Complexity" profile)
8 kHz: Robust up to ~20M parameters
Cortex-A78+ @ 1.37 GHz: Performance parity with A78 (clock speed is primary differentiator)

3.2.3 Tier 3: High Frequency (A55@1.73GHz, A78@1.45GHz, A78+@1.63GHz)

High-performance state approaching sustained limits:

Cortex-A55 @ 1.73 GHz:
8 kHz: ~3M parameters
16 kHz: ~2M parameters
32 kHz: ~1M parameters
Cortex-A78 @ 1.451 GHz:
32 kHz: ~7M parameters
16 kHz: ~10M parameters
8 kHz: ~20M parameters
Cortex-A78+ @ 1.632 GHz: Highest headroom
32 kHz: ~8M parameters
16 kHz: Comfortably supports 10M parameters
8 kHz: ~20M parameters

Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated.

3.3 Maximum Performance Envelope

Analysis at peak locked frequencies establishes absolute upper bounds:

3.3.1 Efficiency Core (Cortex-A55 @ 2.0 GHz)

Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices.

3.3.2 Performance Core (Cortex-A78 @ 2.6 GHz)

Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices.

Critical "Complexity vs. Bandwidth" trade-off identified:

32 kHz: RTF crosses 1.0 near 10M parameters (enc24dec576 variant)
Hard limit for High-Fidelity ULBC candidates
16 kHz: Feasible model size effectively doubles to ~20M parameters (enc32dec768 variant)
enc40dec960 fails real-time constraints
Linear relationship between bandwidth reduction and parameter capacity
8 kHz: Extends to ~39M parameters
enc40dec960 (29M) is safe
Trend suggests failure before enc64dec1536

3.3.3 Prime Core (Cortex-A78+ @ 3.0 GHz)

Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category.

4. Key Technical Contributions

4.1 Quantified Complexity-Bandwidth Trade-off

Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores:
- 32 kHz → 10M parameters
- 16 kHz → 20M parameters
- 8 kHz → 39M parameters

4.2 Real-World Performance Benchmarks

Provided concrete RTF measurements across representative mobile hardware configurations, revealing that:
- Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks
- Memory bandwidth and thermal constraints significantly impact feasibility
- Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity

4.3 Practical Complexity Constraints for ULBC

Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection.

5. Proposal

The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.

Document Information

TDoc:
S4-260155

Source:
vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance

Type:
pCR

For:
Agreement

Original Document:
View on 3GPP

Title: [FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling

Agenda item: 7.8

Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)

Doc type: pCR

For action: Agreement

Abstract: As part of the study on the new Ultra Low Bitrate Speech Codec (ULBC) [1], it is necessary to establish complexity constraints that reflect real-world device capabilities. Previous contributions have analyzed theoretical complexity using static metrics such as FLOPs and WMOPS [2] [5]. However, static metrics often fail to capture system-level bottlenecks, such as memory bandwidth pressure and thermal constraints on mobile System-on-Chips (SoCs). This contribution presents a comprehensive performance analysis of a neural audio codec (based on the Descript Audio Codec architecture) running on a representative mid-range mobile platform. By sweeping across model sizes (1M to 74M parameters) and sample rates (8, 16, 32 kHz), we evaluate the correlation between theoretical complexity and the Real-Time Factor (RTF).

Release: Rel-20

Specification: 26.94

Version: 0.4.0

Related WIs: FS_ULBC

Spec: 26.94

Contact: Wang Dong

Uploaded: 2026-02-03T13:43:09.937000

Contact ID: 107237

Revised to: S4-260445

TDoc Status: revised

Reservation date: 03/02/2026 12:35:47

Agenda item sort order: 20