All Summaries - Table View

Meeting: TSGS4_135_India | Agenda Item: 7.8

38 documents found

Back to Agenda Card View
TDoc Number Source Title Summarie
Bytedance
[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS

Analysis on Complexity Evaluation of ULBC with WMOPS

1. Introduction

This contribution examines the use of WMOPS (Weighted Million Operations Per Second) as a complexity metric for ULBC (Ultra Low Bitrate Codec). WMOPS has been proposed as one of the possible complexity metrics and is traditionally used for evaluating 3GPP speech codecs complexity. The analysis focuses on the WMS tool used for automated WMOPS calculation with floating point C code.

2. Technical Analysis: Discrepancies Between ITU-T Documentation and WMC Tool Implementation

The source conducted systematic testing of the WMC tool against the examples provided in ITU-T standards documentation (specifically clause 18.12.7 and related tables in the ITU-T Software Tool Library 2024 User's Manual). Several discrepancies were identified:

2.1 'Move' Operator Counting

Issue: Extra MOVE operations are counted by the WMC tool

  • Expected behavior (per Table 18.4): Division by constant b = a / L should count as 1 MULT (since 1/L is constant, operation becomes multiplication)
  • Actual WMC output: 1 MULT + 1 MOVE
  • Discrepancy: 1 additional MOVE operation

2.2 Increment Operator ('++')

Issue: Missing operations in WMC tool output

  • Expected behavior (per Table 18.4): (*rnd_T0)++ should count as 1 ADD + 1 STORE (equivalent to *rnd_T0 = *rnd_T0 + 1)
  • Actual WMC output: 0 operations counted
  • Discrepancy: 1 ADD + 1 STORE not counted
  • Note: This may not affect actual complexity on DSP implementations where pointer increment can be combined with other operations

2.3 Logical Operators ('AND/OR')

Issue: Missing TEST operation counting

  • Expected behavior (per Table 18.4): if (a!=b || c==d) should count as 2 ADD + 2 BRANCH + 1 TEST
  • Actual WMC output: 2 ADD + 2 BRANCH
  • Discrepancy: 1 TEST operation missing

2.4 Indirect Addressing

Issue: Extra MOVE operation and incorrect INDIRECT counting

  • Expected behavior (per Table 18.4): Double indirection Indice[0] = indirect_dico1[indice[0]] should count as 2 INDIRECT
  • Actual WMC output: 1 MOVE
  • Discrepancy: 2 INDIRECT operations not counted, 1 MOVE added instead

2.5 Instrumentation with Array Subscripts

Issue: Arithmetic operations inside array subscripts not counted

  • Expected behavior (per Table 18.3): Operations like a[i*2+1] should count arithmetic operations within the subscript (1 INDIRECT + 1 MULT + 1 ADD)
  • Actual WMC output: Only 1 MOVE counted
  • Discrepancy: INDIRECT, MULT, and ADD operations inside array subscripts are not instrumented

3. Observations and Impact Assessment

The source identifies three key observations:

  1. Systematic discrepancies exist between ITU-T standards documentation and WMC tool implementation, with both over-counting (e.g., extra MOVE operations) and under-counting (e.g., missing operations in array subscripts) observed

  2. Potential significance for AI codecs: Some discrepancies, particularly the counting of MOVE operators and instrumentation inside arrays, could significantly impact WMOPS measurements for AI-based codecs

  3. Need for clarification: If WMOPS is adopted as a complexity metric for ULBC, these differences must be carefully addressed and the calculation methodology must be clearly defined

4. Proposal

The source proposes to document the findings from Clause 2 and Clause 3 in clause 6.2 of the permanent document to ensure proper consideration of these WMOPS calculation issues in the ULBC complexity evaluation framework.

Bytedance
[FS_ULBC] Influence of code optimization on WMOPS

Summary of S4-260128: Influence of Code Optimization on WMOPS

1. Introduction and Motivation

This contribution investigates the impact of C code implementation choices on WMOPS (Weighted Million Operations Per Second) measurements for neural audio codecs, specifically in the context of the ULBC (Ultra-Low Bitrate Codec) study. The source examines whether WMOPS, traditionally used for 3GPP speech codec complexity evaluation, is suitable for neural audio codecs given that actual C implementation can significantly affect WMOPS measurements.

2. Experimental Analysis

2.1 Operator-Level Analysis

The source conducted experiments on Conv1D and Conv1DTranspose operators, which are extensively used in DAC (Discrete Audio Codec) for audio feature dimension manipulation:

  • Non-optimized implementation: Naïve nested loop implementation
  • Optimized implementation: Loop unrolling along kernel size dimension only

Key Results: - Conv1D: 441 WOPS (non-optimized) → 301 WOPS (optimized) = 68.25% ratio - Conv1DTranspose: 554 WOPS (non-optimized) → 260 WOPS (optimized) = 46.93% ratio

Finding: The same optimization strategy yields significantly different optimization ratios for different operators.

2.2 Full-Model Level Analysis

Using the optimized and non-optimized operator implementations, the source measured WMOPS for two DAC configurations (enc16dec384 and enc64dec1536) and compared against previously reported results [4]:

Total WMOPS: - enc16dec384: 13,320.35 (non-opt) → 8,152.01 (opt) → 5,785.17 (reported in [4]) - enc64dec1536: 201,552.55 (non-opt) → 123,966.49 (opt) → 84,441.99 (reported in [4])

Encoder WMOPS: - enc16dec384: 3,411.08 (non-opt) → 2,621.98 (opt) → 1,060.79 (reported in [4]) - enc64dec1536: 50,089.70 (non-opt) → 39,604.59 (opt) → 13,675.30 (reported in [4])

Decoder WMOPS: - enc16dec384: 9,847.22 (non-opt) → 5,484.21 (opt) → 4,724.38 (reported in [4]) - enc64dec1536: 151,291.59 (non-opt) → 84,255.25 (opt) → 70,766.69 (reported in [4])

3. Observations and Conclusions

The source draws two critical observations:

  1. Code optimization sensitivity: Simple optimizations (e.g., single-layer loop unrolling) can result in widely different WMOPS results for the same model
  2. Inconsistent optimization impact: The same optimization strategy produces different WMOPS reduction ratios across different models

Main Conclusion: If WMOPS is adopted as a complexity metric for ULBC, results will be highly influenced not only by model design but also by the actual C code implementation, potentially making comparisons between different codec proposals inconsistent.

4. Proposal

The source proposes to document the experimental findings and observations as a new clause 7.6.5 "WMOPS analysis on DAC" in TR 26.940, with three sub-clauses: - 7.6.5.1: On operator level - 7.6.5.2: On full-model level
- 7.6.5.3: Observation

This would capture the implementation-dependency issues of WMOPS measurements for neural audio codecs in the technical report.

China Mobile Com. Corporation
[FS_ULBC] Discussion of FS_ULBC Objective Speech Quality Assessment Method

Summary of S4-260132: Discussion of FS_ULBC Objective Speech Quality Assessment Method

Background

This contribution addresses speech quality assessment challenges for ultra-low bitrate codecs (ULBC). While subjective testing remains the benchmark for ULBC codec selection, objective speech evaluation methods can serve as predictive tools during intermediate testing and parameter adjustment processes, enabling more convenient and efficient quality verification.

Overview of Existing Speech Objective Quality Evaluation Methods

The document provides a comprehensive comparison of available objective assessment tools:

Standardized ITU-T Methods

  • P.863 (POLQA): Full-reference method, widely adopted in ITU/3GPP, supports NB/WB/SWB, maintains performance below 4kbps in SWB mode
  • P.563: No-reference method suitable for real-time applications, but less accurate for extreme noise or complex distortions compared to full-reference methods

Open Source Methods

  • ViSQOL: Full-reference, performs well for low bitrates (under 8kbps with good MOS correlation), but not formally standardized
  • STOI/ESTOI: Full-reference, focuses on speech intelligibility, computationally efficient with high correlation to subjective tests in noisy conditions. ESTOI improves robustness to nonlinear distortions (e.g., neural codecs)
  • SCOREQ: No-reference model with strong cross-domain robustness and improved correlation with human judgments

Capabilities and Limitations for ULBC

The document analyzes each method's suitability for ultra-low bitrate scenarios:

  • P.863: Most widely adopted, broad bandwidth support, proven performance at low bitrates
  • P.563: Limited adaptability to non-linear distortions from neural codecs
  • ViSQOL: Good consistency with MOS at low bitrates but lacks formal standardization
  • STOI/ESTOI: Effective for intelligibility assessment, robust to nonlinear distortions, but not ITU-T/3GPP standardized
  • SCOREQ: Addresses domain-generalization shortcomings with improved out-of-domain robustness

Proposal

Recommended Objective Assessment Methods

After excluding unsuitable methods, the contribution recommends considering P.863, ViSQOL, and ESTOI as potential objective quality assessment methods for ULBC.

Text Proposal for TR 26.940

The document proposes a pCR to TR 26.940 Section 9 (Test methodologies) that includes:

New Section 9.1.1: Typical Quality Impairments

Identifies ULBC-specific impairment categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, reverberant speech)

New Section 9.1.2: Challenges of Quality Assessment

Addresses testing challenges specific to ULBC:

  • Traditional 3GPP Practice: AMR/AMR-WB/EVS used P.800 ACR for clean speech and DCR for noisy/mixed content, but did not focus on intelligibility, speaker identifiability, or prosodic impairments

  • ULBC-Specific Challenges: ML-based codecs introduce new impairment types (e.g., hallucination) requiring alternative test methods

  • Additional Test Methodologies (non-exhaustive list):

  • Diagnostic Rhyme Tests (DRT)
  • Modified Rhyme Tests (MRT)
  • MOS testing for speaker similarity
  • Speaker verification/identification tests
  • Prosodic naturalness MOS tests
  • Intonation recognition tests
  • Transcription tests for word/semantic equivalence
  • Phoneme recognition tests
  • Automatic speech recognition tests

  • Objective Methods as Optional Tools: Proposes documenting that objective methods (P.863, ViSQOL, ESTOI, etc.) can be considered as optional tools for predicting speech quality during ULBC simulation testing and parameter optimization, acknowledging that subjective listening remains the most important evaluation method despite being time and resource-intensive

  • Speech Enhancement Evaluation: Notes that P.835 multi-dimensional rating scales can be used for speech enhancement tools that may be part of ULBC

Technical Contribution

The main technical contribution is establishing a framework for objective quality assessment in ULBC standardization that: 1. Recognizes the unique challenges of ML-based codecs 2. Identifies suitable objective methods as predictive tools 3. Proposes their documentation as optional assessment methods in TR 26.940 4. Maintains subjective testing as the primary benchmark while enabling more efficient intermediate evaluation

Xiaomi Technology
Updates to the simulation results for FS_ULBC

Summary of S4-260136: Updates to Simulation Results for FS_ULBC

1. Introduction and Context

This document presents updated link-level simulation (LLS) results for Ultra-Low Bitrate Communication (ULBC) over Non-Terrestrial Networks (NTN). The simulations follow the NTN-TDL-C channel model as specified in TS 36.102. This revision adds: - Missing simulation results for NTN-TDL-C 10 NPUSCH - New simulation results for NTN-TDL-C 10 NPDSCH - Updated TBS (Transport Block Size) values for both NPDSCH and NPUSCH with 10 degrees elevation angle

The simulations are based on parameters discussed in S4aA250038 and follow agreements from previous meetings.

2. Channel Model Assumptions

Key Parameters: - Satellite elevation angle: 12.5 degrees for link budget calculations - Channel model parameters (delay and power of each path) determined for 10 degrees elevation (approximation of 12.5 degrees) - Channel model: NTN-TDL-C from TS 36.102

3. Link Budget Analysis

3.1 CNR Baseline Values (from RAN1)

Uplink: - CNR = 2.6 dB (0 dBi UE antenna gain, 3.75 kHz SCS, 1 tone, 23 dBm UE TX power)

Downlink: - CNR = -3.3 dB (0 dBi UE antenna gain, 15 kHz SCS, 12 tones, 1 RX antenna, 7 dB noise figure)

3.2 Additional UL CNR Configurations

| Configuration | SCS | UE Power | UL CNR | |---------------|-----|----------|---------| | Config 1 | 3.75 kHz | 23 dBm | 2.6 dB | | Config 2 | 15 kHz | 23 dBm | -3.42 dB | | Config 3 | 3.75 kHz | 26 dBm | 5.6 dB | | Config 4 | 15 kHz | 26 dBm | -0.42 dB | | Config 5 | 3.75 kHz | 31 dBm | 10.6 dB | | Config 6 | 15 kHz | 31 dBm | 4.58 dB |

3.3 Additional DL CNR Configurations

| Configuration | Number of RX | G/T value | DL CNR | |---------------|--------------|-----------|---------| | Config 1 | 1 | -31.6 | -3.3 dB | | Config 2 | 2 | -31.6 | -0.3 dB | | Config 3 | 1 | -28.6 | -0.3 dB | | Config 4 | 2 | -28.6 | 2.7 dB |

4. NPUSCH Simulation Results

4.1 Common Simulation Parameters

  • Scenario: GEO orbit, 10 degree elevation
  • Carrier frequency: 2 GHz
  • Channel model: NTN-TDL-C with 5 ns delay spread
  • Physical channel: NPUSCH format 1
  • SCS: 3.75 kHz and 15 kHz
  • Number of tones: Single tone
  • Waveform: SC-FDMA
  • MIMO: SISO (1T1R)
  • DMRS: OS#4 per slot for 3.75 kHz, OS#3 per slot for 15 kHz
  • UE velocity: 3 km/h
  • Symbols per slot: 7
  • Slots per RU: 16
  • Modulation: π/4-QPSK
  • Receiver: MMSE with real channel estimation
  • Target BLER: 1%, 2%, 6%, 10%

4.2 Results Part 1 - Various TBS and Bundling Times

80 ms Bundling Time

144 bits (Cases 1-4): - Case 1: 15 kHz, MCS 2, 4 RUs, 2 reps → SNR: -4.97 dB (10% BLER) to -4.35 dB (1% BLER) - Case 2: 15 kHz, MCS 2, 2 RUs, 1 rep → SNR: 1.8 dB to 2.7 dB - Case 3: 3.75 kHz, MCS 10, 1 RU, 2 reps → SNR: 1.5 dB to 2.30 dB - Case 4: 15 kHz, MCS 10, 1 RU, 4 reps → SNR: -4.64 dB to -3.90 dB

256 bits (Cases 5-8): - SNR ranges from -2.84 dB to 5.9 dB depending on configuration

328 bits (Cases 9-11): - SNR ranges from -1.53 dB to 9.45 dB depending on configuration

424 bits (Case 12): - SNR: 1.44 dB (10% BLER) to 2.05 dB (1% BLER)

160 ms Bundling Time

208 bits (Cases 13-15): - SNR ranges from -5.56 dB to 1.53 dB

424 bits (Case 16): - SNR: -1.95 dB to -1.52 dB

600 bits (Case 17): - SNR: -1.38 dB to -0.97 dB

808 bits (Cases 18-19): - SNR ranges from -1.42 dB to 0.21 dB

320 ms Bundling Time

328 bits (Cases 20-25): - SNR ranges from -6.80 dB to -0.22 dB

776 bits (Cases 26-27): - SNR ranges from -2.48 dB to 6.46 dB

1000 bits (Cases 28-30): - SNR ranges from -1.95 dB to 7.47 dB

1544 bits (Case 31): - SNR: 0.48 dB to 0.76 dB

4.3 Results Part 2 - Additional Cases

Covers Cases 32-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. SNR requirements range from -3.6 dB to 8.0 dB depending on configuration.

5. NPDSCH Simulation Results

5.1 Common Simulation Parameters

  • Scenario: GEO orbit, 10 degree elevation
  • Carrier frequency: 2 GHz
  • Channel model: NTN-TDL-C with 5 Hz Doppler spread
  • Physical channel: NPDSCH
  • SCS: 15 kHz
  • Number of tones: 12
  • Waveform: OFDM
  • MIMO: SISO (1T1R or 1T2R)
  • Symbols per subframe: 14
  • Modulation: QPSK
  • Receiver: MMSE with NRS-bundling and real channel estimation
  • Target BLER: 1%, 2%, 6%, 10%

5.2 Results Part 1 - Various TBS and Bundling Times

80 ms Bundling Time

144 bits (Case 0a): - 1R: SNR -6.6 dB to -5.3 dB - 2R: SNR -9.3 dB to -8.1 dB

256 bits (Case 0b): - 1R: SNR -4.3 dB to -3.1 dB - 2R: SNR -7.1 dB to -6.1 dB

328 bits (Cases 1-2): - SNR ranges from -11.8 dB to -4.0 dB

424 bits (Cases 3, 3b): - SNR ranges from -11.6 dB to -5.0 dB

160 ms Bundling Time

208 bits (Case 4): - SNR: -15.3 dB to -11.8 dB

424 bits (Cases 5, 5b): - SNR ranges from -14.6 dB to -8.0 dB

600 bits (Case 6): - SNR: -11.1 dB to -7.2 dB

808 bits (Cases 7, 7b, 8, 8b): - SNR ranges from -11.0 dB to -1.1 dB

320 ms Bundling Time

328 bits (Cases 9-11b): - SNR ranges from -17.7 dB to -8.1 dB

776 bits (Cases 12, 12b): - SNR ranges from -14.8 dB to -8.1 dB

1000 bits (Case 13): - SNR: -10.3 dB to -6.4 dB

1544 bits (Cases 14, 14b): - SNR ranges from -11.7 dB to -5.0 dB

5.3 Results Part 2 - Additional Cases

Covers Cases 15-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times.

1T1R Results: - SNR requirements range from -10.9 dB to 1.1 dB

1T2R Results: - SNR requirements range from -13.6 dB to -1.9 dB - Approximately 3 dB gain compared to 1T1R configurations

6. Conclusions

The document recommends considering these simulation results for determining design constraints for ULBC. The results demonstrate: - Performance across various TBS values (144 to 1736 bits) - Multiple bundling times (80, 160, 320 ms) - Different SCS configurations (3.75 kHz, 15 kHz) - Impact of repetitions on SNR requirements - Benefits of 2 RX antennas (approximately 3 dB gain)

Huawei Technologies Co., Ltd.
[FS_ULBC] On eCall scenario for ULBC

Summary of S4-260137: On eCall Scenario for ULBC

1. Background

This contribution addresses the eCall (emergency call) scenario for Ultra-Low Bitrate Codec (ULBC) work. Previous contributions (S4-251908, SA-251848, SA-251881) emphasized the importance of preserving background signals in emergency communications. China has developed a related national standard "On-Board Emergency Call System for Road Vehicles" expected to take effect on July 1, 2027.

The document highlights that eCall scenarios have special requirements and different conditions compared to regular call scenarios, necessitating different design constraints and test methodologies.

2. eCall Scenario Description

2.1 System Overview

The eCall system is an in-vehicle safety technology that: - Automatically dials emergency numbers (e.g., 112 in EU) upon severe collision detection - Sends minimum data set (MSD) including GPS location, VIN, collision direction and time - Can be triggered by built-in sensors or manual SOS button - Functions via GEO satellite even without terrestrial network coverage

2.2 Communication Architecture

The bi-directional voice data flow involves: - Vehicle side: Integrated microphones and speakers communicating over GEO satellite network - Emergency response center: Connected via terrestrial mobile network (VoLTE, VoNR), fixed-line, or other IMS-supported platform - Key requirement: Background noise captured within vehicle must be delivered with fidelity to emergency response center - Asymmetric requirement: Noise preservation may not be required in the opposite direction (emergency center to vehicle) - Dedicated system: No mobile phones involved in the communication link

3. Key Observations

Observation 1: eCall is a dedicated system between vehicles and emergency response centers. Speech codec designed for eCall is not necessarily the same as that for regular call scenarios, allowing for separate design constraints or performance requirements for ULBC-eCall.

Observation 2: Vehicle and emergency response center have significantly different hardware capabilities compared to regular call scenarios: - Less sensitive to power consumption - Higher computational capability - Higher storage capability - This allows for relaxed design constraints and more critical performance requirements for ULBC-eCall

4. Proposed Changes to TR 26.940

4.1 New Clause 4.5: eCall Communication

The contribution proposes adding a new scenario (Scenario 4) to TR 26.940 documenting:

High-level Prerequisites for ULBC in eCall:

  • Very low bitrate support
  • Background noise preserved with no DTX during the call (at least for vehicle-to-emergency center direction)
  • Error concealment
  • Real-time implementation capability (encoding and decoding)
  • Good audio quality for reasonable QoE
  • Relaxed hardware constraints compared to mobile phones

4.2 Modified Clause 6.2: Design Constraint Parameters

The contribution proposes creating separate design constraint columns in Table 6.2-1: - Design Constraint (regular call): Existing constraints - Design Constraint (eCall): New column with eCall-specific constraints

Key Differences for eCall Design Constraints:

| Parameter | Regular Call | eCall | |-----------|-------------|-------| | Noise Suppression | Not required; noise suppression may be applied | Background noise preserved during call (at least vehicle-to-center direction); opposite direction may not require preservation | | DTX Support | Support | No DTX support during call (at least vehicle-to-center direction) | | Complexity/Memory | Standard mobile constraints | Relaxed constraints possible |

5. Technical Contributions

The main technical contributions of this document are:

  1. Introduction of eCall as a distinct ULBC scenario with specific requirements different from regular call scenarios
  2. Identification of asymmetric requirements for noise preservation (required vehicle-to-center, optional center-to-vehicle)
  3. Proposal for relaxed design constraints based on different hardware capabilities of eCall endpoints
  4. Explicit requirement for background noise preservation and no DTX in critical direction
  5. Framework for separate performance requirements for eCall vs. regular call scenarios in TR 26.940

The document establishes that eCall scenarios justify different codec design approaches due to their dedicated nature, different hardware capabilities, and specific regulatory/safety requirements.

Huawei Technologies Co., Ltd.
[FS_ULBC] On target platforms for ULBC

Summary of S4-260141: Target Platforms for ULBC

1. Introduction and Motivation

This contribution addresses a gap in TR 26.940 regarding target platforms for Ultra Low Bit rate Codec (ULBC) deployment. While the TR currently discusses NPU as a possible platform in clause 6.2.1.5.1, it lacks coverage of other non-NPU platforms. The document aims to complete this missing information, particularly focusing on DSP-enabled devices.

2. Technical Problem Statement

The contribution identifies an inconsistency in TR 26.940:

  • Clause 6.2.1.1 states that the codec must support real-time processing alongside other audio processing units and should fit real-time resource constraints of CPUs, potential accelerators, and DSPs across a range of devices
  • Clause 6.2.1.5.1 currently only describes NPU as a target platform, omitting DSP and other non-NPU platforms

The source references previous contributions (S4aA250267 and S4-251747) that discussed the need for DSP deployment and provided clarification on DSP-enabled UE devices.

3. DSP-Enabled Device Definition

The contribution adopts the definition from S4-251747 for DSP-enabled UE devices:

  • Devices with DSP only or devices with multiple computing units including DSP
  • For multi-unit devices (with CPU/NPU/DSP), there remains a preference for DSP deployment due to:
  • Lower power consumption
  • Reduced heat generation
  • Better battery life
  • Target devices include vehicle-mounted devices, glasses, and mobile phones with low computational capability
  • DSP refers to audio processing DSPs available in mobile phones or other devices for voice communication

4. Proposed Text Changes

Main Technical Contribution

The proposal adds a new paragraph to clause 6.2.1.5.1 that:

  1. Acknowledges vendor flexibility: Vendors may choose any computing unit to implement ULBC based on business needs or product constraints

  2. Highlights DSP advantages:

  3. Cheaper in terms of silicon real estate
  4. Less power hungry
  5. Less heat generation
  6. Typically single-threaded for synchronized real-time execution with low overhead
  7. Potentially wider range of product support

  8. Establishes DSP deployment requirement: ULBC should be deployable on DSP-enabled UE devices, including:

  9. Devices with DSP only
  10. Devices with multiple computing units including DSP

  11. Provides deployment rationale: Even when CPU or NPU are available, DSP may be preferred for power-sensitive applications (wearables, mobile phones)

  12. Defines DSP computational power: Audio processing DSPs typically range from several hundred to over a thousand MIPS

Context Preservation

The proposal maintains the existing text about: - NPU prevalence in modern smartphones - NPU being 5-20x more power efficient than CPUs for AI tasks - The note that ULBC may need to run on non-NPU platforms in certain configurations

5. Technical Impact

This contribution ensures that TR 26.940 provides comprehensive guidance on target platforms for ULBC deployment, balancing the AI-optimized NPU approach with the power-efficient DSP approach, thereby supporting a wider range of device implementations and use cases.

Huawei Technologies Co., Ltd.
[FS_ULBC] On complexity and memory constraints for ULBC

Summary of S4-260142: On Complexity and Memory Constraints for ULBC

Introduction

This contribution addresses complexity and memory constraints for Ultra Low Bitrate Codec (ULBC) as part of the study in TR 26.940. The document aims to clarify previous discussions on measurement metrics and specific constraints, proposing concrete values for complexity, RAM, and ROM requirements.

Main Technical Contributions

Complexity Measurement Metrics

The contribution proposes using both MACS (Million Multiply-Accumulate Operations per Second) and Codec/Model Size together to characterize ULBC complexity, rather than relying on a single metric:

  • Codec/Model Size: Directly impacts memory requirements and power consumption (more memory footprint requires more frequent DRAM access, leading to higher power consumption)
  • MACS: More suitable for guiding computing hardware unit selection
  • These metrics do not necessarily correlate, as different model architectures can result in very different MACS for the same model size

Memory Constraints Clarification

The document clarifies confusion from previous contributions (S4aA250253 and S4-251807) regarding the 5-10M parameters proposal:

ROM Constraints

  • ROM characterized by overall Model Sizes across all operation modes
  • Major impact is FLASH consumption in product design
  • Minimal power consumption impact (only one model's parameters accessed at a time)
  • Proposed constraint: < 15M parameters (relaxed from previously discussed 10M to support more operation modes)
  • Enables support for ~5 operation modes (e.g., 2-3 bitrates for 2 different sampling rates)

RAM Constraints

  • RAM characterized by maximum single Model Size (assuming no switching between operation modes)
  • Proposed constraint: < 3M parameters
  • With 15M ROM, this allows 5 operation modes
  • Whether switching between operation modes will be supported is FFS

Complexity Constraints

MACS Reference Point

The contribution references the 2025 Low-Resource Audio Codec (LRAC) Challenge sponsored by Cisco Systems as a relevant benchmark:

LRAC Challenge Requirements: - Sampling rate: 24 kHz - Mono audio input - Bitrate: up to 1 kbps (ultralow) and up to 6 kbps (low) - Latency: 30 ms (Track 1) or 50 ms (Track 2) - Compute complexity: ≤ 350 MMACS total; ≤ 150 MMACS receive-side - Winner (ByteDance) used ~4M parameters

Proposed MACS Value

  • While LRAC suggested 350 MMACS, the contribution proposes < 600 MMACS for ULBC
  • Rationale: Slightly increased complexity enables better speech quality while remaining within target hardware (e.g., DSP) computational capacity
  • Validation: Handcrafted 3M parameter codec (reduced from SoundStream architecture) achieved 600 MMACS

Proposed Design Constraints Summary

The contribution proposes the following specific constraints for ULBC:

  1. Complexity:
  2. Single Model Size < 3M parameters
  3. < 600 MMACS

  4. RAM:

  5. < 3M parameters (assuming no switching between operation modes)
  6. Whether switching will be supported is FFS

  7. ROM:

  8. < 15M parameters

Text Proposal

The contribution includes a change request to TR 26.940, Section 6.2 (Design Constraint Parameter), Table 6.2-1, adding the specific complexity and memory constraints detailed above to the "Complexity and memory demands" parameter row.

China Mobile Com. Corporation
[FS_ULBC]TR 26.940 V 0.5.1

3GPP TR 26.940 - Study on Ultra Low Bit rate Speech Codecs (Release 20)

Document Overview

This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types.

1. Application Scenarios for Ultra-Low Bit Rate Communication Services

1.1 Scenario 1: IMS Voice Call over GEO (Primary Scenario)

Background: - GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay - TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s - Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints

Scenario Descriptions:

Main Scenario (4.2.2.2): One UE connects via GEO satellite access - UE1: Phone supporting IMS voice over GEO satellite - UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation)

Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access - Less common but relevant for disaster/cyberattack scenarios - Even with transparent payload, voice packets transmit to ground before reaching other UE - May enable transcoder-free operation

High-level Prerequisites: - Very low bitrate support - DTX support [TBC] - Error concealment - Real-time implementation on smartphones - Good audio quality for reasonable QoE

1.2 Scenario 2: Multi-Party Voice Communication

Background: - Addresses poor/unstable network conditions in WLAN access - Network congestion during peak usage or in areas with limited infrastructure - Codec selection critical for maintaining quality under bandwidth constraints

Scenario Description: - One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding) - Both participants on unstable networks using ULBC simultaneously (no transcoding needed)

High-level Prerequisites: - Ultra-low bitrate capability - Real-time operation on consumer devices (smartphones, laptops) - Audio quality matching or exceeding existing voice services

1.3 Scenario 3: IMS Voice Call with ULBC over Other Access Types

Motivation: - ULBC may provide enhanced robustness against poor network conditions - Lower bit rates may benefit coverage/capacity - Reduces transcoding needs when calls bridge GEO and other access types

Scenario Description: - Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN)

2. Channel Characteristics and Service-Related Dependencies

2.1 Mouth-to-Ear Delay Estimation for GEO Scenarios

Delay Components:

UE Delay (Table 5.1.2-2): - Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms) - Performance objective range: 268-1435ms (excluding solution-specific delay) - Maximum requirement range: 355-1435ms (excluding solution-specific delay) - Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM

Core Network Delay: - Ground station to core network: [5-20]ms minimum, [200]ms maximum - eNodeB to core network: 5-20ms - Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS)

GEO Transmission Delay: - Minimum: 248ms - Maximum: 280ms (per TS 22.261 KPI requirement) - Variation of 32ms depending on UE location within beam

Mouth-to-Ear Delay Estimates (Table 5.1.3-1):

For Main Scenario (GEO-TN): - 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X

For Sub-Scenario (GEO-GEO): - 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X

2.2 NB-IoT NTN System Parameters

System Architecture: - Service link: UE to NTN payload - Feeder link: NTN payload to NTN Gateway

RAN Parameters: - Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink) - MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM - Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1 - Resource unit (RU) duration varies with subcarrier spacing and number of tones

QoS Characteristics: - Managed through QCI (QoS Class Identifier) - Same PELR (Packet Error Loss Rate) required for UL and DL - Suggests balanced UL/DL time-domain transmission resources

3. Design Constraints

3.1 Design Constraint Parameters (Table 6.2-1)

Key parameters identified: - Bit rates: [TBD] - Sample rate and audio bandwidth: [TBD] - Frame length: [TBD] - Complexity and memory demands: [TBD] - Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing) - Packet loss concealment (PLC): [TBD] - Noise suppression: [TBD] - Discontinuous transmission (DTX): Including VAD and comfort noise [TBD] - Robustness to non-speech input: [TBD]

3.2 Complexity and Memory Considerations

Current Evaluation Analysis: - Codec must support real-time thread and concurrent processing - ML codecs with [5-10M] parameters considered for efficient operation within latency bounds - Must operate within compute constraints of devices for real-time voice communication

Memory and Power Considerations: - Larger models require more DRAM access → higher power consumption - Memory footprint critical for device performance and usability

Complexity Metrics for AI-Based Codecs:

TOPS (Tera Operations Per Second): - TOPS = 2 × MAC unit count × Frequency / 1 trillion - Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16) - TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs - Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics

Alternative Metrics: - MACs (Multiply-Accumulate operations): Practical for complexity assessment - RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure - Model Size: Number of parameters and precision; directly impacts memory and power - Tools available: ptflops, torchinfo, fvcore for MAC counting

Observations: - NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x) - Actual NPU performance depends on computational graph structure - Irregular/sequential/unsupported operations may require CPU fallback - ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs - Million MACs + model size provide first indication of complexity - RTF useful but requires standardized test benches - WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial

Complexity Target Estimation: - Target devices: Modern smartphones with NPU components - Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS) - Actual power consumption on smartphone NPUs: TBD - Model size and architecture significantly impact DRAM operations and overall power consumption

3.3 Design Constraint Verification

Editor's note: Algorithmic delay verification method for AI-based codecs required.

3.4 Additional Design Considerations

Codec Parameters and Configuration: - Static parameters: Rarely changed, exchanged via SDP or predefined - Dynamic parameters: May change frequently, included in each packet/frame - Common static/dynamic parameters to be identified

4. Existing Technologies and Feasibility Evidence

4.1 Overview of Existing Codec Technologies (Table 7.1.1-1)

Categories: 1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS) 2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2) 3. AI-based postprocessor: Enhancement of conventional codec output 4. AI-based encoder/decoder: - Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2) - Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec)

Key Codec Properties:

3GPP IMS Codecs: - AMR: NB, 5ms delay, 20ms frame, 4.75 kbps - AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps - EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps

Conventional Ultra Low Bitrate: - MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps - Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps

AI-based (Causal): - LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps - LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps - Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps

AI-based (Non-causal): - DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps - DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps - SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps

4.2 Observations on Codec Parameters

Audio Bandwidth: - Conventional codecs: NB only - Modern AI codecs: WB or higher

Algorithmic Codec Delay: - IMS codecs: 25-32ms - Conventional ultra-low: 60-126ms - Causal AI: 20-80ms - Non-causal AI: 500ms+ or full signal

Frame Duration: - Conventional ultra-low: Increased vs. standard 20ms VoIP - Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms)

Bitrate: - All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps

Complexity: - AI codecs generally higher than IMS/conventional codecs - Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+) - RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB) - ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB)

4.3 Performance Evaluation

P.808 ACR Test Results (Figure 7.1.4-1):

Test setup: - English clean speech (4 talkers × 6 samples) - 32kHz, SWB, normalized to -26 dBoV - 24 subjects

Key Findings: - Codec2 (all rates) significantly worse than AMR 4.75 kbps - SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps - Three conditions on par or slightly better than EVS 9.6 kbps: - Mimi-Codec 1.1 kbps (causal) - DAC-ibm 1.5 kbps (non-causal) - SNAC 0.98 kbps (non-causal) - AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs

4.4 Packet Loss Concealment (PLC) Experiments

4.4.1 PLC Experiment with DAC

Test Configuration (Table 7.1.5.1-1): - Bitrates: 1, 2.5, 4.5, 6 kbps - Loss percentages: 1%, 6%, 10%, 20% - Frame size: 80ms - Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz)

Loss Simulation Methods: 1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss 2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency)

MUSHRA Test Results (8 listeners): - Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps - 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss - Interleaving benefit increases with error rate - Potential for improvement if model trained with random loss patterns

4.4.2 PLC Experiment with DAC and DAC-IBM

Comparison: - DAC (default): 16kHz, general audio training, scalable bitrate - DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps

MUSHRA Test Results (8 listeners, resampled to 16kHz): - DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions - DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR - Specific training for target bitrate crucial for optimal performance - Error resilience improvable through appropriate training/design choices

Conclusions: - More design freedom needed in bitrate and BLER selection for optimal quality at given SNR - Optimal coding performance (even under errors) achieved with appropriate training strategy - Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates - Dedicated training (e.g., DAC-IBM) much more efficient

4.5 Very Low Bitrate Listening Test Results

Test Setup (Nokia): - Clean Finnish speech (3 males, 3 females, 4 sample pairs each) - Diotic presentation via Sennheiser HD650 headphones - Experienced listeners - Extended ACR5 scale (0.5-5.5) and DCR methodologies - Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz)

Codecs Tested: - DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps) - 3GPP: AMR, AMR-WB, EVS at various rates - ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps)

Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2): - Increased bandwidth improves quality up to ~12kHz (saturation region) - 4kHz bandwidth significantly limits perceived quality - MELP 2.4k and MPEG4 HVXC perform better than Codec2 - 3GPP codecs perform as expected at lowest bitrates - TSAC and DAC show very good performance in clean speech - TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references - Both poor quality <1 kbps

DCR Results (Figure 7.2.4-1): - Results align with ACR test - Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz) - Listeners more likely to notice degradations with reference available

4.6 Test Results on Clean Speech and Music/Mixed Content

4.6.1 DCR Test on Clean Speech (Figure 7.3.2-1)

Test Setup: - French, 30 listeners (5 panels × 6) - 8sec double sentences, 3 male + 3 female - 20-20,000Hz bandpass, -26dB LKFS normalized

Codecs: - Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps) - AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8)

Key Findings: - DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps - EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate - Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps)

4.6.2 ACR Test on Clean Speech (Figure 7.3.3-1)

Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test.

4.6.3 DCR Test on Music and Mixed Content (Figure 7.3.4-1)

Test Setup: - 30 listeners (5 panels × 6) - 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background) - 20-20,000Hz bandpass, -26dB LKFS

Codecs: - Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4) - AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5) - Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft)

Key Findings: - Best quality: EVS and xHE-AAC @ ~24 kbps - Neural codec advantage visible at low bitrates - No tested neural codec achieves quality close to "Direct" - FlowDec 7.5 kbps: 4.08 DMOS (best neural codec) - No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps

4.7 Impact of Noise Suppression on AI-Based Codecs

4.7.1 Background on Existing Systems

Classical Speech Coding: - Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality - Especially beneficial in noisy conditions and low SNRs - Improves intelligibility and perceptual quality - Integrated in 3GPP2 EVRC and VMR-WB standards

Neural Speech Coding: - Known to be sensitive to noisy environments - Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization - Data-driven approaches make failure modes difficult to anticipate - Noise suppression can minimize issues and allow codec to focus on useful signal

4.7.2 Test Design

Two Listening Tests (ITU-T P.808 ACR):

Test 1 - High SNR: - Assumptions from 3GPP EVS characterization - SNRs: +15 to +20 dB (WB) - Noises: car, street, office (from ITU-T P.501 Annex B) - 24 pairs of sentences (8 pairs × 3 noises) - 20 listeners

Test 2 - Low SNR: - More adversarial environments - SNRs: -5 to +15 dB - Noises: street, construction, metro, car, office, restaurant - 24 pairs of sentences (4 pairs × 6 noises) - 21 listeners

Noise Suppression: - DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz - Applied as preprocessor before coding

Mixing Procedure: - Loudness normalization using BS1770demo (ITU-T STL) - RMS long-term option for background noise level

4.7.3 Conditions Under Test

Classical Codecs: - MELPe, AMR, AMR-WB, EVS

Neural Codecs: - SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps) - LyraV2 3.2 kbps (likely trained on diverse data including noisy speech) - DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only

All tested with and without noise suppression ("_nr" suffix).

4.7.4 High SNR Test Results (Figures 7.4.2.4-1, 7.4.2.4-2)

Key Observations: - Listeners prefer uncoded denoised speech over uncoded noisy speech - Denoised speech as good as clean speech at high SNRs (minimal artifacts) - Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs) - Classical codecs: Benefit increases with bitrate/quality - Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps) - DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate - Plain DAC @ 24kHz not competitive at 1.5 kbps - LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par)

4.7.5 Low SNR Test Results (Figures 7.4.2.5-1, 7.4.2.5-2)

Key Observations: - Listeners strongly prefer uncoded denoised speech (~1 MOS difference) - All classical codecs benefit from denoising (<1 MOS improvement) - Neural codecs benefit even more (>1 MOS improvement possible) - Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression - Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech

4.7.6 Conclusions

  • Speech coder performance in noisy conditions significantly enhanced by ensuring high SNR (e.g., via noise suppression)
  • Neural speech coders more sensitive to noisy environments; benefit more from noise suppression than traditional coders
  • High SNR enables improved performance at very low bitrates under both high and low SNR conditions
  • Noise suppression impact on delay/complexity requires further study
  • Note: Removing all background audio may not always be desirable (e.g., emergency calls where background contains relevant information)

4.8 Analysis of Existing AI Codec: Lyra V2

Key Characteristics: - Publicly reported: "38x faster than real-time" on high-end 2021 smartphone - Entirely CPU execution (no NPU/TPU) - Open-source under Apache 2.0 license (permissive for commercial/standardization)

Code-Level Analysis: - Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite) - Flag use_xnn=true directs to CPU execution - No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU) - Single-threaded execution (threads explicitly set to 1) - Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time

Conclusion: - Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU - Significant margin towards max RTF - CPU-only approach viable for ULBC

4.9 Complexity Analysis of Existing AI Codec: DAC

Methodology: - ONNX Runtime library for execution - Tested on CPU backend and NNAPI backend (Android NPU interface) - Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference) - No quantization applied (original float model) - Metrics: Real-Time Factor (RTF) for end-to-end and individual components

Theoretical Complexity Analysis (Figure 7.6.2-1): - Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification) - Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms) - Model: 76.9M parameters, 293MB size - Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes

Real-World Inference Performance:

Test Platforms: 1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed) 2. High-end mobile: Qualcomm Snapdragon 8 Gen 2

Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3):

Desktop CPU: - Single-threaded: NOT real-time (RTF 1.6-1.9) - Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86) - Still very slow for high-end desktop CPU

Mobile SoC: - NO configuration achieves real-time performance - Best-case RTF: 2.125 (>2x slower than real-time) - Worst-case RTF: 5.884 (~6x slower than real-time) - NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU - Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required

Critical Gap: - Significant gap between theoretical NPU capacity and actual measured performance (RTF) - Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone - Real-world testing essential

Editor's note: NNAPI may fallback to CPU for float models; impact needs verification.

5. Test Methodologies

5.1 General Considerations

5.1.1 Typical Quality Impairments of Ultra-Low Bit Rate Speech Coding

Categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech)

Additional Considerations: - Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC

5.1.2 Challenges of Quality Assessment

Traditional 3GPP Practice: - AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR - ACR: Generally for clean speech - DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content - Focus not on intelligibility, speaker identifiability, prosodic impairments

ULBC Challenges: - May need to address additional aspects directly through dedicated tests - Hallucination: Specific to ML-based systems - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic)

Alternative Test Methods: - Automatic speech recognition - Modified rhyme tests - DCR tests (for prosodic differences) - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests (word/semantic equivalence) - Phoneme recognition tests

Noise Suppression Evaluation: - P.835: Multi-dimensional rating (speech quality and noise suppression capability separately) - Typically used for systems with noise suppression

DCR Considerations: - Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference

5.1.3 Subjective Testing Considerations

China Mobile Com. Corporation
[FS_ULBC] Permanent Document v0.5.0

Comprehensive Summary of 3GPP FS_ULBC Permanent Document

Document Overview

This permanent document (p-doc) version 0.45.0 supports the Study Item on Ultra Low Bitrate Speech Codec (FS_ULBC), focusing on developing recommendations for normative work on an ultra-low bit rate codec for voice over Geostationary Orbit (GEO) satellites. The document tracks agreements, open issues, and progress across the study objectives defined in the SID.

1. Introduction and Scope

The study addresses nine key objectives: - Document application scenarios for ultra-low bit rate communication services - Study GEO channel characteristics and derive service-related dependencies - Identify relevant design constraints - Provide feasibility evidence - Define performance requirements and test methodologies - Identify/develop objective measures for design constraint verification - Identify reference codecs - Coordinate with other 3GPP groups (SA2, RAN, CT1) - Define potential normative work item objectives and timeline

Working Procedure: - Maintains one TR and one p-doc - Contributions via pCRs - Brackets restricted to values only - Open issues documented in p-doc

2. Application Scenarios

2.1 Main Scenario: IMS Voice Call over GEO

Key Technical Assumptions:

UE1 Uplink (UE1 → GEO satellite → Ground station): - Transmission data rate significantly limited ([1-3] kbit/s) - Requires ultra-low bit rate codec fitting this transmission rate - Subject to transmission errors reflecting GEO satellite access - Delay greater than typical terrestrial networks

UE1 Downlink (Ground station → GEO satellite → UE1): - Similarly limited transmission data rate - Subject to similar transmission errors and delay

UE2 Connection (Core Network → UE2): - Regular TN network transmission data rate available - Could use existing IMS codec (with transcoding) or same ULBC (transcoding-free) - Transcoding functionality in core network likely needed for seamless communication across network types

2.2 Sub-Scenario

Both connections (UE1 and UE2) via GEO satellite with significantly limited transmission data rate ([1-3] kbit/s), allowing both transcoded and transcoding-free operation.

3. Channel Characteristics and Service-Related Dependencies

3.1 End-to-End Simulation Model

Methodology: - Reuses simulation model from TS 26.132 Annex E (LTE reference scenario) - Adapted for GEO access scenario with "new GEO channel" - Potential inclusion of Non-IP Data Delivery (NIDD) option

Key Input Parameters:

BLER_tx/BLER_rx: Block error rates for uplink/downlink from RAN simulation

drx_cycle_length: DRX cycle duration (20-40ms for LTE; suitability for GEO TBC with RAN2)

mis_eNB1_eNB2: Scheduling time mis-alignment; determines buffer waiting time

nFrames considerations: - Frame length: Maximum 80ms assumed for GEO (vs. 20ms for LTE) - Voice packet size: Depends on protocol overhead (user plane vs. control plane, IP vs. Non-IP NIDD) - RTP Payload Size: Product of frame length and codec bit rate

Editor's Note: SA2 concluded in TR 23.700-19 that voice packets shall be transported over NB-IoT (GEO) user plane.

3.2 RAN Simulation Model for Error Traces

Objective: Generate multiple loss traces for combinations of: - Frame loss rate (target BLER) - Raw bitrate (TBS) - Voice bundling period - Doppler spread

Simulation Parameters: - Number of seeds: 10 - Trace duration: 400 seconds (6.67 minutes) - Channel consistency: Same channel realizations across all combinations

3.2.1 Link Budget Analysis

Baseline CNR values from TR 36.763: - UL CNR = 2.6dB (0dBi UE antenna gain, 3.75kHz SCS, 1 tone, 23dBm UE max TX power) - DL CNR = -3.3dB (0dBi UE antenna gain, 15kHz SCS, 12 tones, 1 UE receive antenna, 23dBm UE max TX power)

3.2.2 Uplink Simulation Parameters

Channel model: NTN-TDL-C [38811]

Elevation angle: 10 degrees (parameters specified in Table 5.2.2.2-1)

Modulation: QPSK, π/2 BPSK

Subcarrier Spacing (SCS): 3.75kHz, 15kHz

Number of tones: 1 for both SCS values

Voice bundling period: 80ms, 160ms, 320ms - Note: 40ms not considered due to insufficient time for DL transmissions with 3.75kHz SCS

Doppler spread: 1Hz, 5Hz

Target BLER: 1%, 2%, 6%, 10% (fixed target BLER is FFS)

Maximum Achievable SNR: SNR = (3GPP SET-1 UL SNR) - 10×log₁₀(B/3.75) + (P - 23dBm) + G + [X] dB

Where: - 3GPP SET-1 UL SNR = 2.6dB - B = bandwidth (3.75kHz or 15kHz) - P = max UE TX power (23, 26, 31 dBm) - G = UE antenna gain difference (0 to -5.5dBi) - X = TBD (accounts for lower loss, better satellite performance)

TBS Values and PHY Bitrates:

For 80ms bundling: - TBS: 144, 256, 328, 424 bits - PHY bitrate: 1.8, 3.2, 4.1, 5.3 kbps - Codec bitrate: 1.1, 2.5, 3.4, 4.6 kbps (assuming 7 bytes packet header)

For 160ms bundling: - TBS: 208, 424, 600, 808 bits - PHY bitrate: 1.30, 2.65, 3.75, 5.05 kbps - Codec bitrate: 0.95, 2.30, 3.40, 4.70 kbps

For 320ms bundling: - TBS: 328, 776, 1096, 1544 bits - PHY bitrate: 1.025, 2.425, 3.425, 4.825 kbps - Codec bitrate: 0.850, 2.250, 3.250, 4.650 kbps

Notes: - Packet header counted once regardless of bundled frames - Loss of single TB means loss of multiple consecutive voice frames - Need for 320ms bundling to be revisited after channel simulation results

3.2.3 Downlink Simulation Parameters

SCS: 15kHz

Number of tones: 12

Achievable SNR: SNR = (3GPP SET-1 DL SNR) + G + [Y] dB

Where: - 3GPP SET-1 DL SNR = -3.3dB - G = UE antenna gain difference (0 to -5.5dBi) - Y = TBD (accounts for 2 RX antennas providing up to 3dB gain, lower loss, better G/T values, better satellite performance)

Editor's Note: Four companies reported Y=3 due to better G/T from field measurements (-28.6dB/K vs. -31.6dB/K assumed), but no RAN1 consensus reached.

TBS values: Identical to uplink (Clause 5.2.2.2)

3.2.4 Frame Structure

Dynamic Scheduling Example (80ms bundling, Half-duplex FDD): - NPDSCH duration: 4ms (variable depending on DL SNR) - UL frequency allocation options: 1, 3, 6, 12 tones with 15kHz per tone

Semi-Persistent Scheduling (SPS): - If specified by RAN for NB-IoT NTN - NPDSCH can be anywhere in first 15ms (maintaining minimum 1ms gap to NPUSCH) - "Cell_specific_Koffset" approach proposed (not dependent on "TA report UE capability")

Gap between DL and UL consists of: - Processing time + DL-to-UL switching (minimum 1ms for half-duplex device) - Max differential delay: [close to 0 to 10.3ms] (TBC)

RAN1 Note: Example frame structures supportable in most scenarios but may not work for very large cells (>3000km) when UE doesn't support TA report and network doesn't support UE-specific K-offset. RAN1/2 have not yet designed SPS.

3.3 Open Issues for NB-IoT GEO Simulation

Issue 1 - UE Power Class: Whether to use specified 23dBm or broader range (26, 29, 31, 33 dBm) - Pending RAN input

Issue 2 - Latitude-Dependent Loss: Scintillation loss (2.2dB or 0dB depending on latitude) - Solved (accounted via X term)

Issue 3 - Elevation Angles: Keeping both 2.3° and 12.5° - Solved (accounted via X term)

Issue 4 - UL/DL Guard Time: 1ms assumption - Pending RAN confirmation

Issue 5 - Candidate TBS Values: Multiple proposals from companies - Unsolved

Issue 6 - Approaches to Select TBS: Three approaches provided - Unsolved

Issue 7 - Overall Simulation Methodology: High-level description needed - Unsolved (to be addressed after simulation completion)

Issue 8 - Simulation Channel Model: NTN-TDL-C vs. NTN-TDL-C5 - Solved (NTN-TDL-C used)

Issue 9 - Protocol Overhead: Clarify packet header for different transport options - Pending RAN2/SA2 confirmation

Issue 10 - Repetition Numbers: Specify and report in simulation - Solved

Issue 11 - RX G/T for Downlink: 3dB better value observed in field - Unsolved

3.4 Alternative Methodology for Determining ULBC Bit Rate

Editor's Note: This methodology remains an open issue.

Proposed Steps:

  1. Agree on operation points: Set of maximum achievable receive SNRs covering marginal to error-free operation with NTN-TDL-C fading

  2. Define performance requirements for each SNR operation point

  3. Agree on source bit rates for each bundling time (80, 160, 320ms) based on transport formats (TBS, SCS, MCS, NRep)

  4. Current range: 825-4650 bits/s
  5. Granularity appears insufficient and unequal

  6. Determine optimum transport format (SCS, MCS, NRep) for each source bit rate based on BLER vs. SNR curves

  7. Produce packet loss patterns for each bundling time and source bit rate at relevant SNRs (unknown to proponents during selection)

  8. Compare ULBC candidates based on performance requirements at relevant SNRs

Example Workflow: - Proponent has design at 0.95 kbps and 3.4 kbps - For 160ms bundling with 7-byte overhead: - Low rate: TBS = 208 bits - High rate: TBS = 600 bits - Select best transport format configuration from available options - Generate BLER patterns for different UE TX powers (23, 26, 29, 31 dBm) - Run codec simulation with these patterns - Evaluate quality (e.g., listening test) with weighted averaging across power settings

Note: Important to test candidates for other conditions beyond NTN NB-IoT (e.g., Terrestrial IMS with 1% BLER, OTT with 0% BLER, extreme conditions with 10% BLER or blockage losses)

3.5 Simulation Results

Table 5-6 documents preliminary results: - 80ms bundling: Qualcomm submitted S4-251739 - Company A, B, C: TBD

4. Design Constraints

4.1 Complexity and Memory Demands

Target Device Types: - Handheld mobile phones - Smart watches - Smart glasses/head mounted devices - TCU (Telematics Control Unit) - CPE (Customer Premises Equipment) - Vehicles - Other IoT devices

Recommended Constraints: - Implementable on DSP/CPU/NPU enabled UE devices - For low-end DSP-only UEs: - Complexity: <500 WMOPS (measured on C reference code) - ROM memory: <20MB assuming 32bit/parameter (or 5M model parameters)

Editor's Notes: - Definition of "DSP enabled UE devices" needs clarification - Exact complexity estimation metric and limits are TBD

4.2 Design Constraint Verification

Complexity Verification: - Constraints may be based on platform-agnostic metrics: - MACs/FLOPs for AI-based components - WMOPS for traditional signal processing - Model size and precision - Verification process details and timing are FFS

Algorithmic Delay: - Verification method for AI-based codecs required

5. Performance Requirements

5.1 Scope

Define performance requirements and test methodologies for: - Speech quality, intelligibility, conversational quality - Clean speech and noisy speech - Tandeming with existing IMS voice codecs - Clean channel and GEO channel conditions - Identify relevant reference codecs

5.2 Status Tracking

Core influencing factors identified: - DC: Sample rate and audio bandwidth - DC: Bitrates (External dependency) - DC: Frame length - DC: PLC (External dependency) - DC: Algorithmic Delay - DC: Complexity, Memory - Test Methodologies - DC: Noise suppression - DC: DTX/CNG - DC: Robust Non-Speech - Evidence DCs - Reference codec

All items currently have open issues and progress TBD

6. Coordination and Dependencies

6.1 External Dependencies

From RAN: - HARQ retransmission parameters (max_tx/max_rx) - DRX cycle length suitability for GEO - Scheduling parameters (dynamic vs. SPS) - Frame structure confirmation - UE power class - UL/DL guard time - Protocol overhead - G/T values for downlink

From SA2: - Transport path for voice packets (user plane vs. control plane, IP vs. Non-IP NIDD) - Protocol overhead details - Transcoding functionality requirements

From RAN2: - Dynamic scheduling vs. Semi-Persistent Scheduling - MAC header size (1-byte feasibility) - Timing parameters

7. Key Technical Contributions

7.1 Simulation Framework Establishment

The document establishes a comprehensive RAN simulation framework for generating error traces: - Defined methodology using NTN-TDL-C channel model - Specified uplink and downlink parameters - Established TBS values and corresponding codec bitrates for multiple bundling periods - Defined channel consistency requirements across simulations

7.2 Link Budget Analysis

Adopted baseline CNR values from TR 36.763 with provisions for: - Variable UE power classes - Latitude-dependent losses - Elevation angle variations - Better-than-assumed satellite performance

7.3 Bitrate Determination Methodology

Proposed alternative methodology allowing proponents design freedom: - Operation point definition based on receive SNRs - Transport format optimization for each source bit rate - Packet loss pattern generation - Comparative evaluation framework

7.4 Frame Structure Definition

Defined frame structures for: - Dynamic scheduling with Half-duplex FDD - Semi-Persistent Scheduling options - Cell_specific_Koffset approach for large cells

7.5 End-to-End Delay-Error Profile Model

Adapted TS 26.132 Annex E model for GEO scenarios: - Identified required input parameters - Defined voice packet structure with protocol overhead - Established relationship between frame length, bundling, and packet loss

8. Open Issues Summary

High Priority (Blocking): 1. Consensus on UE power class (23 dBm vs. higher values) 2. RAN confirmation on frame structures and scheduling 3. SA2/RAN2 confirmation on protocol overhead 4. Selection of candidate TBS values and selection methodology 5. Downlink RX G/T value consensus

Medium Priority: 1. Fixed vs. variable target BLER 2. Need for 320ms bundling option 3. Complexity metric definition and limits 4. Algorithmic delay verification for AI codecs

Lower Priority: 1. Overall simulation methodology description (after completion) 2. Definition of "DSP enabled UE devices"

9. Document Status

Current Version: 0.45.0 (SA4#135, February 2026)

Recent Updates: - Added 10-degree channel model parameters - Updated simulation parameters per multiple agreed TDOCs - Added company simulation results reporting - Clarified voice packet transport over user plane

Working Status: - Active study phase - Collecting simulation results from companies - Coordinating with RAN and SA2 for parameter confirmation - Developing design constraints and performance requirements

China Mobile Com. Corporation
[FS_ULBC] WorkPlan of FS_ULBC v0.5

Timeplan for FS_ULBC Study Item

1. Introduction

This document outlines the timeplan for the Feasibility Study on Ultra Low Bitrate Speech Codec (FS_ULBC). The study focuses on developing a codec for ultra-low bit rate communication services, particularly for IMS Voice Call Using GEO Access as documented in TR 22.887.

Study Item Objectives

The FS_ULBC study has nine main objectives:

  1. Application Scenarios: Document ultra-low bit rate communication service scenarios based on TR 22.887 use cases and requirements for IMS Voice Call Using GEO Access
  2. GEO Channel Characteristics: Study GEO channel characteristics and derive service-related dependencies (bitrates, mouth-to-ear delay, loss/delay/jitter profiles)
  3. Note: NB-IoT services impact is out of scope
  4. Design Constraints: Identify relevant design constraints in coordination with other WGs:
  5. Bit rates
  6. Sample rate and audio bandwidth
  7. Frame length
  8. Complexity and memory demands
  9. Algorithmic delay
  10. Packet loss concealment (PLC)
  11. Potential noise suppression integration
  12. Discontinuous transmission (DTX) including VAD and comfort noise
  13. Speech quality
  14. Robustness to non-speech input
  15. Feasibility Evidence: Provide evidence that design criteria can be met using existing reference codecs
  16. Performance Requirements: Define performance requirements and identify test methodologies for:
  17. Speech quality and intelligibility
  18. Conversational quality
  19. Clean and noisy speech conditions
  20. Tandeming with existing IMS voice codecs
  21. Clean channel and GEO channel conditions
  22. Objective Measures: Identify or develop objective measures to verify design constraints (e.g., complexity and memory measurements)
  23. Reference Codecs: Identify relevant reference codecs for comparison and evaluation
  24. Coordination: Coordinate with other 3GPP groups (SA2, RAN, CT1, etc.)
  25. Normative Work: Define potential normative work item objectives and timeline

2. Current Progress Status

Application Scenarios (85% Complete)

  • Scenario 1: IMS Voice Call over GEO (TR 4.2, P-doc 4.1)
  • Scenario 2: Multi-Party Voice Communication (TR 4.3)
  • Scenario 3: IMS Voice Call with ULBC over other access types than GEO (TR 4.4)
  • Next Steps: Finalize high-level prerequisites and resolve ENs
  • Dependencies: SA2, RAN, CT

GEO Channel Characteristics & Simulation (75% Complete)

  • NB-IoT system design and simulation parameters documented (TR 5.1.4, P-doc 5.2.2)
  • SA4#135 Plans:
  • Finalize remaining parameters
  • Gather candidate Transport Block Sizes (TBS)
  • Dependencies: SA2, RAN

Simulation Methodology (60% Complete)

  • SA4#135 Plans: Discuss and confirm simulation methodology
  • Dependencies: RAN

Company Simulation Results (40% Complete)

  • Companies providing simulation results (P-doc 5.2.3)
  • SA4#135 and Post Ad-hoc Plans:
  • Select appropriate TBS
  • Collect company simulation results
  • SA4#136 Plans:
  • Cross-check simulation results
  • Finalize feasible TBS values and loss traces

Mouth-to-Ear Delay (95% Complete)

  • Documented in TR 5.1
  • Next Steps: Resolve ENs

Design Constraints Progress

Bit Rates (0% Complete)

  • SA4#136 and Post Ad-hoc: Decide bit rates for ULBC (dependent on simulation results)
  • Dependencies: RAN

Sample Rate and Audio Bandwidth (5% Complete)

  • SA4#135 and Post Ad-hoc: Discuss supported audio bandwidth
  • SA4#136 and Post Ad-hoc: Decide supported audio bandwidth

Frame Length (0% Complete)

  • SA4#136 and Post Ad-hoc: Decide frame length for ULBC

Complexity and Memory Demands (80% Complete)

  • Documented in TR 6.2.1, P-doc 6.1.1
  • SA4#135 and Post Ad-hoc: Finalize complexity measurement metrics and resolve ENs

Algorithmic Delay (0% Complete)

  • SA4#135 and Post Ad-hoc: Discuss algorithm delay for ULBC

Packet Loss Concealment (15% Complete)

  • Documented in TR 7.1.5
  • SA4#135 and Post Ad-hoc: Discuss PLC for ULBC

Noise Suppression (15% Complete)

  • Documented in TR 7.4
  • SA4#135 and Post Ad-hoc: Discuss noise suppression for ULBC

DTX (0% Complete)

  • SA4#135 and Post Ad-hoc: Discuss DTX support for ULBC

Design Constraint Verification (5% Complete)

  • P-doc 6.3.1
  • Next Steps: Verify design constraints

Other Considerations (5% Complete)

  • TR 6.4.1
  • Next Steps: Document additional design considerations and resolve ENs

Existing Codec Technologies (85% Complete)

  • Reference codecs documented (e.g., DAC, Lyra) in TR 7
  • SA4#135 and Post Ad-hoc: Continue documenting evidence of existing technologies and resolve ENs

Performance Requirements (0% Complete)

  • SA4#136 and Post Ad-hoc: Define performance requirements

Test Methodologies (50% Complete)

  • Subjective test methodologies documented in TR 9.1.3
  • SA4#135 and Post Ad-hoc: Identify appropriate test methodologies

Coordination with Other WGs

  • Analysis of current liaisons from RAN, CT, and SA2 available (S4aA250139)
  • Ongoing coordination as needed

3. Detailed Timeplan

TSG SA#107 (March 12-14, 2025, Incheon, KR)

  • Approval of FS_ULBC study item

SA4#131-bis-e (April 11-17, 2025)

  • Start documenting application scenarios for ultra-low bit rate communication services
  • Start studying GEO channel characteristics and service-related dependencies
  • Start identifying relevant reference codecs
  • Start coordinating with other 3GPP groups

Audio SWG Telco (May 5, 2025)

  • Focus on application scenarios and technical contributions

SA4#132 (May 19-23, 2025, Fukuoka, JP)

  • Finalize application scenarios documentation
  • Progress GEO channel characteristics study
  • Progress reference codec identification
  • Progress coordination with other WGs
  • Start identifying/developing objective measures for design constraint verification
  • Start identifying relevant design constraints (bit rates, sample rate, frame length, complexity, algorithmic delay, PLC, noise suppression, DTX, speech quality, robustness)
  • Start providing feasibility evidence using existing reference codecs
  • Start defining performance requirements and test methodologies for speech quality
  • If time permits: Start documenting additional application scenarios

Audio SWG Telcos (June-July 2025)

Multiple telcos scheduled to: - Progress GEO channel characteristics study - Perform RAN-related simulations within SA4 - Align on RAN link-level simulations - Power to send LS to SA2 and RAN WGs if needed

SA4#133-e (July 21-25, 2025)

  • Progress all ongoing work items:
  • GEO channel characteristics
  • Coordination with other WGs
  • Reference codec identification
  • Objective measures development
  • Design constraints identification
  • Feasibility evidence
  • Performance requirements and test methodologies
  • If time permits: Progress additional application scenarios

F2F Ad-hoc Meeting (September 23-25, 2025, Erlangen, Germany)

  • Hosted by Fraunhofer IIS
  • Electronic participation on best effort basis

Audio SWG Telcos (October 2025)

  • Opportunity for feedback from other WGs
  • Progress work on:
  • GEO channel characteristics study
  • Existing technologies documentation
  • Design constraints identification
  • Performance requirements and test methodologies
  • Application scenarios if time permits

SA4#134 (November 17-21, 2025, Dallas, US)

Major milestone meeting: - Finalize: - GEO channel characteristics study - Coordination with other WGs - Reference codec identification - Design constraints: bit rates, sample rate, audio bandwidth, frame length, PLC, noise suppression, DTX - Progress: - Feasibility evidence - Objective measures development - Design constraints: complexity, algorithmic delay, speech quality, robustness - Performance requirements and test methodologies - Start defining potential normative work item objectives and timeline - If time permits: Finalize additional application scenarios

Audio SWG Telcos (December 2025 - January 2026)

  • Finalize GEO channel characteristics study
  • Progress:
  • Simulation parameters for end-to-end simulation
  • Existing technologies documentation
  • Design constraints identification
  • Performance requirements and test methodologies
  • Application scenarios if time permits
  • Power to send reply LS for incoming LS postponed during SA4#134

SA4#135 (February 9-13, 2026, India)

  • Finalize objective measures for design constraint verification
  • Progress:
  • Design constraints: complexity, algorithmic delay, speech quality, robustness
  • Feasibility evidence
  • Performance requirements and test methodologies

TSG SA#111 (March 10-13, 2026, Japan)

  • TR for information

SA4#136 (April 13-17, 2026)

  • Finalize:
  • Design constraints: complexity, algorithmic delay, robustness to non-speech input
  • Feasibility evidence
  • Progress:
  • Design constraints: speech quality
  • Performance requirements for speech quality

SA4#137 (May 11-15, 2026)

Final study meeting: - Finalize: - Design constraints: speech quality - Performance requirements and test methodologies (clean/noisy speech, tandeming, clean/GEO channel conditions) - Potential normative work item objectives and timeline

TSG SA#112 (June 9-12, 2026, Singapore)

  • TR for approval - Study completion
China Mobile Com. Corporation
[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation

Summary of S4-260149: Updates on Assumptions and Open Issues for NB-IoT GEO Simulation

Document Overview

This contribution from China Mobile addresses outstanding assumptions and open issues for NB-IoT GEO satellite simulation work within the ULBC (Ultra-Low Bitrate Codec) study. The document consolidates discussions from multiple Audio Ad-hoc meetings (June 4, June 17, and July 11) and proposes updates to TS 26.940 clause 5.2.2.4.

Main Technical Contributions

Status Updates on Simulation Parameters

The document provides a comprehensive status table tracking 11 key simulation issues, with updates on their resolution status:

Resolved Issues

  1. UE Power Class (Issue 1):
  2. Previously pending decision between 23 dBm (specified for NTN NB-IoT) vs. higher commercial values (26-33 dBm)
  3. Resolution: 37 dBm adopted based on RAN4 Reply LS S4aA250219
  4. Note: 37 dB is under study in ongoing RAN work

  5. Latitude-Dependent Loss (Issue 2):

  6. Addressed scintillation loss variation (2.2 dB vs. 0 dB) based on latitude per TR 38.821
  7. Resolution: Simulation accounts for latitude-dependent loss using X term
  8. Additional note: New 10-degree channel model introduced, may increase feasible TBS

  9. Elevation Angles (Issue 3):

  10. Proposal to maintain both 2.3° and 12.5° elevation angles for worst-case scenarios
  11. Resolution: Simulation accounts for elevation angles using X term

  12. Simulation Channel Model (Issue 8):

  13. Choice between NTN-TDL-C or NTN-TDL-C5
  14. Resolution: NTN-TDL-C is used

  15. Repetition Numbers (Issue 10):

  16. Proposal to specify and report repetition numbers in simulation
  17. Resolution: Solved

Partially Resolved Issues

  1. Protocol Overhead (Issue 9):
  2. Requires clarification of packet header overhead for different protocol combinations (user plane, control plane, IP vs. non-IP)
  3. Partial Resolution: SA2 confirmed voice packets transported over User Plane
  4. Still Pending: Overhead for User Plane (IP vs. Non-IP) needs RAN confirmation

  5. RX G/T for Downlink (Issue 11):

  6. Field measurements show 3dB better value than current RAN assumptions
  7. Status: Editorial note added in P-doc 5.2.2.3 to capture field-measured data
  8. Current Status: Listed as "Unsolved" in table

Unresolved Issues

  1. UL/DL Guard Time (Issue 4):
  2. Current assumption: 1 ms guard time for UL/DL switching
  3. Status: Needs RAN confirmation on feasibility

  4. Determine Candidate TBS Values (Issue 5):

  5. Multiple proposals from different companies:
    • Xiaomi (S4aA250035)
    • Fraunhofer (S4aA250031)
    • Skylo (S4-251540)
    • Dolby (S4-251390)
    • Huawei (S4aA250230)
    • Qualcomm (S4-251548)
    • vivo (S4aA250215)
  6. Status: Unsolved, requires further verification

  7. Approaches to Select TBS (Issue 6):

    • Three approaches provided in S4aA250072
    • One approach detailed in clause 5.2.2.4.1
    • Status: Unsolved, requires further discussion
  8. Overall Simulation Methodology Description (Issue 7):

    • Need for high-level description of simulation execution, including optimization parameters and result parameters
    • General description documented in P-doc Clause 5.2.2
    • Status: Unsolved, to be addressed after all simulation work completed

Proposal

The document proposes to: 1. Update the P-doc (TS 26.940) based on the status updates provided 2. Continue tracking these issues until full resolution

Key Dependencies

The document highlights several dependencies on other working groups: - RAN4: UE power class confirmation - RAN: UL/DL guard time feasibility, protocol overhead confirmation - SA2: Protocol overhead for different transport configurations

vivo Mobile Communication Co.,
[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19

Summary of 3GPP Technical Document: Updates to FS_ULBC Permanent Document

Document Overview

This contribution updates the FS_ULBC (Ultra Low Bitrate Speech Codec) Permanent Document to align with SA2 conclusions on Key Issue #1 regarding IMS voice call support over NB-IoT via GEO satellite connecting to EPC, as documented in TR 23.700-19.

Main Technical Contributions

1. Reference Updates

The document adds critical new references to align with recent 3GPP work:

  • TR 23.700-19 V1.2.0: Study on Integration of satellite components in the 5G architecture; Phase 4
  • S2-2509293: Interim conclusions on KI#1 Support of IMS voice call over NB-IoT NTN via GEO satellite connecting to EPC
  • TR 36.763: Study on NB-IoT/eMTC support for Non-Terrestrial Networks
  • R1-2506541: Reply LS on RAN simulation assumptions for ULBC

2. End-to-End Simulation Model Updates (Clause 5.2.1.3)

2.1 Architecture and Protocol Stack Changes

The document introduces significant modifications to the end-to-end simulation model:

  • New GEO Channel Model: Extends the reference LTE scenario (Annex E of TS 26.132) to accommodate GEO satellite access
  • Three Architectural Scenarios Defined:
  • Reference LTE VoLTE scenario (Figure 5.2.1.3-1)
  • Main GEO scenario with IP transport (Figure 5.2.1.3-2)
  • GEO scenario with Non-IP Data Delivery option (Figure 5.2.1.3-2a)

2.2 Transport Mechanism Agreements

Based on SA2 conclusions in TR 23.700-19:

  • User Plane Transport: Voice packets shall be transported over NB-IoT (GEO) user plane using DRB and S1-U
  • Single PDN Connection: Both IMS signaling and IMS voice use a single PDN connection
  • Mandatory Mechanism: Transport of IP packets (UP/IP) with RoHC recommended
  • Optional Mechanism: Transport using removal and restoration of parts of RTP/UDP/IP headers (UP/non-IP)

2.3 Simulation Input Parameters

Key parameters updated for GEO scenarios:

  • BLER_tx/BLER_rx: Block error rates for UL/DL based on error traces from Clause 5.2.2
  • max_tx/max_rx: HARQ retransmissions (note: HARQ feedback suggested to be disabled for IMS voice over GEO per Release 18)
  • drx_cycle_length: DRX cycle duration (LTE values 20-40ms, suitability for GEO requires RAN2 confirmation)
  • mis_eNB1_eNB2: Scheduling time misalignment between eNBs
  • Speech sequence frame length: Maximum 80ms frame length for GEO (vs. 20ms for LTE)
  • Voice packet size: Depends on protocol overhead, varies by transport mechanism

2.4 Protocol Overhead Considerations

Two protocol overhead scenarios illustrated:

  • UP/IP with RoHC (Figure 5.2.1.3-4 left): Mandatory mechanism
  • UP/non-IP with header removal (Figure 5.2.1.3-4 right): Optional mechanism

Editor's Note: Exact overhead for UDP/IP (SA2 scope) and RTP (SA4 scope) for the removal/restoration mechanism requires determination.

3. Simulation Assumptions and Open Issues (Clause 5.2.2.4)

3.1 Resolved Issues

| Issue | Resolution | |-------|-----------| | Latitude-Dependent Loss | Simulation accounts for latitude-dependent scintillation loss using X term (2.2 dB or 0 dB beyond ±20° latitude per TR 38.821) | | Elevation Angles | Both 2.3° and 12.5° angles considered using X term for worst-case scenarios | | Simulation Channel Model | NTN-TDL-C selected | | Repetition Numbers | Specified and reported in simulation |

3.2 Pending Issues Requiring RAN Input

  • UE Power Class: 23 dBm (specified for NTN NB-IoT) vs. commercial UE range (26-37 dBm) - requires RAN confirmation
  • UL/DL Guard Time: 1ms assumption needs RAN verification
  • RX G/T for Downlink: Field observations show 3dB better performance than current RAN assumptions

3.3 Unresolved Issues

  • Candidate TBS Values: Multiple proposals from Xiaomi, Fraunhofer, Skylo, Dolby, Huawei, Qualcomm, and vivo require evaluation
  • TBS Selection Approaches: Three approaches in S4aA250072 need discussion
  • Overall Simulation Methodology: High-level description to be completed after simulation work
  • Protocol Overhead for UP/non-IP: Exact overhead values for removal/restoration mechanism depend on specific RTP fields selected (SA4 decision)

3.4 Updated Understanding on Protocol Overhead

Based on SA2 agreements:

  • Control Plane transport excluded: Only User Plane transport considered
  • Mandatory: UP/IP with RoHC recommended
  • Optional: UP/non-IP with partial header removal/restoration
  • Exact overhead values for optional mechanism pending SA4 decisions on RTP field selection

Key Dependencies and Cross-WG Coordination

The document identifies several inter-working group dependencies:

  • RAN1: Physical layer timing, power class confirmation
  • RAN2: HARQ configuration, DRX cycle parameters, scheduling mechanisms
  • SA2: UDP/IP overhead for non-IP mechanism
  • SA4: RTP overhead, frame length confirmation, RTP field selection for header removal

Editor's Notes

Two critical editor's notes remain:

  1. Whether the eNB1-eNB2 delay model for LTE scenarios accurately reflects GEO deployment delays
  2. Whether RTP payload size affects the delay-error profile
China Mobile Com. Corporation
[FS_ULBC]Considerations for ULBC Codec Selection Process

Comprehensive Summary: Considerations for ULBC Codec Selection Process

Document Overview

This document appears to be a presentation or discussion paper related to ULBC (Uplink Broadcast) codec selection process. However, the provided content is fragmentary and contains mixed language elements (English and Chinese), making comprehensive technical analysis challenging.

Main Technical Areas Identified

1. ULBC Codec Selection Process

The document's primary focus is on considerations for selecting codecs in the ULBC (Uplink Broadcast) context. However, specific technical criteria, evaluation methodologies, or selection parameters are not detailed in the provided content.

2. JPEG AI Integration

Overview

  • The document references JPEG AI as a relevant technology
  • JPEG AI appears to be considered as a potential codec or compression technology for the ULBC use case

Working Mechanism

  • A section is dedicated to JPEG AI's working mechanism
  • Specific technical details of the mechanism are not provided in the extracted content

Timeline

  • Timeline considerations for JPEG AI are mentioned
  • Specific milestones or deployment schedules are not detailed

3. Cross-Working Group Coordination

SA2 Related Work

  • SA2 has related work in Release 18 (R18) and Release 19 (R19)
  • Key Issue Identified: Lack of unified architecture design
  • Requirements are coming from RAN but lack unified architectural framework
  • Suggests fragmentation in approach across different scenarios

RAN Liaison Statements

  • Latest LS (Liaison Statement) from RAN concerns model transmission
  • Indicates coordination requirements between SA and RAN working groups

4. Architecture Considerations

Network Function Changes

  • Reference to "NF变CN" (Network Function changes to Core Network)
  • Suggests potential architectural modifications at the Core Network level
  • Specific changes or proposals are not detailed in the provided content

Open Questions

The document includes an "Open Questions" section, indicating ongoing discussions and unresolved technical issues. However, the specific questions are not provided in the extracted content.

Technical Gaps in Provided Content

Due to the fragmentary nature of the document provided: - Specific codec selection criteria are not detailed - Technical evaluation parameters are missing - Comparison methodologies between candidate codecs are not present - Detailed architectural proposals are not included - Specific agreements or decisions are not documented

Observations

  1. Multi-Release Scope: The work spans R18 and R19, indicating ongoing evolution
  2. Cross-WG Dependencies: Clear dependencies between SA2 and RAN work
  3. Architecture Fragmentation: Identified need for unified architecture design
  4. Emerging Technologies: JPEG AI considered as potential solution
  5. Core Network Impact: Potential changes to CN architecture implied

Note: This summary is based on fragmentary content with significant portions in template format or non-English text. A complete technical analysis would require the full document with all technical details, agreements, and proposals.

vivo Mobile Communication Co., Nokia, Xiaomi Technology, Samsung, Spreadtrum, Bytedance
[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals

Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals

1. Introduction and Objectives

This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics.

The study evaluates: - Descript Audio Codec (DAC) - AI-based codec - Enhanced Voice Services (EVS) codec - 3GPP standard reference

The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.

2. Background and Motivation

2.1 ULBC Context

The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges: - Wideband audio (16 kHz) offers naturalness and perceptual quality - Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits

2.2 Critical Communication Requirements

For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include: - Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification - System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas - Need to balance modern expectations with legacy requirements and emergency scenarios

2.3 EVS as Reference Anchor

EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling: - Practical quality floor definition for ULBC - Comparison against established carrier-grade standards - Isolation of bandwidth choice impact independent of codec architecture

3. Methodology

3.1 Evaluation Pipeline

  • Dataset: LibriSpeech train-clean-100 subset (standard benchmark for high-quality read English speech)
  • Sample size: 500 audio files randomly sampled across three seeds (101, 102, 103)
  • Consistency: Same audio files used for all codec and bitrate configurations

3.2 Processing Chain

  1. Process input audio through target codecs (DAC and EVS) at various bitrates
  2. Transcribe processed audio using OpenAI Whisper model (large-v3) - selected for state-of-the-art performance and noise robustness
  3. Compare transcripts against LibriSpeech ground truth
  4. Calculate WER using jiwer library

4. Experimental Setup

4.1 Codec Configurations

DAC model: Evaluated at three sampling rates - 16 kHz - 24 kHz
- 44 kHz

EVS codec: Evaluated in standard modes - Narrowband (NB) - Wideband (WB)

Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)

4.2 Observations on Baseline Variance

  • NB PCM occasionally scored ~0.1% better than WB PCM
  • Attributed to inherent ASR model variance rather than signal quality differences
  • Explains why high-bitrate DAC models occasionally score slightly lower than WB baseline

4.3 Primary Metric

WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.

5. Results and Analysis

5.1 DAC Performance vs Bitrate

Key Findings: - DAC achieves high efficiency at low bitrates (~2 kbps) - WER drops rapidly as bitrate increases, stabilizing around 3-4% - At 1.5 kbps: WER approximately 5.5% - Significant improvement observed in 1.5-3.0 kbps range

Bandwidth Impact at Low Bitrates: - At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model - With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band - Results in better semantic preservation vs. 24 kHz model suffering from bit starvation

5.2 DAC 8 kHz Narrowband Model Analysis

A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound.

Model Configuration: - Sample rate: 8000 Hz - Encoder rates: [2, 4, 4, 8], dimension: 64 - Decoder rates: [8, 4, 4, 2], dimension: 1536 - Quantization: 6 codebooks, size 1024, dimension 36 - Training: 200,000 steps on VCTK corpus

Critical Findings at Sub-1.5 kbps: - At ~1 kbps: - 8 kHz model (938 bps): WER 5.86% - 16 kHz model (1000 bps): WER 11.23% - Semantic penalty > 5 percentage points when forcing WB at 1 kbps

  • At 1.5 kbps:
  • 8 kHz model (1563 bps): WER 3.86%
  • 16 kHz model (1500 bps): WER 5.46%

Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.

5.3 DAC vs EVS Comparison

Competitive Advantage: - DAC achieves comparable WER scores at significantly lower bitrates than EVS - DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs

ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.

5.4 EVS Narrowband vs Wideband Analysis

Performance at Different Bitrates:

  • At 13.2 kbps (highest tested):
  • EVS-NB: 3.16%
  • EVS-WB: 3.14%
  • Nearly identical, indicating saturation point in semantic quality

  • At 5.9-8.0 kbps range:

  • EVS-WB maintains marginal advantage (e.g., at 8.0 kbps: WB 3.15% vs. NB 3.41%)
  • Both modes provide sufficient basic audio quality

  • At 9.6 kbps:

  • EVS-NB: 3.19%
  • EVS-WB: 3.23%
  • NB performance very close to WB, difference within ASR model error margin

Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.

5.5 EVS Degradation Analysis

Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.

Key Findings: - Semantic loss introduced by EVS in both NB and WB modes is minimal - Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance - Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum

Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.

5.6 Summary of Findings

Strategic Conclusions for ULBC Design:

  1. For ~1 kbps emergency/GEO scenarios:
  2. Semantic intelligibility is paramount
  3. NB and WB offer comparable semantic preservation
  4. Enforcing wider bandwidth at extremely low bitrates is risky due to limited bit budget
  5. Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation

  6. AI-based codec sampling rate optimization:

  7. DAC 16 kHz model provides distinct advantage over 24 kHz model at lower bitrates
  8. 8 kHz model (trained only 200k steps) defeats official 16 kHz model at low bitrates
  9. Optimizing sampling rate to match available bit budget is critical for system design
  10. Performance of intermediate rates (e.g., 12 kHz) remains open question

6. Proposals

Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing: - Methodology - Experimental setup - Analysis of results concerning audio bandwidth impact on semantic intelligibility

7. Detailed Results Tables

Complete experimental data provided in appendix tables covering: - Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps - Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps
- Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps

China Mobile Com. Corporation
[FS_ULBC]pCR on Existing codec technologies

Summary of pCR on Existing Codec Technologies (S4-260154)

Document Information

  • Source: China Mobile Com. Corporation
  • Specification: 3GPP TR 26.940 V0.5.1
  • Meeting: TSG-SA WG4 Meeting #135, Goa, India, 09-13 February 2026

Purpose and Scope

This pCR proposes updates to Clause 7.1 of TR 26.940, which documents existing codec technologies for evidence that design criteria can be met and for comparison/evaluation purposes. The document adds information about recently emerged ultra-low bit-rate voice codecs (below 1 kbps) as reference for further work.

Main Technical Contributions

Expanded Codec Technology Reference Table

The pCR significantly expands Table 7.1.1-1 "List of existing codec technologies" by adding multiple categories of codecs beyond the existing 3GPP IMS codecs. The table includes the following parameters for each codec:

  • Source/Reference
  • Audio bandwidth (NB/WB/SWB/FB)
  • Codec delay (ms)
  • Frame duration (ms)
  • Bitrates (kbps)
  • Specification access/software availability

New Codec Categories Added

1. Conventional Ultra Low Bitrate Codecs

  • MELP/MELPe: 0.6-2.4 kbps, NB, 22.5-90ms frame duration
  • AMBE-LR: 1.6-1.8 kbps, NB
  • MPEG-HVXC: 2-4 kbps, NB
  • TWELP MR: 0.3-3.2 kbps, NB, various frame durations (40-120ms)
  • Codec2: 0.45-2.4 kbps, NB, primarily 40ms frames

2. AI-Based Decoders

  • WaveNet Codec2: 2.4 kbps, WB, 20ms frames
  • CQNV Codec2: 1.0-1.1 kbps, WB, 40-60ms frames

3. AI-Based Encoder and Decoder (Causal)

These codecs support real-time operation: - LPCNet: 1.6 kbps, WB, 40ms frames, 25ms delay - LyraV2 (SoundStream): 3.2-9.2 kbps, WB, 20ms frames - EnCodec: 1.5-24 kbps, 24kHz/FB, 0-1000ms delay, 13.3ms frames - Mimi-Codec: 0.55-1.1 kbps, 24kHz, 80ms frames, 0ms delay - TS3: 0.64-0.8 kbps, WB, 20ms frames, 0ms delay - TAAE: 0.4-0.7 kbps, WB, 20-40ms frames, 0ms delay - LMCodec2: Parameters TBD

4. AI-Based Encoder and Decoder (Non-Causal)

These codecs are designed for offline/non-real-time applications: - DAC: 0.5-3 kbps, WB/24kHz, 244-366ms delay - DAC-IBM: 0.75-3 kbps, 24kHz, 366ms delay - SNAC: 0.98 kbps, 24kHz, 1000ms delay, 80ms frames - SpeechTokenizer: 0.5-1.0 kbps, WB, full-signal delay - SemantiCodec: 0.31-1.4 kbps, WB, 10-40ms frames, full-signal delay - FunCodec: 0.25-1.0+ kbps, WB, 20-40ms frames - WavTokenizer: 0.25-0.9 kbps, 24kHz, 25-40ms frames - BigCodec: 1.04 kbps, WB, 12.5ms frames - FocalCodec: 0.16-0.65 kbps, WB, 20-80ms frames - ALMTokenizer: 0.41 kbps, WB, 13.3ms frames - XY-Tokenizer: 1 kbps, WB, 20ms frames - LongCat-Audio-Codec: 0.43-0.87 kbps, WB, 60ms frames - AcademiCodec: Parameters TBD - MuCodec: 0.35-1.35 kbps, FB

Additional Notes

The pCR includes several important notes:

  • Note 1: Some codecs may include noise suppression
  • Note 2: MPEG-HVXC decoder and reference encoder available only to MPEG members
  • Note 3: Codec2 uses 20ms overlapping FFT/iFFT with overlap-add
  • Note 4: Some codecs only have non-causal versions publicly available
  • Note 5: TWELP has a complete quality assessment testbench available despite lacking open reference implementation

An editor's note indicates that more codecs may be added to the table in future revisions.

Key Observations

The pCR demonstrates significant industry progress in ultra-low bitrate speech coding, particularly: - Multiple AI-based solutions achieving sub-1 kbps bitrates - Wide range of delay characteristics (0ms to 1000ms) - Various bandwidth support (NB to FB) - Different availability levels for specifications and software implementations

vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance
[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling

Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling

1. Introduction and Motivation

This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware.

2. Experimental Setup

2.1 Model Configuration

Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:

  • Architecture: Fully convolutional encoder-decoder with Residual Vector Quantization (RVQ)
  • Frame length: 40ms (fixed across all variants)
  • Total up/down-sampling factor: 320 (consistent across variants)
  • Sample rates tested: 8 kHz (320 samples), 16 kHz (640 samples), 32 kHz (1280 samples)
  • Export format: ONNX with Float32 precision

Key complexity observations from Table 1: - Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536) - Model sizes range from 4.3 MB to 283.6 MB - Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz)

2.2 Device Under Test (DUT) Environment

  • Platform: MediaTek Dimensity 1200 (6nm) - representative mid-range SoC
  • Inference engine: ONNX Runtime v1.14+ with CPU execution provider (single-threaded)
  • CPU clusters tested:
  • Efficiency cluster: Cortex-A55
  • Performance cluster: Cortex-A78
  • Prime core: Cortex-A78+
  • Methodology: Frequency-locked operation with disabled thermal services and power HALs to eliminate dynamic frequency scaling noise

3. Results and Analysis

3.1 Complexity Scaling vs. Bandwidth

Critical finding: For a given model variant, computational complexity scales linearly with sample rate: - enc32dec768 example: - 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s) - 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase - 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase

Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended.

3.2 Real-Time Factor (RTF) Analysis Across Three Frequency Tiers

3.2.1 Tier 1: Low Frequency (A55@750MHz, A78@902MHz, A78+@1.1GHz)

Energy-conserving state with severe constraints:

  • Cortex-A55 @ 750 MHz: Only smallest models (enc8dec144) maintain real-time at 8 kHz; 16/32 kHz unfeasible
  • Cortex-A78 @ 902 MHz:
  • 32 kHz: Limited to <3M parameters
  • 16 kHz: Supports up to ~8M parameters
  • 8 kHz: Supports up to ~10M parameters
  • Cortex-A78+ @ 1.108 GHz: Similar to A78 but extends 16 kHz limit closer to 10M parameters

3.2.2 Tier 2: Mid Frequency (A55@1.0GHz, A78@1.16GHz, A78+@1.37GHz)

Typical sustained workload state:

  • Cortex-A55 @ 1.0 GHz: 8 kHz supports up to ~2M parameters; 16/32 kHz remain largely unfeasible
  • Cortex-A78 @ 1.162 GHz:
  • 32 kHz: ~5M parameter limit
  • 16 kHz: ~10M parameters (covers "Low Complexity" profile)
  • 8 kHz: Robust up to ~20M parameters
  • Cortex-A78+ @ 1.37 GHz: Performance parity with A78 (clock speed is primary differentiator)

3.2.3 Tier 3: High Frequency (A55@1.73GHz, A78@1.45GHz, A78+@1.63GHz)

High-performance state approaching sustained limits:

  • Cortex-A55 @ 1.73 GHz:
  • 8 kHz: ~3M parameters
  • 16 kHz: ~2M parameters
  • 32 kHz: ~1M parameters
  • Cortex-A78 @ 1.451 GHz:
  • 32 kHz: ~7M parameters
  • 16 kHz: ~10M parameters
  • 8 kHz: ~20M parameters
  • Cortex-A78+ @ 1.632 GHz: Highest headroom
  • 32 kHz: ~8M parameters
  • 16 kHz: Comfortably supports 10M parameters
  • 8 kHz: ~20M parameters

Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated.

3.3 Maximum Performance Envelope

Analysis at peak locked frequencies establishes absolute upper bounds:

3.3.1 Efficiency Core (Cortex-A55 @ 2.0 GHz)

Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices.

3.3.2 Performance Core (Cortex-A78 @ 2.6 GHz)

Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices.

Critical "Complexity vs. Bandwidth" trade-off identified:

  • 32 kHz: RTF crosses 1.0 near 10M parameters (enc24dec576 variant)
  • Hard limit for High-Fidelity ULBC candidates
  • 16 kHz: Feasible model size effectively doubles to ~20M parameters (enc32dec768 variant)
  • enc40dec960 fails real-time constraints
  • Linear relationship between bandwidth reduction and parameter capacity
  • 8 kHz: Extends to ~39M parameters
  • enc40dec960 (29M) is safe
  • Trend suggests failure before enc64dec1536

3.3.3 Prime Core (Cortex-A78+ @ 3.0 GHz)

Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category.

4. Key Technical Contributions

4.1 Quantified Complexity-Bandwidth Trade-off

Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores: - 32 kHz → 10M parameters - 16 kHz → 20M parameters
- 8 kHz → 39M parameters

4.2 Real-World Performance Benchmarks

Provided concrete RTF measurements across representative mobile hardware configurations, revealing that: - Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks - Memory bandwidth and thermal constraints significantly impact feasibility - Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity

4.3 Practical Complexity Constraints for ULBC

Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection.

5. Proposal

The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.

vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum
[FS_ULBC] Discussion on Audio Bandwidth for ULBC

Technical Summary: Audio Bandwidth Requirements for ULBC

1. Introduction and Scope

This contribution addresses audio bandwidth design constraints for the Ultra-Low Bitrate Codec (ULBC), targeting primarily voice over GEO satellite communications. The document argues against mandatory Wideband (WB) and Super-Wideband (SWB) support, proposing instead that Narrowband (NB) should be mandatory with WB as an enhancement.

2. Key Technical Arguments

2.1 Global NB Usage and System Efficiency

Current Network Reality: - 2G/3G connections (primarily AMR-NB) still represent 20% of global technology mix (end of 2023) - Regional variations: 81% in Sub-Saharan Africa, 46% in Middle East and North Africa - NB serves as universal fallback for interoperability (CS fallback scenarios)

System Inefficiency Without NB Mode: - WB ULBC to NB user calls waste upper frequency band (4-8 kHz) - Significant bitrate wasted transmitting data that recipient cannot hear - Over expensive, scarce satellite link, this inefficiency is unacceptable - Native NB mode provides most efficient solution for legacy network connectivity

2.2 User Expectations in "Last Resort" Scenarios

Baseline Expectation Setting: - GEO call is final option after terrestrial network failure - Users typically experience AMR-NB fallback before resorting to GEO - ULBC must be at least as reliable as NB fallback to meet user expectations - WB-only ULBC failure in conditions where NB would work represents service failure

2.3 Primary Use Case: Emergency Communications

Typical Deployment Scenario: - Rescue teams in remote areas (e.g., Himalayan mountains) - Mixed-connectivity environment: - Squad A: GEO-only (outside TN coverage) - Squad B: GSM fallback at coverage fringe - Base Camp: PSTN connection (NB service)

Technical Implications: - Terminating endpoints predominantly NB - Emergency systems use traditional NB codecs (Codec2, MELP) for robustness - Transmitting WB over satellite to NB endpoint wastes critical resources in life-or-death situations - Real-world deployment example provided (China rescue missions)

Evaluation Priority: - ULBC candidates should prioritize intelligibility and robustness testing in NB mode

2.4 Performance at Very Low Bitrates

Quality vs. Bandwidth Trade-off: - Forcing wider bandwidth at very low bitrates spreads available data too thinly - Research shows lower sampling rates can achieve higher perceptual quality at very low bitrates - WB codec at ~1 kbps may compromise intelligibility, especially with packet loss - NB signal more robustly reconstructed under constrained conditions

Analogy: "Spreading butter" - concentrating bits on narrower bandwidth preserves speech richness and intelligibility

2.5 Complexity and Power Consumption

Computational Scaling Issues: - AI-based codec architectures don't scale gracefully - Doubling sampling rate (NB to WB): 2x to 4x complexity increase for CNN/Transformer models - WB-only mandate imposes unnecessary computational burden - Critical issue for power-constrained mobile devices - Native NB mode offers high-quality voice at significantly lower complexity/power budget

3. Experimental Analysis: Higher Bandwidth Inefficiency

3.1 Experiment Setup

Test Configuration: - Codec: Descript Audio Codec (DAC) with pre-trained models - Sampling rates tested: 44.1 kHz, 24 kHz (SWB), 16 kHz (WB) - Test corpus: 100 clean speech samples from MS-SNSD dataset - Bitrate variation: 1-9 active quantization codebooks - Quality metric: ViSQOL algorithm (speech mode, MOS estimate)

Model Specifications:

| Model | Compression | Frame Rate | Codebooks | Bitrate/Codebook | |-------|-------------|------------|-----------|------------------| | 16 kHz (WB) | 320x [2,4,5,8] | 50 Hz | 12 (10-bit) | 0.50 kbps | | 24 kHz (SWB) | 320x [2,4,5,8] | 75 Hz | 32 (10-bit) | 0.75 kbps | | 44.1 kHz | 512x [2,4,8,8] | ~86.1 Hz | 9 (10-bit) | ~0.86 kbps |

3.2 Key Experimental Findings

Quality vs. Bitrate Results: - WB (16 kHz): Achieves excellent quality (ViSQOL MOS > 4.0) at ~2.5 kbps - 24 kHz SWB: Requires higher bitrate to match WB quality - 44.1 kHz: Provides minimal perceptible improvement over 24 kHz SWB - Conclusion: Bitrate cost of SWB not justified by quality improvement for voice content

Efficiency Analysis: - Clear trend: diminishing returns for bandwidth beyond WB - SWB/FB represents inefficient use of bandwidth for ULBC service

4. Proposed Design Constraints

4.1 Bandwidth Requirements

Mandatory Support: 1. 8 kHz sampling rate (NB): 50-4000 Hz audio bandwidth 2. 16 kHz sampling rate (WB): 50-8000 Hz audio bandwidth - Enhanced quality where channel conditions and device capabilities permit - WB support can be limited to higher bitrates than NB operation

Further Study: - Necessity and feasibility of SWB and FB support remains FFS

4.2 Text Proposal for TR 26.940

Change to Table 6.2-1 (Design Constraint Parameters):

Sample rate and audio bandwidth: - The ultra low bitrate codec shall support sampling rates of 8kHz (NB) and 16kHz (WB) - Supported audio bandwidth: - NB: 50-4000 Hz - WB: 50-8000 Hz

5. Supporting Evidence Summary

Quantitative Data: - 20% global 2G/3G connections (hundreds of millions of users) - Regional NB dominance: up to 81% in some areas - WB achieves MOS > 4.0 at 2.5 kbps - 2x-4x complexity increase for WB vs. NB in AI codecs

Qualitative Arguments: - System efficiency (no wasted bandwidth to NB endpoints) - User expectation alignment (last resort reliability) - Emergency use case requirements - Computational/power constraints for mobile devices - Diminishing returns for SWB/FB at target bitrates

vivo Mobile Communication Co.,
[FS_ULBC] Analysis of AI Codec Complexity Scaling

Complexity Analysis of AI Codec Scaling for ULBC

1. Introduction

This contribution addresses the need for establishing relevant complexity evaluation methods for the new ULBC codec standardization. Previous contributions (e.g., S4aA250264) highlighted potential gaps between theoretical complexity metrics (FLOPs) and practical on-device performance (Real-Time Factor).

This document provides a complementary analysis focusing on how complexity metrics scale with AI model architecture itself. The analysis investigates the relationship between model architecture, theoretical complexity, and traditional metrics using the publicly available DAC codec as a test case.

2. Analysis of AI Codec Complexity Scaling

2.1. Methodology

The analysis created seven "dummy" model variants based on the open-source DAC codec's 16kHz configuration. The approach:

  • Base Configuration:
  • Sample rate: 16kHz
  • Encoder dimension: 64
  • Encoder rates: [2, 4, 5, 8]
  • Decoder dimension: 1536
  • Decoder rates: [8, 5, 4, 2]

  • Scaling Approach:

  • Only encoder_dim and decoder_dim were modified
  • Encoder/decoder rates kept constant across all variants
  • Total up/down-sampling factor maintained at 320 (2×4×5×8 = 8×5×4×2)
  • Frame size: 20ms (320 samples at 16kHz)

  • Variant Configurations:

  • enc8dec144
  • enc12dec288
  • enc16dec384
  • enc24dec576
  • enc32dec768
  • enc40dec960
  • enc64dec1536

Complexity Metrics Measured:

  1. Model Parameters (Millions): Total trainable parameters
  2. Theoretical Complexity (MFLOP/s): Calculated using thop profiling library (aligned with S4aA250264 and S4aA250231)
  3. WMOPS: Traditional methodology using ITU-T STL wmc_tool, measured separately for encoder and decoder

Implementation Notes: - Each AI operation implemented in pure C - Source files annotated and compiled using wmc_tool - WMOPS highly sensitive to C implementation efficiency - Naive implementations can yield significantly higher counts than optimized versions

2.2. Complexity vs. Model Dimensions

Key Findings:

  • Clear non-linear relationship between latent dimensions and resulting parameters/computational load
  • Model parameters and MFLOP/s scale quadratically (or faster), not linearly, as encoder_dim and decoder_dim increase
  • Results visualized in Figure 1 (Parameters vs. Dimension) and Figure 2 (MFLOP/s vs. Dimension)
  • Encoder and decoder points are linked pairs corresponding to bundled setups

2.3. WMOPS vs. Model Parameters

Key Finding: Clear relationship between AI model size (in millions of parameters) and traditional WMOPS complexity.

Observations on DAC Model:

  1. Clear correlation between number of model parameters and resulting WMOPS when using same architecture with same C optimization level
  2. Decoder complexity scales significantly faster and is substantially higher than encoder complexity for all variants (DAC arranges more parameters/complexity for decoder to achieve better reconstructed audio quality)
  3. Growth in WMOPS appears linear relative to increase in parameters for both encoder and decoder

2.4. Summary of Scaled Variants

Complete complexity metrics for all seven DAC variants (16kHz, 20ms frame):

| Variant | Enc Dim | Dec Dim | Params (M) | GFLOP counts | MFLOP/s | WMOPS Enc | WMOPS Dec | |---------|---------|---------|------------|--------------|---------|-----------|-----------| | enc8dec144 | 8 | 144 | 1.09 | 0.009 | 437.09 | 333.92 | 760.53 | | enc12dec288 | 12 | 288 | 2.89 | 0.028 | 1397.63 | 648.23 | 2732.96 | | enc16dec384 | 16 | 384 | 4.94 | 0.050 | 2481.98 | 1060.79 | 4724.38 | | enc24dec576 | 24 | 576 | 10.76 | 0.112 | 5578.38 | 2228.92 | 10399.00 | | enc32dec768 | 32 | 768 | 18.90 | 0.198 | 9911.72 | 3693.56 | 18093.30 | | enc40dec960 | 40 | 960 | 29.34 | 0.310 | 15482.00 | 5599.48 | 28019.70 | | enc64dec1536 | 64 | 1536 | 74.50 | 0.792 | 39614.50 | 13675.30 | 70766.69 |

Data demonstrates rapid scaling of all metrics as encoder and decoder dimensions increase.

3. Observations and Conclusions

Based on the DAC model variant analysis:

  1. Linear Relationship: For the DAC model, there is a clear linear relationship between Theoretical Complexity (MFLOP/s), Model Parameters, and measured WMOPS. As MFLOP/s or parameter count increases, WMOPS increases linearly, provided C coding style remains consistent.

  2. Quadratic Growth: Increasing model's internal dimensions causes complexity to grow quadratically. Even small dimension increases lead to disproportionately large jumps in MFLOP/s and WMOPS.

  3. Implementation Dependency: WMOPS score depends heavily on source C code efficiency.

4. Proposal

It is proposed to capture the above analysis into 3GPP TR 26.940.

vivo, Samsung, Spreadtrum, MediaTek Inc.
[FS_ULBC] On codec bitrate and capacity discussion for ULBC

Summary of 3GPP CR S4-260159: On Codec Bitrate and Capacity Discussion for ULBC

1. Introduction

This CR addresses the TBS (Transport Block Size) and codec bitrate values for ULBC (Ultra Low Bitrate Codec) evaluation, which are currently noted as 'companies reported' in TR 26.940 v0.4.0. The contribution provides analysis on: - Multiplexed UE number analysis - Confirmation of TBS/Codec bitrate values

2. Technical Analysis

2.1 Multiplexed UE Number Analysis

The document presents a methodology for calculating supported UE numbers considering: - TDM (Time Division Multiplexing): Both UL and DL can schedule different UEs in TDM manner - FDM (Frequency Division Multiplexing): UL can additionally use FDM since NPUSCH may occupy few subcarriers within 180kHz bandwidth - FDM capacity: 48 UEs for 3.75kHz SCS (single tone) - FDM capacity: 12 UEs for 15kHz SCS (single tone) - Bi-directional constraint: Final supported UE number is the minimum of UL and DL capacity

2.2 Capacity Evaluation Results

Analysis conducted under 50-degree elevation channel model with 2% BLER:

Key Observations: - Observation 1: Higher UE transmit power leads to higher capacity (multiplexed UE number) for a given codec bitrate - Observation 2: For codec bitrate of ~3kbps, capacity is limited to ~10 UEs with 31dBm UE power. Capacity further degrades with increased bitrate (e.g., ≤5 UEs for 4.5kbps)

Performance characteristics: - 23 dBm UE power shows very poor capacity - Performance improves with higher power UEs (26 dBm, 31 dBm) - Capacity increases with ptime value

2.3 Benchmark Considerations

  • SA1 assumes transmission data rate of 1-3kbps as benchmark
  • Commercial GEO system (Tiantong) operates within 0.8-2.4kbps range (per clause 5.2.1.3 of reference [1])
  • Real-world deployments support focusing on bitrates below 3kbps

Additional analysis provided in Annex assuming 1% BLER under 10-degree elevation channel model.

3. Proposed Changes to TR 26.940

3.1 TBS and Codec Bitrate Tables

The CR proposes specific TBS values selected from TS 36.213 table 16.5.1.2-2 for NB-IoT NPUSCH, with corresponding PHY bitrates and codec bitrates calculated for each bundling period (assuming 7-byte packet header).

Table 1: 80ms bundling - TBS range: 88-256 bits - PHY bitrate range: 1.1-3.2 kbps - Codec bitrate range: 0.4-2.5 kbps

Table 2: 160ms bundling - TBS range: 120-424 bits - PHY bitrate range: 0.75-2.65 kbps - Codec bitrate range: 0.4-2.30 kbps

Table 3: 320ms bundling - TBS range: 208-808 bits - PHY bitrate range: 0.65-2.52 kbps - Codec bitrate range: 0.475-2.35 kbps

3.2 Additional Notes

NOTE 1: Final packet header size depends on SA2 and RAN conclusions, including feasibility of 1-byte MAC header

NOTE 2: Packet header counted only once regardless of bundled voice frames

NOTE 3: Relationship between voice frame duration and bundling time depends on RTP payload design. Loss of single TB means loss of multiple consecutive voice frames when bundled.

4. Proposals

Proposal 1: Agree that codec bitrate should be bound to be less than 3kbps

Proposal 2: Agree to the proposed changes to Section 5.2.2.2 (Uplink simulation parameters) of TR 26.940, including: - Updated TBS values and PHY bitrates tables - Voice bundling periods: 80ms, 160ms, 320ms (40ms excluded due to insufficient time for DL transmissions with 3.75kHz SCS) - Target BLER values: 1%, 2%, 6%, 10% - Maximum Achievable SNR formula incorporating UE power (23/26/31 dBm), bandwidth, and antenna gain variations

5. Supporting Information

The Annex provides additional multiplexed UE number analysis for different codec bitrates and UE power levels under 10-degree elevation channel model, supporting the main technical conclusions.

Dolby Laboratories Inc., Novamint, Nokia
On ULBC complexity and RTF analysis

Summary of S4-260165: ULBC Complexity and RTF Analysis

Background and Motivation

This contribution addresses the need to finalize complexity and memory design constraints for the ULBC (Ultra-Low Bitrate Codec) study. Previous discussions at SA4 #133-e and the ULBC ADHOC meeting explored various complexity metrics and RTF performance data for existing AI codecs (DAC, Lyra v2, HIL). However, insufficient data exists to draw definitive conclusions on complexity constraints for ULBC.

The document builds upon previous contribution S4-251844 with the following modifications: - Added CPU core information for experiments - Aligned RTF definition with TR 26.940 clause 7.5.3 - Focused on model sizes 3-20M parameters (more relevant to ULBC use cases) - Provided pCR for TR 26.940 - Removed large chunk-based processing experiments (not relevant for real-time voice communication)

Experimental Setup and Methodology

Model Configuration

Modified DAC architecture with reduced parameters while maintaining general structure: - Model sizes: 20M, 15M, 9M, and 3M parameters (float32 precision) - Training: Optimized for ~1 kbps bitrate at 32 kHz sampling rate - Encoder rates: 4,4,8,10 for all models

Complexity Analysis

Theoretical Complexity (GMACS): - Computed using ptflops library - Results show linear relationship between model size and GMACS: - 20M: 5.14 GMACS - 15M: 4.03 GMACS - 9M: 2.39 GMACS - 3M: 0.79 GMACS

RTF Testing Methodology

  • PyTorch models converted to ONNX format
  • ONNX runtime with XNNPACK execution provider
  • Frame-by-frame processing (80 ms frames)
  • Test duration: 2 minutes (1500 inferences per session)
  • 5 repetitions per experiment
  • Single-threaded execution
  • RTF calculation: max(inference time / frame length) across all frames

Experimental Results

Test Devices

Device 1 (2023): - Hexa-core CPU: 2×3.46 GHz (P core) + 4×2.02 GHz (E core) - Dynamic core switching observed between P and E cores

Device 2 (2022): - Octa-core CPU: 1×3.00 GHz Cortex-X2 + 3×2.50 GHz Cortex-A710 + 4×1.80 GHz Cortex-A510 - Processing on Cortex-X2 with frequency switching between 2.4 GHz and 1.8 GHz

RTF Performance Results

| Model Size | Max RTF (High Performance) | Max RTF (Power Efficient) | |------------|---------------------------|---------------------------| | 20M | 0.39-0.63 | 0.81-0.9 | | 15M | 0.29-0.43 | 0.66-0.74 | | 9M | 0.19-0.29 | 0.44-0.57 | | 3M | 0.09-0.13 | 0.18-0.31 |

Results demonstrate linear increase in RTF with model size across both performance modes.

Key Observations

  1. All tested models achieve RTF < 1.0, indicating real-time capability
  2. Significant RTF variation between high-performance and power-efficient modes
  3. Dynamic CPU core/frequency switching impacts performance
  4. 20M model shows max RTF=0.63 (high performance) and RTF=0.9 (power efficient)
  5. Smaller models (3M-9M) provide substantial RTF headroom for real-time operation

Proposed Text for TR 26.940

The contribution provides a comprehensive pCR adding new clause 6.2.1.7 "RTF and MACS analysis for AI based codecs" with detailed experimental results. Key additions to TR 26.940 include:

Complexity Considerations (Clause 6.2.1)

  • Real-time processing requirements for voice communication
  • Model size considerations (5-10M parameters for efficient operation)
  • Memory access and power consumption challenges with larger models

Complexity Metrics (Clause 6.2.1.4)

  • Discussion of NPU/TPU capabilities measured in TOPS
  • TOPS/W as power efficiency metric (2-15 TOPS/W range for smartphones)
  • MAC operations and MACS as practical complexity metrics
  • RTF as reliable complexity assessment metric
  • Comparison with traditional WMOPS metric

Target Devices (Clause 6.2.1.5)

  • NPUs present in most modern smartphones
  • Theoretical max TOPS: 8-59 TOPS (varying precision)
  • TOPS/W range: 2-15 TOPS/W
  • DAC codec estimate: ~150 Giga MAC/sec (~0.3 TOPS)
  • Note on DRAM operations significantly impacting power consumption

Key Conclusions (Clause 6.2.1.6)

  • ML codecs require careful model size and complexity optimization
  • NPUs offer 5-20× power efficiency vs CPUs for AI tasks
  • ULBC complexity constraints should not reference existing 3GPP speech codecs
  • Million MACS + model size provide first-order complexity indication
  • RTF useful but requires standardized test platforms
  • WMOPS not directly suitable for NPU-based AI solutions

Experimental Data (Clause 6.2.1.7)

  • Complete documentation of DAC-like architecture experiments
  • Detailed RTF and GMACS results for 3M-20M parameter models
  • Device specifications and performance characteristics

Proposal

Document the experimental methodology, results, and observations in clause 6.2.1 of TR 26.940 as shown in the provided pCR.

vivo Mobile Communication Co.,
[FS_ULBC] Discussion on Methodology for Delay & Error Trace Generation

Discussion on Methodology for Delay & Error Trace Generation for FS_ULBC

Introduction

This contribution addresses the ongoing debate within SA4 Audio SWG regarding the methodology for generating delay and error traces for Ultra Low Bitrate Codec (ULBC) evaluation under Non-Terrestrial Network (NTN) conditions. Two competing approaches have emerged:

  • Fixed BLER / Target Error Rate: Prioritizes "realistic" channel behavior by fixing a target BLER (e.g., 2% or 10%) and finding feasible Transport Block Sizes (TBS)
  • Fixed Resource / Link Budget: Prioritizes "fair resource usage" by fixing the SNR/Link Budget and allowing codec/modem to trade off bitrate against error robustness (Best Effort)

The contribution proposes clarifying the purpose of these simulations by distinguishing between Design and Verification phases.

Analysis of Current Approaches

2.1 The Precedent: LTE Simulation Methodology (TS 26.132)

Legacy Mechanism (Trace Generation)

The LTE MTSI testing methodology in TS 26.132 (Annex E and F) operated on "Stationary" conditions:

  • Input: BLER_tx (e.g., 10%) was a fixed input parameter
  • Process: The model assumed the network had converged to this average BLER with random error using if (rand(1) < BLER_tx) logic
  • Output: Traces reflecting packet losses and delays based on re-transmissions

Usage for Verification (Annex E & F)

Critically, TS 26.132 defined these traces as verification tools (System Testing):

  • UE Delay Verification (Annex E): Generated profiles verify UE can maintain synchronization and meet delay budgets under specific error conditions
  • JBM and PLC Evaluation (Annex F): Profiles constructed with deliberate impairments to verify robustness:
  • Jitter Bursts
  • Packet Inversions
  • Packet Duplication

Key Finding: Profiles were treated as Test Vectors to verify robustness against defined impairments, not as "realistic channel recordings" to train codec design.

The Shift for NTN

NTN scenarios introduce challenges that invalidate the LTE approach:

  • Cannot rely on simplistic i.i.d. (independent and identically distributed) random error models
  • NTN channel impairments (shadowing, scintillation) introduce complex, non-stationary error patterns
  • ULBC robustness directly influences tolerance levels, making fixed BLER targets inappropriate
  • Must pivot from 'Assumed BLER' model toward 'Derived Performance' model

2.2 Analysis of Current Approaches for FS_ULBC

Approach A: The "Realism" Perspective (Fixed BLER)

Methodology: - Define TBSs for each candidate bitrate and bundling time - Traverse all link parameters (SCS, Tone, etc.) to evaluate if resulting link budgets satisfy predefined Target BLER - Generate error trace for each configuration meeting BLER threshold - Number of output traces = Number of defined Target BLERs (for each TBS)

Underlying Assumption: AI-based Codecs (specifically PLC mechanisms) require specific "real" error patterns during training/design phase

Observation: Limits testing scope to specific "safe" operating points, potentially overlooking codec behavior under unexpected channel degradation

Approach B: The "Resource" Perspective (Fixed SNR)

Methodology: - Normalize TBS across all candidate codec bitrates assuming consistent packet overhead - For each unique Link Budget (fixed SNR) derived from specific UE, satellite, and link parameters, generate dedicated error traces - Number of output traces = Number of unique Link Budgets (for each TBS)

Underlying Assumption: Mimics "Best Effort" or competitive scenario similar to EVS selection, where end-to-end quality (MOS) matters more than intermediate BLER

Observation: Logically sound for optimizing system performance, but implies vast search space potentially leading to unmanageable simulation workload

2.3 The Core Issue: Verification vs. Design

The Logic Chain

The standard workflow should be:

Delay/Error Profiles Generation → Codec/PLC Verification → System Performance Evaluation

Misalignment

The current deadlock stems from treating RAN simulation outputs as Design Constraints (training data) rather than Verification Tools.

Key Principles:

  • Robustness over Overfitting: Robust Codec and PLC design should not rely on "learning" a specific channel trace from a specific simulator. Design should handle variety of harsh conditions (burst losses, high jitter, varying BLER). Data augmentation is standard practice for training robust AI models.

  • The Role of Traces: As in TS 26.132 Annex F, generated traces serve as "Test Vectors" defining challenging conditions under which the Codec must survive. Whether traces represent 90% or 99% of real-world cases is secondary to sufficiently stress-testing JBM and PLC algorithms.

  • Historical Practice: Delay/Error profiles officially generated by SA4 were never distributed to codec proponents for training purposes; they were solely used to verify codec candidates fulfill design constraints and performance requirements.

2.4 Proposal for the Way Forward

Re-orient simulation efforts towards generating a Verification Suite rather than a "Perfect Reality Model":

  • Avoid Excessive "Realism" Filtering: Do not discard simulation results simply because they don't meet strict low-BLER threshold. High BLER conditions are valid "Corner Cases" that ULBC must handle, especially in satellite scenarios with tight link budgets.

  • Limit the Search Space: Select representative subset of challenging conditions (e.g., Deep Fading, High Doppler) at fixed SNR points resulting in range of BLERs (e.g., from <1% up to >10%).

  • Verification Focus: Output traces should verify candidate codecs degrade gracefully under varied conditions. Burden is on Codec proponent to design PLC that works across these profiles, not on RAN simulation group to provide "training set" guaranteeing codec works.

Proposal: Multi-point Fine-grained Trace Generation (MFTG)

The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design by providing a high-resolution library of error traces rather than a single static operating point.

Step 1: Resource Baseline Normalization (TBS Definition)

  • Define set of Reference Transport Block Sizes (TBS) based on unified packet overhead
  • Keep TBS values consistent across all candidate codec bitrates to ensure fair comparison of resource efficiency

Step 2: Link Budget Mapping and Granularity Setup

  • Identify target range of Link Budgets (SNR/CNR) based on realistic NTN deployment scenarios (e.g., LEO/GEO, UE power classes)
  • Establish fine-grained sampling interval (e.g., 1% BLER to 10% BLER in steps of 1% or 2% from BLER perspective, or -5dB to 10dB in steps of 1dB from SNR perspective)

Step 3: Large-scale Link-Level Simulation (LLS)

  • Execute Monte Carlo simulations for each defined TBS at every fine-grained sampling interval

Step 4: Flexible Trace Selection for Verification

  • For Performance Comparison: Proponents selecting specific source bitrate can identify and utilize trace from library whose SNR/BLER most closely matches their design's intended link budget

  • For Robustness Testing: Proponents can select "stress-test" traces (e.g., those with higher BLER or specific jitter profiles) from same library to verify PLC and JBM algorithms

Conclusion

While the source understands the rationale behind both the Fixed BLER approach and Fixed Resource / Link Budget approach for GEO network simulation, a compromised solution is necessary for FS_ULBC to progress. MFTG is therefore proposed for consideration and agreement.

vivo Mobile Communication Co.,
[FS_ULBC] Proposed ULBC design constraints living document

Candidate Convenor for 3GPP Systems Aspects TSG - ULBC Design Constraints Living Document

1. Scope

This living document consolidates design constraints being considered within SA4 for FS_ULBC (Feasibility Study on Ultra-Low Bitrate Codec). Due to the working procedure requiring consensus agreements for design constraints to be integrated into ULBC PD or TR 26.940, and the lack of such consensus so far, this document captures the current status of design constraints even though some items are not fully agreed.

2. ULBC Design Constraints

2.1 Sampling Frequency and Audio Bandwidth

Design Constraint: Support of [8, 16, 32] kHz / [NB, WB, SWB] required [1], [2]

Editor's Notes and Open Issues: - Support of 8 kHz justified for interoperability; clarification needed on whether NB would be tested/supported "externally" based on external resampling - Support of 48 kHz may be considered at higher bitrate operation - Consideration of at least a single model (e.g., SWB) - Many neural codecs operate at 24 kHz; this specific sampling rate should be discussed - Complexity considerations associated with this parameter; joint decisions may be needed

Reference: NB audio typically sampled at 8 kHz (100-3500 Hz), WB at 16 kHz (50-7000 Hz), SWB at 32 kHz (50-14000 Hz), FB up to 20000 Hz

2.2 Number of Audio Channels

Design Constraint: ULBC candidate codecs shall support mono coding with one channel input and one channel output

2.3 Bit Rates

Design Constraint: ULBC candidate codecs shall operate at bitrates lower than [3.00] kb/s [3]

2.4 Frame Length

Design Constraint: Candidate codecs shall operate with a coding frame size of multiple of 20 ms

Note: Since larger than 20ms bundling time periods will be used, codec proponents should be allowed to consider solutions with larger than 20ms frame sizes

2.5 Algorithmic Delay

Design Constraint: Algorithmic delay shall be less than [coding frame size + x] ms

2.6 Complexity

Design Constraint: Complexity limits applied according to categories. Computational complexity and program ROM (PROM) of candidate codecs for each category shall be measured with ITU-T STL2009 [1] as observed worst-case encoder + observed worst-case decoder complexity within the same category [5], [6]

Categories: - Computational: wMOPS: Less than [x] wMOPS - Memory: RAM, ROM, Program ROM (values TBD)

Editor's Notes: - Model size per operation mode is less than [5-10] million parameters - Total number of parameters is less than [Z] million - ULBC Codec should be implementable on mobile device using today's technology - Increased computational complexity and memory usage should be commensurate with gain in quality of user experience (e.g., higher audio bandwidth such as SWB or stereo) or increased efficiency (e.g., lower bit rate for same quality compared to reference codec)

2.7 Potential Use of Noise Suppression as Part of the Codec

Design Constraint: If noise suppression is supported inside ULBC, there should be a mechanism to disable noise suppression in the codec [7], [8]

Editor's Notes - Clarifications Needed: - Need to support noise suppression in ULBC? (typically vendor specific, defined outside the codec) - Impacts on test methodology, DTX operation/performance

Motivations: - Disabling noise suppression required to test feature separately - Avoid tandeming in real operation - IMS voice communication defined in TS 22.228; GEO satellite access has no specific requirement on noise handling

2.8 Jitter Buffer Management (JBM)

Design Constraint: A JBM solution conforming to requirements in TS 26.114, except for the functional requirement in sub-clause 8.2.2 of TS 26.114: "Speech JBM used in MTSI shall support all the codecs as defined in clause 5.2.1", shall be provided with candidate codecs

2.9 Rate Switching

Design Constraint: Candidate codecs shall perform rate switching upon command to the encoder throughout the entire bit rate range at arbitrary frame boundaries. Rate switching may imply switching between different bandwidths

Note: Due to the Bundling period and associated TBS, switching might have to happen at the boundary of bundling period

2.10 Packet Loss Concealment (PLC)

Design Constraint: A PLC solution shall be provided by ULBC candidate codecs [9]

Editor's Notes: - Typical loss profiles/characteristics to be clarified - Support of redundancy to be clarified - Need to be able to handle BLER up to [x%]

2.11 RTP Payload Format

Design Constraint: Candidate codecs shall provide an RTP payload format specification supporting the full set of features and functionality of the ULBC candidate codecs

2.12 DTX

Design Constraint: Candidate codecs shall provide a complete VAD/DTX/CNG framework. It shall be possible to operate the codec with DTX on or DTX off

Editor's Note: Typical radio characteristics and optimizations (SPS, DRX, bitrate) to be clarified

2.13 Output Gain Limitation

Design Constraint: ULBC candidate codecs shall not amplify the output signal relative to the input signal beyond limits

Editor's Note: Similar limits and methodology to measure the amplification are described in the EVS-7a,b processing plan permanent document

3. References

[1] S4-251794 - Discussion on Audio Bandwidth for ULBC (vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum)

[2] S4-251808 - Pseudo-CR on Design Constraints of ULBC: Audio bandwidth (Fraunhofer IIS)

[3] S4-251792 - On codec bitrate and capacity discussion for ULBC (vivo, Samsung, Spreadtrum, MediaTek Inc.)

[4] ITU-T G.191 - Software tools for speech and audio coding standardization (March 2010)

[5] S4-251747 - On complexity constraints for ULBC (Huawei Technologies Co., Ltd.)

[6] S4-251807 - On complexity design constraints for ULBC (Fraunhofer IIS)

[7] S4-251395 - Pseudo-CR on Design Constraints of ULBC: Noise suppression (Fraunhofer IIS)

[8] S4-251748 - On noise suppression for ULBC (Huawei Technologies Co., Ltd.)

[9] S4aA250268 - Packet Loss Concealment with existing AI based codec DAC (Dolby Laboratories Inc., Ericsson LM, Nokia, Novamint)


Note: Items in light blue are candidates for agreement at SA4#135.

vivo Mobile Communication Co.,
[FS_ULBC] Alignment Analysis on Complexity of DAC model

Alignment Analysis on Complexity of DAC Model

1. Introduction

This contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics:

  • S4-260165: ~3M parameter model (32 kHz) requires 0.79 GMACS
  • S4-260155: ~3M parameter model (32 kHz) requires approximately 1.41 GMACS (derived from 2821 MFlops/s)

Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate.

The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF.

2. Architectural Analysis and Discrepancy Resolution

2.1 The "Model Size" Trap

A detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints:

| Metric | [2] (16k, ~3M) | [1] (32k, ~3M) | |--------|----------------|----------------| | Input Rate | 16,000 Hz | 32,000 Hz | | Total Stride | 320 (2×4×5×8) | 1280 (4×4×8×10) | | Latent Rate | 50.0 Hz | 25.0 Hz | | Encoder MACs (M) | 436.30 | 461.92 | | Quantizer MACs (M) | 2.25 | 0.50 | | Decoder MACs (M) | 984.50 | 1037.12 | | Total MFlops/s | 1423.05 | 1499.54 |

Key Analysis:

  • The S4-260165 (32k, ~3M) model runs at 2× higher input rate (32k vs 16k), increasing encoder computational cost
  • The S4-260165 model uses 4× higher stride (1280 vs 320), reducing the latent rate to 25Hz (compared to standard 50Hz)
  • The reduced latent rate significantly lowers decoder cost (fewer frames to upsample)
  • Higher input cost balances with lower decoder/latent cost, resulting in comparable total MFlops/s

Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration.

2.2 Verification of Complexity Metrics

Theoretical complexity (GMACS) was recalculated to validate the analysis:

  • Using the standard conversion: GMACS ≈ MFlops/s / 1000 × 0.5
  • The S4-260165 (32k, ~3M) model at 32 kHz yields ~1,499.5 MFlops/s
  • Calculated GMACS: 1499.5 / 1000 × 0.5 ≈ 0.75 GMACS
  • This aligns closely with the reference value of 0.79 GMACS reported in S4-260165

3. GMACS as the Metric

When RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures.

Key Findings:

  • RTF scales linearly with GMACS across different CPU tiers (Efficiency, Performance, Prime cores)
  • A specific GMACS budget (e.g., 2.0 GMACS) yields predictable RTF on a target CPU core and frequency, regardless of architectural choices (high-sample-rate input vs. large parameter count in decoder)
  • This metric decouples complexity constraint from specific architectural choices (stride, latent rate), allowing codec designers flexibility in optimization
  • High-complexity validation: S4-260155's 20M model (~5.14 GMACS) demonstrates RTF of 0.9 in power-efficient execution mode on high-end 2023 device, aligning with mid-range Prime Core (3.0 GHz) trend where ~5.3 GMACS corresponds to RTF ≈ 1.0

4. Conclusion

By adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices.

5. Proposal

Propose to include this analysis in 3GPP TR 26.940, specifically capturing:

  • Model Size is not a consistent proxy for complexity across varying architectures (e.g., high-stride vs. low-stride configurations)
  • GMACS/GFLOPs demonstrates strong linear correlation with real-time performance on mobile devices
  • This analysis provides a solid basis for defining complexity constraints for ULBC candidates

References

[1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis"

[2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling"

Qualcomm Incorporated, Xiaomi
[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle

Summary of S4-260214: Feasible Bitrates for NTN-TDL-C Channel Model with 10-Degree Elevation Angle

Background and Motivation

This contribution addresses the determination of feasible Transport Block Sizes (TBS) for the newly agreed NTN-TDL-C channel model at 10-degree elevation angle, which was adopted at SA4 #134 (documented in S4-252108). Key observations include:

  • Two channel models now exist: The original channel model from TR 38.811 Table 6.9.2-3 and the new NTN-TDL-C model for 10-degree elevation
  • Channel model validation: The new channel model shows better correlation with field data from NB-IoT GSO service with handheld devices
  • Field data shows 1-3 dB gap between 1st and 50th percentile SNR
  • New channel model: ~1 dB gap (consistent with field data)
  • Initial channel model: ~6 dB gap (less consistent)
  • TBS table update requirement: The TBS values in permanent document tables (5.2.2.1-1, 5.2.2.1-2, 5.2.2.1-3) should reflect the union of supported TBS values for both channel models

Simulation Methodology

The contribution evaluates maximum feasible bitrates under worst-case conditions without DMRS bundling, considering two scenarios:

Scenario 1: Ideal Timing

  • 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread
  • Without DMRS bundling
  • No uncertainty in scheduling timing

Scenario 2: Timing Uncertainty

  • 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread
  • Without DMRS bundling
  • 10ms uncertainty in scheduling timing (relevant for large cells without UE-specific Koffset or TA report)

Both scenarios target 2% BLER for uplink and downlink.

Simulation Results

Scenario 1 Results (No Timing Uncertainty)

  • Maximum TBS: 936 bits
  • Uplink: 15kHz SCS, 3 tones, 48ms (N_RU=6, N_rep=2)
  • BLER: 1.5% at 31dBm UE TX power
  • Downlink: 24ms (N_SF=6, N_rep=4)
  • BLER: 1.1% at -3.3dB SNR

Scenario 2 Results (10ms Timing Uncertainty)

  • Maximum TBS: 680 bits
  • Uplink: 15kHz SCS, 3 tones, 40ms (N_RU=5, N_rep=2)
  • BLER: 0.2% at 31dBm UE TX power
  • Downlink: 20ms (N_SF=5, N_rep=4)
  • BLER: 0.5% at -3.3dB SNR

Proposed Changes to Permanent Document

TBS Table Updates

The contribution proposes adding new TBS values to support the higher bitrates enabled by the new channel model:

For 80ms Bundling Period (Table 5.2.2.1-1)

  • Add TBS 936 bits (PHY bitrate: 11.7 kbps, Net bitrate: 11.0 kbps)
  • Add TBS 680 bits (PHY bitrate: 8.5 kbps, Net bitrate: 7.8 kbps)
  • Add intermediate value between 680 and current maximum (424)

For 160ms Bundling Period (Table 5.2.2.1-2)

  • Add corresponding new TBS values scaled appropriately

For 320ms Bundling Period (Table 5.2.2.1-3)

  • Add corresponding new TBS values scaled appropriately

Terminology Change

  • Change "codec bitrate" to "net bitrate" to clarify that this represents the bitrate available to the codec (after accounting for packet headers), not a required codec operating bitrate

Updated Tables

The proposed tables include: - Packet header assumption: 7 bytes (with note that final size depends on SA2/RAN conclusions on 1-byte MAC header feasibility) - Header counting: Packet header counted only once per bundling period, regardless of number of voice frames bundled - TBS values: Selected from TS 36.213 Table 16.5.1.2-2 for NB-IoT NPUSCH - Net bitrate calculation: PHY bitrate minus overhead from packet headers

The complete updated tables show TBS values ranging from 144 to 936 bits for 80ms bundling, with corresponding PHY and net bitrates calculated for each bundling period configuration.

Qualcomm Incorporated, Ericsson LM
[FS_ULBC] On the scheduling timing uncertainty

Summary of 3GPP Technical Document: FS_ULBC - Scheduling Timing Uncertainty

1. Background and Motivation

This contribution addresses ambiguities in interpreting RAN1 LS S4-251654 regarding uplink and downlink timing for NB-IoT NTN with GEO satellites. The interpretation of this LS has direct implications on: - Scheduling timing uncertainty assumptions - Link capacity calculations

The document proposes clarifications to the Permanent Document (PD) version 0.4.0 to resolve these interpretation issues discussed at SA4 #133-e and subsequent meetings.

2. Main Technical Contributions

2.1 Frame Structure for Dynamic Scheduling

The document maintains the existing frame structure example for Half-duplex FDD with 80ms bundling period: - NPDSCH duration: 4ms (variable depending on DL SNR) - Multiple UL frequency allocation options: 1, 3, 6, and 12 tones with 15 kHz per tone - Allocation choice depends on UL and DL channel capacity

2.2 Semi-Persistent Scheduling (SPS) Frame Structure

Two SPS approaches are presented:

Approach 1 (Figure 5.2.2.3-2): - NPDSCH can be positioned anywhere within first 15ms - Maintains minimum 1ms gap to NPUSCH

Approach 2 (Figure 5.2.2.3-3): - Based on "Cell_specific_Koffset" approach - Does not depend on "TA report UE capability"

2.3 Gap Composition Between DL and UL

The gap consists of: 1. Processing time + DL-to-UL switching: Minimum 1ms for half-duplex device switching 2. Max differential delay: Accounts for different round-trip delays of UEs in NTN cell - Typical range: close to 0 to 10.3ms depending on deployment

2.4 Baseline Assumptions for Codec Simulations

Key Changes Proposed:

For 80ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 68ms - Proposed change: Replace with reference to beam size no larger than 1500km - Note clarifies this corresponds to scenarios where difference between closest and farthest point to satellite is <1500km - Explicitly states codec can be deployed in scenarios not meeting these constraints

For 160ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 148ms
- Proposed change: Replace with reference to beam size no larger than 1500km - Same flexibility noted regarding deployment in other scenarios

2.5 Important Notes and Editor's Notes

RAN1 LS Clarification: - Figure 5.2.2.3-1 supportable in most scenarios - May not be supportable when: - Cell is very large (e.g., >3000km) - UE does not support TA report - Network does not support UE-specific K-offset - Requires UE configuration with two HARQ processes and HARQ feedback disabled

SPS Design Status: - RAN1/2 have not yet started SPS design work - RAN1 cannot currently confirm whether SPS frame structure examples (Figures 5.2.2.3-2 and associated text) will be supported

Editor's Note: - Range of "Max differential delay" is TBC (To Be Confirmed)

3. Summary of Changes

The primary technical contribution is replacing specific timing constraint assumptions (X + Y values and max differential delay) with a more practical reference based on beam size ≤ 1500km for codec simulation baseline, while explicitly allowing codec deployment in scenarios exceeding these reference conditions. This provides clearer guidance to SA4 while maintaining flexibility for various deployment scenarios.

Qualcomm Incorporated
[FS_ULBC] On transmission delay for voice over NB-IoT NTN

Summary of S4-260216: On Transmission Delay for Voice over NB-IoT NTN

Document Overview

This contribution from Qualcomm addresses gaps in TR 26.940's mouth-to-ear delay calculations for NB-IoT NTN systems, specifically highlighting the omission of NPUSCH/NPDSCH transmission durations and clarifying the distinction between propagation delay and transmission delay.

Main Technical Issues Identified

Problem Statement

  1. Missing Transmission Duration: TR 26.940 did not account for the duration of NPUSCH transmission or NPDSCH transmission, which can be significant for NB-IoT (e.g., 64ms for NPUSCH)

  2. Terminology Confusion: TR 26.940 confuses propagation delay with transmission delay, where:

  3. Transmission delay: The interval from when the first bit leaves a transmitter to when the last bit leaves the transmitter
  4. Propagation delay: The time for signal propagation through the medium
  5. Processing delay (up to 3ms) can be ignored in mouth-to-ear calculations

Proposed Technical Changes

5.2.2.4 Propagation Delay Corrections

Key Change: Renamed "Transmission delay" to "Propagation delay" for GEO satellite link

  • Maximum propagation delay: 280ms (per KPI requirement in clause 7.4.2 of reference document)
  • Minimum propagation delay: 248ms (280ms - 32ms, accounting for UE location within beam)
  • Assumes no retransmissions over GEO satellite link

5.2.2.5 Transmission Delay (New Section)

New Addition: Introduces proper definition and consideration of transmission delay

  • Defines transmission delay as the interval from first bit to last bit leaving the transmitter
  • Highlights significance for NB-IoT NTN (up to 64ms for NPUSCH in uplink)
  • Must be accounted for in mouth-to-ear delay calculations
  • Transmission delay for transport block size should be based on RAN simulation results

5.2.2.5/6 ULBC Delay Components

  • Section renumbered from 5.2.2.5 to 5.2.2.6
  • References existing algorithmic delays for IMS codecs (AMR and EVS: 5ms to 12ms)
  • Notes that ULBC may have different delay values from codec processing and algorithmic delays
  • Marked as FFS (Further Study)

5.1.3 Mouth-to-Ear Delay Estimation Updates

Editorial Note Added: - Numbers in Table 5.1.3-1 will be updated once RAN simulation is completed to account for transmission delays in uplink and downlink - Current values assume AMR and EVS algorithmic delays - ULBC delay components still need to be addressed - Minimum Delay_GSCN assumed as 20ms

Existing Table Structure Maintained: - Frame sizes: 20ms, 40ms, 80ms, 160ms, 320ms - Two scenarios: GEO-TN (main) and GEO-GEO (sub-scenario 1) - Lower and upper bounds for mouth-to-ear delay - Delay ranges from 428-712ms (20ms frame, GEO-TN) to 984-1455ms (320ms frame, GEO-GEO)

Dependencies and Next Steps

  • Awaiting RAN simulation results to determine actual transmission delay values for different transport block sizes
  • ULBC-specific delay components require further study
  • Terminology alignment needed with clause 4 "Application Scenario"
  • Table 5.1.3-1 values pending update based on RAN simulation completion
Qualcomm Incorporated
[FS_ULBC] Support for Dual-Tone Multi-Frequency for IMS voice over NB-IoT NTN

Summary of S4-260217: Support for Dual-Tone Multi-Frequency for IMS Voice over NB-IoT NTN

Background

SA1 has mandated support for Dual-Tone Multi-Frequency (DTMF) for IMS voice over NB-IoT NTN. The document addresses the need to consider multiplexing of DTMF traffic with voice traffic in the system design, referencing RFC 4733 for DTMF payload formats.

DTMF Use in IMS Voice Services and Traffic Characteristics

DTMF Payload Types

RFC 4733 defines two DTMF payload format types: - Telephone events: User button presses (0-9, , #) during calls - Tones*: Ringing tone, busy tone, etc.

For IMS calls, tones are generated locally (e.g., "180 Ringing" or "486 Busy Here" SIP messages trigger local tone generation), so only telephone events need to be transported over the air.

Technical Specifications

  • RTP payload size: 4 bytes
  • Telephone events: Standard DTMF digits and symbols

Traffic Characteristics

The document identifies key DTMF traffic characteristics: - DTMF packets are transmitted infrequently (only on button press) - Telephone events may or may not overlap with active voice activity - Multiple DTMF packets may be transmitted per button press, with the RTP marker bit indicating the first packet - RTP packets must differentiate between voice and DTMF packets for multiplexing

Design Assumptions

Three key assumptions are established: 1. DTMF packet size ≤ voice packet size 2. DTMF delay requirements are less stringent than voice service 3. DTMF takes priority over voice

SPS Configuration Considerations

When SPS (Semi-Persistent Scheduling) is configured for voice traffic with fixed TBS: - If DTMF packets don't overlap with active voice frames, they can be multiplexed with SID frames (smaller than active voice frames) and transmitted in SPS occasions - If overlapping occurs, the UE can puncture an active voice frame and send the DTMF frame instead - SA4 needs to coordinate with RAN1 and RAN2 on SPS design

Proposals

Proposal 1: Make DTMF support an integral part of IMS voice service over NB-IoT NTN

Proposal 2: Design DTMF support based on the three assumptions: - DTMF packet size ≤ voice packet size - DTMF delay requirement less stringent than voice - DTMF priority over voice

Proposal 3: SA4 to design mechanisms for voice and DTMF multiplexing for SPS and coordinate with RAN1 and RAN2

Nokia
Proposed design constraints for noise suppression, DTX, and non-speech inputs

Summary of S4-260220: Design Constraints for Noise Suppression, DTX, and Non-Speech Inputs

1. Background and Context

This contribution addresses design constraints for the ULBC (Ultra-Low Bit-rate Communication) over GEO channel solution, building upon previous discussions from S4-251881 and S4-251786. The document focuses on three key areas: - Noise suppression handling - Discontinuous transmission (DTX) framework - Robustness to non-speech inputs

Emergency Call Use Case

The contribution emphasizes that emergency calls represent a critical use case for ULBC over GEO, particularly when terrestrial network (TN) service coverage is unavailable. Key considerations include: - Background signals may contain critical contextual information (e.g., voices, environmental sounds indicating danger) - Post-call analysis requirements (ASR transcripts, emergency response evaluation, criminal investigations) - Need for full situational awareness rather than aggressive noise suppression

2. Technical Analysis

2.1 Noise Suppression Trade-offs

The document identifies several technical challenges:

  • Performance requirements alone may be insufficient: Testing with background signals (even using ITU-T P.800 DCR methodology) may not prevent systems from employing aggressive noise suppression that removes critical background information
  • Ultra-low bit rate optimization: At very low bit rates, there exists an unknown trade-off between:
  • Applying noise suppression
  • Accepting more coding artifacts
  • Potentially reduced intelligibility in presence of background signals
  • Device-specific processing: Acknowledges that device-specific noise suppression is standard practice and will likely be applied before ULBC encoding

2.2 Updated Approach

The contribution updates the original proposal from S4-251881 by: - Maintaining the requirement for disableable noise suppression within the codec - Adding specific SNR ranges for stationary (5-15 dB) and non-stationary (10-25 dB) noise - Deferring specific noise type definitions for future discussion - Linking noise suppression behavior primarily to performance requirements

3. Proposed Design Constraints

The document proposes updates to Table 6.2-1 in draft TR 26.940 with three new/modified constraint parameters:

3.1 Noise Suppression Constraint

Requirement: If noise suppression is supported as part of the candidate codec, it must be possible to disable it to preserve background signals.

Editor's Notes: - EN1: Requirement to disable may be considered in connection with specific operating bit rate(s) - EN2: Solution behavior w.r.t. potential noise suppression is primarily enforced via performance requirements; default operation for tests is with noise suppression disabled

3.2 DTX Framework Constraint

Requirement: The candidate codec shall provide a framework for: - Voice Activity Detection (VAD) - Discontinuous Transmission (DTX) - Comfort Noise Generation (CNG) - Operation with DTX on or DTX off

Editor's Note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need clarification

3.3 Robustness to Non-Speech Input

Requirement: The candidate codec shall be robust to: - Noisy speech with stationary noise (5-15 dB SNR) - Noisy speech with non-stationary noise (10-25 dB SNR) - Background signals during and between speech segments - Other non-speech input signals

Editor's Notes: - EN1: May need to be in performance requirements - EN2: Relevant background signals to be further defined as part of performance requirements, including both stationary and non-stationary types

4. Key Technical Contributions

  1. Balanced approach to noise suppression: Recognizes both the need for flexibility in noise suppression (for speech quality) and the critical requirement to preserve background signals (for emergency use cases)

  2. Mandatory DTX framework: Establishes VAD/DTX/CNG as a required feature rather than optional, with explicit on/off control

  3. Quantified robustness requirements: Provides specific SNR ranges for different noise conditions that the codec must handle

  4. Testing methodology guidance: Proposes default testing with noise suppression disabled, while allowing performance requirements to govern overall behavior

5. Open Issues

Several editor's notes indicate areas requiring further work: - Specific operating bit rates where noise suppression disable requirement applies - Clarification of DTX and noise suppression interaction - Final placement of robustness requirements (design constraints vs. performance requirements) - Definition of specific background signal types for testing - Speech quality requirements (to be addressed separately in performance requirements)

Ericsson LM
UE Antenna Gain in link-budget evaluations

UE Antenna Gain in Link-Budget Evaluations

Introduction

This contribution addresses the need to establish common assumptions for UE Antenna Gain in link-budget evaluations for FS_ULBC (Ultra Low Bitrate Speech Codec). The document highlights that different assumptions on UE Antenna Gain lead to significantly different conclusions on suitable radio configurations, and proposes alignment with existing 5G NR-NTN assumptions.

Problem Statement

The current FS_ULBC Pdoc references TR 36.763 with UE Antenna Gain assumptions ranging between 0 dBi and -5.5 dBi. Previous SA4 contributions on link level simulations have shown divergent assumptions regarding achievable link level performance, leading to inconsistent conclusions. The lack of a common assumption for UE Antenna Gain (G_Tx) significantly impacts:

  • Link-budget results
  • Performance references for link-level evaluations
  • Overall system design conclusions

Link-Budget Analysis

Comparative Evaluation

The document presents a detailed side-by-side comparison of link-budget calculations for GEO satellite uplink with two different UE Antenna Gain assumptions:

Scenario Parameters (Common): - Satellite Orbit: GEO - Link Direction: Uplink - Device Type: Handheld - Satellite Elevation Angle: 2.3 degrees - Satellite Altitude: 35,786 km - Slant Range: 41,417.91 km - Carrier Frequency: 2000 MHz - Free Space Path Loss (FSPL): 190.8 dB - UE Transmit Power: 23 dBm - Receive Antenna Gain: 51 dBi - Satellite G/T: 19 dB/K - Bandwidth: 3750 Hz - Various losses (atmospheric, shadow fading, scintillation, polarization, additional): 11.4 dB total

Key Results:

| UE Antenna Gain | Received Power | Noise Power | SNR at Satellite Receiver | |-----------------|----------------|-------------|---------------------------| | 0 dBi | -135.58 dBm | -138.23 dBm | 2.66 dB | | -5.5 dBi | -141 dBm | -138.23 dBm | -2.84 dB |

The difference in UE Antenna Gain assumption results in a 5.5 dB difference in SNR, which is highly significant for link-level performance evaluation and system design.

Observations

Observation 1: The assumption for UE_Antenna_Gain (G_Tx) critically impacts the resulting SNR at the satellite receiver, which in turn affects conclusions on link-level results. Clarification is needed on whether to use 0 dBi, -5.5 dBi, or both values.

Observation 2: It is unlikely that an NB-IoT device would have superior antenna performance compared to an NR handheld device. Therefore, the UE_Antenna_Gain assumption should align with 5G NR-NTN specifications, which use -5.5 dBi.

Observation 3: RAN4 guidance (R1-2208353) explicitly recommends -5.5 dBi as a realistic UE antenna gain value, stating: "The UE antenna gain varies depending on the operating frequency and UE design. RAN4 thinks that a realistic UE antenna gain value would be -5.5 dBi. RAN4 would then recommend RAN1 to take this value as an assumption for their link budget evaluation."

Proposal

Proposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on UE_Antenna_Gain (G_Tx) with 5G NR-NTN specifications, i.e., -5.5 dBi.

This alignment ensures: - Consistency with existing 3GPP NTN specifications - Realistic assumptions based on RAN4 recommendations - Comparable link-budget evaluations across different contributions - Appropriate performance targets for codec and system design

Fraunhofer IIS, Apple Inc.
On reference code and model format

3GPP Change Request Summary: S4-260233

Document Overview

This contribution proposes the use of ML model formats as intermediate representations (IR) for the ULBC (Ultra Low Bitrate Codec) reference implementation, rather than a pure C implementation. The document is structured as a proposed Change Request (pCR) to TR 26.940, introducing a new clause 6.4.2.


Main Technical Contributions

1. Problem Statement and Motivation (Goal Section)

The document identifies a fundamental question for ULBC standardization: whether to provide the entire codec reference implementation in C (including neural network components) or to define specific parts based on ML model formats (e.g., ONNX, PyTorch, TensorFlow).

Key concerns with pure C implementation: - Limits UE vendors from leveraging custom architectures and optimizations - UE vendors typically have custom optimization pipelines to port ML models to internal formats - Pure C approach restricts full utilization of specialized hardware (NPUs, DSPs, TPUs)

2. Limitations of C-Based Reference Implementation (Clause 6.4.2.1)

Issues with existing WMC (Weighted Million Operations) tool for complexity measurement: - Weights in Table 18.3 of G.191 do not account for vectorized implementations of matrix multiplications - Theoretical complexity estimation does not reflect actual runtime complexity - Does not account for diversity of target platforms

Additional limitations identified: - Hardware/platform dependencies: C implementations may rely on platform-specific intrinsics and vectorization pragmas, limiting portability to NPUs - Unoptimized reference code: May not be optimized for certain platforms - Compiler dependencies: Intrinsics are compiler-specific - Maintenance burden: Keeping C implementation updated with new ML operators and architectures is costly and error-prone

3. Definitions and Concepts (Clause 6.4.2.1 - Definitions)

The document establishes clear terminology:

  • Graph format: Describes neural network as computational graph (structure only, no parameters)
  • Model format: Combines graph representation, trained parameters (weights, biases), and metadata; self-contained and directly runnable
  • Intermediate Representation (IR): Serves as bridge between high-level ML framework and execution runtimes

Note: PyTorch does not contain a graph format and requires model definition as Torchcode.

4. Advantages of Model Format Approach (Clause 6.4.3.2)

Platform Portability: - Specifies what is computed, not how it's executed - Framework-agnostic: models can be exported from different training frameworks - Allows vendors to use custom toolchains for hardware-specific optimization

Hardware Evolution: - Future-proof method to leverage latest AI processor developments - Maintains compatibility with low maintenance effort

Combination with Standard C-code: - ULBC can combine ML parts (as model format) with classic signal processing (in ANSI C) - Backend runtime in C can integrate ML components - Enables traditional 3GPP codec reference implementation structure

5. Comprehensive Model Format Analysis (Clause 6.4.3.3)

The document provides detailed comparison of major ML model formats:

| Format | Type | Key Advantages | Key Limitations | |--------|------|----------------|-----------------| | ONNX | Framework-agnostic IR | Cross-framework portability, wide runtime/hardware support, native OS support (Windows/Linux), dedicated C/C++ runtime | Operator coverage limitations, limited dynamic graph support | | TensorFlow Lite (TFLite/LiteRT) | Edge/embedded-focused IR | Mobile/edge optimized, strong Android ecosystem, quantization tools, C/C++ runtime | TensorFlow-centric, partially vendor-specific maintenance | | PyTorch/Python | Torch.nn.Module + checkpoints | Easy prototyping, highly optimized conversion tools | Suboptimal for real-world testing, Python dependencies, no C/C++ runtime without Python | | TorchScript | PyTorch-specific serialized IR | Static graph without Python dependencies, supports custom ops, LibTorch C++ runtime | PyTorch-specific, deprecated (being replaced by ExportedProgram) | | ExportedProgram & ExecuTorch | Two IRs: ExportedProgram and .pte | Replaces TorchScript, canonical PyTorch export IR, dedicated C++ runtime | PyTorch-specific, requires compilation to another IR, pipeline not fully mature | | OpenVINO IR | Intel/CPU-centric IR | Strong Intel CPU/GPU optimization | Not suitable for mobile SoCs, extra conversion step needed | | Proprietary vendor IRs | Vendor-specific internal IR | Highly hardware-optimized | Not portable, requires conversion from open IR |

Key observations: - PyTorch format provides maximum flexibility and transparency but may have long-term compatibility concerns due to format evolution - ONNX and TFLite are designed for inference deployment and cross-platform compatibility, representing stable industry standards - ULBC ML parts will likely be based on PyTorch format, convertible to stable formats like ONNX or TFLite

6. SoC AI Engine Support Analysis (Clause 6.4.3.4)

Hardware landscape: - Major smartphone SoCs include NPUs, DSPs, TPUs, GPUs, and CPUs - Vendors provide specialized runtime environments and SDKs - Vendors use native/preferred internal model formats optimized for their architecture

Industry pattern: - All major vendors provide conversion mechanisms from popular open-source formats - Common supported formats: ONNX, TFLite, PyTorch, TensorFlow - References provided for major vendors: Qualcomm, Apple, Samsung, MediaTek, Google, Huawei

7. Summary and Recommendations (Clause 6.4.3.4)

Advantages of model-format/IR-based reference implementation: - Decouples algorithm definition from hardware-specific implementation - Leverages existing SoC vendor compilers, AI accelerators, and runtimes - Significantly more portable, maintainable, and future-proof

Recommended approach for ULBC reference implementation: 1. Base reference on ML model-format with auxiliary signal processing in C 2. Include both ONNX and PyTorch as ML model-formats 3. Define neural network model-format including operator set and version 4. Specify I/O interfaces of ML models and auxiliary signal processing steps in C 5. Use reference implementation for integration illustration, verification, and testing


Proposal

The document proposes: 1. Discussion and agreement on selection of one or more model formats for ULBC reference implementation 2. Agreement on principle of using model format as part of ULBC standardization reference model 3. Documentation of findings in TR 26.940 under new clause 6.4.2


Key Technical Impact

This contribution represents a significant departure from traditional 3GPP codec standardization approaches by advocating for ML model formats rather than pure C implementations. The proposal addresses practical deployment considerations for ML-based codecs while maintaining compatibility with 3GPP standardization practices through hybrid approach combining model formats with C code for signal processing components.

Orange
On the use of objective metrics in ULBC standardization

Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

Introduction and Scope

This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

Main Technical Contributions

Test Methodologies (Clause 9)

Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1)

The document identifies specific impairment categories relevant to ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).

Challenges of Quality Assessment (9.1.2)

The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):

Traditional 3GPP Approach: - Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech - P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content - Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments

ULBC-Specific Considerations: - ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues) - DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition

Alternative Test Methodologies Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation

Subjective Testing Considerations (9.1.3)

Robustness Related to Source Material (9.1.3.1): - Multiple languages with diverse intonations - Non-speech signals - Various linguistic features and accents - Wide range of speakers (different voice pitches, speaking styles) - Overlapping talkers

Simulation of Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Various reverberation levels (RT60 ranging from 0.3s to 1.0s)

Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov

Conclusion (9.1.3.4): - ITU-T P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

Objective Testing Considerations (9.1.4)

Correlation Analysis Results (9.1.4.1):

The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:

Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ

General audio metrics: PEAQ, ViSQOL-A

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

Key Observations for Clean Speech: - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior - 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

Correlation Analysis for Music/Mixed Content:

Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

Key Observations for Music/Mixed Content: - POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f-model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance - Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology

Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective and objective metrics may be considered for codec characterization - Objective metrics have merits in other tasks such as codec conformance testing

Proposed Changes to TR 26.940

The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.

Fraunhofer IIS
On complexity estimation of ULBC

Summary of S4-260241: On Complexity Estimation of ULBC

Document Overview

This contribution addresses the complexity measurement methodology for the Ultra-Low Bitrate Codec (ULBC) under development in 3GPP SA4. The document proposes a hybrid complexity metric that combines traditional DSP-based measurements with ML-specific metrics.

Background and Motivation

Multiple input documents [1-4] have previously discussed complexity measurement approaches: - Documents [1] and [3] proposed using WMOPS (Weighted Million Operations Per Second), following conventional speech codec practices - Document [2] suggested using MACs and a modified WMOPS version - Document [4] emphasized model size considerations

The key challenge is that ULBC will operate on heterogeneous, non-fixed target hardware and processors, requiring a platform-agnostic complexity metric.

Main Technical Contributions

Proposed Hybrid Complexity Metric

The document proposes combining two complementary measurement approaches:

For DSP-based components: - Use traditional WMOPS measurement

For ML-based components: - Use MAC (Multiply-Accumulate) operations count - Include parameter count for memory/model size considerations

Combined metric formula: WMOPS + w · MACs where w is an ML weighting factor (expected to be < 1) that reflects the vectorization capability of matrix multiplications.

Rationale for the Hybrid Approach

Limitations of WMOPS-only approach: - WMOPS reflects complexity primarily for DSP operations - Does not account for modern vectorization capabilities available even on modern DSPs - Less relevant for non-DSP processor types - The WMOPS toolbox doesn't reflect modern computational capabilities

ML-specific considerations: - ML component complexity is dominated by matrix multiplications - Inference time and energy consumption are highly platform-dependent - MAC count provides architecture-agnostic computational load measurement - Parameter count relates directly to model size, memory usage, and energy consumption

Advantages of the Proposed Metric

The hybrid approach provides: 1. Overall complexity estimate for hybrid DSP+ML codec designs 2. Avoids over-constraining codec design toward specific platforms (referenced S4-260233) 3. Allows UE vendors to leverage custom architectures and optimizations 4. Accounts for efficient vectorization of ML components 5. Enables flexible computational cost balancing between DSP-based and ML-based components 6. Maintains continuity with established practice while accommodating emerging ML-based designs

Vectorization Capability Reference Data

The document provides example processing units and their vectorization capabilities to inform the ML weighting factor w:

| Chip | Type | Vectorization Capabilities | |------|------|---------------------------| | HiFi 5s | DSP | 32×(8×8 bit MAC)
16×(32×16 bit MAC)
8×(32×32 bit MAC) | | ARM Cortex A55 | CPU | 16×(8×8 MAC)
8×(16×16 MAC FP) |

Proposal

The source proposes to:

  1. Define computational complexity metric by counting:
  2. WMOPS for DSP-based components
  3. MAC for ML-based components
  4. Combine according to: WMOPS + w · MACs (where w is an ML weighting factor)

  5. Define a maximum value as the computational complexity limit in design constraints

  6. Apply similar principles for memory counting metrics

References

The document references five previous contributions [1-4] and two external technical specifications [5-6] for processor capabilities.

Dolby Laboratories Inc., Nokia, Novamint
[FS_ULBC] ULBC Re-Focus Proposal

ULBC Re-Focus Proposal

Background and Motivation

The FS_ULBC study item, initiated nearly a year ago, aims to establish a normative ULBC standard for voice communication over GEO within Rel-20. However, progress has been slow, with crucial issues such as end-to-end simulation parameters remaining unresolved. This contribution proposes a focused approach to meet 3GPP standardization timelines.

Core Proposal: Two-Phase Standardization Approach

The document proposes separating ULBC standardization into two distinct phases to ensure timely delivery while accommodating future enhancements:

Phase 1: Rel-20 ULBC Baseline

  • Scope: GEO-focused functionality based strictly on stable Rel-19 service requirements
  • Rationale: Ongoing 6G Media requirements in SA1, SA2, and SA4 have not yet produced consolidated or normative requirement sets suitable for codec design
  • Principle: Following established 3GPP procedures, the ULBC work item shall not define new service requirements but rely on formally defined and stabilized upstream requirements

Phase 2: Rel-21 ULBC Advanced

  • Scope: Extended functionality aligned with finalized 6G Media requirements
  • Application scenarios: Beyond Rel-20 IMS Voice Call over GEO
  • Compatibility: Should be backward compatible extension of Rel-20 baseline

Technical Configuration Comparison

Application Scenarios

Baseline (Rel-20): - IMS Voice Call over GEO based strictly on Rel-19 service requirements

Advanced (Rel-21): - Multi-Party Voice Communication - IMS Voice Call with ULBC over additional access types beyond GEO

GEO Channel Characteristics & Simulation

Baseline (Rel-20): - Single baseline UE Tx/Rx capability - Single CNR in UL and DL (e.g., UL single-tone 23 dBm: CNR=5.28 dB for SCS=3.75 kHz, CNR=-0.74 dB for 15 kHz; DL 12-tone single Rx: CNR=-0.61 dB) - Single agreed target bitrate compatible with baseline UE capability enabling acceptable system capacity - Reliance only on mandatory Rel-19 NB-IoT radio protocol features (except SPS) - i.i.d. random block error patterns - Single SPS/bundling period (160 ms)

Advanced (Rel-21): - Advanced UE capabilities (e.g., increased Tx power, multiple Rx antennas) - Multiple CNR assumptions in UL and DL - Codec designers may choose optimal bitrate/TBS per CNR - Allow reliance on expected Rel-20 and selected non-mandatory NB-IoT features - Simulated block error patterns based on advanced features - Additional SPS/bundling periods (e.g., 80 ms, 320 ms)

Design Constraints: Bitrate

Baseline (Rel-20): - Single target bitrate derived from Rel-19 GEO IMS voice service requirements - Example: TBS=208 with SPS period 160 ms, achieving 950 bps net bitrate

Advanced (Rel-21): - Multiple target CNRs with bitrate as codec design choice - Additional bitrates for future 6G-related scenarios

Design Constraints: Sample Rate and Audio Bandwidth

Baseline (Rel-20): - Single sample rate: e.g., 16 kHz - Audio bandwidth: up to WB - Note: May depend on agreed target bitrate

Advanced (Rel-21): - Input/output sampling rates: at least 8, 16, 32, 48 kHz - Audio bandwidth unconstrained (codec design choice)

Design Constraints: Frame Length and Algorithmic Delay

Baseline (Rel-20): - Corresponding to SPS/bundling period (160 ms) or sub-multiples thereof - Algorithmic delay excl. framing: e.g., ≤80 ms (0.5 × SPS/bundling period)

Advanced (Rel-21): - Frame structure and algorithmic delay aligned with advanced SPS/bundling options and future 6G Media requirements

Design Constraints: Complexity and Memory

Baseline (Rel-20): - Limited; sufficiently low to not preclude deployment on current-generation smartphones - TBD MMAC/s - E.g., 3M parameters

Advanced (Rel-21): - Relaxed, enabling multiple models - Addressing future 6G Media requirements while leveraging new UE hardware trends

Design Constraints: Packet Loss Concealment

Baseline (Rel-20): - Required; capable of addressing single agreed-upon target bit rate and operation point of IMS Voice Call over GEO

Advanced (Rel-21): - Required; capable of supporting anticipated extended application scenarios beyond Rel-20 IMS Voice Call over GEO, while fulfilling potential 6G Media requirements

Design Constraints: Noise Suppression and Robustness

Baseline (Rel-20): - No requirement to provide noise suppression - Required capability to handle and reconstruct noisy speech input with moderate to high SNR - Note: Noise reconstruction capability primarily enforced through performance requirements

Advanced (Rel-21): - No requirement to provide noise suppression - Required capability to handle speech and generic input anticipated in extended application scenarios

Design Constraints: DTX

Baseline (Rel-20): - No requirement to support DTX - Note: No separate DTX-related performance requirement

Advanced (Rel-21): - DTX support may be required for certain extended application scenarios, depending on potential 6G Media requirements

Performance Requirements

Baseline (Rel-20): - Requirements focusing on clean and noisy speech performance - NWT AMR7.4 or NWT AMR-WB8.85 depending on target bandwidth for: - Clean speech - Noisy speech (AMR/AMR-WB references operated with DTX on) - Relevant transcoding cases with G.711, AMR, AMR-WB, EVS

Advanced (Rel-21): - Complex set of requirements considering required capability to handle speech and generic input anticipated in extended application scenarios

Test Methodology and Test Plan

Baseline (Rel-20): - Subjective: P.800 DCR - Note: Test methodology and test plan should be conceptually aligned with corresponding EVS codec standardization Pdocs (e.g., DCR test design, applicable SNRs and types of noises for noisy speech test cases)

Advanced (Rel-21): - Subjective: Suitable for critical evaluation of candidate codec(s) against expected complex set of performance requirements

Proposal

SA4 is asked to adopt this phased approach for ULBC standardization as working assumption:

  1. Rel-20 ULBC Baseline: GEO-focused functionality based solely on Rel-19 service requirements and mandatory Rel-19 features (except SPS), enabling completion of viable ULBC baseline standard within Rel-20 schedule

  2. Rel-21 ULBC Advanced: Extended ULBC functionality aligned with finalized 6G Media requirements, supporting application scenarios beyond Rel-20 IMS Voice Call over GEO, possibly leveraging advanced UE capabilities, and providing backward compatible extension of Rel-20 baseline

This approach ensures deliverable ULBC baseline in Rel-20 while providing clear and orderly path toward enhanced ULBC design in Rel-21.

Qualcomm Incorporated
[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel

Feasible TBS Values and Packet Loss Traces for 80ms Bundling Period for ULBC over NB-IoT NTN GEO Channel

1. Background and Scope

This contribution presents simulation results for 80ms bundling period following the Simulation One ("target QoS based simulation") methodology. The document provides:

  • Feasible TBS values
  • Packet loss traces for optimal configurations
  • System capacity analysis

2. Simulation Parameters and Trace Labeling

The simulations cover the following parameter ranges:

  • Direction: UL/DL
  • TBS: 144, 256, 328, 424 bits (for 80ms bundling)
  • Bundling period: 80ms (focus of this paper)
  • Doppler spread: 1Hz, 5Hz
  • Number of RX: UL: 1, DL: 1, 2
  • SCS: UL: 3.75kHz, 15kHz; DL: 15kHz
  • Number of tones: UL: 1 for 3.75kHz SCS; 1, 3, 6, 12 for 15kHz SCS; DL: 15
  • BLER targets: 1%, 2%, 6%, 10%
  • UE power class: 23dBm, 26dBm, 31dBm

Trace file naming convention established for both UL and DL scenarios.

3. Optimal Configurations

3.1 TBS 144, 1 UE RX

Optimality criterion: Tradeoff between per-UE performance (TBS and BLER) and system capacity.

1% BLER

  • DL: 16ms NPDSCH (N_SF=4, N_rep=4), required SNR: -4.6dB, capacity: 5 UEs
  • UL: 48ms NPUSCH (N_RU=6, SCS 15kHz, 1 tone), required SNR: 0.0dB, UE TX power: 26.4dBm
  • Feasibility: Only with 31dBm UE power class

2% BLER

  • DL: 12ms NPDSCH (N_SF=3, N_rep=4), required SNR: -4.1dB, capacity: 6 UEs
  • UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -2.2dB, UE TX power: 24.2dBm
  • Feasibility: 26dBm and 31dBm UE power classes

6% BLER

  • DL: 8ms NPDSCH (N_SF=4, N_rep=2), required SNR: -4.0dB, capacity: 10 UEs
  • UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.2dB, UE TX power: 23.2dBm
  • Feasibility: 26dBm and 31dBm UE power classes

10% BLER

  • DL: 6ms NPDSCH (N_SF=3, N_rep=2), required SNR: -3.4dB, capacity: 12 UEs (limited by UL)
  • UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.7dB, UE TX power: 22.7dBm
  • Feasibility: All power classes (23dBm, 26dBm, 31dBm)

3.2 TBS 144, 2 UE RX

For 23dBm UE Power Class

  • Only 10% BLER achievable with system capacity: 12 UEs
  • Uses 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone) and 3ms NPDSCH (N_SF=3, N_rep=1)

For 26dBm and 31dBm UE Power Classes

  • 1% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.6dBm, 4ms NPDSCH
  • 2% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.0dBm, 4ms NPDSCH
  • 6% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.9dBm, 4ms NPDSCH
  • 10% BLER: 26 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.4dBm, 3ms NPDSCH

Key observation: 3.75kHz SCS configuration becomes optimal for higher power classes due to better coding rate.

3.3 TBS 256, 2 UE RX

For 26dBm UE Power Class

  • 10% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 24.8dBm
  • 6% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.3dBm
  • 2% and 1% BLER: Infeasible

For 31dBm UE Power Class

  • 10% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.2dBm, 5ms NPDSCH
  • 6% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.8dBm, 5ms NPDSCH
  • 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.3dBm, 8ms NPDSCH
  • 1% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.8dBm, 8ms NPDSCH

3.4 TBS 328, 2 UE RX

For 26dBm UE Power Class

  • Only 10% BLER achievable: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.88dBm

For 31dBm UE Power Class

  • 10% BLER: 13 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 30.5dBm, 6ms NPDSCH
  • 6% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 26.4dBm, 8ms NPDSCH
  • 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 27.5dBm, 8ms NPDSCH
  • 1% BLER: 8 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 28.1dBm, 10ms NPDSCH

3.5 TBS 424, 2 UE RX

Note: Coarse 5ms granularity for NPDSCH time-domain configuration.

For 31dBm UE Power Class

  • 10% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.10dBm, 10ms NPDSCH
  • 6% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.73dBm, 10ms NPDSCH
  • 2% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 30.96dBm, 10ms NPDSCH
  • 1% BLER: Infeasible

4. Feasible TBS Values

Observation: For 80ms bundling period with UE power class up to 31dBm: - All TBS values (144, 256, 328, 424) are feasible for BLERs 1%, 2%, 6%, and 10% - Exception: TBS 424 is not feasible at 1% BLER

5. Packet Loss Traces

299,391 traces provided in attached zip file for all 4 TBS values (144, 256, 328, 424).

6. Proposal

Proposal: Include clauses 2 through 5 to the PD or TR to provide a workable example on determining configurations based on optimal tradeoff between per-UE performance and system capacity.

Apple Inc.
[FS_ULBC] ULBC Performance Requirements

Summary of S4-260271: ULBC Performance Requirements

Document Information

  • Source: Apple Inc.
  • Meeting: 3GPP TSG SA WG4#135, Goa, India (09-13 February 2026)
  • Type: Discussion and Agreement
  • Revision: Revision of S4aA250135 addressing comments from SA4 #134 post-adhoc telco (Dec 02)

Main Technical Contributions

Performance Requirements Framework

The document proposes establishing minimum performance requirements for the Ultra-Low Bitrate Codec (ULBC) based on the following rationale:

  • ULBC targets IMS voice service over GEO and NGSO satellite systems (per Clause 4, TR 26.940)
  • Quality must be consistent with deployed VoLTE IMS voice services
  • Current TBS discussions center on bitrates in the 1-3 kbps range
  • AMR-WB 12.65kbps and EVS-SWB 13.2kbps are commonly deployed VoLTE operating points

Proposed Minimum Performance Benchmarks

The document establishes two key performance anchors:

  1. At lowest operating range (~1 kbps):
  2. ULBC shall provide speech quality No Worse Than (NWT) AMR-WB @12.65kbps
  3. Applies to: clean speech, noisy speech, and packet loss conditions

  4. At higher operating range (~3 kbps):

  5. ULBC shall provide speech quality No Worse Than (NWT) EVS-SWB @13.2kbps
  6. Applies to: clean speech, noisy speech, and packet loss conditions

Reference Codecs and Operating Points for Testing

The document proposes a comprehensive list of reference codecs and operating points for ToR comparison testing in subjective evaluation:

  • AMR: 12.2kbps
  • AMR-WB: 8.85kbps, 12.65kbps, 23.85kbps
  • EVS AMR-WB-IO: 8.85kbps, 12.65kbps, 23.85kbps
  • EVS-WB/SWB: 7.2kbps, 8kbps, 9.6kbps, 13.2kbps, 13.2kbps CA, 24.4kbps

Text Proposal

The document proposes updates to Clause 8 (Performance requirements) of TR 26.940, adding: - New Clause 8.1 (General) containing the performance requirements framework and minimum benchmarks - New Clause 8.1.1 (A List of Reference Codecs and Operating Points) containing the reference codec list for subjective evaluation

Apple Inc.
[FS_ULBC] ULBC Codec Testing in Background Noise

Summary of S4-260272: ULBC Codec Testing in Background Noise

Document Overview

This contribution proposes a testing framework for the Ultra-Low Bitrate Codec (ULBC) in noisy conditions, drawing from EVS codec testing methodologies. The document is a revision of S4-251786 from SA4#134 and proposes updates to TR 26.940 Clause 9.

Background and Motivation

Noise Suppression Considerations

The document argues against mandating NS algorithms within the codec specification based on several key considerations:

  • Device-Specific Optimization: NS algorithms are typically optimized for specific device microphone array configurations. A generic NS algorithm applied uniformly could result in suboptimal performance across different device types.

  • Codec Robustness vs. NS Artifacts: Testing ULBC with clean, noisy, and optionally NS-processed speech provides better understanding of the codec's inherent robustness. NS algorithms may introduce speech distortions that could bias codec testing results.

  • Emergency Call Requirements: For emergency calls, preserving background noise is critical as it may contain important contextual information (alarms, traffic, voices) that helps identify the caller's environment or ongoing danger.

  • Complexity and Latency Concerns: ML-based NS algorithms can be computationally complex, increasing power consumption and end-to-end latency. Mandating complex NS could burden some devices inefficiently.

The document advocates for flexibility in NS implementation to enable manufacturers to develop device-specific solutions.

Proposed Testing Framework

Core Testing Scenarios (Table 9.1.4.1)

Following EVS codec testing principles (TR 26.952), the proposal includes:

| Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Clean speech | - | - | ITU-T P.800 ACR and/or DCR | | Speech + Noise | Stationary (car, etc.) | 15 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 20-25 dB | ITU-T P.800 DCR |

This framework aligns with EVS testing which used: - Car noise at 15 dB - Street noise at 20 dB
- Office/babble noise at 20 dB - ITU-T P.800 DCR methodology ("Degradation of Speech in Noise" DMOS test)

Optional Extended Testing for Low SNR (Table 9.1.4.2)

To characterize ULBC robustness in challenging low SNR conditions:

| Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR |

Key Notes: - To avoid bias, a common NS processing tool should be used for generating NS-processed speech - Selection of specific noise types and the NS processing tool is FFS - Reference is made to TR 26.989 v19.0.0 (MCPTT work) where EVS was evaluated in siren noise at 5 dB SNR

Proposed Specification Changes

The document proposes adding new Clause 9.1.4 to TR 26.940 with two subclauses:

  • 9.1.4.1 Background: Captures the rationale for flexible NS implementation
  • 9.1.4.2 Recommendations for ULBC Codec Testing: Defines the testing framework with Tables 9.1.4.1 and 9.1.4.2

Action Requested

The document seeks Discussion and Agreement on: 1. The proposed testing framework for ULBC in noisy conditions 2. Updates to TR 26.940 Clause 9 as specified in the text proposal

Dolby Laboratories Inc., Nokia, Novamint
[FS_ULBC] On device capability diversity

Summary of S4-250275: On Device Capability Diversity for ULBC

Overview

This document (revision of S4aA260006) addresses UE capability diversity in NB-IoT NTN deployments for ULBC voice services. It proposes a capability-aware system design approach rather than assuming uniform baseline UE capabilities, accompanied by a pCR to TR 26.940.

Key Technical Contributions

1. UE Capability Diversity Framework

Identified Capability Dimensions:

  • Transmit Power Classes:
  • Baseline: PC3 (23 dBm)
  • Enhanced: PC2 (26 dBm) or PC1 (31 dBm)
  • Future: up to 37 dBm under study for Rel-20

  • Receive Antenna Configurations:

  • Standard: Single RX antenna
  • Enhanced: Dual RX antennas (providing ~3 dB gain)

  • Advanced Features:

  • Improved RF sensitivity
  • Multi-tone NPUSCH transmission capability

Key Insight: These capabilities are optional and vary across device categories, market segments, and implementations.

2. Benefits of Enhanced Capabilities

Enhanced UE capabilities enable:

  • Reduced time-domain resource usage in half-duplex NB-IoT transmission
  • Overcoming limitations of 80 ms SPS periods (excessive BLER and capacity constraints)
  • Multi-tone NPUSCH transmission for:
  • Higher ULBC bitrates
  • Reduced time-domain resource usage
  • Improved link robustness (reduced packet error rates)

3. Capability-Aware Multi-User SPS Scheduling

Proposed Scheduling Strategy:

  • Dynamic SPS Assignment: Enhanced UEs use shorter SPS periods (80 ms) while baseline UEs use longer periods (160/320 ms)
  • Multi-Tone Transmission: Enhanced UEs utilize multi-tone NPUSCH formats
  • Load Balancing: Resource allocation prioritized based on UE capability
  • Service Differentiation: Three-tier service model:
  • Baseline Service: Conservative configurations (long SPS, single-tone NPUSCH)
  • Intermediate Service: Moderate enhancements (shorter SPS, possible multi-tone)
  • Enhanced Service: Higher bitrates and reduced latency (shortest SPS, multi-tone, dual RX)

Practical Example (Figure 1):

  • UE Type A (Baseline): 160 ms SPS, 128 ms single-tone NPUSCH, 950 bits/s net bitrate (TBS 208 bits)
  • UE Type B (Intermediate): 80 ms SPS, 64 ms NPUSCH (higher TX power), 1100 bits/s net bitrate (TBS 144 bits)
  • UE Type C (Enhanced): 80 ms SPS, 64 ms NPUSCH, reduced NPDSCH duration (dual RX), 1100 bits/s net bitrate (TBS 144 bits)

4. ULBC Bitrate Differentiation

Proposed Approach:

  • Leverage UE capability diversity for bitrate differentiation
  • Align bitrates with service tiers and UE capabilities
  • Recommended minimum set of 3 ULBC target bitrates:
  • Basic tier: [600 - 1000] bits/s
  • Intermediate tier: [1000 - 1800] bits/s
  • Enhanced tier: [1800 - 3000] bits/s
  • Higher bitrates may be considered in second ULBC standardization phase

Note: Actual bitrates subject to ongoing TBS discussions; values >3000 bits/s may become relevant.

Proposed Changes to TR 26.940 (pCR)

Section 5.2.4: New Clause on UE Capabilities

Documents capability variations for NB-IoT NTN:

  • Transmit Power Classes: PC3/PC5 (Rel-18), PC1/PC2 (Rel-19), potential >31 dBm (Rel-20)
  • Receive Antennas: Single (typical) vs. dual (enhanced)
  • Enhanced Capabilities: Higher TX power, improved RF sensitivity

Section 5.2.5: Enhanced Multi-User Considerations

Replaces assumption of uniform UE configuration with capability-aware scheduling:

  • Capability-Aware Resource Allocation: Different SPS periods based on UE capabilities
  • Multi-Tone Transmission for Enhanced UEs: Increased bitrate and/or reduced resource usage
  • Dynamic Load Balancing: Optimized capacity through capability-based prioritization
  • Service Level Differentiation: Three-tier service model aligned with UE capabilities

Includes Figure 1 demonstrating practical multi-user scheduling scenario with three UE types.

Section 5.1.2.2: UE Delay Tables

Updates delay estimation tables (5.1.2-2, 5.1.3-1) to include:

  • Voice bundling periods: 80, 160, 320 ms
  • Codec frame sizes: 20, 40, 80, 160, 320 ms
  • Mouth-to-ear delay estimates for GEO-TN and GEO-GEO scenarios

Recommendations

  1. Adopt capability-aware ULBC system design rather than assuming single baseline configuration
  2. Agree on minimum set of 3 ULBC target bitrates for codec evaluation (approximate ranges: [600-1000], [1000-1800], [1800-3000] bits/s)
  3. Document agreed ULBC target bitrates in Pdoc
  4. Consider higher bitrates in second ULBC standardization phase

References

Key dependencies: S4aA260006 (previous version), S4-260144 (TR 26.940 v0.5.1), S4-260255 (ULBC Re-Focus Proposal), TS 36.763 (UE radio transmission/reception), S4-251863 (system capacity), S4aA250112 (error trace methodology), S4aA250118 (RAN simulation results)

Orange, Dolby Laboratories Inc.
On the use of objective metrics in ULBC standardization

Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

Introduction and Scope

This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.

Main Technical Contributions

Test Methodologies - General Considerations (Clause 9.1.1-9.1.2)

Quality Impairment Categories for ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

Testing Challenges: - ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS) - Traditional P.800 ACR methodology may not optimally quantify all potential impairments - DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences - Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations

Alternative Test Methods Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation

Subjective Testing Considerations (Clause 9.1.3)

Source Material Robustness (9.1.3.1): - Multiple languages with diverse intonations - Various phonetic and linguistic environments - Different voice pitches and speaking styles - Overlapping talkers

Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)

Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov

Conclusion: - P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

Objective Testing Considerations (Clause 9.1.4)

Correlation Analysis on Clean Speech (9.1.4.1):

Evaluated objective models from references [7-11]: - Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS - General audio metrics: PEAQ, ViSQOL-A - Additional metric: SCOREQ

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

Key Observations (Clean Speech): - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs - Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

Correlation Analysis on Music/Mixed Content:

Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

Key Observations (Music/Mixed Content): - POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content - Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading

Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective/objective metrics may be considered for characterization of new codec - Objective metrics have merits in other tasks such as codec conformance testing

Document Type

This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.

Total Summaries: 38 | PDFs Available: 38