|
(pdf)
|
[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS |
Bytedance |
Analysis on Complexity Evaluation of ULBC with WMOPS
1. Introduction
This contribution examines the use of WMOPS (Weighted Million Operations Per Second) as a complexity metric for ULBC (Ultra Low Bitrate Codec). WMOPS has been proposed as one of the possible complexity metrics and is traditionally used for evaluating 3GPP speech codecs complexity. The analysis focuses on the WMS tool used for automated WMOPS calculation with floating point C code.
2. Technical Analysis: Discrepancies Between ITU-T Documentation and WMC Tool Implementation
The source conducted systematic testing of the WMC tool against the examples provided in ITU-T standards documentation (specifically clause 18.12.7 and related tables in the ITU-T Software Tool Library 2024 User's Manual). Several discrepancies were identified:
2.1 'Move' Operator Counting
Issue: Extra MOVE operations are counted by the WMC tool
- Expected behavior (per Table 18.4): Division by constant
b = a / L should count as 1 MULT (since 1/L is constant, operation becomes multiplication)
- Actual WMC output: 1 MULT + 1 MOVE
- Discrepancy: 1 additional MOVE operation
2.2 Increment Operator ('++')
Issue: Missing operations in WMC tool output
- Expected behavior (per Table 18.4):
(*rnd_T0)++ should count as 1 ADD + 1 STORE (equivalent to *rnd_T0 = *rnd_T0 + 1)
- Actual WMC output: 0 operations counted
- Discrepancy: 1 ADD + 1 STORE not counted
- Note: This may not affect actual complexity on DSP implementations where pointer increment can be combined with other operations
2.3 Logical Operators ('AND/OR')
Issue: Missing TEST operation counting
- Expected behavior (per Table 18.4):
if (a!=b || c==d) should count as 2 ADD + 2 BRANCH + 1 TEST
- Actual WMC output: 2 ADD + 2 BRANCH
- Discrepancy: 1 TEST operation missing
2.4 Indirect Addressing
Issue: Extra MOVE operation and incorrect INDIRECT counting
- Expected behavior (per Table 18.4): Double indirection
Indice[0] = indirect_dico1[indice[0]] should count as 2 INDIRECT
- Actual WMC output: 1 MOVE
- Discrepancy: 2 INDIRECT operations not counted, 1 MOVE added instead
2.5 Instrumentation with Array Subscripts
Issue: Arithmetic operations inside array subscripts not counted
- Expected behavior (per Table 18.3): Operations like
a[i*2+1] should count arithmetic operations within the subscript (1 INDIRECT + 1 MULT + 1 ADD)
- Actual WMC output: Only 1 MOVE counted
- Discrepancy: INDIRECT, MULT, and ADD operations inside array subscripts are not instrumented
3. Observations and Impact Assessment
The source identifies three key observations:
-
Systematic discrepancies exist between ITU-T standards documentation and WMC tool implementation, with both over-counting (e.g., extra MOVE operations) and under-counting (e.g., missing operations in array subscripts) observed
-
Potential significance for AI codecs: Some discrepancies, particularly the counting of MOVE operators and instrumentation inside arrays, could significantly impact WMOPS measurements for AI-based codecs
-
Need for clarification: If WMOPS is adopted as a complexity metric for ULBC, these differences must be carefully addressed and the calculation methodology must be clearly defined
4. Proposal
The source proposes to document the findings from Clause 2 and Clause 3 in clause 6.2 of the permanent document to ensure proper consideration of these WMOPS calculation issues in the ULBC complexity evaluation framework.
|
Proposal: The source would like to propose to document clause 2 and clause 3 in clause 6.2 of the pdoc.
|
|
|
(pdf)
|
[FS_ULBC] Influence of code optimization on WMOPS |
Bytedance |
Summary of S4-260128: Influence of Code Optimization on WMOPS
1. Introduction and Motivation
This contribution investigates the impact of C code implementation choices on WMOPS (Weighted Million Operations Per Second) measurements for neural audio codecs, specifically in the context of the ULBC (Ultra-Low Bitrate Codec) study. The source examines whether WMOPS, traditionally used for 3GPP speech codec complexity evaluation, is suitable for neural audio codecs given that actual C implementation can significantly affect WMOPS measurements.
2. Experimental Analysis
2.1 Operator-Level Analysis
The source conducted experiments on Conv1D and Conv1DTranspose operators, which are extensively used in DAC (Discrete Audio Codec) for audio feature dimension manipulation:
- Non-optimized implementation: Naïve nested loop implementation
- Optimized implementation: Loop unrolling along kernel size dimension only
Key Results:
- Conv1D: 441 WOPS (non-optimized) → 301 WOPS (optimized) = 68.25% ratio
- Conv1DTranspose: 554 WOPS (non-optimized) → 260 WOPS (optimized) = 46.93% ratio
Finding: The same optimization strategy yields significantly different optimization ratios for different operators.
2.2 Full-Model Level Analysis
Using the optimized and non-optimized operator implementations, the source measured WMOPS for two DAC configurations (enc16dec384 and enc64dec1536) and compared against previously reported results [4]:
Total WMOPS:
- enc16dec384: 13,320.35 (non-opt) → 8,152.01 (opt) → 5,785.17 (reported in [4])
- enc64dec1536: 201,552.55 (non-opt) → 123,966.49 (opt) → 84,441.99 (reported in [4])
Encoder WMOPS:
- enc16dec384: 3,411.08 (non-opt) → 2,621.98 (opt) → 1,060.79 (reported in [4])
- enc64dec1536: 50,089.70 (non-opt) → 39,604.59 (opt) → 13,675.30 (reported in [4])
Decoder WMOPS:
- enc16dec384: 9,847.22 (non-opt) → 5,484.21 (opt) → 4,724.38 (reported in [4])
- enc64dec1536: 151,291.59 (non-opt) → 84,255.25 (opt) → 70,766.69 (reported in [4])
3. Observations and Conclusions
The source draws two critical observations:
- Code optimization sensitivity: Simple optimizations (e.g., single-layer loop unrolling) can result in widely different WMOPS results for the same model
- Inconsistent optimization impact: The same optimization strategy produces different WMOPS reduction ratios across different models
Main Conclusion: If WMOPS is adopted as a complexity metric for ULBC, results will be highly influenced not only by model design but also by the actual C code implementation, potentially making comparisons between different codec proposals inconsistent.
4. Proposal
The source proposes to document the experimental findings and observations as a new clause 7.6.5 "WMOPS analysis on DAC" in TR 26.940, with three sub-clauses:
- 7.6.5.1: On operator level
- 7.6.5.2: On full-model level
- 7.6.5.3: Observation
This would capture the implementation-dependency issues of WMOPS measurements for neural audio codecs in the technical report.
|
Proposal The source would like to propose to document clause 2 and clause 3 into TR 26.940 clause 7.6 as a new clause 7.6.5
|
|
|
(pdf)
|
[FS_ULBC] Discussion of FS_ULBC Objective Speech Quality Assessment Method |
China Mobile Com. Corporation |
Summary of S4-260132: Discussion of FS_ULBC Objective Speech Quality Assessment Method
Background
This contribution addresses speech quality assessment challenges for ultra-low bitrate codecs (ULBC). While subjective testing remains the benchmark for ULBC codec selection, objective speech evaluation methods can serve as predictive tools during intermediate testing and parameter adjustment processes, enabling more convenient and efficient quality verification.
Overview of Existing Speech Objective Quality Evaluation Methods
The document provides a comprehensive comparison of available objective assessment tools:
Standardized ITU-T Methods
- P.863 (POLQA): Full-reference method, widely adopted in ITU/3GPP, supports NB/WB/SWB, maintains performance below 4kbps in SWB mode
- P.563: No-reference method suitable for real-time applications, but less accurate for extreme noise or complex distortions compared to full-reference methods
Open Source Methods
- ViSQOL: Full-reference, performs well for low bitrates (under 8kbps with good MOS correlation), but not formally standardized
- STOI/ESTOI: Full-reference, focuses on speech intelligibility, computationally efficient with high correlation to subjective tests in noisy conditions. ESTOI improves robustness to nonlinear distortions (e.g., neural codecs)
- SCOREQ: No-reference model with strong cross-domain robustness and improved correlation with human judgments
Capabilities and Limitations for ULBC
The document analyzes each method's suitability for ultra-low bitrate scenarios:
- P.863: Most widely adopted, broad bandwidth support, proven performance at low bitrates
- P.563: Limited adaptability to non-linear distortions from neural codecs
- ViSQOL: Good consistency with MOS at low bitrates but lacks formal standardization
- STOI/ESTOI: Effective for intelligibility assessment, robust to nonlinear distortions, but not ITU-T/3GPP standardized
- SCOREQ: Addresses domain-generalization shortcomings with improved out-of-domain robustness
Proposal
Recommended Objective Assessment Methods
After excluding unsuitable methods, the contribution recommends considering P.863, ViSQOL, and ESTOI as potential objective quality assessment methods for ULBC.
Text Proposal for TR 26.940
The document proposes a pCR to TR 26.940 Section 9 (Test methodologies) that includes:
New Section 9.1.1: Typical Quality Impairments
Identifies ULBC-specific impairment categories:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, reverberant speech)
New Section 9.1.2: Challenges of Quality Assessment
Addresses testing challenges specific to ULBC:
-
Traditional 3GPP Practice: AMR/AMR-WB/EVS used P.800 ACR for clean speech and DCR for noisy/mixed content, but did not focus on intelligibility, speaker identifiability, or prosodic impairments
-
ULBC-Specific Challenges: ML-based codecs introduce new impairment types (e.g., hallucination) requiring alternative test methods
-
Additional Test Methodologies (non-exhaustive list):
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word/semantic equivalence
- Phoneme recognition tests
-
Automatic speech recognition tests
-
Objective Methods as Optional Tools: Proposes documenting that objective methods (P.863, ViSQOL, ESTOI, etc.) can be considered as optional tools for predicting speech quality during ULBC simulation testing and parameter optimization, acknowledging that subjective listening remains the most important evaluation method despite being time and resource-intensive
-
Speech Enhancement Evaluation: Notes that P.835 multi-dimensional rating scales can be used for speech enhancement tools that may be part of ULBC
Technical Contribution
The main technical contribution is establishing a framework for objective quality assessment in ULBC standardization that:
1. Recognizes the unique challenges of ML-based codecs
2. Identifies suitable objective methods as predictive tools
3. Proposes their documentation as optional assessment methods in TR 26.940
4. Maintains subjective testing as the primary benchmark while enabling more efficient intermediate evaluation
|
Proposals
Proposal
We propose to document the consideration of optional objective test method for ultra-low bit rate codecs provided above in a pCR to TR 26.940 or PD file as shown below.
|
|
|
(pdf)
|
Updates to the simulation results for FS_ULBC |
Xiaomi Technology |
Summary of S4-260136: Updates to Simulation Results for FS_ULBC
1. Introduction and Context
This document presents updated link-level simulation (LLS) results for Ultra-Low Bitrate Communication (ULBC) over Non-Terrestrial Networks (NTN). The simulations follow the NTN-TDL-C channel model as specified in TS 36.102. This revision adds:
- Missing simulation results for NTN-TDL-C 10 NPUSCH
- New simulation results for NTN-TDL-C 10 NPDSCH
- Updated TBS (Transport Block Size) values for both NPDSCH and NPUSCH with 10 degrees elevation angle
The simulations are based on parameters discussed in S4aA250038 and follow agreements from previous meetings.
2. Channel Model Assumptions
Key Parameters:
- Satellite elevation angle: 12.5 degrees for link budget calculations
- Channel model parameters (delay and power of each path) determined for 10 degrees elevation (approximation of 12.5 degrees)
- Channel model: NTN-TDL-C from TS 36.102
3. Link Budget Analysis
3.1 CNR Baseline Values (from RAN1)
Uplink:
- CNR = 2.6 dB (0 dBi UE antenna gain, 3.75 kHz SCS, 1 tone, 23 dBm UE TX power)
Downlink:
- CNR = -3.3 dB (0 dBi UE antenna gain, 15 kHz SCS, 12 tones, 1 RX antenna, 7 dB noise figure)
3.2 Additional UL CNR Configurations
| Configuration | SCS | UE Power | UL CNR |
|---------------|-----|----------|---------|
| Config 1 | 3.75 kHz | 23 dBm | 2.6 dB |
| Config 2 | 15 kHz | 23 dBm | -3.42 dB |
| Config 3 | 3.75 kHz | 26 dBm | 5.6 dB |
| Config 4 | 15 kHz | 26 dBm | -0.42 dB |
| Config 5 | 3.75 kHz | 31 dBm | 10.6 dB |
| Config 6 | 15 kHz | 31 dBm | 4.58 dB |
3.3 Additional DL CNR Configurations
| Configuration | Number of RX | G/T value | DL CNR |
|---------------|--------------|-----------|---------|
| Config 1 | 1 | -31.6 | -3.3 dB |
| Config 2 | 2 | -31.6 | -0.3 dB |
| Config 3 | 1 | -28.6 | -0.3 dB |
| Config 4 | 2 | -28.6 | 2.7 dB |
4. NPUSCH Simulation Results
4.1 Common Simulation Parameters
- Scenario: GEO orbit, 10 degree elevation
- Carrier frequency: 2 GHz
- Channel model: NTN-TDL-C with 5 ns delay spread
- Physical channel: NPUSCH format 1
- SCS: 3.75 kHz and 15 kHz
- Number of tones: Single tone
- Waveform: SC-FDMA
- MIMO: SISO (1T1R)
- DMRS: OS#4 per slot for 3.75 kHz, OS#3 per slot for 15 kHz
- UE velocity: 3 km/h
- Symbols per slot: 7
- Slots per RU: 16
- Modulation: π/4-QPSK
- Receiver: MMSE with real channel estimation
- Target BLER: 1%, 2%, 6%, 10%
4.2 Results Part 1 - Various TBS and Bundling Times
80 ms Bundling Time
144 bits (Cases 1-4):
- Case 1: 15 kHz, MCS 2, 4 RUs, 2 reps → SNR: -4.97 dB (10% BLER) to -4.35 dB (1% BLER)
- Case 2: 15 kHz, MCS 2, 2 RUs, 1 rep → SNR: 1.8 dB to 2.7 dB
- Case 3: 3.75 kHz, MCS 10, 1 RU, 2 reps → SNR: 1.5 dB to 2.30 dB
- Case 4: 15 kHz, MCS 10, 1 RU, 4 reps → SNR: -4.64 dB to -3.90 dB
256 bits (Cases 5-8):
- SNR ranges from -2.84 dB to 5.9 dB depending on configuration
328 bits (Cases 9-11):
- SNR ranges from -1.53 dB to 9.45 dB depending on configuration
424 bits (Case 12):
- SNR: 1.44 dB (10% BLER) to 2.05 dB (1% BLER)
160 ms Bundling Time
208 bits (Cases 13-15):
- SNR ranges from -5.56 dB to 1.53 dB
424 bits (Case 16):
- SNR: -1.95 dB to -1.52 dB
600 bits (Case 17):
- SNR: -1.38 dB to -0.97 dB
808 bits (Cases 18-19):
- SNR ranges from -1.42 dB to 0.21 dB
320 ms Bundling Time
328 bits (Cases 20-25):
- SNR ranges from -6.80 dB to -0.22 dB
776 bits (Cases 26-27):
- SNR ranges from -2.48 dB to 6.46 dB
1000 bits (Cases 28-30):
- SNR ranges from -1.95 dB to 7.47 dB
1544 bits (Case 31):
- SNR: 0.48 dB to 0.76 dB
4.3 Results Part 2 - Additional Cases
Covers Cases 32-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. SNR requirements range from -3.6 dB to 8.0 dB depending on configuration.
5. NPDSCH Simulation Results
5.1 Common Simulation Parameters
- Scenario: GEO orbit, 10 degree elevation
- Carrier frequency: 2 GHz
- Channel model: NTN-TDL-C with 5 Hz Doppler spread
- Physical channel: NPDSCH
- SCS: 15 kHz
- Number of tones: 12
- Waveform: OFDM
- MIMO: SISO (1T1R or 1T2R)
- Symbols per subframe: 14
- Modulation: QPSK
- Receiver: MMSE with NRS-bundling and real channel estimation
- Target BLER: 1%, 2%, 6%, 10%
5.2 Results Part 1 - Various TBS and Bundling Times
80 ms Bundling Time
144 bits (Case 0a):
- 1R: SNR -6.6 dB to -5.3 dB
- 2R: SNR -9.3 dB to -8.1 dB
256 bits (Case 0b):
- 1R: SNR -4.3 dB to -3.1 dB
- 2R: SNR -7.1 dB to -6.1 dB
328 bits (Cases 1-2):
- SNR ranges from -11.8 dB to -4.0 dB
424 bits (Cases 3, 3b):
- SNR ranges from -11.6 dB to -5.0 dB
160 ms Bundling Time
208 bits (Case 4):
- SNR: -15.3 dB to -11.8 dB
424 bits (Cases 5, 5b):
- SNR ranges from -14.6 dB to -8.0 dB
600 bits (Case 6):
- SNR: -11.1 dB to -7.2 dB
808 bits (Cases 7, 7b, 8, 8b):
- SNR ranges from -11.0 dB to -1.1 dB
320 ms Bundling Time
328 bits (Cases 9-11b):
- SNR ranges from -17.7 dB to -8.1 dB
776 bits (Cases 12, 12b):
- SNR ranges from -14.8 dB to -8.1 dB
1000 bits (Case 13):
- SNR: -10.3 dB to -6.4 dB
1544 bits (Cases 14, 14b):
- SNR ranges from -11.7 dB to -5.0 dB
5.3 Results Part 2 - Additional Cases
Covers Cases 15-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times.
1T1R Results:
- SNR requirements range from -10.9 dB to 1.1 dB
1T2R Results:
- SNR requirements range from -13.6 dB to -1.9 dB
- Approximately 3 dB gain compared to 1T1R configurations
6. Conclusions
The document recommends considering these simulation results for determining design constraints for ULBC. The results demonstrate:
- Performance across various TBS values (144 to 1736 bits)
- Multiple bundling times (80, 160, 320 ms)
- Different SCS configurations (3.75 kHz, 15 kHz)
- Impact of repetitions on SNR requirements
- Benefits of 2 RX antennas (approximately 3 dB gain)
|
Extracted Proposals
This document does not contain any explicit proposals. The document presents simulation results and technical analysis for ULBC (Ultra-Low Bitrate Codec) in NTN (Non-Terrestrial Networks) scenarios, but does not include any sections explicitly marked as "Proposal", "Proposal:", "Proposal X:", etc.
The conclusion section states "It is recommended to consider the above simulation results for determining the design constraints for ULBC" but this is presented as a recommendation rather than a formal proposal.
|
|
|
(pdf)
|
[FS_ULBC] On eCall scenario for ULBC |
Huawei Technologies Co., Ltd. |
Summary of S4-260137: On eCall Scenario for ULBC
1. Background
This contribution addresses the eCall (emergency call) scenario for Ultra-Low Bitrate Codec (ULBC) work. Previous contributions (S4-251908, SA-251848, SA-251881) emphasized the importance of preserving background signals in emergency communications. China has developed a related national standard "On-Board Emergency Call System for Road Vehicles" expected to take effect on July 1, 2027.
The document highlights that eCall scenarios have special requirements and different conditions compared to regular call scenarios, necessitating different design constraints and test methodologies.
2. eCall Scenario Description
2.1 System Overview
The eCall system is an in-vehicle safety technology that:
- Automatically dials emergency numbers (e.g., 112 in EU) upon severe collision detection
- Sends minimum data set (MSD) including GPS location, VIN, collision direction and time
- Can be triggered by built-in sensors or manual SOS button
- Functions via GEO satellite even without terrestrial network coverage
2.2 Communication Architecture
The bi-directional voice data flow involves:
- Vehicle side: Integrated microphones and speakers communicating over GEO satellite network
- Emergency response center: Connected via terrestrial mobile network (VoLTE, VoNR), fixed-line, or other IMS-supported platform
- Key requirement: Background noise captured within vehicle must be delivered with fidelity to emergency response center
- Asymmetric requirement: Noise preservation may not be required in the opposite direction (emergency center to vehicle)
- Dedicated system: No mobile phones involved in the communication link
3. Key Observations
Observation 1: eCall is a dedicated system between vehicles and emergency response centers. Speech codec designed for eCall is not necessarily the same as that for regular call scenarios, allowing for separate design constraints or performance requirements for ULBC-eCall.
Observation 2: Vehicle and emergency response center have significantly different hardware capabilities compared to regular call scenarios:
- Less sensitive to power consumption
- Higher computational capability
- Higher storage capability
- This allows for relaxed design constraints and more critical performance requirements for ULBC-eCall
4. Proposed Changes to TR 26.940
4.1 New Clause 4.5: eCall Communication
The contribution proposes adding a new scenario (Scenario 4) to TR 26.940 documenting:
High-level Prerequisites for ULBC in eCall:
- Very low bitrate support
- Background noise preserved with no DTX during the call (at least for vehicle-to-emergency center direction)
- Error concealment
- Real-time implementation capability (encoding and decoding)
- Good audio quality for reasonable QoE
- Relaxed hardware constraints compared to mobile phones
4.2 Modified Clause 6.2: Design Constraint Parameters
The contribution proposes creating separate design constraint columns in Table 6.2-1:
- Design Constraint (regular call): Existing constraints
- Design Constraint (eCall): New column with eCall-specific constraints
Key Differences for eCall Design Constraints:
| Parameter | Regular Call | eCall |
|-----------|-------------|-------|
| Noise Suppression | Not required; noise suppression may be applied | Background noise preserved during call (at least vehicle-to-center direction); opposite direction may not require preservation |
| DTX Support | Support | No DTX support during call (at least vehicle-to-center direction) |
| Complexity/Memory | Standard mobile constraints | Relaxed constraints possible |
5. Technical Contributions
The main technical contributions of this document are:
- Introduction of eCall as a distinct ULBC scenario with specific requirements different from regular call scenarios
- Identification of asymmetric requirements for noise preservation (required vehicle-to-center, optional center-to-vehicle)
- Proposal for relaxed design constraints based on different hardware capabilities of eCall endpoints
- Explicit requirement for background noise preservation and no DTX in critical direction
- Framework for separate performance requirements for eCall vs. regular call scenarios in TR 26.940
The document establishes that eCall scenarios justify different codec design approaches due to their dedicated nature, different hardware capabilities, and specific regulatory/safety requirements.
|
Proposals from the Document
Proposal: The source proposes to add a new clause 4.5 eCall Communication in TR 26.940 for documenting the eCall scenario, and change the clause 6.2 Design Constraint Parameter in TR 26.940 to have separate design constraints for eCall and regular call as follows:
[Note: The proposal includes the detailed text changes shown in the "First Change" and "Second Change" sections of the document, which add clause 4.5 and modify Table 6.2-1 to include separate design constraints for regular call and eCall scenarios.]
|
|
|
(pdf)
|
[FS_ULBC] On target platforms for ULBC |
Huawei Technologies Co., Ltd. |
Summary of S4-260141: Target Platforms for ULBC
1. Introduction and Motivation
This contribution addresses a gap in TR 26.940 regarding target platforms for Ultra Low Bit rate Codec (ULBC) deployment. While the TR currently discusses NPU as a possible platform in clause 6.2.1.5.1, it lacks coverage of other non-NPU platforms. The document aims to complete this missing information, particularly focusing on DSP-enabled devices.
2. Technical Problem Statement
The contribution identifies an inconsistency in TR 26.940:
- Clause 6.2.1.1 states that the codec must support real-time processing alongside other audio processing units and should fit real-time resource constraints of CPUs, potential accelerators, and DSPs across a range of devices
- Clause 6.2.1.5.1 currently only describes NPU as a target platform, omitting DSP and other non-NPU platforms
The source references previous contributions (S4aA250267 and S4-251747) that discussed the need for DSP deployment and provided clarification on DSP-enabled UE devices.
3. DSP-Enabled Device Definition
The contribution adopts the definition from S4-251747 for DSP-enabled UE devices:
- Devices with DSP only or devices with multiple computing units including DSP
- For multi-unit devices (with CPU/NPU/DSP), there remains a preference for DSP deployment due to:
- Lower power consumption
- Reduced heat generation
- Better battery life
- Target devices include vehicle-mounted devices, glasses, and mobile phones with low computational capability
- DSP refers to audio processing DSPs available in mobile phones or other devices for voice communication
4. Proposed Text Changes
Main Technical Contribution
The proposal adds a new paragraph to clause 6.2.1.5.1 that:
-
Acknowledges vendor flexibility: Vendors may choose any computing unit to implement ULBC based on business needs or product constraints
-
Highlights DSP advantages:
- Cheaper in terms of silicon real estate
- Less power hungry
- Less heat generation
- Typically single-threaded for synchronized real-time execution with low overhead
-
Potentially wider range of product support
-
Establishes DSP deployment requirement: ULBC should be deployable on DSP-enabled UE devices, including:
- Devices with DSP only
-
Devices with multiple computing units including DSP
-
Provides deployment rationale: Even when CPU or NPU are available, DSP may be preferred for power-sensitive applications (wearables, mobile phones)
-
Defines DSP computational power: Audio processing DSPs typically range from several hundred to over a thousand MIPS
Context Preservation
The proposal maintains the existing text about:
- NPU prevalence in modern smartphones
- NPU being 5-20x more power efficient than CPUs for AI tasks
- The note that ULBC may need to run on non-NPU platforms in certain configurations
5. Technical Impact
This contribution ensures that TR 26.940 provides comprehensive guidance on target platforms for ULBC deployment, balancing the AI-optimized NPU approach with the power-efficient DSP approach, thereby supporting a wider range of device implementations and use cases.
|
Extracted Proposals
Proposal: It is proposed to agree the following changes to the TR 26.940.
|
|
|
(pdf)
|
[FS_ULBC] On complexity and memory constraints for ULBC |
Huawei Technologies Co., Ltd. |
Summary of S4-260142: On Complexity and Memory Constraints for ULBC
Introduction
This contribution addresses complexity and memory constraints for Ultra Low Bitrate Codec (ULBC) as part of the study in TR 26.940. The document aims to clarify previous discussions on measurement metrics and specific constraints, proposing concrete values for complexity, RAM, and ROM requirements.
Main Technical Contributions
Complexity Measurement Metrics
The contribution proposes using both MACS (Million Multiply-Accumulate Operations per Second) and Codec/Model Size together to characterize ULBC complexity, rather than relying on a single metric:
- Codec/Model Size: Directly impacts memory requirements and power consumption (more memory footprint requires more frequent DRAM access, leading to higher power consumption)
- MACS: More suitable for guiding computing hardware unit selection
- These metrics do not necessarily correlate, as different model architectures can result in very different MACS for the same model size
Memory Constraints Clarification
The document clarifies confusion from previous contributions (S4aA250253 and S4-251807) regarding the 5-10M parameters proposal:
ROM Constraints
- ROM characterized by overall Model Sizes across all operation modes
- Major impact is FLASH consumption in product design
- Minimal power consumption impact (only one model's parameters accessed at a time)
- Proposed constraint: < 15M parameters (relaxed from previously discussed 10M to support more operation modes)
- Enables support for ~5 operation modes (e.g., 2-3 bitrates for 2 different sampling rates)
RAM Constraints
- RAM characterized by maximum single Model Size (assuming no switching between operation modes)
- Proposed constraint: < 3M parameters
- With 15M ROM, this allows 5 operation modes
- Whether switching between operation modes will be supported is FFS
Complexity Constraints
MACS Reference Point
The contribution references the 2025 Low-Resource Audio Codec (LRAC) Challenge sponsored by Cisco Systems as a relevant benchmark:
LRAC Challenge Requirements:
- Sampling rate: 24 kHz
- Mono audio input
- Bitrate: up to 1 kbps (ultralow) and up to 6 kbps (low)
- Latency: 30 ms (Track 1) or 50 ms (Track 2)
- Compute complexity: ≤ 350 MMACS total; ≤ 150 MMACS receive-side
- Winner (ByteDance) used ~4M parameters
Proposed MACS Value
- While LRAC suggested 350 MMACS, the contribution proposes < 600 MMACS for ULBC
- Rationale: Slightly increased complexity enables better speech quality while remaining within target hardware (e.g., DSP) computational capacity
- Validation: Handcrafted 3M parameter codec (reduced from SoundStream architecture) achieved 600 MMACS
Proposed Design Constraints Summary
The contribution proposes the following specific constraints for ULBC:
- Complexity:
- Single Model Size < 3M parameters
-
< 600 MMACS
-
RAM:
- < 3M parameters (assuming no switching between operation modes)
-
Whether switching will be supported is FFS
-
ROM:
- < 15M parameters
Text Proposal
The contribution includes a change request to TR 26.940, Section 6.2 (Design Constraint Parameter), Table 6.2-1, adding the specific complexity and memory constraints detailed above to the "Complexity and memory demands" parameter row.
|
Proposal
We propose to use both the MACS and Codec/Model Size to characterize ULBC complexity and set complexity, RAM and ROM respectively for ULBC as below:
Complexity: single Model Size < 3M parameters and < 600MMACS.
RAM: < 3M parameters.
ROM: < 15M parameters.
And it is proposed to agree the following changes to the TR 26.940.
|
|
|
(pdf)
|
[FS_ULBC]TR 26.940 V 0.5.1 |
China Mobile Com. Corporation |
3GPP TR 26.940 - Study on Ultra Low Bit rate Speech Codecs (Release 20)
Document Overview
This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types.
1. Application Scenarios for Ultra-Low Bit Rate Communication Services
1.1 Scenario 1: IMS Voice Call over GEO (Primary Scenario)
Background:
- GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay
- TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s
- Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints
Scenario Descriptions:
Main Scenario (4.2.2.2): One UE connects via GEO satellite access
- UE1: Phone supporting IMS voice over GEO satellite
- UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation)
Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access
- Less common but relevant for disaster/cyberattack scenarios
- Even with transparent payload, voice packets transmit to ground before reaching other UE
- May enable transcoder-free operation
High-level Prerequisites:
- Very low bitrate support
- DTX support [TBC]
- Error concealment
- Real-time implementation on smartphones
- Good audio quality for reasonable QoE
1.2 Scenario 2: Multi-Party Voice Communication
Background:
- Addresses poor/unstable network conditions in WLAN access
- Network congestion during peak usage or in areas with limited infrastructure
- Codec selection critical for maintaining quality under bandwidth constraints
Scenario Description:
- One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding)
- Both participants on unstable networks using ULBC simultaneously (no transcoding needed)
High-level Prerequisites:
- Ultra-low bitrate capability
- Real-time operation on consumer devices (smartphones, laptops)
- Audio quality matching or exceeding existing voice services
1.3 Scenario 3: IMS Voice Call with ULBC over Other Access Types
Motivation:
- ULBC may provide enhanced robustness against poor network conditions
- Lower bit rates may benefit coverage/capacity
- Reduces transcoding needs when calls bridge GEO and other access types
Scenario Description:
- Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN)
2. Channel Characteristics and Service-Related Dependencies
2.1 Mouth-to-Ear Delay Estimation for GEO Scenarios
Delay Components:
UE Delay (Table 5.1.2-2):
- Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms)
- Performance objective range: 268-1435ms (excluding solution-specific delay)
- Maximum requirement range: 355-1435ms (excluding solution-specific delay)
- Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM
Core Network Delay:
- Ground station to core network: [5-20]ms minimum, [200]ms maximum
- eNodeB to core network: 5-20ms
- Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS)
GEO Transmission Delay:
- Minimum: 248ms
- Maximum: 280ms (per TS 22.261 KPI requirement)
- Variation of 32ms depending on UE location within beam
Mouth-to-Ear Delay Estimates (Table 5.1.3-1):
For Main Scenario (GEO-TN):
- 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
For Sub-Scenario (GEO-GEO):
- 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
2.2 NB-IoT NTN System Parameters
System Architecture:
- Service link: UE to NTN payload
- Feeder link: NTN payload to NTN Gateway
RAN Parameters:
- Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink)
- MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM
- Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1
- Resource unit (RU) duration varies with subcarrier spacing and number of tones
QoS Characteristics:
- Managed through QCI (QoS Class Identifier)
- Same PELR (Packet Error Loss Rate) required for UL and DL
- Suggests balanced UL/DL time-domain transmission resources
3. Design Constraints
3.1 Design Constraint Parameters (Table 6.2-1)
Key parameters identified:
- Bit rates: [TBD]
- Sample rate and audio bandwidth: [TBD]
- Frame length: [TBD]
- Complexity and memory demands: [TBD]
- Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing)
- Packet loss concealment (PLC): [TBD]
- Noise suppression: [TBD]
- Discontinuous transmission (DTX): Including VAD and comfort noise [TBD]
- Robustness to non-speech input: [TBD]
3.2 Complexity and Memory Considerations
Current Evaluation Analysis:
- Codec must support real-time thread and concurrent processing
- ML codecs with [5-10M] parameters considered for efficient operation within latency bounds
- Must operate within compute constraints of devices for real-time voice communication
Memory and Power Considerations:
- Larger models require more DRAM access → higher power consumption
- Memory footprint critical for device performance and usability
Complexity Metrics for AI-Based Codecs:
TOPS (Tera Operations Per Second):
- TOPS = 2 × MAC unit count × Frequency / 1 trillion
- Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16)
- TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs
- Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics
Alternative Metrics:
- MACs (Multiply-Accumulate operations): Practical for complexity assessment
- RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure
- Model Size: Number of parameters and precision; directly impacts memory and power
- Tools available: ptflops, torchinfo, fvcore for MAC counting
Observations:
- NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x)
- Actual NPU performance depends on computational graph structure
- Irregular/sequential/unsupported operations may require CPU fallback
- ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs
- Million MACs + model size provide first indication of complexity
- RTF useful but requires standardized test benches
- WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial
Complexity Target Estimation:
- Target devices: Modern smartphones with NPU components
- Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS)
- Actual power consumption on smartphone NPUs: TBD
- Model size and architecture significantly impact DRAM operations and overall power consumption
3.3 Design Constraint Verification
Editor's note: Algorithmic delay verification method for AI-based codecs required.
3.4 Additional Design Considerations
Codec Parameters and Configuration:
- Static parameters: Rarely changed, exchanged via SDP or predefined
- Dynamic parameters: May change frequently, included in each packet/frame
- Common static/dynamic parameters to be identified
4. Existing Technologies and Feasibility Evidence
4.1 Overview of Existing Codec Technologies (Table 7.1.1-1)
Categories:
1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS)
2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2)
3. AI-based postprocessor: Enhancement of conventional codec output
4. AI-based encoder/decoder:
- Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2)
- Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec)
Key Codec Properties:
3GPP IMS Codecs:
- AMR: NB, 5ms delay, 20ms frame, 4.75 kbps
- AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps
- EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps
Conventional Ultra Low Bitrate:
- MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps
- Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps
AI-based (Causal):
- LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps
- LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps
- Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps
AI-based (Non-causal):
- DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps
- DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps
- SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps
4.2 Observations on Codec Parameters
Audio Bandwidth:
- Conventional codecs: NB only
- Modern AI codecs: WB or higher
Algorithmic Codec Delay:
- IMS codecs: 25-32ms
- Conventional ultra-low: 60-126ms
- Causal AI: 20-80ms
- Non-causal AI: 500ms+ or full signal
Frame Duration:
- Conventional ultra-low: Increased vs. standard 20ms VoIP
- Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms)
Bitrate:
- All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps
Complexity:
- AI codecs generally higher than IMS/conventional codecs
- Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+)
- RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB)
- ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB)
4.3 Performance Evaluation
P.808 ACR Test Results (Figure 7.1.4-1):
Test setup:
- English clean speech (4 talkers × 6 samples)
- 32kHz, SWB, normalized to -26 dBoV
- 24 subjects
Key Findings:
- Codec2 (all rates) significantly worse than AMR 4.75 kbps
- SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps
- Three conditions on par or slightly better than EVS 9.6 kbps:
- Mimi-Codec 1.1 kbps (causal)
- DAC-ibm 1.5 kbps (non-causal)
- SNAC 0.98 kbps (non-causal)
- AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs
4.4 Packet Loss Concealment (PLC) Experiments
4.4.1 PLC Experiment with DAC
Test Configuration (Table 7.1.5.1-1):
- Bitrates: 1, 2.5, 4.5, 6 kbps
- Loss percentages: 1%, 6%, 10%, 20%
- Frame size: 80ms
- Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz)
Loss Simulation Methods:
1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss
2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency)
MUSHRA Test Results (8 listeners):
- Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps
- 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss
- Interleaving benefit increases with error rate
- Potential for improvement if model trained with random loss patterns
4.4.2 PLC Experiment with DAC and DAC-IBM
Comparison:
- DAC (default): 16kHz, general audio training, scalable bitrate
- DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps
MUSHRA Test Results (8 listeners, resampled to 16kHz):
- DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions
- DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR
- Specific training for target bitrate crucial for optimal performance
- Error resilience improvable through appropriate training/design choices
Conclusions:
- More design freedom needed in bitrate and BLER selection for optimal quality at given SNR
- Optimal coding performance (even under errors) achieved with appropriate training strategy
- Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates
- Dedicated training (e.g., DAC-IBM) much more efficient
4.5 Very Low Bitrate Listening Test Results
Test Setup (Nokia):
- Clean Finnish speech (3 males, 3 females, 4 sample pairs each)
- Diotic presentation via Sennheiser HD650 headphones
- Experienced listeners
- Extended ACR5 scale (0.5-5.5) and DCR methodologies
- Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz)
Codecs Tested:
- DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps)
- 3GPP: AMR, AMR-WB, EVS at various rates
- ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps)
Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2):
- Increased bandwidth improves quality up to ~12kHz (saturation region)
- 4kHz bandwidth significantly limits perceived quality
- MELP 2.4k and MPEG4 HVXC perform better than Codec2
- 3GPP codecs perform as expected at lowest bitrates
- TSAC and DAC show very good performance in clean speech
- TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references
- Both poor quality <1 kbps
DCR Results (Figure 7.2.4-1):
- Results align with ACR test
- Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz)
- Listeners more likely to notice degradations with reference available
4.6 Test Results on Clean Speech and Music/Mixed Content
4.6.1 DCR Test on Clean Speech (Figure 7.3.2-1)
Test Setup:
- French, 30 listeners (5 panels × 6)
- 8sec double sentences, 3 male + 3 female
- 20-20,000Hz bandpass, -26dB LKFS normalized
Codecs:
- Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps)
- AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8)
Key Findings:
- DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps
- EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate
- Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps)
4.6.2 ACR Test on Clean Speech (Figure 7.3.3-1)
Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test.
4.6.3 DCR Test on Music and Mixed Content (Figure 7.3.4-1)
Test Setup:
- 30 listeners (5 panels × 6)
- 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background)
- 20-20,000Hz bandpass, -26dB LKFS
Codecs:
- Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4)
- AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5)
- Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft)
Key Findings:
- Best quality: EVS and xHE-AAC @ ~24 kbps
- Neural codec advantage visible at low bitrates
- No tested neural codec achieves quality close to "Direct"
- FlowDec 7.5 kbps: 4.08 DMOS (best neural codec)
- No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps
4.7 Impact of Noise Suppression on AI-Based Codecs
4.7.1 Background on Existing Systems
Classical Speech Coding:
- Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality
- Especially beneficial in noisy conditions and low SNRs
- Improves intelligibility and perceptual quality
- Integrated in 3GPP2 EVRC and VMR-WB standards
Neural Speech Coding:
- Known to be sensitive to noisy environments
- Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization
- Data-driven approaches make failure modes difficult to anticipate
- Noise suppression can minimize issues and allow codec to focus on useful signal
4.7.2 Test Design
Two Listening Tests (ITU-T P.808 ACR):
Test 1 - High SNR:
- Assumptions from 3GPP EVS characterization
- SNRs: +15 to +20 dB (WB)
- Noises: car, street, office (from ITU-T P.501 Annex B)
- 24 pairs of sentences (8 pairs × 3 noises)
- 20 listeners
Test 2 - Low SNR:
- More adversarial environments
- SNRs: -5 to +15 dB
- Noises: street, construction, metro, car, office, restaurant
- 24 pairs of sentences (4 pairs × 6 noises)
- 21 listeners
Noise Suppression:
- DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz
- Applied as preprocessor before coding
Mixing Procedure:
- Loudness normalization using BS1770demo (ITU-T STL)
- RMS long-term option for background noise level
4.7.3 Conditions Under Test
Classical Codecs:
- MELPe, AMR, AMR-WB, EVS
Neural Codecs:
- SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps)
- LyraV2 3.2 kbps (likely trained on diverse data including noisy speech)
- DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only
All tested with and without noise suppression ("_nr" suffix).
4.7.4 High SNR Test Results (Figures 7.4.2.4-1, 7.4.2.4-2)
Key Observations:
- Listeners prefer uncoded denoised speech over uncoded noisy speech
- Denoised speech as good as clean speech at high SNRs (minimal artifacts)
- Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs)
- Classical codecs: Benefit increases with bitrate/quality
- Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps)
- DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate
- Plain DAC @ 24kHz not competitive at 1.5 kbps
- LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par)
4.7.5 Low SNR Test Results (Figures 7.4.2.5-1, 7.4.2.5-2)
Key Observations:
- Listeners strongly prefer uncoded denoised speech (~1 MOS difference)
- All classical codecs benefit from denoising (<1 MOS improvement)
- Neural codecs benefit even more (>1 MOS improvement possible)
- Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression
- Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech
4.7.6 Conclusions
- Speech coder performance in noisy conditions significantly enhanced by ensuring high SNR (e.g., via noise suppression)
- Neural speech coders more sensitive to noisy environments; benefit more from noise suppression than traditional coders
- High SNR enables improved performance at very low bitrates under both high and low SNR conditions
- Noise suppression impact on delay/complexity requires further study
- Note: Removing all background audio may not always be desirable (e.g., emergency calls where background contains relevant information)
4.8 Analysis of Existing AI Codec: Lyra V2
Key Characteristics:
- Publicly reported: "38x faster than real-time" on high-end 2021 smartphone
- Entirely CPU execution (no NPU/TPU)
- Open-source under Apache 2.0 license (permissive for commercial/standardization)
Code-Level Analysis:
- Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite)
- Flag use_xnn=true directs to CPU execution
- No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU)
- Single-threaded execution (threads explicitly set to 1)
- Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time
Conclusion:
- Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU
- Significant margin towards max RTF
- CPU-only approach viable for ULBC
4.9 Complexity Analysis of Existing AI Codec: DAC
Methodology:
- ONNX Runtime library for execution
- Tested on CPU backend and NNAPI backend (Android NPU interface)
- Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference)
- No quantization applied (original float model)
- Metrics: Real-Time Factor (RTF) for end-to-end and individual components
Theoretical Complexity Analysis (Figure 7.6.2-1):
- Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification)
- Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms)
- Model: 76.9M parameters, 293MB size
- Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes
Real-World Inference Performance:
Test Platforms:
1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed)
2. High-end mobile: Qualcomm Snapdragon 8 Gen 2
Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3):
Desktop CPU:
- Single-threaded: NOT real-time (RTF 1.6-1.9)
- Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86)
- Still very slow for high-end desktop CPU
Mobile SoC:
- NO configuration achieves real-time performance
- Best-case RTF: 2.125 (>2x slower than real-time)
- Worst-case RTF: 5.884 (~6x slower than real-time)
- NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU
- Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required
Critical Gap:
- Significant gap between theoretical NPU capacity and actual measured performance (RTF)
- Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone
- Real-world testing essential
Editor's note: NNAPI may fallback to CPU for float models; impact needs verification.
5. Test Methodologies
5.1 General Considerations
5.1.1 Typical Quality Impairments of Ultra-Low Bit Rate Speech Coding
Categories:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech)
Additional Considerations:
- Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC
5.1.2 Challenges of Quality Assessment
Traditional 3GPP Practice:
- AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR
- ACR: Generally for clean speech
- DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content
- Focus not on intelligibility, speaker identifiability, prosodic impairments
ULBC Challenges:
- May need to address additional aspects directly through dedicated tests
- Hallucination: Specific to ML-based systems
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic)
Alternative Test Methods:
- Automatic speech recognition
- Modified rhyme tests
- DCR tests (for prosodic differences)
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests (word/semantic equivalence)
- Phoneme recognition tests
Noise Suppression Evaluation:
- P.835: Multi-dimensional rating (speech quality and noise suppression capability separately)
- Typically used for systems with noise suppression
DCR Considerations:
- Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference
5.1.3 Subjective Testing Considerations
|
Extracted Proposals
This document does not contain any explicit proposals marked with "Proposal X:", "Proposal X.", "Proposal:", "Proposal.", or "Proposal " formatting.
The document is a Technical Report (TR 26.940) that is still under development (Version 0.5.1) and contains study items, observations, editor's notes, and conclusions, but no formally marked proposals in the standard 3GPP proposal format.
|
|
|
(pdf)
|
[FS_ULBC] Permanent Document v0.5.0 |
China Mobile Com. Corporation |
Comprehensive Summary of 3GPP FS_ULBC Permanent Document
Document Overview
This permanent document (p-doc) version 0.45.0 supports the Study Item on Ultra Low Bitrate Speech Codec (FS_ULBC), focusing on developing recommendations for normative work on an ultra-low bit rate codec for voice over Geostationary Orbit (GEO) satellites. The document tracks agreements, open issues, and progress across the study objectives defined in the SID.
1. Introduction and Scope
The study addresses nine key objectives:
- Document application scenarios for ultra-low bit rate communication services
- Study GEO channel characteristics and derive service-related dependencies
- Identify relevant design constraints
- Provide feasibility evidence
- Define performance requirements and test methodologies
- Identify/develop objective measures for design constraint verification
- Identify reference codecs
- Coordinate with other 3GPP groups (SA2, RAN, CT1)
- Define potential normative work item objectives and timeline
Working Procedure:
- Maintains one TR and one p-doc
- Contributions via pCRs
- Brackets restricted to values only
- Open issues documented in p-doc
2. Application Scenarios
2.1 Main Scenario: IMS Voice Call over GEO
Key Technical Assumptions:
UE1 Uplink (UE1 → GEO satellite → Ground station):
- Transmission data rate significantly limited ([1-3] kbit/s)
- Requires ultra-low bit rate codec fitting this transmission rate
- Subject to transmission errors reflecting GEO satellite access
- Delay greater than typical terrestrial networks
UE1 Downlink (Ground station → GEO satellite → UE1):
- Similarly limited transmission data rate
- Subject to similar transmission errors and delay
UE2 Connection (Core Network → UE2):
- Regular TN network transmission data rate available
- Could use existing IMS codec (with transcoding) or same ULBC (transcoding-free)
- Transcoding functionality in core network likely needed for seamless communication across network types
2.2 Sub-Scenario
Both connections (UE1 and UE2) via GEO satellite with significantly limited transmission data rate ([1-3] kbit/s), allowing both transcoded and transcoding-free operation.
3. Channel Characteristics and Service-Related Dependencies
3.1 End-to-End Simulation Model
Methodology:
- Reuses simulation model from TS 26.132 Annex E (LTE reference scenario)
- Adapted for GEO access scenario with "new GEO channel"
- Potential inclusion of Non-IP Data Delivery (NIDD) option
Key Input Parameters:
BLER_tx/BLER_rx: Block error rates for uplink/downlink from RAN simulation
drx_cycle_length: DRX cycle duration (20-40ms for LTE; suitability for GEO TBC with RAN2)
mis_eNB1_eNB2: Scheduling time mis-alignment; determines buffer waiting time
nFrames considerations:
- Frame length: Maximum 80ms assumed for GEO (vs. 20ms for LTE)
- Voice packet size: Depends on protocol overhead (user plane vs. control plane, IP vs. Non-IP NIDD)
- RTP Payload Size: Product of frame length and codec bit rate
Editor's Note: SA2 concluded in TR 23.700-19 that voice packets shall be transported over NB-IoT (GEO) user plane.
3.2 RAN Simulation Model for Error Traces
Objective: Generate multiple loss traces for combinations of:
- Frame loss rate (target BLER)
- Raw bitrate (TBS)
- Voice bundling period
- Doppler spread
Simulation Parameters:
- Number of seeds: 10
- Trace duration: 400 seconds (6.67 minutes)
- Channel consistency: Same channel realizations across all combinations
3.2.1 Link Budget Analysis
Baseline CNR values from TR 36.763:
- UL CNR = 2.6dB (0dBi UE antenna gain, 3.75kHz SCS, 1 tone, 23dBm UE max TX power)
- DL CNR = -3.3dB (0dBi UE antenna gain, 15kHz SCS, 12 tones, 1 UE receive antenna, 23dBm UE max TX power)
3.2.2 Uplink Simulation Parameters
Channel model: NTN-TDL-C [38811]
Elevation angle: 10 degrees (parameters specified in Table 5.2.2.2-1)
Modulation: QPSK, π/2 BPSK
Subcarrier Spacing (SCS): 3.75kHz, 15kHz
Number of tones: 1 for both SCS values
Voice bundling period: 80ms, 160ms, 320ms
- Note: 40ms not considered due to insufficient time for DL transmissions with 3.75kHz SCS
Doppler spread: 1Hz, 5Hz
Target BLER: 1%, 2%, 6%, 10% (fixed target BLER is FFS)
Maximum Achievable SNR:
SNR = (3GPP SET-1 UL SNR) - 10×log₁₀(B/3.75) + (P - 23dBm) + G + [X] dB
Where:
- 3GPP SET-1 UL SNR = 2.6dB
- B = bandwidth (3.75kHz or 15kHz)
- P = max UE TX power (23, 26, 31 dBm)
- G = UE antenna gain difference (0 to -5.5dBi)
- X = TBD (accounts for lower loss, better satellite performance)
TBS Values and PHY Bitrates:
For 80ms bundling:
- TBS: 144, 256, 328, 424 bits
- PHY bitrate: 1.8, 3.2, 4.1, 5.3 kbps
- Codec bitrate: 1.1, 2.5, 3.4, 4.6 kbps (assuming 7 bytes packet header)
For 160ms bundling:
- TBS: 208, 424, 600, 808 bits
- PHY bitrate: 1.30, 2.65, 3.75, 5.05 kbps
- Codec bitrate: 0.95, 2.30, 3.40, 4.70 kbps
For 320ms bundling:
- TBS: 328, 776, 1096, 1544 bits
- PHY bitrate: 1.025, 2.425, 3.425, 4.825 kbps
- Codec bitrate: 0.850, 2.250, 3.250, 4.650 kbps
Notes:
- Packet header counted once regardless of bundled frames
- Loss of single TB means loss of multiple consecutive voice frames
- Need for 320ms bundling to be revisited after channel simulation results
3.2.3 Downlink Simulation Parameters
SCS: 15kHz
Number of tones: 12
Achievable SNR:
SNR = (3GPP SET-1 DL SNR) + G + [Y] dB
Where:
- 3GPP SET-1 DL SNR = -3.3dB
- G = UE antenna gain difference (0 to -5.5dBi)
- Y = TBD (accounts for 2 RX antennas providing up to 3dB gain, lower loss, better G/T values, better satellite performance)
Editor's Note: Four companies reported Y=3 due to better G/T from field measurements (-28.6dB/K vs. -31.6dB/K assumed), but no RAN1 consensus reached.
TBS values: Identical to uplink (Clause 5.2.2.2)
3.2.4 Frame Structure
Dynamic Scheduling Example (80ms bundling, Half-duplex FDD):
- NPDSCH duration: 4ms (variable depending on DL SNR)
- UL frequency allocation options: 1, 3, 6, 12 tones with 15kHz per tone
Semi-Persistent Scheduling (SPS):
- If specified by RAN for NB-IoT NTN
- NPDSCH can be anywhere in first 15ms (maintaining minimum 1ms gap to NPUSCH)
- "Cell_specific_Koffset" approach proposed (not dependent on "TA report UE capability")
Gap between DL and UL consists of:
- Processing time + DL-to-UL switching (minimum 1ms for half-duplex device)
- Max differential delay: [close to 0 to 10.3ms] (TBC)
RAN1 Note: Example frame structures supportable in most scenarios but may not work for very large cells (>3000km) when UE doesn't support TA report and network doesn't support UE-specific K-offset. RAN1/2 have not yet designed SPS.
3.3 Open Issues for NB-IoT GEO Simulation
Issue 1 - UE Power Class: Whether to use specified 23dBm or broader range (26, 29, 31, 33 dBm) - Pending RAN input
Issue 2 - Latitude-Dependent Loss: Scintillation loss (2.2dB or 0dB depending on latitude) - Solved (accounted via X term)
Issue 3 - Elevation Angles: Keeping both 2.3° and 12.5° - Solved (accounted via X term)
Issue 4 - UL/DL Guard Time: 1ms assumption - Pending RAN confirmation
Issue 5 - Candidate TBS Values: Multiple proposals from companies - Unsolved
Issue 6 - Approaches to Select TBS: Three approaches provided - Unsolved
Issue 7 - Overall Simulation Methodology: High-level description needed - Unsolved (to be addressed after simulation completion)
Issue 8 - Simulation Channel Model: NTN-TDL-C vs. NTN-TDL-C5 - Solved (NTN-TDL-C used)
Issue 9 - Protocol Overhead: Clarify packet header for different transport options - Pending RAN2/SA2 confirmation
Issue 10 - Repetition Numbers: Specify and report in simulation - Solved
Issue 11 - RX G/T for Downlink: 3dB better value observed in field - Unsolved
3.4 Alternative Methodology for Determining ULBC Bit Rate
Editor's Note: This methodology remains an open issue.
Proposed Steps:
-
Agree on operation points: Set of maximum achievable receive SNRs covering marginal to error-free operation with NTN-TDL-C fading
-
Define performance requirements for each SNR operation point
-
Agree on source bit rates for each bundling time (80, 160, 320ms) based on transport formats (TBS, SCS, MCS, NRep)
- Current range: 825-4650 bits/s
-
Granularity appears insufficient and unequal
-
Determine optimum transport format (SCS, MCS, NRep) for each source bit rate based on BLER vs. SNR curves
-
Produce packet loss patterns for each bundling time and source bit rate at relevant SNRs (unknown to proponents during selection)
-
Compare ULBC candidates based on performance requirements at relevant SNRs
Example Workflow:
- Proponent has design at 0.95 kbps and 3.4 kbps
- For 160ms bundling with 7-byte overhead:
- Low rate: TBS = 208 bits
- High rate: TBS = 600 bits
- Select best transport format configuration from available options
- Generate BLER patterns for different UE TX powers (23, 26, 29, 31 dBm)
- Run codec simulation with these patterns
- Evaluate quality (e.g., listening test) with weighted averaging across power settings
Note: Important to test candidates for other conditions beyond NTN NB-IoT (e.g., Terrestrial IMS with 1% BLER, OTT with 0% BLER, extreme conditions with 10% BLER or blockage losses)
3.5 Simulation Results
Table 5-6 documents preliminary results:
- 80ms bundling: Qualcomm submitted S4-251739
- Company A, B, C: TBD
4. Design Constraints
4.1 Complexity and Memory Demands
Target Device Types:
- Handheld mobile phones
- Smart watches
- Smart glasses/head mounted devices
- TCU (Telematics Control Unit)
- CPE (Customer Premises Equipment)
- Vehicles
- Other IoT devices
Recommended Constraints:
- Implementable on DSP/CPU/NPU enabled UE devices
- For low-end DSP-only UEs:
- Complexity: <500 WMOPS (measured on C reference code)
- ROM memory: <20MB assuming 32bit/parameter (or 5M model parameters)
Editor's Notes:
- Definition of "DSP enabled UE devices" needs clarification
- Exact complexity estimation metric and limits are TBD
4.2 Design Constraint Verification
Complexity Verification:
- Constraints may be based on platform-agnostic metrics:
- MACs/FLOPs for AI-based components
- WMOPS for traditional signal processing
- Model size and precision
- Verification process details and timing are FFS
Algorithmic Delay:
- Verification method for AI-based codecs required
5. Performance Requirements
5.1 Scope
Define performance requirements and test methodologies for:
- Speech quality, intelligibility, conversational quality
- Clean speech and noisy speech
- Tandeming with existing IMS voice codecs
- Clean channel and GEO channel conditions
- Identify relevant reference codecs
5.2 Status Tracking
Core influencing factors identified:
- DC: Sample rate and audio bandwidth
- DC: Bitrates (External dependency)
- DC: Frame length
- DC: PLC (External dependency)
- DC: Algorithmic Delay
- DC: Complexity, Memory
- Test Methodologies
- DC: Noise suppression
- DC: DTX/CNG
- DC: Robust Non-Speech
- Evidence DCs
- Reference codec
All items currently have open issues and progress TBD
6. Coordination and Dependencies
6.1 External Dependencies
From RAN:
- HARQ retransmission parameters (max_tx/max_rx)
- DRX cycle length suitability for GEO
- Scheduling parameters (dynamic vs. SPS)
- Frame structure confirmation
- UE power class
- UL/DL guard time
- Protocol overhead
- G/T values for downlink
From SA2:
- Transport path for voice packets (user plane vs. control plane, IP vs. Non-IP NIDD)
- Protocol overhead details
- Transcoding functionality requirements
From RAN2:
- Dynamic scheduling vs. Semi-Persistent Scheduling
- MAC header size (1-byte feasibility)
- Timing parameters
7. Key Technical Contributions
7.1 Simulation Framework Establishment
The document establishes a comprehensive RAN simulation framework for generating error traces:
- Defined methodology using NTN-TDL-C channel model
- Specified uplink and downlink parameters
- Established TBS values and corresponding codec bitrates for multiple bundling periods
- Defined channel consistency requirements across simulations
7.2 Link Budget Analysis
Adopted baseline CNR values from TR 36.763 with provisions for:
- Variable UE power classes
- Latitude-dependent losses
- Elevation angle variations
- Better-than-assumed satellite performance
7.3 Bitrate Determination Methodology
Proposed alternative methodology allowing proponents design freedom:
- Operation point definition based on receive SNRs
- Transport format optimization for each source bit rate
- Packet loss pattern generation
- Comparative evaluation framework
7.4 Frame Structure Definition
Defined frame structures for:
- Dynamic scheduling with Half-duplex FDD
- Semi-Persistent Scheduling options
- Cell_specific_Koffset approach for large cells
7.5 End-to-End Delay-Error Profile Model
Adapted TS 26.132 Annex E model for GEO scenarios:
- Identified required input parameters
- Defined voice packet structure with protocol overhead
- Established relationship between frame length, bundling, and packet loss
8. Open Issues Summary
High Priority (Blocking):
1. Consensus on UE power class (23 dBm vs. higher values)
2. RAN confirmation on frame structures and scheduling
3. SA2/RAN2 confirmation on protocol overhead
4. Selection of candidate TBS values and selection methodology
5. Downlink RX G/T value consensus
Medium Priority:
1. Fixed vs. variable target BLER
2. Need for 320ms bundling option
3. Complexity metric definition and limits
4. Algorithmic delay verification for AI codecs
Lower Priority:
1. Overall simulation methodology description (after completion)
2. Definition of "DSP enabled UE devices"
9. Document Status
Current Version: 0.45.0 (SA4#135, February 2026)
Recent Updates:
- Added 10-degree channel model parameters
- Updated simulation parameters per multiple agreed TDOCs
- Added company simulation results reporting
- Clarified voice packet transport over user plane
Working Status:
- Active study phase
- Collecting simulation results from companies
- Coordinating with RAN and SA2 for parameter confirmation
- Developing design constraints and performance requirements
|
Extracted Proposals
Based on my review of the document, there are no formal proposals explicitly marked with the word "Proposal" followed by a colon, period, number, or other punctuation in this 3GPP document.
The document is a Permanent Document (p-doc) for the FS_ULBC study item that contains:
- Technical assumptions
- Open issues
- Editor's notes
- Tables with simulation parameters
- References to other documents
However, it does not contain any sections explicitly labeled as "Proposal", "Proposal:", "Proposal X:", etc. The document appears to be a working document that tracks the status of study objectives, technical parameters, and open issues rather than containing formal proposals for agreement.
|
|
|
(pdf)
|
[FS_ULBC] WorkPlan of FS_ULBC v0.5 |
China Mobile Com. Corporation |
Timeplan for FS_ULBC Study Item
1. Introduction
This document outlines the timeplan for the Feasibility Study on Ultra Low Bitrate Speech Codec (FS_ULBC). The study focuses on developing a codec for ultra-low bit rate communication services, particularly for IMS Voice Call Using GEO Access as documented in TR 22.887.
Study Item Objectives
The FS_ULBC study has nine main objectives:
- Application Scenarios: Document ultra-low bit rate communication service scenarios based on TR 22.887 use cases and requirements for IMS Voice Call Using GEO Access
- GEO Channel Characteristics: Study GEO channel characteristics and derive service-related dependencies (bitrates, mouth-to-ear delay, loss/delay/jitter profiles)
- Note: NB-IoT services impact is out of scope
- Design Constraints: Identify relevant design constraints in coordination with other WGs:
- Bit rates
- Sample rate and audio bandwidth
- Frame length
- Complexity and memory demands
- Algorithmic delay
- Packet loss concealment (PLC)
- Potential noise suppression integration
- Discontinuous transmission (DTX) including VAD and comfort noise
- Speech quality
- Robustness to non-speech input
- Feasibility Evidence: Provide evidence that design criteria can be met using existing reference codecs
- Performance Requirements: Define performance requirements and identify test methodologies for:
- Speech quality and intelligibility
- Conversational quality
- Clean and noisy speech conditions
- Tandeming with existing IMS voice codecs
- Clean channel and GEO channel conditions
- Objective Measures: Identify or develop objective measures to verify design constraints (e.g., complexity and memory measurements)
- Reference Codecs: Identify relevant reference codecs for comparison and evaluation
- Coordination: Coordinate with other 3GPP groups (SA2, RAN, CT1, etc.)
- Normative Work: Define potential normative work item objectives and timeline
2. Current Progress Status
Application Scenarios (85% Complete)
- Scenario 1: IMS Voice Call over GEO (TR 4.2, P-doc 4.1)
- Scenario 2: Multi-Party Voice Communication (TR 4.3)
- Scenario 3: IMS Voice Call with ULBC over other access types than GEO (TR 4.4)
- Next Steps: Finalize high-level prerequisites and resolve ENs
- Dependencies: SA2, RAN, CT
GEO Channel Characteristics & Simulation (75% Complete)
- NB-IoT system design and simulation parameters documented (TR 5.1.4, P-doc 5.2.2)
- SA4#135 Plans:
- Finalize remaining parameters
- Gather candidate Transport Block Sizes (TBS)
- Dependencies: SA2, RAN
Simulation Methodology (60% Complete)
- SA4#135 Plans: Discuss and confirm simulation methodology
- Dependencies: RAN
Company Simulation Results (40% Complete)
- Companies providing simulation results (P-doc 5.2.3)
- SA4#135 and Post Ad-hoc Plans:
- Select appropriate TBS
- Collect company simulation results
- SA4#136 Plans:
- Cross-check simulation results
- Finalize feasible TBS values and loss traces
Mouth-to-Ear Delay (95% Complete)
- Documented in TR 5.1
- Next Steps: Resolve ENs
Design Constraints Progress
Bit Rates (0% Complete)
- SA4#136 and Post Ad-hoc: Decide bit rates for ULBC (dependent on simulation results)
- Dependencies: RAN
Sample Rate and Audio Bandwidth (5% Complete)
- SA4#135 and Post Ad-hoc: Discuss supported audio bandwidth
- SA4#136 and Post Ad-hoc: Decide supported audio bandwidth
Frame Length (0% Complete)
- SA4#136 and Post Ad-hoc: Decide frame length for ULBC
Complexity and Memory Demands (80% Complete)
- Documented in TR 6.2.1, P-doc 6.1.1
- SA4#135 and Post Ad-hoc: Finalize complexity measurement metrics and resolve ENs
Algorithmic Delay (0% Complete)
- SA4#135 and Post Ad-hoc: Discuss algorithm delay for ULBC
Packet Loss Concealment (15% Complete)
- Documented in TR 7.1.5
- SA4#135 and Post Ad-hoc: Discuss PLC for ULBC
Noise Suppression (15% Complete)
- Documented in TR 7.4
- SA4#135 and Post Ad-hoc: Discuss noise suppression for ULBC
DTX (0% Complete)
- SA4#135 and Post Ad-hoc: Discuss DTX support for ULBC
Design Constraint Verification (5% Complete)
- P-doc 6.3.1
- Next Steps: Verify design constraints
Other Considerations (5% Complete)
- TR 6.4.1
- Next Steps: Document additional design considerations and resolve ENs
Existing Codec Technologies (85% Complete)
- Reference codecs documented (e.g., DAC, Lyra) in TR 7
- SA4#135 and Post Ad-hoc: Continue documenting evidence of existing technologies and resolve ENs
Performance Requirements (0% Complete)
- SA4#136 and Post Ad-hoc: Define performance requirements
Test Methodologies (50% Complete)
- Subjective test methodologies documented in TR 9.1.3
- SA4#135 and Post Ad-hoc: Identify appropriate test methodologies
Coordination with Other WGs
- Analysis of current liaisons from RAN, CT, and SA2 available (S4aA250139)
- Ongoing coordination as needed
3. Detailed Timeplan
TSG SA#107 (March 12-14, 2025, Incheon, KR)
- Approval of FS_ULBC study item
SA4#131-bis-e (April 11-17, 2025)
- Start documenting application scenarios for ultra-low bit rate communication services
- Start studying GEO channel characteristics and service-related dependencies
- Start identifying relevant reference codecs
- Start coordinating with other 3GPP groups
Audio SWG Telco (May 5, 2025)
- Focus on application scenarios and technical contributions
SA4#132 (May 19-23, 2025, Fukuoka, JP)
- Finalize application scenarios documentation
- Progress GEO channel characteristics study
- Progress reference codec identification
- Progress coordination with other WGs
- Start identifying/developing objective measures for design constraint verification
- Start identifying relevant design constraints (bit rates, sample rate, frame length, complexity, algorithmic delay, PLC, noise suppression, DTX, speech quality, robustness)
- Start providing feasibility evidence using existing reference codecs
- Start defining performance requirements and test methodologies for speech quality
- If time permits: Start documenting additional application scenarios
Audio SWG Telcos (June-July 2025)
Multiple telcos scheduled to:
- Progress GEO channel characteristics study
- Perform RAN-related simulations within SA4
- Align on RAN link-level simulations
- Power to send LS to SA2 and RAN WGs if needed
SA4#133-e (July 21-25, 2025)
- Progress all ongoing work items:
- GEO channel characteristics
- Coordination with other WGs
- Reference codec identification
- Objective measures development
- Design constraints identification
- Feasibility evidence
- Performance requirements and test methodologies
- If time permits: Progress additional application scenarios
F2F Ad-hoc Meeting (September 23-25, 2025, Erlangen, Germany)
- Hosted by Fraunhofer IIS
- Electronic participation on best effort basis
Audio SWG Telcos (October 2025)
- Opportunity for feedback from other WGs
- Progress work on:
- GEO channel characteristics study
- Existing technologies documentation
- Design constraints identification
- Performance requirements and test methodologies
- Application scenarios if time permits
SA4#134 (November 17-21, 2025, Dallas, US)
Major milestone meeting:
- Finalize:
- GEO channel characteristics study
- Coordination with other WGs
- Reference codec identification
- Design constraints: bit rates, sample rate, audio bandwidth, frame length, PLC, noise suppression, DTX
- Progress:
- Feasibility evidence
- Objective measures development
- Design constraints: complexity, algorithmic delay, speech quality, robustness
- Performance requirements and test methodologies
- Start defining potential normative work item objectives and timeline
- If time permits: Finalize additional application scenarios
Audio SWG Telcos (December 2025 - January 2026)
- Finalize GEO channel characteristics study
- Progress:
- Simulation parameters for end-to-end simulation
- Existing technologies documentation
- Design constraints identification
- Performance requirements and test methodologies
- Application scenarios if time permits
- Power to send reply LS for incoming LS postponed during SA4#134
SA4#135 (February 9-13, 2026, India)
- Finalize objective measures for design constraint verification
- Progress:
- Design constraints: complexity, algorithmic delay, speech quality, robustness
- Feasibility evidence
- Performance requirements and test methodologies
TSG SA#111 (March 10-13, 2026, Japan)
SA4#136 (April 13-17, 2026)
- Finalize:
- Design constraints: complexity, algorithmic delay, robustness to non-speech input
- Feasibility evidence
- Progress:
- Design constraints: speech quality
- Performance requirements for speech quality
SA4#137 (May 11-15, 2026)
Final study meeting:
- Finalize:
- Design constraints: speech quality
- Performance requirements and test methodologies (clean/noisy speech, tandeming, clean/GEO channel conditions)
- Potential normative work item objectives and timeline
TSG SA#112 (June 9-12, 2026, Singapore)
- TR for approval - Study completion
|
Extracted Proposals
This document does not contain any explicit proposals. The document is a timeplan for the FS_ULBC (Feasibility Study on Ultra Low Bitrate Speech Codec) Study Item, outlining objectives, current progress, and meeting schedules, but does not include any sections explicitly marked as "Proposal", "Proposal X:", "Proposal X.", etc.
|
|
|
(pdf)
|
[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation |
China Mobile Com. Corporation |
Summary of S4-260149: Updates on Assumptions and Open Issues for NB-IoT GEO Simulation
Document Overview
This contribution from China Mobile addresses outstanding assumptions and open issues for NB-IoT GEO satellite simulation work within the ULBC (Ultra-Low Bitrate Codec) study. The document consolidates discussions from multiple Audio Ad-hoc meetings (June 4, June 17, and July 11) and proposes updates to TS 26.940 clause 5.2.2.4.
Main Technical Contributions
Status Updates on Simulation Parameters
The document provides a comprehensive status table tracking 11 key simulation issues, with updates on their resolution status:
Resolved Issues
- UE Power Class (Issue 1):
- Previously pending decision between 23 dBm (specified for NTN NB-IoT) vs. higher commercial values (26-33 dBm)
- Resolution: 37 dBm adopted based on RAN4 Reply LS S4aA250219
-
Note: 37 dB is under study in ongoing RAN work
-
Latitude-Dependent Loss (Issue 2):
- Addressed scintillation loss variation (2.2 dB vs. 0 dB) based on latitude per TR 38.821
- Resolution: Simulation accounts for latitude-dependent loss using X term
-
Additional note: New 10-degree channel model introduced, may increase feasible TBS
-
Elevation Angles (Issue 3):
- Proposal to maintain both 2.3° and 12.5° elevation angles for worst-case scenarios
-
Resolution: Simulation accounts for elevation angles using X term
-
Simulation Channel Model (Issue 8):
- Choice between NTN-TDL-C or NTN-TDL-C5
-
Resolution: NTN-TDL-C is used
-
Repetition Numbers (Issue 10):
- Proposal to specify and report repetition numbers in simulation
- Resolution: Solved
Partially Resolved Issues
- Protocol Overhead (Issue 9):
- Requires clarification of packet header overhead for different protocol combinations (user plane, control plane, IP vs. non-IP)
- Partial Resolution: SA2 confirmed voice packets transported over User Plane
-
Still Pending: Overhead for User Plane (IP vs. Non-IP) needs RAN confirmation
-
RX G/T for Downlink (Issue 11):
- Field measurements show 3dB better value than current RAN assumptions
- Status: Editorial note added in P-doc 5.2.2.3 to capture field-measured data
- Current Status: Listed as "Unsolved" in table
Unresolved Issues
- UL/DL Guard Time (Issue 4):
- Current assumption: 1 ms guard time for UL/DL switching
-
Status: Needs RAN confirmation on feasibility
-
Determine Candidate TBS Values (Issue 5):
- Multiple proposals from different companies:
- Xiaomi (S4aA250035)
- Fraunhofer (S4aA250031)
- Skylo (S4-251540)
- Dolby (S4-251390)
- Huawei (S4aA250230)
- Qualcomm (S4-251548)
- vivo (S4aA250215)
-
Status: Unsolved, requires further verification
-
Approaches to Select TBS (Issue 6):
- Three approaches provided in S4aA250072
- One approach detailed in clause 5.2.2.4.1
- Status: Unsolved, requires further discussion
-
Overall Simulation Methodology Description (Issue 7):
- Need for high-level description of simulation execution, including optimization parameters and result parameters
- General description documented in P-doc Clause 5.2.2
- Status: Unsolved, to be addressed after all simulation work completed
Proposal
The document proposes to:
1. Update the P-doc (TS 26.940) based on the status updates provided
2. Continue tracking these issues until full resolution
Key Dependencies
The document highlights several dependencies on other working groups:
- RAN4: UE power class confirmation
- RAN: UL/DL guard time feasibility, protocol overhead confirmation
- SA2: Protocol overhead for different transport configurations
|
Proposal
It is proposed update the P-doc based on the content in Clause 2 of this document and to continue tracking the status of these issues.
|
|
|
(pdf)
|
[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19 |
vivo Mobile Communication Co., |
Summary of 3GPP Technical Document: Updates to FS_ULBC Permanent Document
Document Overview
This contribution updates the FS_ULBC (Ultra Low Bitrate Speech Codec) Permanent Document to align with SA2 conclusions on Key Issue #1 regarding IMS voice call support over NB-IoT via GEO satellite connecting to EPC, as documented in TR 23.700-19.
Main Technical Contributions
1. Reference Updates
The document adds critical new references to align with recent 3GPP work:
- TR 23.700-19 V1.2.0: Study on Integration of satellite components in the 5G architecture; Phase 4
- S2-2509293: Interim conclusions on KI#1 Support of IMS voice call over NB-IoT NTN via GEO satellite connecting to EPC
- TR 36.763: Study on NB-IoT/eMTC support for Non-Terrestrial Networks
- R1-2506541: Reply LS on RAN simulation assumptions for ULBC
2. End-to-End Simulation Model Updates (Clause 5.2.1.3)
2.1 Architecture and Protocol Stack Changes
The document introduces significant modifications to the end-to-end simulation model:
- New GEO Channel Model: Extends the reference LTE scenario (Annex E of TS 26.132) to accommodate GEO satellite access
- Three Architectural Scenarios Defined:
- Reference LTE VoLTE scenario (Figure 5.2.1.3-1)
- Main GEO scenario with IP transport (Figure 5.2.1.3-2)
- GEO scenario with Non-IP Data Delivery option (Figure 5.2.1.3-2a)
2.2 Transport Mechanism Agreements
Based on SA2 conclusions in TR 23.700-19:
- User Plane Transport: Voice packets shall be transported over NB-IoT (GEO) user plane using DRB and S1-U
- Single PDN Connection: Both IMS signaling and IMS voice use a single PDN connection
- Mandatory Mechanism: Transport of IP packets (UP/IP) with RoHC recommended
- Optional Mechanism: Transport using removal and restoration of parts of RTP/UDP/IP headers (UP/non-IP)
2.3 Simulation Input Parameters
Key parameters updated for GEO scenarios:
- BLER_tx/BLER_rx: Block error rates for UL/DL based on error traces from Clause 5.2.2
- max_tx/max_rx: HARQ retransmissions (note: HARQ feedback suggested to be disabled for IMS voice over GEO per Release 18)
- drx_cycle_length: DRX cycle duration (LTE values 20-40ms, suitability for GEO requires RAN2 confirmation)
- mis_eNB1_eNB2: Scheduling time misalignment between eNBs
- Speech sequence frame length: Maximum 80ms frame length for GEO (vs. 20ms for LTE)
- Voice packet size: Depends on protocol overhead, varies by transport mechanism
2.4 Protocol Overhead Considerations
Two protocol overhead scenarios illustrated:
- UP/IP with RoHC (Figure 5.2.1.3-4 left): Mandatory mechanism
- UP/non-IP with header removal (Figure 5.2.1.3-4 right): Optional mechanism
Editor's Note: Exact overhead for UDP/IP (SA2 scope) and RTP (SA4 scope) for the removal/restoration mechanism requires determination.
3. Simulation Assumptions and Open Issues (Clause 5.2.2.4)
3.1 Resolved Issues
| Issue | Resolution |
|-------|-----------|
| Latitude-Dependent Loss | Simulation accounts for latitude-dependent scintillation loss using X term (2.2 dB or 0 dB beyond ±20° latitude per TR 38.821) |
| Elevation Angles | Both 2.3° and 12.5° angles considered using X term for worst-case scenarios |
| Simulation Channel Model | NTN-TDL-C selected |
| Repetition Numbers | Specified and reported in simulation |
3.2 Pending Issues Requiring RAN Input
- UE Power Class: 23 dBm (specified for NTN NB-IoT) vs. commercial UE range (26-37 dBm) - requires RAN confirmation
- UL/DL Guard Time: 1ms assumption needs RAN verification
- RX G/T for Downlink: Field observations show 3dB better performance than current RAN assumptions
3.3 Unresolved Issues
- Candidate TBS Values: Multiple proposals from Xiaomi, Fraunhofer, Skylo, Dolby, Huawei, Qualcomm, and vivo require evaluation
- TBS Selection Approaches: Three approaches in S4aA250072 need discussion
- Overall Simulation Methodology: High-level description to be completed after simulation work
- Protocol Overhead for UP/non-IP: Exact overhead values for removal/restoration mechanism depend on specific RTP fields selected (SA4 decision)
3.4 Updated Understanding on Protocol Overhead
Based on SA2 agreements:
- Control Plane transport excluded: Only User Plane transport considered
- Mandatory: UP/IP with RoHC recommended
- Optional: UP/non-IP with partial header removal/restoration
- Exact overhead values for optional mechanism pending SA4 decisions on RTP field selection
Key Dependencies and Cross-WG Coordination
The document identifies several inter-working group dependencies:
- RAN1: Physical layer timing, power class confirmation
- RAN2: HARQ configuration, DRX cycle parameters, scheduling mechanisms
- SA2: UDP/IP overhead for non-IP mechanism
- SA4: RTP overhead, frame length confirmation, RTP field selection for header removal
Editor's Notes
Two critical editor's notes remain:
- Whether the eNB1-eNB2 delay model for LTE scenarios accurately reflects GEO deployment delays
- Whether RTP payload size affects the delay-error profile
|
Extracted Proposals
Based on my review of the document, there are no explicit proposals in this 3GPP contribution.
The document is a contribution to update the permanent document (PD) based on TR 23.700-19, and it contains:
- Reasons for change
- Proposed changes to the PD (editorial updates to align with agreements already reached)
- Technical assumptions and open issues
However, there are no sections explicitly marked as "Proposal", "Proposal X:", "Proposal X.", etc. The document presents updates and changes to an existing document rather than new proposals for consideration.
|
|
|
(pdf)
|
[FS_ULBC]Considerations for ULBC Codec Selection Process |
China Mobile Com. Corporation |
Comprehensive Summary: Considerations for ULBC Codec Selection Process
Document Overview
This document appears to be a presentation or discussion paper related to ULBC (Uplink Broadcast) codec selection process. However, the provided content is fragmentary and contains mixed language elements (English and Chinese), making comprehensive technical analysis challenging.
Main Technical Areas Identified
1. ULBC Codec Selection Process
The document's primary focus is on considerations for selecting codecs in the ULBC (Uplink Broadcast) context. However, specific technical criteria, evaluation methodologies, or selection parameters are not detailed in the provided content.
2. JPEG AI Integration
Overview
- The document references JPEG AI as a relevant technology
- JPEG AI appears to be considered as a potential codec or compression technology for the ULBC use case
Working Mechanism
- A section is dedicated to JPEG AI's working mechanism
- Specific technical details of the mechanism are not provided in the extracted content
Timeline
- Timeline considerations for JPEG AI are mentioned
- Specific milestones or deployment schedules are not detailed
3. Cross-Working Group Coordination
SA2 Related Work
- SA2 has related work in Release 18 (R18) and Release 19 (R19)
- Key Issue Identified: Lack of unified architecture design
- Requirements are coming from RAN but lack unified architectural framework
- Suggests fragmentation in approach across different scenarios
RAN Liaison Statements
- Latest LS (Liaison Statement) from RAN concerns model transmission
- Indicates coordination requirements between SA and RAN working groups
4. Architecture Considerations
Network Function Changes
- Reference to "NF变CN" (Network Function changes to Core Network)
- Suggests potential architectural modifications at the Core Network level
- Specific changes or proposals are not detailed in the provided content
Open Questions
The document includes an "Open Questions" section, indicating ongoing discussions and unresolved technical issues. However, the specific questions are not provided in the extracted content.
Technical Gaps in Provided Content
Due to the fragmentary nature of the document provided:
- Specific codec selection criteria are not detailed
- Technical evaluation parameters are missing
- Comparison methodologies between candidate codecs are not present
- Detailed architectural proposals are not included
- Specific agreements or decisions are not documented
Observations
- Multi-Release Scope: The work spans R18 and R19, indicating ongoing evolution
- Cross-WG Dependencies: Clear dependencies between SA2 and RAN work
- Architecture Fragmentation: Identified need for unified architecture design
- Emerging Technologies: JPEG AI considered as potential solution
- Core Network Impact: Potential changes to CN architecture implied
Note: This summary is based on fragmentary content with significant portions in template format or non-English text. A complete technical analysis would require the full document with all technical details, agreements, and proposals.
|
Proposals
This document does not contain any proposals. The document appears to be a presentation about ULBC Codec Selection Process and JPEG AI, but no sections are explicitly marked as "Proposal" in any of the standard formats.
|
|
|
(pdf)
|
[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals |
vivo Mobile Communication Co., Nokia, Xiaomi Technology, Samsung, Spreadtrum, Bytedance |
Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals
1. Introduction and Objectives
This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics.
The study evaluates:
- Descript Audio Codec (DAC) - AI-based codec
- Enhanced Voice Services (EVS) codec - 3GPP standard reference
The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.
2. Background and Motivation
2.1 ULBC Context
The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges:
- Wideband audio (16 kHz) offers naturalness and perceptual quality
- Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits
2.2 Critical Communication Requirements
For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include:
- Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification
- System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas
- Need to balance modern expectations with legacy requirements and emergency scenarios
2.3 EVS as Reference Anchor
EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling:
- Practical quality floor definition for ULBC
- Comparison against established carrier-grade standards
- Isolation of bandwidth choice impact independent of codec architecture
3. Methodology
3.1 Evaluation Pipeline
- Dataset: LibriSpeech train-clean-100 subset (standard benchmark for high-quality read English speech)
- Sample size: 500 audio files randomly sampled across three seeds (101, 102, 103)
- Consistency: Same audio files used for all codec and bitrate configurations
3.2 Processing Chain
- Process input audio through target codecs (DAC and EVS) at various bitrates
- Transcribe processed audio using OpenAI Whisper model (large-v3) - selected for state-of-the-art performance and noise robustness
- Compare transcripts against LibriSpeech ground truth
- Calculate WER using jiwer library
4. Experimental Setup
4.1 Codec Configurations
DAC model: Evaluated at three sampling rates
- 16 kHz
- 24 kHz
- 44 kHz
EVS codec: Evaluated in standard modes
- Narrowband (NB)
- Wideband (WB)
Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)
4.2 Observations on Baseline Variance
- NB PCM occasionally scored ~0.1% better than WB PCM
- Attributed to inherent ASR model variance rather than signal quality differences
- Explains why high-bitrate DAC models occasionally score slightly lower than WB baseline
4.3 Primary Metric
WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.
5. Results and Analysis
5.1 DAC Performance vs Bitrate
Key Findings:
- DAC achieves high efficiency at low bitrates (~2 kbps)
- WER drops rapidly as bitrate increases, stabilizing around 3-4%
- At 1.5 kbps: WER approximately 5.5%
- Significant improvement observed in 1.5-3.0 kbps range
Bandwidth Impact at Low Bitrates:
- At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model
- With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band
- Results in better semantic preservation vs. 24 kHz model suffering from bit starvation
5.2 DAC 8 kHz Narrowband Model Analysis
A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound.
Model Configuration:
- Sample rate: 8000 Hz
- Encoder rates: [2, 4, 4, 8], dimension: 64
- Decoder rates: [8, 4, 4, 2], dimension: 1536
- Quantization: 6 codebooks, size 1024, dimension 36
- Training: 200,000 steps on VCTK corpus
Critical Findings at Sub-1.5 kbps:
- At ~1 kbps:
- 8 kHz model (938 bps): WER 5.86%
- 16 kHz model (1000 bps): WER 11.23%
- Semantic penalty > 5 percentage points when forcing WB at 1 kbps
- At 1.5 kbps:
- 8 kHz model (1563 bps): WER 3.86%
- 16 kHz model (1500 bps): WER 5.46%
Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.
5.3 DAC vs EVS Comparison
Competitive Advantage:
- DAC achieves comparable WER scores at significantly lower bitrates than EVS
- DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs
ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.
5.4 EVS Narrowband vs Wideband Analysis
Performance at Different Bitrates:
- At 13.2 kbps (highest tested):
- EVS-NB: 3.16%
- EVS-WB: 3.14%
-
Nearly identical, indicating saturation point in semantic quality
-
At 5.9-8.0 kbps range:
- EVS-WB maintains marginal advantage (e.g., at 8.0 kbps: WB 3.15% vs. NB 3.41%)
-
Both modes provide sufficient basic audio quality
-
At 9.6 kbps:
- EVS-NB: 3.19%
- EVS-WB: 3.23%
- NB performance very close to WB, difference within ASR model error margin
Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.
5.5 EVS Degradation Analysis
Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.
Key Findings:
- Semantic loss introduced by EVS in both NB and WB modes is minimal
- Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance
- Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum
Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.
5.6 Summary of Findings
Strategic Conclusions for ULBC Design:
- For ~1 kbps emergency/GEO scenarios:
- Semantic intelligibility is paramount
- NB and WB offer comparable semantic preservation
- Enforcing wider bandwidth at extremely low bitrates is risky due to limited bit budget
-
Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation
-
AI-based codec sampling rate optimization:
- DAC 16 kHz model provides distinct advantage over 24 kHz model at lower bitrates
- 8 kHz model (trained only 200k steps) defeats official 16 kHz model at low bitrates
- Optimizing sampling rate to match available bit budget is critical for system design
- Performance of intermediate rates (e.g., 12 kHz) remains open question
6. Proposals
Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing:
- Methodology
- Experimental setup
- Analysis of results concerning audio bandwidth impact on semantic intelligibility
7. Detailed Results Tables
Complete experimental data provided in appendix tables covering:
- Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps
- Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps
|
Proposals
It is proposed to include the relevant content from Sections 3, 4, and 5 of this document into TR 26.940. This includes the methodology, experimental setup, and the analysis of results concerning the impact of audio bandwidth on semantic intelligibility, to capture the findings of this study.
|
|
|
(pdf)
|
[FS_ULBC]pCR on Existing codec technologies |
China Mobile Com. Corporation |
Summary of pCR on Existing Codec Technologies (S4-260154)
Document Information
- Source: China Mobile Com. Corporation
- Specification: 3GPP TR 26.940 V0.5.1
- Meeting: TSG-SA WG4 Meeting #135, Goa, India, 09-13 February 2026
Purpose and Scope
This pCR proposes updates to Clause 7.1 of TR 26.940, which documents existing codec technologies for evidence that design criteria can be met and for comparison/evaluation purposes. The document adds information about recently emerged ultra-low bit-rate voice codecs (below 1 kbps) as reference for further work.
Main Technical Contributions
Expanded Codec Technology Reference Table
The pCR significantly expands Table 7.1.1-1 "List of existing codec technologies" by adding multiple categories of codecs beyond the existing 3GPP IMS codecs. The table includes the following parameters for each codec:
- Source/Reference
- Audio bandwidth (NB/WB/SWB/FB)
- Codec delay (ms)
- Frame duration (ms)
- Bitrates (kbps)
- Specification access/software availability
New Codec Categories Added
1. Conventional Ultra Low Bitrate Codecs
- MELP/MELPe: 0.6-2.4 kbps, NB, 22.5-90ms frame duration
- AMBE-LR: 1.6-1.8 kbps, NB
- MPEG-HVXC: 2-4 kbps, NB
- TWELP MR: 0.3-3.2 kbps, NB, various frame durations (40-120ms)
- Codec2: 0.45-2.4 kbps, NB, primarily 40ms frames
2. AI-Based Decoders
- WaveNet Codec2: 2.4 kbps, WB, 20ms frames
- CQNV Codec2: 1.0-1.1 kbps, WB, 40-60ms frames
3. AI-Based Encoder and Decoder (Causal)
These codecs support real-time operation:
- LPCNet: 1.6 kbps, WB, 40ms frames, 25ms delay
- LyraV2 (SoundStream): 3.2-9.2 kbps, WB, 20ms frames
- EnCodec: 1.5-24 kbps, 24kHz/FB, 0-1000ms delay, 13.3ms frames
- Mimi-Codec: 0.55-1.1 kbps, 24kHz, 80ms frames, 0ms delay
- TS3: 0.64-0.8 kbps, WB, 20ms frames, 0ms delay
- TAAE: 0.4-0.7 kbps, WB, 20-40ms frames, 0ms delay
- LMCodec2: Parameters TBD
4. AI-Based Encoder and Decoder (Non-Causal)
These codecs are designed for offline/non-real-time applications:
- DAC: 0.5-3 kbps, WB/24kHz, 244-366ms delay
- DAC-IBM: 0.75-3 kbps, 24kHz, 366ms delay
- SNAC: 0.98 kbps, 24kHz, 1000ms delay, 80ms frames
- SpeechTokenizer: 0.5-1.0 kbps, WB, full-signal delay
- SemantiCodec: 0.31-1.4 kbps, WB, 10-40ms frames, full-signal delay
- FunCodec: 0.25-1.0+ kbps, WB, 20-40ms frames
- WavTokenizer: 0.25-0.9 kbps, 24kHz, 25-40ms frames
- BigCodec: 1.04 kbps, WB, 12.5ms frames
- FocalCodec: 0.16-0.65 kbps, WB, 20-80ms frames
- ALMTokenizer: 0.41 kbps, WB, 13.3ms frames
- XY-Tokenizer: 1 kbps, WB, 20ms frames
- LongCat-Audio-Codec: 0.43-0.87 kbps, WB, 60ms frames
- AcademiCodec: Parameters TBD
- MuCodec: 0.35-1.35 kbps, FB
Additional Notes
The pCR includes several important notes:
- Note 1: Some codecs may include noise suppression
- Note 2: MPEG-HVXC decoder and reference encoder available only to MPEG members
- Note 3: Codec2 uses 20ms overlapping FFT/iFFT with overlap-add
- Note 4: Some codecs only have non-causal versions publicly available
- Note 5: TWELP has a complete quality assessment testbench available despite lacking open reference implementation
An editor's note indicates that more codecs may be added to the table in future revisions.
Key Observations
The pCR demonstrates significant industry progress in ultra-low bitrate speech coding, particularly:
- Multiple AI-based solutions achieving sub-1 kbps bitrates
- Wide range of delay characteristics (0ms to 1000ms)
- Various bandwidth support (NB to FB)
- Different availability levels for specifications and software implementations
|
Proposal 1: It is proposed to agree the following changes to clause 7.1 of 3GPP TR 26.940.
|
|
|
(pdf)
|
[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling |
vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance |
Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling
1. Introduction and Motivation
This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware.
2. Experimental Setup
2.1 Model Configuration
Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:
- Architecture: Fully convolutional encoder-decoder with Residual Vector Quantization (RVQ)
- Frame length: 40ms (fixed across all variants)
- Total up/down-sampling factor: 320 (consistent across variants)
- Sample rates tested: 8 kHz (320 samples), 16 kHz (640 samples), 32 kHz (1280 samples)
- Export format: ONNX with Float32 precision
Key complexity observations from Table 1:
- Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536)
- Model sizes range from 4.3 MB to 283.6 MB
- Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz)
2.2 Device Under Test (DUT) Environment
- Platform: MediaTek Dimensity 1200 (6nm) - representative mid-range SoC
- Inference engine: ONNX Runtime v1.14+ with CPU execution provider (single-threaded)
- CPU clusters tested:
- Efficiency cluster: Cortex-A55
- Performance cluster: Cortex-A78
- Prime core: Cortex-A78+
- Methodology: Frequency-locked operation with disabled thermal services and power HALs to eliminate dynamic frequency scaling noise
3. Results and Analysis
3.1 Complexity Scaling vs. Bandwidth
Critical finding: For a given model variant, computational complexity scales linearly with sample rate:
- enc32dec768 example:
- 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s)
- 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase
- 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase
Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended.
3.2 Real-Time Factor (RTF) Analysis Across Three Frequency Tiers
3.2.1 Tier 1: Low Frequency (A55@750MHz, A78@902MHz, A78+@1.1GHz)
Energy-conserving state with severe constraints:
- Cortex-A55 @ 750 MHz: Only smallest models (enc8dec144) maintain real-time at 8 kHz; 16/32 kHz unfeasible
- Cortex-A78 @ 902 MHz:
- 32 kHz: Limited to <3M parameters
- 16 kHz: Supports up to ~8M parameters
- 8 kHz: Supports up to ~10M parameters
- Cortex-A78+ @ 1.108 GHz: Similar to A78 but extends 16 kHz limit closer to 10M parameters
3.2.2 Tier 2: Mid Frequency (A55@1.0GHz, A78@1.16GHz, A78+@1.37GHz)
Typical sustained workload state:
- Cortex-A55 @ 1.0 GHz: 8 kHz supports up to ~2M parameters; 16/32 kHz remain largely unfeasible
- Cortex-A78 @ 1.162 GHz:
- 32 kHz: ~5M parameter limit
- 16 kHz: ~10M parameters (covers "Low Complexity" profile)
- 8 kHz: Robust up to ~20M parameters
- Cortex-A78+ @ 1.37 GHz: Performance parity with A78 (clock speed is primary differentiator)
3.2.3 Tier 3: High Frequency (A55@1.73GHz, A78@1.45GHz, A78+@1.63GHz)
High-performance state approaching sustained limits:
- Cortex-A55 @ 1.73 GHz:
- 8 kHz: ~3M parameters
- 16 kHz: ~2M parameters
- 32 kHz: ~1M parameters
- Cortex-A78 @ 1.451 GHz:
- 32 kHz: ~7M parameters
- 16 kHz: ~10M parameters
- 8 kHz: ~20M parameters
- Cortex-A78+ @ 1.632 GHz: Highest headroom
- 32 kHz: ~8M parameters
- 16 kHz: Comfortably supports 10M parameters
- 8 kHz: ~20M parameters
Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated.
3.3 Maximum Performance Envelope
Analysis at peak locked frequencies establishes absolute upper bounds:
3.3.1 Efficiency Core (Cortex-A55 @ 2.0 GHz)
Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices.
3.3.2 Performance Core (Cortex-A78 @ 2.6 GHz)
Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices.
Critical "Complexity vs. Bandwidth" trade-off identified:
- 32 kHz: RTF crosses 1.0 near 10M parameters (enc24dec576 variant)
- Hard limit for High-Fidelity ULBC candidates
- 16 kHz: Feasible model size effectively doubles to ~20M parameters (enc32dec768 variant)
- enc40dec960 fails real-time constraints
- Linear relationship between bandwidth reduction and parameter capacity
- 8 kHz: Extends to ~39M parameters
- enc40dec960 (29M) is safe
- Trend suggests failure before enc64dec1536
3.3.3 Prime Core (Cortex-A78+ @ 3.0 GHz)
Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category.
4. Key Technical Contributions
4.1 Quantified Complexity-Bandwidth Trade-off
Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores:
- 32 kHz → 10M parameters
- 16 kHz → 20M parameters
- 8 kHz → 39M parameters
4.2 Real-World Performance Benchmarks
Provided concrete RTF measurements across representative mobile hardware configurations, revealing that:
- Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks
- Memory bandwidth and thermal constraints significantly impact feasibility
- Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity
4.3 Practical Complexity Constraints for ULBC
Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection.
5. Proposal
The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.
|
Proposal: It is proposed to include the findings of this RTF analysis in TR 26.940 to inform the selection of complexity constraint for the ULBC candidate.
|
|
|
(pdf)
|
[FS_ULBC] Discussion on Audio Bandwidth for ULBC |
vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum |
Technical Summary: Audio Bandwidth Requirements for ULBC
1. Introduction and Scope
This contribution addresses audio bandwidth design constraints for the Ultra-Low Bitrate Codec (ULBC), targeting primarily voice over GEO satellite communications. The document argues against mandatory Wideband (WB) and Super-Wideband (SWB) support, proposing instead that Narrowband (NB) should be mandatory with WB as an enhancement.
2. Key Technical Arguments
2.1 Global NB Usage and System Efficiency
Current Network Reality:
- 2G/3G connections (primarily AMR-NB) still represent 20% of global technology mix (end of 2023)
- Regional variations: 81% in Sub-Saharan Africa, 46% in Middle East and North Africa
- NB serves as universal fallback for interoperability (CS fallback scenarios)
System Inefficiency Without NB Mode:
- WB ULBC to NB user calls waste upper frequency band (4-8 kHz)
- Significant bitrate wasted transmitting data that recipient cannot hear
- Over expensive, scarce satellite link, this inefficiency is unacceptable
- Native NB mode provides most efficient solution for legacy network connectivity
2.2 User Expectations in "Last Resort" Scenarios
Baseline Expectation Setting:
- GEO call is final option after terrestrial network failure
- Users typically experience AMR-NB fallback before resorting to GEO
- ULBC must be at least as reliable as NB fallback to meet user expectations
- WB-only ULBC failure in conditions where NB would work represents service failure
2.3 Primary Use Case: Emergency Communications
Typical Deployment Scenario:
- Rescue teams in remote areas (e.g., Himalayan mountains)
- Mixed-connectivity environment:
- Squad A: GEO-only (outside TN coverage)
- Squad B: GSM fallback at coverage fringe
- Base Camp: PSTN connection (NB service)
Technical Implications:
- Terminating endpoints predominantly NB
- Emergency systems use traditional NB codecs (Codec2, MELP) for robustness
- Transmitting WB over satellite to NB endpoint wastes critical resources in life-or-death situations
- Real-world deployment example provided (China rescue missions)
Evaluation Priority:
- ULBC candidates should prioritize intelligibility and robustness testing in NB mode
2.4 Performance at Very Low Bitrates
Quality vs. Bandwidth Trade-off:
- Forcing wider bandwidth at very low bitrates spreads available data too thinly
- Research shows lower sampling rates can achieve higher perceptual quality at very low bitrates
- WB codec at ~1 kbps may compromise intelligibility, especially with packet loss
- NB signal more robustly reconstructed under constrained conditions
Analogy: "Spreading butter" - concentrating bits on narrower bandwidth preserves speech richness and intelligibility
2.5 Complexity and Power Consumption
Computational Scaling Issues:
- AI-based codec architectures don't scale gracefully
- Doubling sampling rate (NB to WB): 2x to 4x complexity increase for CNN/Transformer models
- WB-only mandate imposes unnecessary computational burden
- Critical issue for power-constrained mobile devices
- Native NB mode offers high-quality voice at significantly lower complexity/power budget
3. Experimental Analysis: Higher Bandwidth Inefficiency
3.1 Experiment Setup
Test Configuration:
- Codec: Descript Audio Codec (DAC) with pre-trained models
- Sampling rates tested: 44.1 kHz, 24 kHz (SWB), 16 kHz (WB)
- Test corpus: 100 clean speech samples from MS-SNSD dataset
- Bitrate variation: 1-9 active quantization codebooks
- Quality metric: ViSQOL algorithm (speech mode, MOS estimate)
Model Specifications:
| Model | Compression | Frame Rate | Codebooks | Bitrate/Codebook |
|-------|-------------|------------|-----------|------------------|
| 16 kHz (WB) | 320x [2,4,5,8] | 50 Hz | 12 (10-bit) | 0.50 kbps |
| 24 kHz (SWB) | 320x [2,4,5,8] | 75 Hz | 32 (10-bit) | 0.75 kbps |
| 44.1 kHz | 512x [2,4,8,8] | ~86.1 Hz | 9 (10-bit) | ~0.86 kbps |
3.2 Key Experimental Findings
Quality vs. Bitrate Results:
- WB (16 kHz): Achieves excellent quality (ViSQOL MOS > 4.0) at ~2.5 kbps
- 24 kHz SWB: Requires higher bitrate to match WB quality
- 44.1 kHz: Provides minimal perceptible improvement over 24 kHz SWB
- Conclusion: Bitrate cost of SWB not justified by quality improvement for voice content
Efficiency Analysis:
- Clear trend: diminishing returns for bandwidth beyond WB
- SWB/FB represents inefficient use of bandwidth for ULBC service
4. Proposed Design Constraints
4.1 Bandwidth Requirements
Mandatory Support:
1. 8 kHz sampling rate (NB): 50-4000 Hz audio bandwidth
2. 16 kHz sampling rate (WB): 50-8000 Hz audio bandwidth
- Enhanced quality where channel conditions and device capabilities permit
- WB support can be limited to higher bitrates than NB operation
Further Study:
- Necessity and feasibility of SWB and FB support remains FFS
4.2 Text Proposal for TR 26.940
Change to Table 6.2-1 (Design Constraint Parameters):
Sample rate and audio bandwidth:
- The ultra low bitrate codec shall support sampling rates of 8kHz (NB) and 16kHz (WB)
- Supported audio bandwidth:
- NB: 50-4000 Hz
- WB: 50-8000 Hz
5. Supporting Evidence Summary
Quantitative Data:
- 20% global 2G/3G connections (hundreds of millions of users)
- Regional NB dominance: up to 81% in some areas
- WB achieves MOS > 4.0 at 2.5 kbps
- 2x-4x complexity increase for WB vs. NB in AI codecs
Qualitative Arguments:
- System efficiency (no wasted bandwidth to NB endpoints)
- User expectation alignment (last resort reliability)
- Emergency use case requirements
- Computational/power constraints for mobile devices
- Diminishing returns for SWB/FB at target bitrates
|
Extracted Proposals
Based on my analysis of the document, here are the proposals found in the "Proposal" section (Section 5):
Proposal 1: The codec shall be able to operate with an 8 kHz sampling rate, supporting an audio bandwidth of 50 – 4000 Hz.
Proposal 2: The codec shall be able to operate with a 16 kHz sampling rate (50 – 8000 Hz audio bandwidth) to offer enhanced quality where channel conditions and device capabilities permit. For example, WB support can be limited to higher bitrates than NB operation.
Proposal 3: The necessity and feasibility of including Super-Wideband (SWB) and Full-band (FB) support remains for further study.
|
|
|
(pdf)
|
[FS_ULBC] Analysis of AI Codec Complexity Scaling |
vivo Mobile Communication Co., |
Complexity Analysis of AI Codec Scaling for ULBC
1. Introduction
This contribution addresses the need for establishing relevant complexity evaluation methods for the new ULBC codec standardization. Previous contributions (e.g., S4aA250264) highlighted potential gaps between theoretical complexity metrics (FLOPs) and practical on-device performance (Real-Time Factor).
This document provides a complementary analysis focusing on how complexity metrics scale with AI model architecture itself. The analysis investigates the relationship between model architecture, theoretical complexity, and traditional metrics using the publicly available DAC codec as a test case.
2. Analysis of AI Codec Complexity Scaling
2.1. Methodology
The analysis created seven "dummy" model variants based on the open-source DAC codec's 16kHz configuration. The approach:
- Base Configuration:
- Sample rate: 16kHz
- Encoder dimension: 64
- Encoder rates: [2, 4, 5, 8]
- Decoder dimension: 1536
-
Decoder rates: [8, 5, 4, 2]
-
Scaling Approach:
- Only
encoder_dim and decoder_dim were modified
- Encoder/decoder rates kept constant across all variants
- Total up/down-sampling factor maintained at 320 (2×4×5×8 = 8×5×4×2)
-
Frame size: 20ms (320 samples at 16kHz)
-
Variant Configurations:
- enc8dec144
- enc12dec288
- enc16dec384
- enc24dec576
- enc32dec768
- enc40dec960
- enc64dec1536
Complexity Metrics Measured:
- Model Parameters (Millions): Total trainable parameters
- Theoretical Complexity (MFLOP/s): Calculated using thop profiling library (aligned with S4aA250264 and S4aA250231)
- WMOPS: Traditional methodology using ITU-T STL wmc_tool, measured separately for encoder and decoder
Implementation Notes:
- Each AI operation implemented in pure C
- Source files annotated and compiled using wmc_tool
- WMOPS highly sensitive to C implementation efficiency
- Naive implementations can yield significantly higher counts than optimized versions
2.2. Complexity vs. Model Dimensions
Key Findings:
- Clear non-linear relationship between latent dimensions and resulting parameters/computational load
- Model parameters and MFLOP/s scale quadratically (or faster), not linearly, as encoder_dim and decoder_dim increase
- Results visualized in Figure 1 (Parameters vs. Dimension) and Figure 2 (MFLOP/s vs. Dimension)
- Encoder and decoder points are linked pairs corresponding to bundled setups
2.3. WMOPS vs. Model Parameters
Key Finding: Clear relationship between AI model size (in millions of parameters) and traditional WMOPS complexity.
Observations on DAC Model:
- Clear correlation between number of model parameters and resulting WMOPS when using same architecture with same C optimization level
- Decoder complexity scales significantly faster and is substantially higher than encoder complexity for all variants (DAC arranges more parameters/complexity for decoder to achieve better reconstructed audio quality)
- Growth in WMOPS appears linear relative to increase in parameters for both encoder and decoder
2.4. Summary of Scaled Variants
Complete complexity metrics for all seven DAC variants (16kHz, 20ms frame):
| Variant | Enc Dim | Dec Dim | Params (M) | GFLOP counts | MFLOP/s | WMOPS Enc | WMOPS Dec |
|---------|---------|---------|------------|--------------|---------|-----------|-----------|
| enc8dec144 | 8 | 144 | 1.09 | 0.009 | 437.09 | 333.92 | 760.53 |
| enc12dec288 | 12 | 288 | 2.89 | 0.028 | 1397.63 | 648.23 | 2732.96 |
| enc16dec384 | 16 | 384 | 4.94 | 0.050 | 2481.98 | 1060.79 | 4724.38 |
| enc24dec576 | 24 | 576 | 10.76 | 0.112 | 5578.38 | 2228.92 | 10399.00 |
| enc32dec768 | 32 | 768 | 18.90 | 0.198 | 9911.72 | 3693.56 | 18093.30 |
| enc40dec960 | 40 | 960 | 29.34 | 0.310 | 15482.00 | 5599.48 | 28019.70 |
| enc64dec1536 | 64 | 1536 | 74.50 | 0.792 | 39614.50 | 13675.30 | 70766.69 |
Data demonstrates rapid scaling of all metrics as encoder and decoder dimensions increase.
3. Observations and Conclusions
Based on the DAC model variant analysis:
-
Linear Relationship: For the DAC model, there is a clear linear relationship between Theoretical Complexity (MFLOP/s), Model Parameters, and measured WMOPS. As MFLOP/s or parameter count increases, WMOPS increases linearly, provided C coding style remains consistent.
-
Quadratic Growth: Increasing model's internal dimensions causes complexity to grow quadratically. Even small dimension increases lead to disproportionately large jumps in MFLOP/s and WMOPS.
-
Implementation Dependency: WMOPS score depends heavily on source C code efficiency.
4. Proposal
It is proposed to capture the above analysis into 3GPP TR 26.940.
|
Proposal
It is proposed to capture the above analysis into 3GPP TR 26.940.
|
|
|
(pdf)
|
[FS_ULBC] On codec bitrate and capacity discussion for ULBC |
vivo, Samsung, Spreadtrum, MediaTek Inc. |
Summary of 3GPP CR S4-260159: On Codec Bitrate and Capacity Discussion for ULBC
1. Introduction
This CR addresses the TBS (Transport Block Size) and codec bitrate values for ULBC (Ultra Low Bitrate Codec) evaluation, which are currently noted as 'companies reported' in TR 26.940 v0.4.0. The contribution provides analysis on:
- Multiplexed UE number analysis
- Confirmation of TBS/Codec bitrate values
2. Technical Analysis
2.1 Multiplexed UE Number Analysis
The document presents a methodology for calculating supported UE numbers considering:
- TDM (Time Division Multiplexing): Both UL and DL can schedule different UEs in TDM manner
- FDM (Frequency Division Multiplexing): UL can additionally use FDM since NPUSCH may occupy few subcarriers within 180kHz bandwidth
- FDM capacity: 48 UEs for 3.75kHz SCS (single tone)
- FDM capacity: 12 UEs for 15kHz SCS (single tone)
- Bi-directional constraint: Final supported UE number is the minimum of UL and DL capacity
2.2 Capacity Evaluation Results
Analysis conducted under 50-degree elevation channel model with 2% BLER:
Key Observations:
- Observation 1: Higher UE transmit power leads to higher capacity (multiplexed UE number) for a given codec bitrate
- Observation 2: For codec bitrate of ~3kbps, capacity is limited to ~10 UEs with 31dBm UE power. Capacity further degrades with increased bitrate (e.g., ≤5 UEs for 4.5kbps)
Performance characteristics:
- 23 dBm UE power shows very poor capacity
- Performance improves with higher power UEs (26 dBm, 31 dBm)
- Capacity increases with ptime value
2.3 Benchmark Considerations
- SA1 assumes transmission data rate of 1-3kbps as benchmark
- Commercial GEO system (Tiantong) operates within 0.8-2.4kbps range (per clause 5.2.1.3 of reference [1])
- Real-world deployments support focusing on bitrates below 3kbps
Additional analysis provided in Annex assuming 1% BLER under 10-degree elevation channel model.
3. Proposed Changes to TR 26.940
3.1 TBS and Codec Bitrate Tables
The CR proposes specific TBS values selected from TS 36.213 table 16.5.1.2-2 for NB-IoT NPUSCH, with corresponding PHY bitrates and codec bitrates calculated for each bundling period (assuming 7-byte packet header).
Table 1: 80ms bundling
- TBS range: 88-256 bits
- PHY bitrate range: 1.1-3.2 kbps
- Codec bitrate range: 0.4-2.5 kbps
Table 2: 160ms bundling
- TBS range: 120-424 bits
- PHY bitrate range: 0.75-2.65 kbps
- Codec bitrate range: 0.4-2.30 kbps
Table 3: 320ms bundling
- TBS range: 208-808 bits
- PHY bitrate range: 0.65-2.52 kbps
- Codec bitrate range: 0.475-2.35 kbps
3.2 Additional Notes
NOTE 1: Final packet header size depends on SA2 and RAN conclusions, including feasibility of 1-byte MAC header
NOTE 2: Packet header counted only once regardless of bundled voice frames
NOTE 3: Relationship between voice frame duration and bundling time depends on RTP payload design. Loss of single TB means loss of multiple consecutive voice frames when bundled.
4. Proposals
Proposal 1: Agree that codec bitrate should be bound to be less than 3kbps
Proposal 2: Agree to the proposed changes to Section 5.2.2.2 (Uplink simulation parameters) of TR 26.940, including:
- Updated TBS values and PHY bitrates tables
- Voice bundling periods: 80ms, 160ms, 320ms (40ms excluded due to insufficient time for DL transmissions with 3.75kHz SCS)
- Target BLER values: 1%, 2%, 6%, 10%
- Maximum Achievable SNR formula incorporating UE power (23/26/31 dBm), bandwidth, and antenna gain variations
5. Supporting Information
The Annex provides additional multiplexed UE number analysis for different codec bitrates and UE power levels under 10-degree elevation channel model, supporting the main technical conclusions.
|
Extracted Proposals
Proposal 1: Agree that the codec bitrate should be bound to be less than 3kbps.
Proposal 2: Agree the following changes to 3GPP TR 26.940 [2].
|
|
|
(pdf)
|
On ULBC complexity and RTF analysis |
Dolby Laboratories Inc., Novamint, Nokia |
Summary of S4-260165: ULBC Complexity and RTF Analysis
Background and Motivation
This contribution addresses the need to finalize complexity and memory design constraints for the ULBC (Ultra-Low Bitrate Codec) study. Previous discussions at SA4 #133-e and the ULBC ADHOC meeting explored various complexity metrics and RTF performance data for existing AI codecs (DAC, Lyra v2, HIL). However, insufficient data exists to draw definitive conclusions on complexity constraints for ULBC.
The document builds upon previous contribution S4-251844 with the following modifications:
- Added CPU core information for experiments
- Aligned RTF definition with TR 26.940 clause 7.5.3
- Focused on model sizes 3-20M parameters (more relevant to ULBC use cases)
- Provided pCR for TR 26.940
- Removed large chunk-based processing experiments (not relevant for real-time voice communication)
Experimental Setup and Methodology
Model Configuration
Modified DAC architecture with reduced parameters while maintaining general structure:
- Model sizes: 20M, 15M, 9M, and 3M parameters (float32 precision)
- Training: Optimized for ~1 kbps bitrate at 32 kHz sampling rate
- Encoder rates: 4,4,8,10 for all models
Complexity Analysis
Theoretical Complexity (GMACS):
- Computed using ptflops library
- Results show linear relationship between model size and GMACS:
- 20M: 5.14 GMACS
- 15M: 4.03 GMACS
- 9M: 2.39 GMACS
- 3M: 0.79 GMACS
RTF Testing Methodology
- PyTorch models converted to ONNX format
- ONNX runtime with XNNPACK execution provider
- Frame-by-frame processing (80 ms frames)
- Test duration: 2 minutes (1500 inferences per session)
- 5 repetitions per experiment
- Single-threaded execution
- RTF calculation: max(inference time / frame length) across all frames
Experimental Results
Test Devices
Device 1 (2023):
- Hexa-core CPU: 2×3.46 GHz (P core) + 4×2.02 GHz (E core)
- Dynamic core switching observed between P and E cores
Device 2 (2022):
- Octa-core CPU: 1×3.00 GHz Cortex-X2 + 3×2.50 GHz Cortex-A710 + 4×1.80 GHz Cortex-A510
- Processing on Cortex-X2 with frequency switching between 2.4 GHz and 1.8 GHz
RTF Performance Results
| Model Size | Max RTF (High Performance) | Max RTF (Power Efficient) |
|------------|---------------------------|---------------------------|
| 20M | 0.39-0.63 | 0.81-0.9 |
| 15M | 0.29-0.43 | 0.66-0.74 |
| 9M | 0.19-0.29 | 0.44-0.57 |
| 3M | 0.09-0.13 | 0.18-0.31 |
Results demonstrate linear increase in RTF with model size across both performance modes.
Key Observations
- All tested models achieve RTF < 1.0, indicating real-time capability
- Significant RTF variation between high-performance and power-efficient modes
- Dynamic CPU core/frequency switching impacts performance
- 20M model shows max RTF=0.63 (high performance) and RTF=0.9 (power efficient)
- Smaller models (3M-9M) provide substantial RTF headroom for real-time operation
Proposed Text for TR 26.940
The contribution provides a comprehensive pCR adding new clause 6.2.1.7 "RTF and MACS analysis for AI based codecs" with detailed experimental results. Key additions to TR 26.940 include:
Complexity Considerations (Clause 6.2.1)
- Real-time processing requirements for voice communication
- Model size considerations (5-10M parameters for efficient operation)
- Memory access and power consumption challenges with larger models
Complexity Metrics (Clause 6.2.1.4)
- Discussion of NPU/TPU capabilities measured in TOPS
- TOPS/W as power efficiency metric (2-15 TOPS/W range for smartphones)
- MAC operations and MACS as practical complexity metrics
- RTF as reliable complexity assessment metric
- Comparison with traditional WMOPS metric
Target Devices (Clause 6.2.1.5)
- NPUs present in most modern smartphones
- Theoretical max TOPS: 8-59 TOPS (varying precision)
- TOPS/W range: 2-15 TOPS/W
- DAC codec estimate: ~150 Giga MAC/sec (~0.3 TOPS)
- Note on DRAM operations significantly impacting power consumption
Key Conclusions (Clause 6.2.1.6)
- ML codecs require careful model size and complexity optimization
- NPUs offer 5-20× power efficiency vs CPUs for AI tasks
- ULBC complexity constraints should not reference existing 3GPP speech codecs
- Million MACS + model size provide first-order complexity indication
- RTF useful but requires standardized test platforms
- WMOPS not directly suitable for NPU-based AI solutions
Experimental Data (Clause 6.2.1.7)
- Complete documentation of DAC-like architecture experiments
- Detailed RTF and GMACS results for 3M-20M parameter models
- Device specifications and performance characteristics
Proposal
Document the experimental methodology, results, and observations in clause 6.2.1 of TR 26.940 as shown in the provided pCR.
|
Proposals
Proposal
The source proposes to document clause 2 and clause 3 in clause 6.2.1 of 3GPP TR 26.940 as shown in the pCR below.
|
|
|
(pdf)
|
[FS_ULBC] Discussion on Methodology for Delay & Error Trace Generation |
vivo Mobile Communication Co., |
Discussion on Methodology for Delay & Error Trace Generation for FS_ULBC
Introduction
This contribution addresses the ongoing debate within SA4 Audio SWG regarding the methodology for generating delay and error traces for Ultra Low Bitrate Codec (ULBC) evaluation under Non-Terrestrial Network (NTN) conditions. Two competing approaches have emerged:
- Fixed BLER / Target Error Rate: Prioritizes "realistic" channel behavior by fixing a target BLER (e.g., 2% or 10%) and finding feasible Transport Block Sizes (TBS)
- Fixed Resource / Link Budget: Prioritizes "fair resource usage" by fixing the SNR/Link Budget and allowing codec/modem to trade off bitrate against error robustness (Best Effort)
The contribution proposes clarifying the purpose of these simulations by distinguishing between Design and Verification phases.
Analysis of Current Approaches
2.1 The Precedent: LTE Simulation Methodology (TS 26.132)
Legacy Mechanism (Trace Generation)
The LTE MTSI testing methodology in TS 26.132 (Annex E and F) operated on "Stationary" conditions:
- Input: BLER_tx (e.g., 10%) was a fixed input parameter
- Process: The model assumed the network had converged to this average BLER with random error using
if (rand(1) < BLER_tx) logic
- Output: Traces reflecting packet losses and delays based on re-transmissions
Usage for Verification (Annex E & F)
Critically, TS 26.132 defined these traces as verification tools (System Testing):
- UE Delay Verification (Annex E): Generated profiles verify UE can maintain synchronization and meet delay budgets under specific error conditions
- JBM and PLC Evaluation (Annex F): Profiles constructed with deliberate impairments to verify robustness:
- Jitter Bursts
- Packet Inversions
- Packet Duplication
Key Finding: Profiles were treated as Test Vectors to verify robustness against defined impairments, not as "realistic channel recordings" to train codec design.
The Shift for NTN
NTN scenarios introduce challenges that invalidate the LTE approach:
- Cannot rely on simplistic i.i.d. (independent and identically distributed) random error models
- NTN channel impairments (shadowing, scintillation) introduce complex, non-stationary error patterns
- ULBC robustness directly influences tolerance levels, making fixed BLER targets inappropriate
- Must pivot from 'Assumed BLER' model toward 'Derived Performance' model
2.2 Analysis of Current Approaches for FS_ULBC
Approach A: The "Realism" Perspective (Fixed BLER)
Methodology:
- Define TBSs for each candidate bitrate and bundling time
- Traverse all link parameters (SCS, Tone, etc.) to evaluate if resulting link budgets satisfy predefined Target BLER
- Generate error trace for each configuration meeting BLER threshold
- Number of output traces = Number of defined Target BLERs (for each TBS)
Underlying Assumption: AI-based Codecs (specifically PLC mechanisms) require specific "real" error patterns during training/design phase
Observation: Limits testing scope to specific "safe" operating points, potentially overlooking codec behavior under unexpected channel degradation
Approach B: The "Resource" Perspective (Fixed SNR)
Methodology:
- Normalize TBS across all candidate codec bitrates assuming consistent packet overhead
- For each unique Link Budget (fixed SNR) derived from specific UE, satellite, and link parameters, generate dedicated error traces
- Number of output traces = Number of unique Link Budgets (for each TBS)
Underlying Assumption: Mimics "Best Effort" or competitive scenario similar to EVS selection, where end-to-end quality (MOS) matters more than intermediate BLER
Observation: Logically sound for optimizing system performance, but implies vast search space potentially leading to unmanageable simulation workload
2.3 The Core Issue: Verification vs. Design
The Logic Chain
The standard workflow should be:
Delay/Error Profiles Generation → Codec/PLC Verification → System Performance Evaluation
Misalignment
The current deadlock stems from treating RAN simulation outputs as Design Constraints (training data) rather than Verification Tools.
Key Principles:
-
Robustness over Overfitting: Robust Codec and PLC design should not rely on "learning" a specific channel trace from a specific simulator. Design should handle variety of harsh conditions (burst losses, high jitter, varying BLER). Data augmentation is standard practice for training robust AI models.
-
The Role of Traces: As in TS 26.132 Annex F, generated traces serve as "Test Vectors" defining challenging conditions under which the Codec must survive. Whether traces represent 90% or 99% of real-world cases is secondary to sufficiently stress-testing JBM and PLC algorithms.
-
Historical Practice: Delay/Error profiles officially generated by SA4 were never distributed to codec proponents for training purposes; they were solely used to verify codec candidates fulfill design constraints and performance requirements.
2.4 Proposal for the Way Forward
Re-orient simulation efforts towards generating a Verification Suite rather than a "Perfect Reality Model":
-
Avoid Excessive "Realism" Filtering: Do not discard simulation results simply because they don't meet strict low-BLER threshold. High BLER conditions are valid "Corner Cases" that ULBC must handle, especially in satellite scenarios with tight link budgets.
-
Limit the Search Space: Select representative subset of challenging conditions (e.g., Deep Fading, High Doppler) at fixed SNR points resulting in range of BLERs (e.g., from <1% up to >10%).
-
Verification Focus: Output traces should verify candidate codecs degrade gracefully under varied conditions. Burden is on Codec proponent to design PLC that works across these profiles, not on RAN simulation group to provide "training set" guaranteeing codec works.
Proposal: Multi-point Fine-grained Trace Generation (MFTG)
The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design by providing a high-resolution library of error traces rather than a single static operating point.
Step 1: Resource Baseline Normalization (TBS Definition)
- Define set of Reference Transport Block Sizes (TBS) based on unified packet overhead
- Keep TBS values consistent across all candidate codec bitrates to ensure fair comparison of resource efficiency
Step 2: Link Budget Mapping and Granularity Setup
- Identify target range of Link Budgets (SNR/CNR) based on realistic NTN deployment scenarios (e.g., LEO/GEO, UE power classes)
- Establish fine-grained sampling interval (e.g., 1% BLER to 10% BLER in steps of 1% or 2% from BLER perspective, or -5dB to 10dB in steps of 1dB from SNR perspective)
Step 3: Large-scale Link-Level Simulation (LLS)
- Execute Monte Carlo simulations for each defined TBS at every fine-grained sampling interval
Step 4: Flexible Trace Selection for Verification
-
For Performance Comparison: Proponents selecting specific source bitrate can identify and utilize trace from library whose SNR/BLER most closely matches their design's intended link budget
-
For Robustness Testing: Proponents can select "stress-test" traces (e.g., those with higher BLER or specific jitter profiles) from same library to verify PLC and JBM algorithms
Conclusion
While the source understands the rationale behind both the Fixed BLER approach and Fixed Resource / Link Budget approach for GEO network simulation, a compromised solution is necessary for FS_ULBC to progress. MFTG is therefore proposed for consideration and agreement.
|
Extracted Proposals
Proposal: Multi-point Fine-grained Trace Generation
The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design. By providing a high-resolution library of error traces rather than a single static operating point, it enables a fair and flexible evaluation of various codec strategies (e.g., different bitrate/BLER trade-offs) while bypassing the current standardization deadlock.
Step 1: Resource Baseline Normalization (TBS Definition)
- Define a set of Reference Transport Block Sizes (TBS) based on a unified packect overhead.
- These TBS values must be kept consistent across all candidate codec bitrates to ensure a fair comparison of resource efficiency.
Step 2: Link Budget Mapping and Granularity Setup
- Identify the target range of Link Budgets (SNR/CNR) based on realistic NTN deployment scenarios (e.g., LEO/GEO, UE power classes).
- Establish a fine-grained sampling interval (e.g., 1% BLER to 10% BLER in step of 1% or 2% from BLER perspective or -5dB to 10dB in step of 1dB from SNR perspective) along the SNR-BLER curve to ensure high resolution for subsequent selection.
Step 3: Large-scale Link-Level Simulation (LLS)
- Execute Monte Carlo simulations for each defined TBS at every fine-grained sampling interval.
Step 4: Flexible Trace Selection for Verification
- For Performance Comparison: Proponents selecting a specific source bitrate can identify and utilize the trace from the library whose SNR/BLER most closely matches their design's intended link budget.
- For Robustness Testing: Proponents can select "stress-test" traces (e.g., those with higher BLER or specific jitter profiles) from the same library to verify PLC and JBM algorithms.
|
|
|
(pdf)
|
[FS_ULBC] Proposed ULBC design constraints living document |
vivo Mobile Communication Co., |
Candidate Convenor for 3GPP Systems Aspects TSG - ULBC Design Constraints Living Document
1. Scope
This living document consolidates design constraints being considered within SA4 for FS_ULBC (Feasibility Study on Ultra-Low Bitrate Codec). Due to the working procedure requiring consensus agreements for design constraints to be integrated into ULBC PD or TR 26.940, and the lack of such consensus so far, this document captures the current status of design constraints even though some items are not fully agreed.
2. ULBC Design Constraints
2.1 Sampling Frequency and Audio Bandwidth
Design Constraint: Support of [8, 16, 32] kHz / [NB, WB, SWB] required [1], [2]
Editor's Notes and Open Issues:
- Support of 8 kHz justified for interoperability; clarification needed on whether NB would be tested/supported "externally" based on external resampling
- Support of 48 kHz may be considered at higher bitrate operation
- Consideration of at least a single model (e.g., SWB)
- Many neural codecs operate at 24 kHz; this specific sampling rate should be discussed
- Complexity considerations associated with this parameter; joint decisions may be needed
Reference: NB audio typically sampled at 8 kHz (100-3500 Hz), WB at 16 kHz (50-7000 Hz), SWB at 32 kHz (50-14000 Hz), FB up to 20000 Hz
2.2 Number of Audio Channels
Design Constraint: ULBC candidate codecs shall support mono coding with one channel input and one channel output
2.3 Bit Rates
Design Constraint: ULBC candidate codecs shall operate at bitrates lower than [3.00] kb/s [3]
2.4 Frame Length
Design Constraint: Candidate codecs shall operate with a coding frame size of multiple of 20 ms
Note: Since larger than 20ms bundling time periods will be used, codec proponents should be allowed to consider solutions with larger than 20ms frame sizes
2.5 Algorithmic Delay
Design Constraint: Algorithmic delay shall be less than [coding frame size + x] ms
2.6 Complexity
Design Constraint: Complexity limits applied according to categories. Computational complexity and program ROM (PROM) of candidate codecs for each category shall be measured with ITU-T STL2009 [1] as observed worst-case encoder + observed worst-case decoder complexity within the same category [5], [6]
Categories:
- Computational: wMOPS: Less than [x] wMOPS
- Memory: RAM, ROM, Program ROM (values TBD)
Editor's Notes:
- Model size per operation mode is less than [5-10] million parameters
- Total number of parameters is less than [Z] million
- ULBC Codec should be implementable on mobile device using today's technology
- Increased computational complexity and memory usage should be commensurate with gain in quality of user experience (e.g., higher audio bandwidth such as SWB or stereo) or increased efficiency (e.g., lower bit rate for same quality compared to reference codec)
2.7 Potential Use of Noise Suppression as Part of the Codec
Design Constraint: If noise suppression is supported inside ULBC, there should be a mechanism to disable noise suppression in the codec [7], [8]
Editor's Notes - Clarifications Needed:
- Need to support noise suppression in ULBC? (typically vendor specific, defined outside the codec)
- Impacts on test methodology, DTX operation/performance
Motivations:
- Disabling noise suppression required to test feature separately
- Avoid tandeming in real operation
- IMS voice communication defined in TS 22.228; GEO satellite access has no specific requirement on noise handling
2.8 Jitter Buffer Management (JBM)
Design Constraint: A JBM solution conforming to requirements in TS 26.114, except for the functional requirement in sub-clause 8.2.2 of TS 26.114: "Speech JBM used in MTSI shall support all the codecs as defined in clause 5.2.1", shall be provided with candidate codecs
2.9 Rate Switching
Design Constraint: Candidate codecs shall perform rate switching upon command to the encoder throughout the entire bit rate range at arbitrary frame boundaries. Rate switching may imply switching between different bandwidths
Note: Due to the Bundling period and associated TBS, switching might have to happen at the boundary of bundling period
2.10 Packet Loss Concealment (PLC)
Design Constraint: A PLC solution shall be provided by ULBC candidate codecs [9]
Editor's Notes:
- Typical loss profiles/characteristics to be clarified
- Support of redundancy to be clarified
- Need to be able to handle BLER up to [x%]
2.11 RTP Payload Format
Design Constraint: Candidate codecs shall provide an RTP payload format specification supporting the full set of features and functionality of the ULBC candidate codecs
2.12 DTX
Design Constraint: Candidate codecs shall provide a complete VAD/DTX/CNG framework. It shall be possible to operate the codec with DTX on or DTX off
Editor's Note: Typical radio characteristics and optimizations (SPS, DRX, bitrate) to be clarified
2.13 Output Gain Limitation
Design Constraint: ULBC candidate codecs shall not amplify the output signal relative to the input signal beyond limits
Editor's Note: Similar limits and methodology to measure the amplification are described in the EVS-7a,b processing plan permanent document
3. References
[1] S4-251794 - Discussion on Audio Bandwidth for ULBC (vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum)
[2] S4-251808 - Pseudo-CR on Design Constraints of ULBC: Audio bandwidth (Fraunhofer IIS)
[3] S4-251792 - On codec bitrate and capacity discussion for ULBC (vivo, Samsung, Spreadtrum, MediaTek Inc.)
[4] ITU-T G.191 - Software tools for speech and audio coding standardization (March 2010)
[5] S4-251747 - On complexity constraints for ULBC (Huawei Technologies Co., Ltd.)
[6] S4-251807 - On complexity design constraints for ULBC (Fraunhofer IIS)
[7] S4-251395 - Pseudo-CR on Design Constraints of ULBC: Noise suppression (Fraunhofer IIS)
[8] S4-251748 - On noise suppression for ULBC (Huawei Technologies Co., Ltd.)
[9] S4aA250268 - Packet Loss Concealment with existing AI based codec DAC (Dolby Laboratories Inc., Ericsson LM, Nokia, Novamint)
Note: Items in light blue are candidates for agreement at SA4#135.
|
This document does not contain any proposals in the standard 3GPP format. The document is a "living document" that attempts to capture and consolidate design constraints for FS_ULBC, but it does not include any sections explicitly marked as "Proposal", "Proposal X:", "Proposal X.", etc.
The document contains design constraints presented in a table format and various editor's notes, but these are not formatted as formal proposals.
|
|
|
(pdf)
|
[FS_ULBC] Alignment Analysis on Complexity of DAC model |
vivo Mobile Communication Co., |
Alignment Analysis on Complexity of DAC Model
1. Introduction
This contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics:
- S4-260165: ~3M parameter model (32 kHz) requires 0.79 GMACS
- S4-260155: ~3M parameter model (32 kHz) requires approximately 1.41 GMACS (derived from 2821 MFlops/s)
Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate.
The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF.
2. Architectural Analysis and Discrepancy Resolution
2.1 The "Model Size" Trap
A detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints:
| Metric | [2] (16k, ~3M) | [1] (32k, ~3M) |
|--------|----------------|----------------|
| Input Rate | 16,000 Hz | 32,000 Hz |
| Total Stride | 320 (2×4×5×8) | 1280 (4×4×8×10) |
| Latent Rate | 50.0 Hz | 25.0 Hz |
| Encoder MACs (M) | 436.30 | 461.92 |
| Quantizer MACs (M) | 2.25 | 0.50 |
| Decoder MACs (M) | 984.50 | 1037.12 |
| Total MFlops/s | 1423.05 | 1499.54 |
Key Analysis:
- The S4-260165 (32k, ~3M) model runs at 2× higher input rate (32k vs 16k), increasing encoder computational cost
- The S4-260165 model uses 4× higher stride (1280 vs 320), reducing the latent rate to 25Hz (compared to standard 50Hz)
- The reduced latent rate significantly lowers decoder cost (fewer frames to upsample)
- Higher input cost balances with lower decoder/latent cost, resulting in comparable total MFlops/s
Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration.
2.2 Verification of Complexity Metrics
Theoretical complexity (GMACS) was recalculated to validate the analysis:
- Using the standard conversion: GMACS ≈ MFlops/s / 1000 × 0.5
- The S4-260165 (32k, ~3M) model at 32 kHz yields ~1,499.5 MFlops/s
- Calculated GMACS: 1499.5 / 1000 × 0.5 ≈ 0.75 GMACS
- This aligns closely with the reference value of 0.79 GMACS reported in S4-260165
3. GMACS as the Metric
When RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures.
Key Findings:
- RTF scales linearly with GMACS across different CPU tiers (Efficiency, Performance, Prime cores)
- A specific GMACS budget (e.g., 2.0 GMACS) yields predictable RTF on a target CPU core and frequency, regardless of architectural choices (high-sample-rate input vs. large parameter count in decoder)
- This metric decouples complexity constraint from specific architectural choices (stride, latent rate), allowing codec designers flexibility in optimization
- High-complexity validation: S4-260155's 20M model (~5.14 GMACS) demonstrates RTF of 0.9 in power-efficient execution mode on high-end 2023 device, aligning with mid-range Prime Core (3.0 GHz) trend where ~5.3 GMACS corresponds to RTF ≈ 1.0
4. Conclusion
By adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices.
5. Proposal
Propose to include this analysis in 3GPP TR 26.940, specifically capturing:
- Model Size is not a consistent proxy for complexity across varying architectures (e.g., high-stride vs. low-stride configurations)
- GMACS/GFLOPs demonstrates strong linear correlation with real-time performance on mobile devices
- This analysis provides a solid basis for defining complexity constraints for ULBC candidates
References
[1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis"
[2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling"
|
Proposal
We propose to include the analysis presented in this contribution into 3GPP TR 26.940. Specifically, the text should capture the findings that Model Size is not a consistent proxy for complexity across varying architectures (e.g., high-stride vs. low-stride), and that GMACS/GFLOPs demonstrates a strong linear correlation with real-time performance on mobile devices. Documenting this analysis will provide a solid basis for defining the complexity constraints for the ULBC candidate.
|
|
|
(pdf)
|
[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle |
Qualcomm Incorporated, Xiaomi |
Summary of S4-260214: Feasible Bitrates for NTN-TDL-C Channel Model with 10-Degree Elevation Angle
Background and Motivation
This contribution addresses the determination of feasible Transport Block Sizes (TBS) for the newly agreed NTN-TDL-C channel model at 10-degree elevation angle, which was adopted at SA4 #134 (documented in S4-252108). Key observations include:
- Two channel models now exist: The original channel model from TR 38.811 Table 6.9.2-3 and the new NTN-TDL-C model for 10-degree elevation
- Channel model validation: The new channel model shows better correlation with field data from NB-IoT GSO service with handheld devices
- Field data shows 1-3 dB gap between 1st and 50th percentile SNR
- New channel model: ~1 dB gap (consistent with field data)
- Initial channel model: ~6 dB gap (less consistent)
- TBS table update requirement: The TBS values in permanent document tables (5.2.2.1-1, 5.2.2.1-2, 5.2.2.1-3) should reflect the union of supported TBS values for both channel models
Simulation Methodology
The contribution evaluates maximum feasible bitrates under worst-case conditions without DMRS bundling, considering two scenarios:
Scenario 1: Ideal Timing
- 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread
- Without DMRS bundling
- No uncertainty in scheduling timing
Scenario 2: Timing Uncertainty
- 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread
- Without DMRS bundling
- 10ms uncertainty in scheduling timing (relevant for large cells without UE-specific Koffset or TA report)
Both scenarios target 2% BLER for uplink and downlink.
Simulation Results
Scenario 1 Results (No Timing Uncertainty)
- Maximum TBS: 936 bits
- Uplink: 15kHz SCS, 3 tones, 48ms (N_RU=6, N_rep=2)
- BLER: 1.5% at 31dBm UE TX power
- Downlink: 24ms (N_SF=6, N_rep=4)
- BLER: 1.1% at -3.3dB SNR
Scenario 2 Results (10ms Timing Uncertainty)
- Maximum TBS: 680 bits
- Uplink: 15kHz SCS, 3 tones, 40ms (N_RU=5, N_rep=2)
- BLER: 0.2% at 31dBm UE TX power
- Downlink: 20ms (N_SF=5, N_rep=4)
- BLER: 0.5% at -3.3dB SNR
Proposed Changes to Permanent Document
TBS Table Updates
The contribution proposes adding new TBS values to support the higher bitrates enabled by the new channel model:
For 80ms Bundling Period (Table 5.2.2.1-1)
- Add TBS 936 bits (PHY bitrate: 11.7 kbps, Net bitrate: 11.0 kbps)
- Add TBS 680 bits (PHY bitrate: 8.5 kbps, Net bitrate: 7.8 kbps)
- Add intermediate value between 680 and current maximum (424)
For 160ms Bundling Period (Table 5.2.2.1-2)
- Add corresponding new TBS values scaled appropriately
For 320ms Bundling Period (Table 5.2.2.1-3)
- Add corresponding new TBS values scaled appropriately
Terminology Change
- Change "codec bitrate" to "net bitrate" to clarify that this represents the bitrate available to the codec (after accounting for packet headers), not a required codec operating bitrate
Updated Tables
The proposed tables include:
- Packet header assumption: 7 bytes (with note that final size depends on SA2/RAN conclusions on 1-byte MAC header feasibility)
- Header counting: Packet header counted only once per bundling period, regardless of number of voice frames bundled
- TBS values: Selected from TS 36.213 Table 16.5.1.2-2 for NB-IoT NPUSCH
- Net bitrate calculation: PHY bitrate minus overhead from packet headers
The complete updated tables show TBS values ranging from 144 to 936 bits for 80ms bundling, with corresponding PHY and net bitrates calculated for each bundling period configuration.
|
Proposal: The proposed change includes the following:
Add TBS 936 and TBS 680 and an additional value between 680 and the current maximum TBS value (424) to the TBS table for the 80ms bundling period in the permanent document v.0.4.0.
Add new TBS values to the TBS tables for 160ms and 320 ms bundling periods accordingly.
Change the "codec bitrate" to "net bitrate" to indicate that the "net bitrate" is the bitrate available for the codec and does not require a codec to operate at the bitrate.
|
|
|
(pdf)
|
[FS_ULBC] On the scheduling timing uncertainty |
Qualcomm Incorporated, Ericsson LM |
Summary of 3GPP Technical Document: FS_ULBC - Scheduling Timing Uncertainty
1. Background and Motivation
This contribution addresses ambiguities in interpreting RAN1 LS S4-251654 regarding uplink and downlink timing for NB-IoT NTN with GEO satellites. The interpretation of this LS has direct implications on:
- Scheduling timing uncertainty assumptions
- Link capacity calculations
The document proposes clarifications to the Permanent Document (PD) version 0.4.0 to resolve these interpretation issues discussed at SA4 #133-e and subsequent meetings.
2. Main Technical Contributions
2.1 Frame Structure for Dynamic Scheduling
The document maintains the existing frame structure example for Half-duplex FDD with 80ms bundling period:
- NPDSCH duration: 4ms (variable depending on DL SNR)
- Multiple UL frequency allocation options: 1, 3, 6, and 12 tones with 15 kHz per tone
- Allocation choice depends on UL and DL channel capacity
2.2 Semi-Persistent Scheduling (SPS) Frame Structure
Two SPS approaches are presented:
Approach 1 (Figure 5.2.2.3-2):
- NPDSCH can be positioned anywhere within first 15ms
- Maintains minimum 1ms gap to NPUSCH
Approach 2 (Figure 5.2.2.3-3):
- Based on "Cell_specific_Koffset" approach
- Does not depend on "TA report UE capability"
2.3 Gap Composition Between DL and UL
The gap consists of:
1. Processing time + DL-to-UL switching: Minimum 1ms for half-duplex device switching
2. Max differential delay: Accounts for different round-trip delays of UEs in NTN cell
- Typical range: close to 0 to 10.3ms depending on deployment
2.4 Baseline Assumptions for Codec Simulations
Key Changes Proposed:
For 80ms bundling:
- Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 68ms
- Proposed change: Replace with reference to beam size no larger than 1500km
- Note clarifies this corresponds to scenarios where difference between closest and farthest point to satellite is <1500km
- Explicitly states codec can be deployed in scenarios not meeting these constraints
For 160ms bundling:
- Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 148ms
- Proposed change: Replace with reference to beam size no larger than 1500km
- Same flexibility noted regarding deployment in other scenarios
2.5 Important Notes and Editor's Notes
RAN1 LS Clarification:
- Figure 5.2.2.3-1 supportable in most scenarios
- May not be supportable when:
- Cell is very large (e.g., >3000km)
- UE does not support TA report
- Network does not support UE-specific K-offset
- Requires UE configuration with two HARQ processes and HARQ feedback disabled
SPS Design Status:
- RAN1/2 have not yet started SPS design work
- RAN1 cannot currently confirm whether SPS frame structure examples (Figures 5.2.2.3-2 and associated text) will be supported
Editor's Note:
- Range of "Max differential delay" is TBC (To Be Confirmed)
3. Summary of Changes
The primary technical contribution is replacing specific timing constraint assumptions (X + Y values and max differential delay) with a more practical reference based on beam size ≤ 1500km for codec simulation baseline, while explicitly allowing codec deployment in scenarios exceeding these reference conditions. This provides clearer guidance to SA4 while maintaining flexibility for various deployment scenarios.
|
Extracted Proposals
Based on my review of the document, there are no proposals explicitly marked as "Proposal" (with any of the variations mentioned) in this 3GPP document.
The document contains proposed changes to a PD (Permanent Document) version 0.4.0, specifically modifications to section 5.2.2.3 on Frame structure, but these are presented as "Proposed change" rather than formatted as numbered or unnumbered proposals in the standard proposal format.
|
|
|
(pdf)
|
[FS_ULBC] On transmission delay for voice over NB-IoT NTN |
Qualcomm Incorporated |
Summary of S4-260216: On Transmission Delay for Voice over NB-IoT NTN
Document Overview
This contribution from Qualcomm addresses gaps in TR 26.940's mouth-to-ear delay calculations for NB-IoT NTN systems, specifically highlighting the omission of NPUSCH/NPDSCH transmission durations and clarifying the distinction between propagation delay and transmission delay.
Main Technical Issues Identified
Problem Statement
-
Missing Transmission Duration: TR 26.940 did not account for the duration of NPUSCH transmission or NPDSCH transmission, which can be significant for NB-IoT (e.g., 64ms for NPUSCH)
-
Terminology Confusion: TR 26.940 confuses propagation delay with transmission delay, where:
- Transmission delay: The interval from when the first bit leaves a transmitter to when the last bit leaves the transmitter
- Propagation delay: The time for signal propagation through the medium
- Processing delay (up to 3ms) can be ignored in mouth-to-ear calculations
Proposed Technical Changes
5.2.2.4 Propagation Delay Corrections
Key Change: Renamed "Transmission delay" to "Propagation delay" for GEO satellite link
- Maximum propagation delay: 280ms (per KPI requirement in clause 7.4.2 of reference document)
- Minimum propagation delay: 248ms (280ms - 32ms, accounting for UE location within beam)
- Assumes no retransmissions over GEO satellite link
5.2.2.5 Transmission Delay (New Section)
New Addition: Introduces proper definition and consideration of transmission delay
- Defines transmission delay as the interval from first bit to last bit leaving the transmitter
- Highlights significance for NB-IoT NTN (up to 64ms for NPUSCH in uplink)
- Must be accounted for in mouth-to-ear delay calculations
- Transmission delay for transport block size should be based on RAN simulation results
5.2.2.5/6 ULBC Delay Components
- Section renumbered from 5.2.2.5 to 5.2.2.6
- References existing algorithmic delays for IMS codecs (AMR and EVS: 5ms to 12ms)
- Notes that ULBC may have different delay values from codec processing and algorithmic delays
- Marked as FFS (Further Study)
5.1.3 Mouth-to-Ear Delay Estimation Updates
Editorial Note Added:
- Numbers in Table 5.1.3-1 will be updated once RAN simulation is completed to account for transmission delays in uplink and downlink
- Current values assume AMR and EVS algorithmic delays
- ULBC delay components still need to be addressed
- Minimum Delay_GSCN assumed as 20ms
Existing Table Structure Maintained:
- Frame sizes: 20ms, 40ms, 80ms, 160ms, 320ms
- Two scenarios: GEO-TN (main) and GEO-GEO (sub-scenario 1)
- Lower and upper bounds for mouth-to-ear delay
- Delay ranges from 428-712ms (20ms frame, GEO-TN) to 984-1455ms (320ms frame, GEO-GEO)
Dependencies and Next Steps
- Awaiting RAN simulation results to determine actual transmission delay values for different transport block sizes
- ULBC-specific delay components require further study
- Terminology alignment needed with clause 4 "Application Scenario"
- Table 5.1.3-1 values pending update based on RAN simulation completion
|
Proposal 1: Consider the 280ms as the max. transmission propagation delay and consequently 248ms (280ms – 32ms) as the minimal transmission propagation delay.
Proposal 2: The transmission delay for a transport block size should be based on the RAN simulation results.
|
|
|
(pdf)
|
[FS_ULBC] Support for Dual-Tone Multi-Frequency for IMS voice over NB-IoT NTN |
Qualcomm Incorporated |
Summary of S4-260217: Support for Dual-Tone Multi-Frequency for IMS Voice over NB-IoT NTN
Background
SA1 has mandated support for Dual-Tone Multi-Frequency (DTMF) for IMS voice over NB-IoT NTN. The document addresses the need to consider multiplexing of DTMF traffic with voice traffic in the system design, referencing RFC 4733 for DTMF payload formats.
DTMF Use in IMS Voice Services and Traffic Characteristics
DTMF Payload Types
RFC 4733 defines two DTMF payload format types:
- Telephone events: User button presses (0-9, , #) during calls
- Tones*: Ringing tone, busy tone, etc.
For IMS calls, tones are generated locally (e.g., "180 Ringing" or "486 Busy Here" SIP messages trigger local tone generation), so only telephone events need to be transported over the air.
Technical Specifications
- RTP payload size: 4 bytes
- Telephone events: Standard DTMF digits and symbols
Traffic Characteristics
The document identifies key DTMF traffic characteristics:
- DTMF packets are transmitted infrequently (only on button press)
- Telephone events may or may not overlap with active voice activity
- Multiple DTMF packets may be transmitted per button press, with the RTP marker bit indicating the first packet
- RTP packets must differentiate between voice and DTMF packets for multiplexing
Design Assumptions
Three key assumptions are established:
1. DTMF packet size ≤ voice packet size
2. DTMF delay requirements are less stringent than voice service
3. DTMF takes priority over voice
SPS Configuration Considerations
When SPS (Semi-Persistent Scheduling) is configured for voice traffic with fixed TBS:
- If DTMF packets don't overlap with active voice frames, they can be multiplexed with SID frames (smaller than active voice frames) and transmitted in SPS occasions
- If overlapping occurs, the UE can puncture an active voice frame and send the DTMF frame instead
- SA4 needs to coordinate with RAN1 and RAN2 on SPS design
Proposals
Proposal 1: Make DTMF support an integral part of IMS voice service over NB-IoT NTN
Proposal 2: Design DTMF support based on the three assumptions:
- DTMF packet size ≤ voice packet size
- DTMF delay requirement less stringent than voice
- DTMF priority over voice
Proposal 3: SA4 to design mechanisms for voice and DTMF multiplexing for SPS and coordinate with RAN1 and RAN2
|
Proposal 1: Make DTMF support part of the IMS voice service over NB-IoT NTN.
Proposal 2: Design DTMF support under the following assumptions
The DTMF packet size is at most as large as the voice packet size
Delay requirement of DTMF is not as stringent as voice service
DTMF takes priority over voice
Proposal 3: SA4 designs the mechanisms needed to support voice and DTMF multiplexing for SPS and works with RAN1 and RAN2.
|
|
|
(pdf)
|
Proposed design constraints for noise suppression, DTX, and non-speech inputs |
Nokia |
Summary of S4-260220: Design Constraints for Noise Suppression, DTX, and Non-Speech Inputs
1. Background and Context
This contribution addresses design constraints for the ULBC (Ultra-Low Bit-rate Communication) over GEO channel solution, building upon previous discussions from S4-251881 and S4-251786. The document focuses on three key areas:
- Noise suppression handling
- Discontinuous transmission (DTX) framework
- Robustness to non-speech inputs
Emergency Call Use Case
The contribution emphasizes that emergency calls represent a critical use case for ULBC over GEO, particularly when terrestrial network (TN) service coverage is unavailable. Key considerations include:
- Background signals may contain critical contextual information (e.g., voices, environmental sounds indicating danger)
- Post-call analysis requirements (ASR transcripts, emergency response evaluation, criminal investigations)
- Need for full situational awareness rather than aggressive noise suppression
2. Technical Analysis
2.1 Noise Suppression Trade-offs
The document identifies several technical challenges:
- Performance requirements alone may be insufficient: Testing with background signals (even using ITU-T P.800 DCR methodology) may not prevent systems from employing aggressive noise suppression that removes critical background information
- Ultra-low bit rate optimization: At very low bit rates, there exists an unknown trade-off between:
- Applying noise suppression
- Accepting more coding artifacts
- Potentially reduced intelligibility in presence of background signals
- Device-specific processing: Acknowledges that device-specific noise suppression is standard practice and will likely be applied before ULBC encoding
2.2 Updated Approach
The contribution updates the original proposal from S4-251881 by:
- Maintaining the requirement for disableable noise suppression within the codec
- Adding specific SNR ranges for stationary (5-15 dB) and non-stationary (10-25 dB) noise
- Deferring specific noise type definitions for future discussion
- Linking noise suppression behavior primarily to performance requirements
3. Proposed Design Constraints
The document proposes updates to Table 6.2-1 in draft TR 26.940 with three new/modified constraint parameters:
3.1 Noise Suppression Constraint
Requirement: If noise suppression is supported as part of the candidate codec, it must be possible to disable it to preserve background signals.
Editor's Notes:
- EN1: Requirement to disable may be considered in connection with specific operating bit rate(s)
- EN2: Solution behavior w.r.t. potential noise suppression is primarily enforced via performance requirements; default operation for tests is with noise suppression disabled
3.2 DTX Framework Constraint
Requirement: The candidate codec shall provide a framework for:
- Voice Activity Detection (VAD)
- Discontinuous Transmission (DTX)
- Comfort Noise Generation (CNG)
- Operation with DTX on or DTX off
Editor's Note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need clarification
3.3 Robustness to Non-Speech Input
Requirement: The candidate codec shall be robust to:
- Noisy speech with stationary noise (5-15 dB SNR)
- Noisy speech with non-stationary noise (10-25 dB SNR)
- Background signals during and between speech segments
- Other non-speech input signals
Editor's Notes:
- EN1: May need to be in performance requirements
- EN2: Relevant background signals to be further defined as part of performance requirements, including both stationary and non-stationary types
4. Key Technical Contributions
-
Balanced approach to noise suppression: Recognizes both the need for flexibility in noise suppression (for speech quality) and the critical requirement to preserve background signals (for emergency use cases)
-
Mandatory DTX framework: Establishes VAD/DTX/CNG as a required feature rather than optional, with explicit on/off control
-
Quantified robustness requirements: Provides specific SNR ranges for different noise conditions that the codec must handle
-
Testing methodology guidance: Proposes default testing with noise suppression disabled, while allowing performance requirements to govern overall behavior
5. Open Issues
Several editor's notes indicate areas requiring further work:
- Specific operating bit rates where noise suppression disable requirement applies
- Clarification of DTX and noise suppression interaction
- Final placement of robustness requirements (design constraints vs. performance requirements)
- Definition of specific background signal types for testing
- Speech quality requirements (to be addressed separately in performance requirements)
|
Proposals
The source proposes to update Table 6.2-1 in draft TR 26.940 as follows:
[The proposal consists of updating Table 6.2-1 with the following design constraint parameters:]
Parameter: Potential use of noise suppression as part of the codec
Design Constraint: If noise suppression is supported as part of the candidate codec, it can be disabled to preserve background signals
Editor's note 1: Requirement to disable noise suppression may be considered in connection with specific operating bit rate(s)
Editor's note 2: Solution behaviour w.r.t. potential noise suppression is primarily enforced via performance requirements. The default operation for tests is with noise suppression disabled.
Parameter: Discontinuous transmission including voice activity detection and comfort noise
Design Constraint: The candidate codec shall provide a framework for voice activity detection (VAD) and discontinuous transmission (DTX) with comfort noise generation (CNG). It shall be possible to operate the codec with DTX on or DTX off.
Editor's note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need to be clarified
Parameter: Robustness to non-speech input
Design Constraint: The candidate codec shall be robust to noisy speech (stationary noise 5-15 dB, non-stationary noise 10-25 dB), background signals during and between speech segments, and other non-speech input signals
Editor's note 1: May need to be in performance requirements
Editor's note 2: Relevant background signals, etc. to be further defined as part of performance requirements, these include both stationary and non-stationary background signal types
|
|
|
(pdf)
|
UE Antenna Gain in link-budget evaluations |
Ericsson LM |
UE Antenna Gain in Link-Budget Evaluations
Introduction
This contribution addresses the need to establish common assumptions for UE Antenna Gain in link-budget evaluations for FS_ULBC (Ultra Low Bitrate Speech Codec). The document highlights that different assumptions on UE Antenna Gain lead to significantly different conclusions on suitable radio configurations, and proposes alignment with existing 5G NR-NTN assumptions.
Problem Statement
The current FS_ULBC Pdoc references TR 36.763 with UE Antenna Gain assumptions ranging between 0 dBi and -5.5 dBi. Previous SA4 contributions on link level simulations have shown divergent assumptions regarding achievable link level performance, leading to inconsistent conclusions. The lack of a common assumption for UE Antenna Gain (G_Tx) significantly impacts:
- Link-budget results
- Performance references for link-level evaluations
- Overall system design conclusions
Link-Budget Analysis
Comparative Evaluation
The document presents a detailed side-by-side comparison of link-budget calculations for GEO satellite uplink with two different UE Antenna Gain assumptions:
Scenario Parameters (Common):
- Satellite Orbit: GEO
- Link Direction: Uplink
- Device Type: Handheld
- Satellite Elevation Angle: 2.3 degrees
- Satellite Altitude: 35,786 km
- Slant Range: 41,417.91 km
- Carrier Frequency: 2000 MHz
- Free Space Path Loss (FSPL): 190.8 dB
- UE Transmit Power: 23 dBm
- Receive Antenna Gain: 51 dBi
- Satellite G/T: 19 dB/K
- Bandwidth: 3750 Hz
- Various losses (atmospheric, shadow fading, scintillation, polarization, additional): 11.4 dB total
Key Results:
| UE Antenna Gain | Received Power | Noise Power | SNR at Satellite Receiver |
|-----------------|----------------|-------------|---------------------------|
| 0 dBi | -135.58 dBm | -138.23 dBm | 2.66 dB |
| -5.5 dBi | -141 dBm | -138.23 dBm | -2.84 dB |
The difference in UE Antenna Gain assumption results in a 5.5 dB difference in SNR, which is highly significant for link-level performance evaluation and system design.
Observations
Observation 1: The assumption for UE_Antenna_Gain (G_Tx) critically impacts the resulting SNR at the satellite receiver, which in turn affects conclusions on link-level results. Clarification is needed on whether to use 0 dBi, -5.5 dBi, or both values.
Observation 2: It is unlikely that an NB-IoT device would have superior antenna performance compared to an NR handheld device. Therefore, the UE_Antenna_Gain assumption should align with 5G NR-NTN specifications, which use -5.5 dBi.
Observation 3: RAN4 guidance (R1-2208353) explicitly recommends -5.5 dBi as a realistic UE antenna gain value, stating: "The UE antenna gain varies depending on the operating frequency and UE design. RAN4 thinks that a realistic UE antenna gain value would be -5.5 dBi. RAN4 would then recommend RAN1 to take this value as an assumption for their link budget evaluation."
Proposal
Proposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on UE_Antenna_Gain (G_Tx) with 5G NR-NTN specifications, i.e., -5.5 dBi.
This alignment ensures:
- Consistency with existing 3GPP NTN specifications
- Realistic assumptions based on RAN4 recommendations
- Comparable link-budget evaluations across different contributions
- Appropriate performance targets for codec and system design
|
Proposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on the "UE_Antenna_Gain (a.k.a. G_Tx)", corresponding to G in the FS_ULBC Pdoc, with the one of 5G NR-NTN (i.e., UE_Antenna_Gain (a.k.a. G_Tx): -5.5 dBi).
|
|
|
(pdf)
|
On reference code and model format |
Fraunhofer IIS, Apple Inc. |
3GPP Change Request Summary: S4-260233
Document Overview
This contribution proposes the use of ML model formats as intermediate representations (IR) for the ULBC (Ultra Low Bitrate Codec) reference implementation, rather than a pure C implementation. The document is structured as a proposed Change Request (pCR) to TR 26.940, introducing a new clause 6.4.2.
Main Technical Contributions
1. Problem Statement and Motivation (Goal Section)
The document identifies a fundamental question for ULBC standardization: whether to provide the entire codec reference implementation in C (including neural network components) or to define specific parts based on ML model formats (e.g., ONNX, PyTorch, TensorFlow).
Key concerns with pure C implementation:
- Limits UE vendors from leveraging custom architectures and optimizations
- UE vendors typically have custom optimization pipelines to port ML models to internal formats
- Pure C approach restricts full utilization of specialized hardware (NPUs, DSPs, TPUs)
2. Limitations of C-Based Reference Implementation (Clause 6.4.2.1)
Issues with existing WMC (Weighted Million Operations) tool for complexity measurement:
- Weights in Table 18.3 of G.191 do not account for vectorized implementations of matrix multiplications
- Theoretical complexity estimation does not reflect actual runtime complexity
- Does not account for diversity of target platforms
Additional limitations identified:
- Hardware/platform dependencies: C implementations may rely on platform-specific intrinsics and vectorization pragmas, limiting portability to NPUs
- Unoptimized reference code: May not be optimized for certain platforms
- Compiler dependencies: Intrinsics are compiler-specific
- Maintenance burden: Keeping C implementation updated with new ML operators and architectures is costly and error-prone
3. Definitions and Concepts (Clause 6.4.2.1 - Definitions)
The document establishes clear terminology:
- Graph format: Describes neural network as computational graph (structure only, no parameters)
- Model format: Combines graph representation, trained parameters (weights, biases), and metadata; self-contained and directly runnable
- Intermediate Representation (IR): Serves as bridge between high-level ML framework and execution runtimes
Note: PyTorch does not contain a graph format and requires model definition as Torchcode.
4. Advantages of Model Format Approach (Clause 6.4.3.2)
Platform Portability:
- Specifies what is computed, not how it's executed
- Framework-agnostic: models can be exported from different training frameworks
- Allows vendors to use custom toolchains for hardware-specific optimization
Hardware Evolution:
- Future-proof method to leverage latest AI processor developments
- Maintains compatibility with low maintenance effort
Combination with Standard C-code:
- ULBC can combine ML parts (as model format) with classic signal processing (in ANSI C)
- Backend runtime in C can integrate ML components
- Enables traditional 3GPP codec reference implementation structure
5. Comprehensive Model Format Analysis (Clause 6.4.3.3)
The document provides detailed comparison of major ML model formats:
| Format | Type | Key Advantages | Key Limitations |
|--------|------|----------------|-----------------|
| ONNX | Framework-agnostic IR | Cross-framework portability, wide runtime/hardware support, native OS support (Windows/Linux), dedicated C/C++ runtime | Operator coverage limitations, limited dynamic graph support |
| TensorFlow Lite (TFLite/LiteRT) | Edge/embedded-focused IR | Mobile/edge optimized, strong Android ecosystem, quantization tools, C/C++ runtime | TensorFlow-centric, partially vendor-specific maintenance |
| PyTorch/Python | Torch.nn.Module + checkpoints | Easy prototyping, highly optimized conversion tools | Suboptimal for real-world testing, Python dependencies, no C/C++ runtime without Python |
| TorchScript | PyTorch-specific serialized IR | Static graph without Python dependencies, supports custom ops, LibTorch C++ runtime | PyTorch-specific, deprecated (being replaced by ExportedProgram) |
| ExportedProgram & ExecuTorch | Two IRs: ExportedProgram and .pte | Replaces TorchScript, canonical PyTorch export IR, dedicated C++ runtime | PyTorch-specific, requires compilation to another IR, pipeline not fully mature |
| OpenVINO IR | Intel/CPU-centric IR | Strong Intel CPU/GPU optimization | Not suitable for mobile SoCs, extra conversion step needed |
| Proprietary vendor IRs | Vendor-specific internal IR | Highly hardware-optimized | Not portable, requires conversion from open IR |
Key observations:
- PyTorch format provides maximum flexibility and transparency but may have long-term compatibility concerns due to format evolution
- ONNX and TFLite are designed for inference deployment and cross-platform compatibility, representing stable industry standards
- ULBC ML parts will likely be based on PyTorch format, convertible to stable formats like ONNX or TFLite
6. SoC AI Engine Support Analysis (Clause 6.4.3.4)
Hardware landscape:
- Major smartphone SoCs include NPUs, DSPs, TPUs, GPUs, and CPUs
- Vendors provide specialized runtime environments and SDKs
- Vendors use native/preferred internal model formats optimized for their architecture
Industry pattern:
- All major vendors provide conversion mechanisms from popular open-source formats
- Common supported formats: ONNX, TFLite, PyTorch, TensorFlow
- References provided for major vendors: Qualcomm, Apple, Samsung, MediaTek, Google, Huawei
7. Summary and Recommendations (Clause 6.4.3.4)
Advantages of model-format/IR-based reference implementation:
- Decouples algorithm definition from hardware-specific implementation
- Leverages existing SoC vendor compilers, AI accelerators, and runtimes
- Significantly more portable, maintainable, and future-proof
Recommended approach for ULBC reference implementation:
1. Base reference on ML model-format with auxiliary signal processing in C
2. Include both ONNX and PyTorch as ML model-formats
3. Define neural network model-format including operator set and version
4. Specify I/O interfaces of ML models and auxiliary signal processing steps in C
5. Use reference implementation for integration illustration, verification, and testing
Proposal
The document proposes:
1. Discussion and agreement on selection of one or more model formats for ULBC reference implementation
2. Agreement on principle of using model format as part of ULBC standardization reference model
3. Documentation of findings in TR 26.940 under new clause 6.4.2
Key Technical Impact
This contribution represents a significant departure from traditional 3GPP codec standardization approaches by advocating for ML model formats rather than pure C implementations. The proposal addresses practical deployment considerations for ML-based codecs while maintaining compatibility with 3GPP standardization practices through hybrid approach combining model formats with C code for signal processing components.
|
Proposals
The sources invite discussions regarding the selection of one or more model formats which can be used for a reference implementation in support of the standardization of ULBC. Furthermore, the principle using a model format as part of the ULBC standardization reference model should be agreed and documented
It is proposed to document the findings on reference model formats into TR 26.940 under a new clause 6.4.2.
|
|
|
(pdf)
|
On the use of objective metrics in ULBC standardization |
Orange |
Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization
Introduction and Scope
This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
Main Technical Contributions
Test Methodologies (Clause 9)
Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1)
The document identifies specific impairment categories relevant to ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).
Challenges of Quality Assessment (9.1.2)
The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):
Traditional 3GPP Approach:
- Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech
- P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content
- Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments
ULBC-Specific Considerations:
- ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues)
- DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition
Alternative Test Methodologies Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation
Subjective Testing Considerations (9.1.3)
Robustness Related to Source Material (9.1.3.1):
- Multiple languages with diverse intonations
- Non-speech signals
- Various linguistic features and accents
- Wide range of speakers (different voice pitches, speaking styles)
- Overlapping talkers
Simulation of Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Various reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion (9.1.3.4):
- ITU-T P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Objective Testing Considerations (9.1.4)
Correlation Analysis Results (9.1.4.1):
The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:
Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ
General audio metrics: PEAQ, ViSQOL-A
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations for Clean Speech:
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior
- 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis for Music/Mixed Content:
Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations for Music/Mixed Content:
- POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f-model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology
Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective and objective metrics may be considered for codec characterization
- Objective metrics have merits in other tasks such as codec conformance testing
Proposed Changes to TR 26.940
The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.
|
Proposals
Based on my review of the document, there are no explicit proposals in this 3GPP document.
The document contains sections labeled "Proposal" (on page 1) and "Proposed revisions" (also on page 1), but these are section headers describing the general approach of the document rather than formal numbered or formatted proposals as typically found in 3GPP contributions.
The document primarily consists of:
- An introduction
- References
- An Annex containing proposed changes to TR 26.940 (marked as "pCR to TR 26.940")
- Technical content about test methodologies and objective metrics
While the Annex contains technical content that could be considered as proposed text for inclusion in TR 26.940, it is not formatted as explicit "Proposal X:" or "Proposal:" statements as specified in the extraction criteria.
|
|
|
(pdf)
|
On complexity estimation of ULBC |
Fraunhofer IIS |
Summary of S4-260241: On Complexity Estimation of ULBC
Document Overview
This contribution addresses the complexity measurement methodology for the Ultra-Low Bitrate Codec (ULBC) under development in 3GPP SA4. The document proposes a hybrid complexity metric that combines traditional DSP-based measurements with ML-specific metrics.
Background and Motivation
Multiple input documents [1-4] have previously discussed complexity measurement approaches:
- Documents [1] and [3] proposed using WMOPS (Weighted Million Operations Per Second), following conventional speech codec practices
- Document [2] suggested using MACs and a modified WMOPS version
- Document [4] emphasized model size considerations
The key challenge is that ULBC will operate on heterogeneous, non-fixed target hardware and processors, requiring a platform-agnostic complexity metric.
Main Technical Contributions
Proposed Hybrid Complexity Metric
The document proposes combining two complementary measurement approaches:
For DSP-based components:
- Use traditional WMOPS measurement
For ML-based components:
- Use MAC (Multiply-Accumulate) operations count
- Include parameter count for memory/model size considerations
Combined metric formula:
WMOPS + w · MACs
where w is an ML weighting factor (expected to be < 1) that reflects the vectorization capability of matrix multiplications.
Rationale for the Hybrid Approach
Limitations of WMOPS-only approach:
- WMOPS reflects complexity primarily for DSP operations
- Does not account for modern vectorization capabilities available even on modern DSPs
- Less relevant for non-DSP processor types
- The WMOPS toolbox doesn't reflect modern computational capabilities
ML-specific considerations:
- ML component complexity is dominated by matrix multiplications
- Inference time and energy consumption are highly platform-dependent
- MAC count provides architecture-agnostic computational load measurement
- Parameter count relates directly to model size, memory usage, and energy consumption
Advantages of the Proposed Metric
The hybrid approach provides:
1. Overall complexity estimate for hybrid DSP+ML codec designs
2. Avoids over-constraining codec design toward specific platforms (referenced S4-260233)
3. Allows UE vendors to leverage custom architectures and optimizations
4. Accounts for efficient vectorization of ML components
5. Enables flexible computational cost balancing between DSP-based and ML-based components
6. Maintains continuity with established practice while accommodating emerging ML-based designs
Vectorization Capability Reference Data
The document provides example processing units and their vectorization capabilities to inform the ML weighting factor w:
| Chip | Type | Vectorization Capabilities |
|------|------|---------------------------|
| HiFi 5s | DSP | 32×(8×8 bit MAC) 16×(32×16 bit MAC) 8×(32×32 bit MAC) |
| ARM Cortex A55 | CPU | 16×(8×8 MAC) 8×(16×16 MAC FP) |
Proposal
The source proposes to:
- Define computational complexity metric by counting:
- WMOPS for DSP-based components
- MAC for ML-based components
-
Combine according to: WMOPS + w · MACs (where w is an ML weighting factor)
-
Define a maximum value as the computational complexity limit in design constraints
-
Apply similar principles for memory counting metrics
References
The document references five previous contributions [1-4] and two external technical specifications [5-6] for processor capabilities.
|
Proposals
Proposal
The source proposes to define a computational complexity metric by counting
- WMOPS for the DSP‑based components,
- MAC for the ML‑based components, and
- combine those to a common value according to WMOPS + w · MACs, where w is a ML weighting factor
Finally, a maximum value needs to be defined as computational complexity limit in design constraints. Based on this principle, a similar metric can be defined for memory counting.
|
|
|
(pdf)
|
[FS_ULBC] ULBC Re-Focus Proposal |
Dolby Laboratories Inc., Nokia, Novamint |
ULBC Re-Focus Proposal
Background and Motivation
The FS_ULBC study item, initiated nearly a year ago, aims to establish a normative ULBC standard for voice communication over GEO within Rel-20. However, progress has been slow, with crucial issues such as end-to-end simulation parameters remaining unresolved. This contribution proposes a focused approach to meet 3GPP standardization timelines.
Core Proposal: Two-Phase Standardization Approach
The document proposes separating ULBC standardization into two distinct phases to ensure timely delivery while accommodating future enhancements:
Phase 1: Rel-20 ULBC Baseline
- Scope: GEO-focused functionality based strictly on stable Rel-19 service requirements
- Rationale: Ongoing 6G Media requirements in SA1, SA2, and SA4 have not yet produced consolidated or normative requirement sets suitable for codec design
- Principle: Following established 3GPP procedures, the ULBC work item shall not define new service requirements but rely on formally defined and stabilized upstream requirements
Phase 2: Rel-21 ULBC Advanced
- Scope: Extended functionality aligned with finalized 6G Media requirements
- Application scenarios: Beyond Rel-20 IMS Voice Call over GEO
- Compatibility: Should be backward compatible extension of Rel-20 baseline
Technical Configuration Comparison
Application Scenarios
Baseline (Rel-20):
- IMS Voice Call over GEO based strictly on Rel-19 service requirements
Advanced (Rel-21):
- Multi-Party Voice Communication
- IMS Voice Call with ULBC over additional access types beyond GEO
GEO Channel Characteristics & Simulation
Baseline (Rel-20):
- Single baseline UE Tx/Rx capability
- Single CNR in UL and DL (e.g., UL single-tone 23 dBm: CNR=5.28 dB for SCS=3.75 kHz, CNR=-0.74 dB for 15 kHz; DL 12-tone single Rx: CNR=-0.61 dB)
- Single agreed target bitrate compatible with baseline UE capability enabling acceptable system capacity
- Reliance only on mandatory Rel-19 NB-IoT radio protocol features (except SPS)
- i.i.d. random block error patterns
- Single SPS/bundling period (160 ms)
Advanced (Rel-21):
- Advanced UE capabilities (e.g., increased Tx power, multiple Rx antennas)
- Multiple CNR assumptions in UL and DL
- Codec designers may choose optimal bitrate/TBS per CNR
- Allow reliance on expected Rel-20 and selected non-mandatory NB-IoT features
- Simulated block error patterns based on advanced features
- Additional SPS/bundling periods (e.g., 80 ms, 320 ms)
Design Constraints: Bitrate
Baseline (Rel-20):
- Single target bitrate derived from Rel-19 GEO IMS voice service requirements
- Example: TBS=208 with SPS period 160 ms, achieving 950 bps net bitrate
Advanced (Rel-21):
- Multiple target CNRs with bitrate as codec design choice
- Additional bitrates for future 6G-related scenarios
Design Constraints: Sample Rate and Audio Bandwidth
Baseline (Rel-20):
- Single sample rate: e.g., 16 kHz
- Audio bandwidth: up to WB
- Note: May depend on agreed target bitrate
Advanced (Rel-21):
- Input/output sampling rates: at least 8, 16, 32, 48 kHz
- Audio bandwidth unconstrained (codec design choice)
Design Constraints: Frame Length and Algorithmic Delay
Baseline (Rel-20):
- Corresponding to SPS/bundling period (160 ms) or sub-multiples thereof
- Algorithmic delay excl. framing: e.g., ≤80 ms (0.5 × SPS/bundling period)
Advanced (Rel-21):
- Frame structure and algorithmic delay aligned with advanced SPS/bundling options and future 6G Media requirements
Design Constraints: Complexity and Memory
Baseline (Rel-20):
- Limited; sufficiently low to not preclude deployment on current-generation smartphones
- TBD MMAC/s
- E.g., 3M parameters
Advanced (Rel-21):
- Relaxed, enabling multiple models
- Addressing future 6G Media requirements while leveraging new UE hardware trends
Design Constraints: Packet Loss Concealment
Baseline (Rel-20):
- Required; capable of addressing single agreed-upon target bit rate and operation point of IMS Voice Call over GEO
Advanced (Rel-21):
- Required; capable of supporting anticipated extended application scenarios beyond Rel-20 IMS Voice Call over GEO, while fulfilling potential 6G Media requirements
Design Constraints: Noise Suppression and Robustness
Baseline (Rel-20):
- No requirement to provide noise suppression
- Required capability to handle and reconstruct noisy speech input with moderate to high SNR
- Note: Noise reconstruction capability primarily enforced through performance requirements
Advanced (Rel-21):
- No requirement to provide noise suppression
- Required capability to handle speech and generic input anticipated in extended application scenarios
Design Constraints: DTX
Baseline (Rel-20):
- No requirement to support DTX
- Note: No separate DTX-related performance requirement
Advanced (Rel-21):
- DTX support may be required for certain extended application scenarios, depending on potential 6G Media requirements
Performance Requirements
Baseline (Rel-20):
- Requirements focusing on clean and noisy speech performance
- NWT AMR7.4 or NWT AMR-WB8.85 depending on target bandwidth for:
- Clean speech
- Noisy speech (AMR/AMR-WB references operated with DTX on)
- Relevant transcoding cases with G.711, AMR, AMR-WB, EVS
Advanced (Rel-21):
- Complex set of requirements considering required capability to handle speech and generic input anticipated in extended application scenarios
Test Methodology and Test Plan
Baseline (Rel-20):
- Subjective: P.800 DCR
- Note: Test methodology and test plan should be conceptually aligned with corresponding EVS codec standardization Pdocs (e.g., DCR test design, applicable SNRs and types of noises for noisy speech test cases)
Advanced (Rel-21):
- Subjective: Suitable for critical evaluation of candidate codec(s) against expected complex set of performance requirements
Proposal
SA4 is asked to adopt this phased approach for ULBC standardization as working assumption:
-
Rel-20 ULBC Baseline: GEO-focused functionality based solely on Rel-19 service requirements and mandatory Rel-19 features (except SPS), enabling completion of viable ULBC baseline standard within Rel-20 schedule
-
Rel-21 ULBC Advanced: Extended ULBC functionality aligned with finalized 6G Media requirements, supporting application scenarios beyond Rel-20 IMS Voice Call over GEO, possibly leveraging advanced UE capabilities, and providing backward compatible extension of Rel-20 baseline
This approach ensures deliverable ULBC baseline in Rel-20 while providing clear and orderly path toward enhanced ULBC design in Rel-21.
|
Proposals
Proposal: To facilitate the timely delivery of a practical ULBC solution for IMS Voice over GEO within the Rel‑20 timeline, initial standardization efforts should be confined to a baseline configuration that is strictly anchored in stable Rel‑19 service requirements and guaranteed Rel‑19 radio features (with SPS as the sole exception). It is also recommended to maintain a targeted approach regarding design constraints, performance requirements, and testing methodologies. Defining this focused scope is critical to ensure the successful completion of the anticipated Rel‑20 Work Item following the FS_ULBC Study Item, as prompt SI finalization is required before commencing any ensuing WI activities.
Concurrently, advanced features and expanded application scenarios dependent on evolving 6G Media requirements have been recognized. As these requirements are still under development and not yet established as normative, they are unsuitable for inclusion in the Rel‑20 baseline. These elements should instead be systematically addressed in a structured second phase during Rel‑21, subsequent to the stabilization of pertinent 6G requirements.
It is therefore proposed to standardize ULBC in two phases:
-
Rel‑20 ULBC Baseline — GEO‑focused functionality based solely on Rel‑19 service requirements and mandatory Rel‑19 features (except SPS), enabling completion of a viable ULBC baseline standard within the Rel‑20 schedule.
-
Rel‑21 ULBC Advanced — Extended ULBC functionality aligned with finalized 6G Media requirements and supporting application scenarios beyond Rel‑20 IMS Voice Call over GEO. Possibly also leveraging advanced UE capabilities.
This two‑phase approach ensures a deliverable ULBC baseline in Rel‑20 while providing a clear and orderly path toward an enhanced ULBC design in Rel‑21. To ensure full alignment of the ULBC systems designed in the two phases, Rel-21 ULBC Advanced should be a backward compatible extension of the Rel‑20 ULBC Baseline.
The source suggests that SA4 adopts this phased approach for ULBC standardization as working assumption.
|
|
|
(pdf)
|
[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel |
Qualcomm Incorporated |
Feasible TBS Values and Packet Loss Traces for 80ms Bundling Period for ULBC over NB-IoT NTN GEO Channel
1. Background and Scope
This contribution presents simulation results for 80ms bundling period following the Simulation One ("target QoS based simulation") methodology. The document provides:
- Feasible TBS values
- Packet loss traces for optimal configurations
- System capacity analysis
2. Simulation Parameters and Trace Labeling
The simulations cover the following parameter ranges:
- Direction: UL/DL
- TBS: 144, 256, 328, 424 bits (for 80ms bundling)
- Bundling period: 80ms (focus of this paper)
- Doppler spread: 1Hz, 5Hz
- Number of RX: UL: 1, DL: 1, 2
- SCS: UL: 3.75kHz, 15kHz; DL: 15kHz
- Number of tones: UL: 1 for 3.75kHz SCS; 1, 3, 6, 12 for 15kHz SCS; DL: 15
- BLER targets: 1%, 2%, 6%, 10%
- UE power class: 23dBm, 26dBm, 31dBm
Trace file naming convention established for both UL and DL scenarios.
3. Optimal Configurations
3.1 TBS 144, 1 UE RX
Optimality criterion: Tradeoff between per-UE performance (TBS and BLER) and system capacity.
1% BLER
- DL: 16ms NPDSCH (N_SF=4, N_rep=4), required SNR: -4.6dB, capacity: 5 UEs
- UL: 48ms NPUSCH (N_RU=6, SCS 15kHz, 1 tone), required SNR: 0.0dB, UE TX power: 26.4dBm
- Feasibility: Only with 31dBm UE power class
2% BLER
- DL: 12ms NPDSCH (N_SF=3, N_rep=4), required SNR: -4.1dB, capacity: 6 UEs
- UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -2.2dB, UE TX power: 24.2dBm
- Feasibility: 26dBm and 31dBm UE power classes
6% BLER
- DL: 8ms NPDSCH (N_SF=4, N_rep=2), required SNR: -4.0dB, capacity: 10 UEs
- UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.2dB, UE TX power: 23.2dBm
- Feasibility: 26dBm and 31dBm UE power classes
10% BLER
- DL: 6ms NPDSCH (N_SF=3, N_rep=2), required SNR: -3.4dB, capacity: 12 UEs (limited by UL)
- UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.7dB, UE TX power: 22.7dBm
- Feasibility: All power classes (23dBm, 26dBm, 31dBm)
3.2 TBS 144, 2 UE RX
For 23dBm UE Power Class
- Only 10% BLER achievable with system capacity: 12 UEs
- Uses 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone) and 3ms NPDSCH (N_SF=3, N_rep=1)
For 26dBm and 31dBm UE Power Classes
- 1% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.6dBm, 4ms NPDSCH
- 2% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.0dBm, 4ms NPDSCH
- 6% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.9dBm, 4ms NPDSCH
- 10% BLER: 26 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.4dBm, 3ms NPDSCH
Key observation: 3.75kHz SCS configuration becomes optimal for higher power classes due to better coding rate.
3.3 TBS 256, 2 UE RX
For 26dBm UE Power Class
- 10% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 24.8dBm
- 6% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.3dBm
- 2% and 1% BLER: Infeasible
For 31dBm UE Power Class
- 10% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.2dBm, 5ms NPDSCH
- 6% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.8dBm, 5ms NPDSCH
- 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.3dBm, 8ms NPDSCH
- 1% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.8dBm, 8ms NPDSCH
3.4 TBS 328, 2 UE RX
For 26dBm UE Power Class
- Only 10% BLER achievable: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.88dBm
For 31dBm UE Power Class
- 10% BLER: 13 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 30.5dBm, 6ms NPDSCH
- 6% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 26.4dBm, 8ms NPDSCH
- 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 27.5dBm, 8ms NPDSCH
- 1% BLER: 8 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 28.1dBm, 10ms NPDSCH
3.5 TBS 424, 2 UE RX
Note: Coarse 5ms granularity for NPDSCH time-domain configuration.
For 31dBm UE Power Class
- 10% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.10dBm, 10ms NPDSCH
- 6% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.73dBm, 10ms NPDSCH
- 2% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 30.96dBm, 10ms NPDSCH
- 1% BLER: Infeasible
4. Feasible TBS Values
Observation: For 80ms bundling period with UE power class up to 31dBm:
- All TBS values (144, 256, 328, 424) are feasible for BLERs 1%, 2%, 6%, and 10%
- Exception: TBS 424 is not feasible at 1% BLER
5. Packet Loss Traces
299,391 traces provided in attached zip file for all 4 TBS values (144, 256, 328, 424).
6. Proposal
Proposal: Include clauses 2 through 5 to the PD or TR to provide a workable example on determining configurations based on optimal tradeoff between per-UE performance and system capacity.
|
Extracted Proposals
Proposal: include clauses 2 through 5 above to the PD or TR.
|
|
|
(pdf)
|
[FS_ULBC] ULBC Performance Requirements |
Apple Inc. |
Summary of S4-260271: ULBC Performance Requirements
Document Information
- Source: Apple Inc.
- Meeting: 3GPP TSG SA WG4#135, Goa, India (09-13 February 2026)
- Type: Discussion and Agreement
- Revision: Revision of S4aA250135 addressing comments from SA4 #134 post-adhoc telco (Dec 02)
Main Technical Contributions
Performance Requirements Framework
The document proposes establishing minimum performance requirements for the Ultra-Low Bitrate Codec (ULBC) based on the following rationale:
- ULBC targets IMS voice service over GEO and NGSO satellite systems (per Clause 4, TR 26.940)
- Quality must be consistent with deployed VoLTE IMS voice services
- Current TBS discussions center on bitrates in the 1-3 kbps range
- AMR-WB 12.65kbps and EVS-SWB 13.2kbps are commonly deployed VoLTE operating points
Proposed Minimum Performance Benchmarks
The document establishes two key performance anchors:
- At lowest operating range (~1 kbps):
- ULBC shall provide speech quality No Worse Than (NWT) AMR-WB @12.65kbps
-
Applies to: clean speech, noisy speech, and packet loss conditions
-
At higher operating range (~3 kbps):
- ULBC shall provide speech quality No Worse Than (NWT) EVS-SWB @13.2kbps
- Applies to: clean speech, noisy speech, and packet loss conditions
Reference Codecs and Operating Points for Testing
The document proposes a comprehensive list of reference codecs and operating points for ToR comparison testing in subjective evaluation:
- AMR: 12.2kbps
- AMR-WB: 8.85kbps, 12.65kbps, 23.85kbps
- EVS AMR-WB-IO: 8.85kbps, 12.65kbps, 23.85kbps
- EVS-WB/SWB: 7.2kbps, 8kbps, 9.6kbps, 13.2kbps, 13.2kbps CA, 24.4kbps
Text Proposal
The document proposes updates to Clause 8 (Performance requirements) of TR 26.940, adding:
- New Clause 8.1 (General) containing the performance requirements framework and minimum benchmarks
- New Clause 8.1.1 (A List of Reference Codecs and Operating Points) containing the reference codec list for subjective evaluation
|
Proposal
It is proposed to update Clause 8 in TR 26.940 to reflect the changes below.
|
|
|
(pdf)
|
[FS_ULBC] ULBC Codec Testing in Background Noise |
Apple Inc. |
Summary of S4-260272: ULBC Codec Testing in Background Noise
Document Overview
This contribution proposes a testing framework for the Ultra-Low Bitrate Codec (ULBC) in noisy conditions, drawing from EVS codec testing methodologies. The document is a revision of S4-251786 from SA4#134 and proposes updates to TR 26.940 Clause 9.
Background and Motivation
Noise Suppression Considerations
The document argues against mandating NS algorithms within the codec specification based on several key considerations:
-
Device-Specific Optimization: NS algorithms are typically optimized for specific device microphone array configurations. A generic NS algorithm applied uniformly could result in suboptimal performance across different device types.
-
Codec Robustness vs. NS Artifacts: Testing ULBC with clean, noisy, and optionally NS-processed speech provides better understanding of the codec's inherent robustness. NS algorithms may introduce speech distortions that could bias codec testing results.
-
Emergency Call Requirements: For emergency calls, preserving background noise is critical as it may contain important contextual information (alarms, traffic, voices) that helps identify the caller's environment or ongoing danger.
-
Complexity and Latency Concerns: ML-based NS algorithms can be computationally complex, increasing power consumption and end-to-end latency. Mandating complex NS could burden some devices inefficiently.
The document advocates for flexibility in NS implementation to enable manufacturers to develop device-specific solutions.
Proposed Testing Framework
Core Testing Scenarios (Table 9.1.4.1)
Following EVS codec testing principles (TR 26.952), the proposal includes:
| Source Material | Noise Type | SNR | Test Methodology |
|----------------|------------|-----|------------------|
| Clean speech | - | - | ITU-T P.800 ACR and/or DCR |
| Speech + Noise | Stationary (car, etc.) | 15 dB | ITU-T P.800 DCR |
| Speech + Noise | Non-stationary (street, babble, etc.) | 20-25 dB | ITU-T P.800 DCR |
This framework aligns with EVS testing which used:
- Car noise at 15 dB
- Street noise at 20 dB
- Office/babble noise at 20 dB
- ITU-T P.800 DCR methodology ("Degradation of Speech in Noise" DMOS test)
Optional Extended Testing for Low SNR (Table 9.1.4.2)
To characterize ULBC robustness in challenging low SNR conditions:
| Source Material | Noise Type | SNR | Test Methodology |
|----------------|------------|-----|------------------|
| Speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR |
| Speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR |
| NS processed speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR |
| NS processed speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR |
Key Notes:
- To avoid bias, a common NS processing tool should be used for generating NS-processed speech
- Selection of specific noise types and the NS processing tool is FFS
- Reference is made to TR 26.989 v19.0.0 (MCPTT work) where EVS was evaluated in siren noise at 5 dB SNR
Proposed Specification Changes
The document proposes adding new Clause 9.1.4 to TR 26.940 with two subclauses:
- 9.1.4.1 Background: Captures the rationale for flexible NS implementation
- 9.1.4.2 Recommendations for ULBC Codec Testing: Defines the testing framework with Tables 9.1.4.1 and 9.1.4.2
Action Requested
The document seeks Discussion and Agreement on:
1. The proposed testing framework for ULBC in noisy conditions
2. Updates to TR 26.940 Clause 9 as specified in the text proposal
|
Based on my analysis of the document, there are no explicit proposals in the standard 3GPP proposal format.
The document contains a section titled "4. Proposal" which states "It is proposed to update Clause 9 in TR 26.940 to reflect the changes below." However, this is followed by proposed text changes to be inserted into the technical report, rather than standalone numbered or formatted proposals in the typical 3GPP style (e.g., "Proposal 1:", "Proposal:", etc.).
The document is structured as a discussion paper with a proposed update to TR 26.940, but does not contain explicitly formatted proposals as typically found in 3GPP contributions.
|
|
|
(pdf)
|
[FS_ULBC] On device capability diversity |
Dolby Laboratories Inc., Nokia, Novamint |
Summary of S4-250275: On Device Capability Diversity for ULBC
Overview
This document (revision of S4aA260006) addresses UE capability diversity in NB-IoT NTN deployments for ULBC voice services. It proposes a capability-aware system design approach rather than assuming uniform baseline UE capabilities, accompanied by a pCR to TR 26.940.
Key Technical Contributions
1. UE Capability Diversity Framework
Identified Capability Dimensions:
- Transmit Power Classes:
- Baseline: PC3 (23 dBm)
- Enhanced: PC2 (26 dBm) or PC1 (31 dBm)
-
Future: up to 37 dBm under study for Rel-20
-
Receive Antenna Configurations:
- Standard: Single RX antenna
-
Enhanced: Dual RX antennas (providing ~3 dB gain)
-
Advanced Features:
- Improved RF sensitivity
- Multi-tone NPUSCH transmission capability
Key Insight: These capabilities are optional and vary across device categories, market segments, and implementations.
2. Benefits of Enhanced Capabilities
Enhanced UE capabilities enable:
- Reduced time-domain resource usage in half-duplex NB-IoT transmission
- Overcoming limitations of 80 ms SPS periods (excessive BLER and capacity constraints)
- Multi-tone NPUSCH transmission for:
- Higher ULBC bitrates
- Reduced time-domain resource usage
- Improved link robustness (reduced packet error rates)
3. Capability-Aware Multi-User SPS Scheduling
Proposed Scheduling Strategy:
- Dynamic SPS Assignment: Enhanced UEs use shorter SPS periods (80 ms) while baseline UEs use longer periods (160/320 ms)
- Multi-Tone Transmission: Enhanced UEs utilize multi-tone NPUSCH formats
- Load Balancing: Resource allocation prioritized based on UE capability
- Service Differentiation: Three-tier service model:
- Baseline Service: Conservative configurations (long SPS, single-tone NPUSCH)
- Intermediate Service: Moderate enhancements (shorter SPS, possible multi-tone)
- Enhanced Service: Higher bitrates and reduced latency (shortest SPS, multi-tone, dual RX)
Practical Example (Figure 1):
- UE Type A (Baseline): 160 ms SPS, 128 ms single-tone NPUSCH, 950 bits/s net bitrate (TBS 208 bits)
- UE Type B (Intermediate): 80 ms SPS, 64 ms NPUSCH (higher TX power), 1100 bits/s net bitrate (TBS 144 bits)
- UE Type C (Enhanced): 80 ms SPS, 64 ms NPUSCH, reduced NPDSCH duration (dual RX), 1100 bits/s net bitrate (TBS 144 bits)
4. ULBC Bitrate Differentiation
Proposed Approach:
- Leverage UE capability diversity for bitrate differentiation
- Align bitrates with service tiers and UE capabilities
- Recommended minimum set of 3 ULBC target bitrates:
- Basic tier: [600 - 1000] bits/s
- Intermediate tier: [1000 - 1800] bits/s
- Enhanced tier: [1800 - 3000] bits/s
- Higher bitrates may be considered in second ULBC standardization phase
Note: Actual bitrates subject to ongoing TBS discussions; values >3000 bits/s may become relevant.
Proposed Changes to TR 26.940 (pCR)
Section 5.2.4: New Clause on UE Capabilities
Documents capability variations for NB-IoT NTN:
- Transmit Power Classes: PC3/PC5 (Rel-18), PC1/PC2 (Rel-19), potential >31 dBm (Rel-20)
- Receive Antennas: Single (typical) vs. dual (enhanced)
- Enhanced Capabilities: Higher TX power, improved RF sensitivity
Section 5.2.5: Enhanced Multi-User Considerations
Replaces assumption of uniform UE configuration with capability-aware scheduling:
- Capability-Aware Resource Allocation: Different SPS periods based on UE capabilities
- Multi-Tone Transmission for Enhanced UEs: Increased bitrate and/or reduced resource usage
- Dynamic Load Balancing: Optimized capacity through capability-based prioritization
- Service Level Differentiation: Three-tier service model aligned with UE capabilities
Includes Figure 1 demonstrating practical multi-user scheduling scenario with three UE types.
Section 5.1.2.2: UE Delay Tables
Updates delay estimation tables (5.1.2-2, 5.1.3-1) to include:
- Voice bundling periods: 80, 160, 320 ms
- Codec frame sizes: 20, 40, 80, 160, 320 ms
- Mouth-to-ear delay estimates for GEO-TN and GEO-GEO scenarios
Recommendations
- Adopt capability-aware ULBC system design rather than assuming single baseline configuration
- Agree on minimum set of 3 ULBC target bitrates for codec evaluation (approximate ranges: [600-1000], [1000-1800], [1800-3000] bits/s)
- Document agreed ULBC target bitrates in Pdoc
- Consider higher bitrates in second ULBC standardization phase
References
Key dependencies: S4aA260006 (previous version), S4-260144 (TR 26.940 v0.5.1), S4-260255 (ULBC Re-Focus Proposal), TS 36.763 (UE radio transmission/reception), S4-251863 (system capacity), S4aA250112 (error trace methodology), S4aA250118 (RAN simulation results)
|
Extracted Proposals
Based on my analysis of the document, I could not find any explicitly marked proposals using the standard formats:
- "Proposal X: "
- "Proposal X. "
- "Proposal: "
- "Proposal. "
- "Proposal "
The document contains technical discussions, a pCR (proposed Change Request) to TR 26.940, and recommendations in the Conclusion section, but none of these are explicitly labeled as "Proposal" in the standardized format typically used in 3GPP contributions.
The Conclusion section mentions suggestions and recommendations (e.g., "it is suggested to agree on a minimum set of 3 ULBC target bitrates" and "it is proposed to document the ULBC target bit rates in the Pdoc"), but these are not formatted as formal proposals with the "Proposal" keyword.
|
|
|
(pdf)
|
On the use of objective metrics in ULBC standardization |
Orange, Dolby Laboratories Inc. |
Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization
Introduction and Scope
This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.
Main Technical Contributions
Test Methodologies - General Considerations (Clause 9.1.1-9.1.2)
Quality Impairment Categories for ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Testing Challenges:
- ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS)
- Traditional P.800 ACR methodology may not optimally quantify all potential impairments
- DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences
- Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations
Alternative Test Methods Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation
Subjective Testing Considerations (Clause 9.1.3)
Source Material Robustness (9.1.3.1):
- Multiple languages with diverse intonations
- Various phonetic and linguistic environments
- Different voice pitches and speaking styles
- Overlapping talkers
Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion:
- P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Objective Testing Considerations (Clause 9.1.4)
Correlation Analysis on Clean Speech (9.1.4.1):
Evaluated objective models from references [7-11]:
- Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS
- General audio metrics: PEAQ, ViSQOL-A
- Additional metric: SCOREQ
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations (Clean Speech):
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs
- Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis on Music/Mixed Content:
Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations (Music/Mixed Content):
- POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading
Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective/objective metrics may be considered for characterization of new codec
- Objective metrics have merits in other tasks such as codec conformance testing
Document Type
This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.
|
Proposals
Based on my review of the document, there are no explicit proposals in this 3GPP document.
The document contains sections labeled "Proposal" (on page 1) and "Conclusion" (in sections 9.1.3.4 and 9.1.4.3), but these are section headers rather than formal numbered or formatted proposals following the standard 3GPP proposal format (e.g., "Proposal 1:", "Proposal:", etc.).
The "Proposal" section on page 1 contains descriptive text about objective quality models and proposed revisions to TR 26.940, but does not contain any formally stated proposals. The "Conclusion" sections contain observations and recommendations but are not formatted as formal proposals.
|
|
|
(pdf)
|
[FS_ULBC] Analysis and recommended handling of the reply liaisons from other working groups to SA4 on ULBC |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] ULBC Re-Focus Proposal |
Dolby Laboratories Inc., Nokia, Novamint, Vivo |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Proposed ULBC design constraints living document v0.0.2 |
vivo Mobile Communication Co., Orange |
No summary available
|
No proposals available
|
|
|
|
[FS_ULBC] Analysis and recommended handling of the reply liaisons from other working groups to SA4 on ULBC |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
On the use of objective metrics in ULBC standardization |
Orange, Dolby Laboratories Inc. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On the scheduling timing uncertainty |
Qualcomm Incorporated, Ericsson LM |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle |
Qualcomm Incorporated, Xiaomi |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On transmission delay for voice over NB-IoT NTN |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
Reply LS on issues related to support of IMS voice over NB-IoT NTN connected to EPC |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
Reply LS on on the RAN simulation assumptions, bundling period and SPS for ULBC |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On target platforms for ULBC |
Huawei Technologies Co., Ltd. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On vehicle emergency call scenario for ULBC |
Huawei Technologies Co., Ltd. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19 |
vivo Mobile Communication Co., |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS |
Bytedance |
No summary available
|
No proposals available
|
|
|
(pdf)
|
On the use of objective metrics in ULBC standardization |
Orange, Dolby Laboratories Inc. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] ULBC Performance Requirements |
Apple Inc. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] ULBC Codec Testing in Background Noise |
Apple Inc., Lenovo |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC]TR 26.940 V 0.6.0 |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|
|
(pdf)
|
On complexity estimation of ULBC |
Fraunhofer IIS |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Permanent Document v0.6.0 |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] WorkPlan of FS_ULBC v0.6 |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On codec bitrate and capacity discussion for ULBC |
vivo, Samsung, Spreadtrum, MediaTek Inc. |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Alignment Analysis on Complexity of DAC model |
vivo Mobile Communication Co., |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Analysis of AI Codec Complexity Scaling |
vivo Mobile Communication Co., |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling |
vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance |
No summary available
|
No proposals available
|
|
|
(pdf)
|
On ULBC complexity and RTF analysis |
Dolby Laboratories Inc., Novamint, Nokia |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On device capability diversity |
Dolby Laboratories Inc., Nokia, Novamint |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] On device capability diversity |
vivo Mobile Communication Co., |
No summary available
|
No proposals available
|
|
|
(pdf)
|
Reply LS on on the RAN simulation assumptions, bundling period and SPS for ULBC |
Qualcomm Incorporated |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC]TR 26.940 V 0.6.1 |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|
|
(pdf)
|
[FS_ULBC] WorkPlan of FS_ULBC v0.6 |
China Mobile Com. Corporation |
No summary available
|
No proposals available
|
|