All Summaries - Table View

Meeting: TSGS4_135_India | Agenda Item: 7.8

38 documents found

TDoc Number	Source	Title	Summarie
S4-260126 PDF Info	Bytedance	[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS	Analysis on Complexity Evaluation of ULBC with WMOPS 1. Introduction This contribution examines the use of WMOPS (Weighted Million Operations Per Second) as a complexity metric for ULBC (Ultra Low Bitrate Codec). WMOPS has been proposed as one of the possible complexity metrics and is traditionally used for evaluating 3GPP speech codecs complexity. The analysis focuses on the WMS tool used for automated WMOPS calculation with floating point C code. 2. Technical Analysis: Discrepancies Between ITU-T Documentation and WMC Tool Implementation The source conducted systematic testing of the WMC tool against the examples provided in ITU-T standards documentation (specifically clause 18.12.7 and related tables in the ITU-T Software Tool Library 2024 User's Manual). Several discrepancies were identified: 2.1 'Move' Operator Counting Issue: Extra MOVE operations are counted by the WMC tool Expected behavior (per Table 18.4): Division by constant `b = a / L` should count as 1 MULT (since 1/L is constant, operation becomes multiplication) Actual WMC output: 1 MULT + 1 MOVE Discrepancy: 1 additional MOVE operation 2.2 Increment Operator ('++') Issue: Missing operations in WMC tool output Expected behavior (per Table 18.4): `(rnd_T0)++` should count as 1 ADD + 1 STORE (equivalent to `rnd_T0 = rnd_T0 + 1`) Actual WMC output:* 0 operations counted Discrepancy: 1 ADD + 1 STORE not counted Note: This may not affect actual complexity on DSP implementations where pointer increment can be combined with other operations 2.3 Logical Operators ('AND/OR') Issue: Missing TEST operation counting Expected behavior (per Table 18.4): `if (a!=b \|\| c==d)` should count as 2 ADD + 2 BRANCH + 1 TEST Actual WMC output: 2 ADD + 2 BRANCH Discrepancy: 1 TEST operation missing 2.4 Indirect Addressing Issue: Extra MOVE operation and incorrect INDIRECT counting Expected behavior (per Table 18.4): Double indirection `Indice[0] = indirect_dico1[indice[0]]` should count as 2 INDIRECT Actual WMC output: 1 MOVE Discrepancy: 2 INDIRECT operations not counted, 1 MOVE added instead 2.5 Instrumentation with Array Subscripts Issue: Arithmetic operations inside array subscripts not counted Expected behavior (per Table 18.3): Operations like `a[i2+1]` should count arithmetic operations within the subscript (1 INDIRECT + 1 MULT + 1 ADD) Actual WMC output:* Only 1 MOVE counted Discrepancy: INDIRECT, MULT, and ADD operations inside array subscripts are not instrumented 3. Observations and Impact Assessment The source identifies three key observations: Systematic discrepancies exist between ITU-T standards documentation and WMC tool implementation, with both over-counting (e.g., extra MOVE operations) and under-counting (e.g., missing operations in array subscripts) observed Potential significance for AI codecs: Some discrepancies, particularly the counting of MOVE operators and instrumentation inside arrays, could significantly impact WMOPS measurements for AI-based codecs Need for clarification: If WMOPS is adopted as a complexity metric for ULBC, these differences must be carefully addressed and the calculation methodology must be clearly defined 4. Proposal The source proposes to document the findings from Clause 2 and Clause 3 in clause 6.2 of the permanent document to ensure proper consideration of these WMOPS calculation issues in the ULBC complexity evaluation framework.
S4-260128 PDF Info	Bytedance	[FS_ULBC] Influence of code optimization on WMOPS	Summary of S4-260128: Influence of Code Optimization on WMOPS 1. Introduction and Motivation This contribution investigates the impact of C code implementation choices on WMOPS (Weighted Million Operations Per Second) measurements for neural audio codecs, specifically in the context of the ULBC (Ultra-Low Bitrate Codec) study. The source examines whether WMOPS, traditionally used for 3GPP speech codec complexity evaluation, is suitable for neural audio codecs given that actual C implementation can significantly affect WMOPS measurements. 2. Experimental Analysis 2.1 Operator-Level Analysis The source conducted experiments on Conv1D and Conv1DTranspose operators, which are extensively used in DAC (Discrete Audio Codec) for audio feature dimension manipulation: Non-optimized implementation: Naïve nested loop implementation Optimized implementation: Loop unrolling along kernel size dimension only Key Results: - Conv1D: 441 WOPS (non-optimized) → 301 WOPS (optimized) = 68.25% ratio - Conv1DTranspose: 554 WOPS (non-optimized) → 260 WOPS (optimized) = 46.93% ratio Finding: The same optimization strategy yields significantly different optimization ratios for different operators. 2.2 Full-Model Level Analysis Using the optimized and non-optimized operator implementations, the source measured WMOPS for two DAC configurations (enc16dec384 and enc64dec1536) and compared against previously reported results [4]: Total WMOPS: - enc16dec384: 13,320.35 (non-opt) → 8,152.01 (opt) → 5,785.17 (reported in [4]) - enc64dec1536: 201,552.55 (non-opt) → 123,966.49 (opt) → 84,441.99 (reported in [4]) Encoder WMOPS: - enc16dec384: 3,411.08 (non-opt) → 2,621.98 (opt) → 1,060.79 (reported in [4]) - enc64dec1536: 50,089.70 (non-opt) → 39,604.59 (opt) → 13,675.30 (reported in [4]) Decoder WMOPS: - enc16dec384: 9,847.22 (non-opt) → 5,484.21 (opt) → 4,724.38 (reported in [4]) - enc64dec1536: 151,291.59 (non-opt) → 84,255.25 (opt) → 70,766.69 (reported in [4]) 3. Observations and Conclusions The source draws two critical observations: Code optimization sensitivity: Simple optimizations (e.g., single-layer loop unrolling) can result in widely different WMOPS results for the same model Inconsistent optimization impact: The same optimization strategy produces different WMOPS reduction ratios across different models Main Conclusion: If WMOPS is adopted as a complexity metric for ULBC, results will be highly influenced not only by model design but also by the actual C code implementation, potentially making comparisons between different codec proposals inconsistent. 4. Proposal The source proposes to document the experimental findings and observations as a new clause 7.6.5 "WMOPS analysis on DAC" in TR 26.940, with three sub-clauses: - 7.6.5.1: On operator level - 7.6.5.2: On full-model level - 7.6.5.3: Observation This would capture the implementation-dependency issues of WMOPS measurements for neural audio codecs in the technical report.
S4-260132 PDF Info	China Mobile Com. Corporation	[FS_ULBC] Discussion of FS_ULBC Objective Speech Quality Assessment Method	Summary of S4-260132: Discussion of FS_ULBC Objective Speech Quality Assessment Method Background This contribution addresses speech quality assessment challenges for ultra-low bitrate codecs (ULBC). While subjective testing remains the benchmark for ULBC codec selection, objective speech evaluation methods can serve as predictive tools during intermediate testing and parameter adjustment processes, enabling more convenient and efficient quality verification. Overview of Existing Speech Objective Quality Evaluation Methods The document provides a comprehensive comparison of available objective assessment tools: Standardized ITU-T Methods P.863 (POLQA): Full-reference method, widely adopted in ITU/3GPP, supports NB/WB/SWB, maintains performance below 4kbps in SWB mode P.563: No-reference method suitable for real-time applications, but less accurate for extreme noise or complex distortions compared to full-reference methods Open Source Methods ViSQOL: Full-reference, performs well for low bitrates (under 8kbps with good MOS correlation), but not formally standardized STOI/ESTOI: Full-reference, focuses on speech intelligibility, computationally efficient with high correlation to subjective tests in noisy conditions. ESTOI improves robustness to nonlinear distortions (e.g., neural codecs) SCOREQ: No-reference model with strong cross-domain robustness and improved correlation with human judgments Capabilities and Limitations for ULBC The document analyzes each method's suitability for ultra-low bitrate scenarios: P.863: Most widely adopted, broad bandwidth support, proven performance at low bitrates P.563: Limited adaptability to non-linear distortions from neural codecs ViSQOL: Good consistency with MOS at low bitrates but lacks formal standardization STOI/ESTOI: Effective for intelligibility assessment, robust to nonlinear distortions, but not ITU-T/3GPP standardized SCOREQ: Addresses domain-generalization shortcomings with improved out-of-domain robustness Proposal Recommended Objective Assessment Methods After excluding unsuitable methods, the contribution recommends considering P.863, ViSQOL, and ESTOI as potential objective quality assessment methods for ULBC. Text Proposal for TR 26.940 The document proposes a pCR to TR 26.940 Section 9 (Test methodologies) that includes: New Section 9.1.1: Typical Quality Impairments Identifies ULBC-specific impairment categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, reverberant speech) New Section 9.1.2: Challenges of Quality Assessment Addresses testing challenges specific to ULBC: Traditional 3GPP Practice: AMR/AMR-WB/EVS used P.800 ACR for clean speech and DCR for noisy/mixed content, but did not focus on intelligibility, speaker identifiability, or prosodic impairments ULBC-Specific Challenges: ML-based codecs introduce new impairment types (e.g., hallucination) requiring alternative test methods Additional Test Methodologies (non-exhaustive list): Diagnostic Rhyme Tests (DRT) Modified Rhyme Tests (MRT) MOS testing for speaker similarity Speaker verification/identification tests Prosodic naturalness MOS tests Intonation recognition tests Transcription tests for word/semantic equivalence Phoneme recognition tests Automatic speech recognition tests Objective Methods as Optional Tools: Proposes documenting that objective methods (P.863, ViSQOL, ESTOI, etc.) can be considered as optional tools for predicting speech quality during ULBC simulation testing and parameter optimization, acknowledging that subjective listening remains the most important evaluation method despite being time and resource-intensive Speech Enhancement Evaluation: Notes that P.835 multi-dimensional rating scales can be used for speech enhancement tools that may be part of ULBC Technical Contribution The main technical contribution is establishing a framework for objective quality assessment in ULBC standardization that: 1. Recognizes the unique challenges of ML-based codecs 2. Identifies suitable objective methods as predictive tools 3. Proposes their documentation as optional assessment methods in TR 26.940 4. Maintains subjective testing as the primary benchmark while enabling more efficient intermediate evaluation
S4-260136 PDF Info	Xiaomi Technology	Updates to the simulation results for FS_ULBC	Summary of S4-260136: Updates to Simulation Results for FS_ULBC 1. Introduction and Context This document presents updated link-level simulation (LLS) results for Ultra-Low Bitrate Communication (ULBC) over Non-Terrestrial Networks (NTN). The simulations follow the NTN-TDL-C channel model as specified in TS 36.102. This revision adds: - Missing simulation results for NTN-TDL-C 10 NPUSCH - New simulation results for NTN-TDL-C 10 NPDSCH - Updated TBS (Transport Block Size) values for both NPDSCH and NPUSCH with 10 degrees elevation angle The simulations are based on parameters discussed in S4aA250038 and follow agreements from previous meetings. 2. Channel Model Assumptions Key Parameters: - Satellite elevation angle: 12.5 degrees for link budget calculations - Channel model parameters (delay and power of each path) determined for 10 degrees elevation (approximation of 12.5 degrees) - Channel model: NTN-TDL-C from TS 36.102 3. Link Budget Analysis 3.1 CNR Baseline Values (from RAN1) Uplink: - CNR = 2.6 dB (0 dBi UE antenna gain, 3.75 kHz SCS, 1 tone, 23 dBm UE TX power) Downlink: - CNR = -3.3 dB (0 dBi UE antenna gain, 15 kHz SCS, 12 tones, 1 RX antenna, 7 dB noise figure) 3.2 Additional UL CNR Configurations \| Configuration \| SCS \| UE Power \| UL CNR \| \|---------------\|-----\|----------\|---------\| \| Config 1 \| 3.75 kHz \| 23 dBm \| 2.6 dB \| \| Config 2 \| 15 kHz \| 23 dBm \| -3.42 dB \| \| Config 3 \| 3.75 kHz \| 26 dBm \| 5.6 dB \| \| Config 4 \| 15 kHz \| 26 dBm \| -0.42 dB \| \| Config 5 \| 3.75 kHz \| 31 dBm \| 10.6 dB \| \| Config 6 \| 15 kHz \| 31 dBm \| 4.58 dB \| 3.3 Additional DL CNR Configurations \| Configuration \| Number of RX \| G/T value \| DL CNR \| \|---------------\|--------------\|-----------\|---------\| \| Config 1 \| 1 \| -31.6 \| -3.3 dB \| \| Config 2 \| 2 \| -31.6 \| -0.3 dB \| \| Config 3 \| 1 \| -28.6 \| -0.3 dB \| \| Config 4 \| 2 \| -28.6 \| 2.7 dB \| 4. NPUSCH Simulation Results 4.1 Common Simulation Parameters Scenario: GEO orbit, 10 degree elevation Carrier frequency: 2 GHz Channel model: NTN-TDL-C with 5 ns delay spread Physical channel: NPUSCH format 1 SCS: 3.75 kHz and 15 kHz Number of tones: Single tone Waveform: SC-FDMA MIMO: SISO (1T1R) DMRS: OS#4 per slot for 3.75 kHz, OS#3 per slot for 15 kHz UE velocity: 3 km/h Symbols per slot: 7 Slots per RU: 16 Modulation: π/4-QPSK Receiver: MMSE with real channel estimation Target BLER: 1%, 2%, 6%, 10% 4.2 Results Part 1 - Various TBS and Bundling Times 80 ms Bundling Time 144 bits (Cases 1-4): - Case 1: 15 kHz, MCS 2, 4 RUs, 2 reps → SNR: -4.97 dB (10% BLER) to -4.35 dB (1% BLER) - Case 2: 15 kHz, MCS 2, 2 RUs, 1 rep → SNR: 1.8 dB to 2.7 dB - Case 3: 3.75 kHz, MCS 10, 1 RU, 2 reps → SNR: 1.5 dB to 2.30 dB - Case 4: 15 kHz, MCS 10, 1 RU, 4 reps → SNR: -4.64 dB to -3.90 dB 256 bits (Cases 5-8): - SNR ranges from -2.84 dB to 5.9 dB depending on configuration 328 bits (Cases 9-11): - SNR ranges from -1.53 dB to 9.45 dB depending on configuration 424 bits (Case 12): - SNR: 1.44 dB (10% BLER) to 2.05 dB (1% BLER) 160 ms Bundling Time 208 bits (Cases 13-15): - SNR ranges from -5.56 dB to 1.53 dB 424 bits (Case 16): - SNR: -1.95 dB to -1.52 dB 600 bits (Case 17): - SNR: -1.38 dB to -0.97 dB 808 bits (Cases 18-19): - SNR ranges from -1.42 dB to 0.21 dB 320 ms Bundling Time 328 bits (Cases 20-25): - SNR ranges from -6.80 dB to -0.22 dB 776 bits (Cases 26-27): - SNR ranges from -2.48 dB to 6.46 dB 1000 bits (Cases 28-30): - SNR ranges from -1.95 dB to 7.47 dB 1544 bits (Case 31): - SNR: 0.48 dB to 0.76 dB 4.3 Results Part 2 - Additional Cases Covers Cases 32-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. SNR requirements range from -3.6 dB to 8.0 dB depending on configuration. 5. NPDSCH Simulation Results 5.1 Common Simulation Parameters Scenario: GEO orbit, 10 degree elevation Carrier frequency: 2 GHz Channel model: NTN-TDL-C with 5 Hz Doppler spread Physical channel: NPDSCH SCS: 15 kHz Number of tones: 12 Waveform: OFDM MIMO: SISO (1T1R or 1T2R) Symbols per subframe: 14 Modulation: QPSK Receiver: MMSE with NRS-bundling and real channel estimation Target BLER: 1%, 2%, 6%, 10% 5.2 Results Part 1 - Various TBS and Bundling Times 80 ms Bundling Time 144 bits (Case 0a): - 1R: SNR -6.6 dB to -5.3 dB - 2R: SNR -9.3 dB to -8.1 dB 256 bits (Case 0b): - 1R: SNR -4.3 dB to -3.1 dB - 2R: SNR -7.1 dB to -6.1 dB 328 bits (Cases 1-2): - SNR ranges from -11.8 dB to -4.0 dB 424 bits (Cases 3, 3b): - SNR ranges from -11.6 dB to -5.0 dB 160 ms Bundling Time 208 bits (Case 4): - SNR: -15.3 dB to -11.8 dB 424 bits (Cases 5, 5b): - SNR ranges from -14.6 dB to -8.0 dB 600 bits (Case 6): - SNR: -11.1 dB to -7.2 dB 808 bits (Cases 7, 7b, 8, 8b): - SNR ranges from -11.0 dB to -1.1 dB 320 ms Bundling Time 328 bits (Cases 9-11b): - SNR ranges from -17.7 dB to -8.1 dB 776 bits (Cases 12, 12b): - SNR ranges from -14.8 dB to -8.1 dB 1000 bits (Case 13): - SNR: -10.3 dB to -6.4 dB 1544 bits (Cases 14, 14b): - SNR ranges from -11.7 dB to -5.0 dB 5.3 Results Part 2 - Additional Cases Covers Cases 15-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. 1T1R Results: - SNR requirements range from -10.9 dB to 1.1 dB 1T2R Results: - SNR requirements range from -13.6 dB to -1.9 dB - Approximately 3 dB gain compared to 1T1R configurations 6. Conclusions The document recommends considering these simulation results for determining design constraints for ULBC. The results demonstrate: - Performance across various TBS values (144 to 1736 bits) - Multiple bundling times (80, 160, 320 ms) - Different SCS configurations (3.75 kHz, 15 kHz) - Impact of repetitions on SNR requirements - Benefits of 2 RX antennas (approximately 3 dB gain)
S4-260137 PDF Info	Huawei Technologies Co., Ltd.	[FS_ULBC] On eCall scenario for ULBC	Summary of S4-260137: On eCall Scenario for ULBC 1. Background This contribution addresses the eCall (emergency call) scenario for Ultra-Low Bitrate Codec (ULBC) work. Previous contributions (S4-251908, SA-251848, SA-251881) emphasized the importance of preserving background signals in emergency communications. China has developed a related national standard "On-Board Emergency Call System for Road Vehicles" expected to take effect on July 1, 2027. The document highlights that eCall scenarios have special requirements and different conditions compared to regular call scenarios, necessitating different design constraints and test methodologies. 2. eCall Scenario Description 2.1 System Overview The eCall system is an in-vehicle safety technology that: - Automatically dials emergency numbers (e.g., 112 in EU) upon severe collision detection - Sends minimum data set (MSD) including GPS location, VIN, collision direction and time - Can be triggered by built-in sensors or manual SOS button - Functions via GEO satellite even without terrestrial network coverage 2.2 Communication Architecture The bi-directional voice data flow involves: - Vehicle side: Integrated microphones and speakers communicating over GEO satellite network - Emergency response center: Connected via terrestrial mobile network (VoLTE, VoNR), fixed-line, or other IMS-supported platform - Key requirement: Background noise captured within vehicle must be delivered with fidelity to emergency response center - Asymmetric requirement: Noise preservation may not be required in the opposite direction (emergency center to vehicle) - Dedicated system: No mobile phones involved in the communication link 3. Key Observations Observation 1: eCall is a dedicated system between vehicles and emergency response centers. Speech codec designed for eCall is not necessarily the same as that for regular call scenarios, allowing for separate design constraints or performance requirements for ULBC-eCall. Observation 2: Vehicle and emergency response center have significantly different hardware capabilities compared to regular call scenarios: - Less sensitive to power consumption - Higher computational capability - Higher storage capability - This allows for relaxed design constraints and more critical performance requirements for ULBC-eCall 4. Proposed Changes to TR 26.940 4.1 New Clause 4.5: eCall Communication The contribution proposes adding a new scenario (Scenario 4) to TR 26.940 documenting: High-level Prerequisites for ULBC in eCall: Very low bitrate support Background noise preserved with no DTX during the call (at least for vehicle-to-emergency center direction) Error concealment Real-time implementation capability (encoding and decoding) Good audio quality for reasonable QoE Relaxed hardware constraints compared to mobile phones 4.2 Modified Clause 6.2: Design Constraint Parameters The contribution proposes creating separate design constraint columns in Table 6.2-1: - Design Constraint (regular call): Existing constraints - Design Constraint (eCall): New column with eCall-specific constraints Key Differences for eCall Design Constraints: \| Parameter \| Regular Call \| eCall \| \|-----------\|-------------\|-------\| \| Noise Suppression \| Not required; noise suppression may be applied \| Background noise preserved during call (at least vehicle-to-center direction); opposite direction may not require preservation \| \| DTX Support \| Support \| No DTX support during call (at least vehicle-to-center direction) \| \| Complexity/Memory \| Standard mobile constraints \| Relaxed constraints possible \| 5. Technical Contributions The main technical contributions of this document are: Introduction of eCall as a distinct ULBC scenario with specific requirements different from regular call scenarios Identification of asymmetric requirements for noise preservation (required vehicle-to-center, optional center-to-vehicle) Proposal for relaxed design constraints based on different hardware capabilities of eCall endpoints Explicit requirement for background noise preservation and no DTX in critical direction Framework for separate performance requirements for eCall vs. regular call scenarios in TR 26.940 The document establishes that eCall scenarios justify different codec design approaches due to their dedicated nature, different hardware capabilities, and specific regulatory/safety requirements.
S4-260141 PDF Info	Huawei Technologies Co., Ltd.	[FS_ULBC] On target platforms for ULBC	Summary of S4-260141: Target Platforms for ULBC 1. Introduction and Motivation This contribution addresses a gap in TR 26.940 regarding target platforms for Ultra Low Bit rate Codec (ULBC) deployment. While the TR currently discusses NPU as a possible platform in clause 6.2.1.5.1, it lacks coverage of other non-NPU platforms. The document aims to complete this missing information, particularly focusing on DSP-enabled devices. 2. Technical Problem Statement The contribution identifies an inconsistency in TR 26.940: Clause 6.2.1.1 states that the codec must support real-time processing alongside other audio processing units and should fit real-time resource constraints of CPUs, potential accelerators, and DSPs across a range of devices Clause 6.2.1.5.1 currently only describes NPU as a target platform, omitting DSP and other non-NPU platforms The source references previous contributions (S4aA250267 and S4-251747) that discussed the need for DSP deployment and provided clarification on DSP-enabled UE devices. 3. DSP-Enabled Device Definition The contribution adopts the definition from S4-251747 for DSP-enabled UE devices: Devices with DSP only or devices with multiple computing units including DSP For multi-unit devices (with CPU/NPU/DSP), there remains a preference for DSP deployment due to: Lower power consumption Reduced heat generation Better battery life Target devices include vehicle-mounted devices, glasses, and mobile phones with low computational capability DSP refers to audio processing DSPs available in mobile phones or other devices for voice communication 4. Proposed Text Changes Main Technical Contribution The proposal adds a new paragraph to clause 6.2.1.5.1 that: Acknowledges vendor flexibility: Vendors may choose any computing unit to implement ULBC based on business needs or product constraints Highlights DSP advantages: Cheaper in terms of silicon real estate Less power hungry Less heat generation Typically single-threaded for synchronized real-time execution with low overhead Potentially wider range of product support Establishes DSP deployment requirement: ULBC should be deployable on DSP-enabled UE devices, including: Devices with DSP only Devices with multiple computing units including DSP Provides deployment rationale: Even when CPU or NPU are available, DSP may be preferred for power-sensitive applications (wearables, mobile phones) Defines DSP computational power: Audio processing DSPs typically range from several hundred to over a thousand MIPS Context Preservation The proposal maintains the existing text about: - NPU prevalence in modern smartphones - NPU being 5-20x more power efficient than CPUs for AI tasks - The note that ULBC may need to run on non-NPU platforms in certain configurations 5. Technical Impact This contribution ensures that TR 26.940 provides comprehensive guidance on target platforms for ULBC deployment, balancing the AI-optimized NPU approach with the power-efficient DSP approach, thereby supporting a wider range of device implementations and use cases.
S4-260142 PDF Info	Huawei Technologies Co., Ltd.	[FS_ULBC] On complexity and memory constraints for ULBC	Summary of S4-260142: On Complexity and Memory Constraints for ULBC Introduction This contribution addresses complexity and memory constraints for Ultra Low Bitrate Codec (ULBC) as part of the study in TR 26.940. The document aims to clarify previous discussions on measurement metrics and specific constraints, proposing concrete values for complexity, RAM, and ROM requirements. Main Technical Contributions Complexity Measurement Metrics The contribution proposes using both MACS (Million Multiply-Accumulate Operations per Second) and Codec/Model Size together to characterize ULBC complexity, rather than relying on a single metric: Codec/Model Size: Directly impacts memory requirements and power consumption (more memory footprint requires more frequent DRAM access, leading to higher power consumption) MACS: More suitable for guiding computing hardware unit selection These metrics do not necessarily correlate, as different model architectures can result in very different MACS for the same model size Memory Constraints Clarification The document clarifies confusion from previous contributions (S4aA250253 and S4-251807) regarding the 5-10M parameters proposal: ROM Constraints ROM characterized by overall Model Sizes across all operation modes Major impact is FLASH consumption in product design Minimal power consumption impact (only one model's parameters accessed at a time) Proposed constraint: < 15M parameters (relaxed from previously discussed 10M to support more operation modes) Enables support for ~5 operation modes (e.g., 2-3 bitrates for 2 different sampling rates) RAM Constraints RAM characterized by maximum single Model Size (assuming no switching between operation modes) Proposed constraint: < 3M parameters With 15M ROM, this allows 5 operation modes Whether switching between operation modes will be supported is FFS Complexity Constraints MACS Reference Point The contribution references the 2025 Low-Resource Audio Codec (LRAC) Challenge sponsored by Cisco Systems as a relevant benchmark: LRAC Challenge Requirements: - Sampling rate: 24 kHz - Mono audio input - Bitrate: up to 1 kbps (ultralow) and up to 6 kbps (low) - Latency: 30 ms (Track 1) or 50 ms (Track 2) - Compute complexity: ≤ 350 MMACS total; ≤ 150 MMACS receive-side - Winner (ByteDance) used ~4M parameters Proposed MACS Value While LRAC suggested 350 MMACS, the contribution proposes < 600 MMACS for ULBC Rationale: Slightly increased complexity enables better speech quality while remaining within target hardware (e.g., DSP) computational capacity Validation: Handcrafted 3M parameter codec (reduced from SoundStream architecture) achieved 600 MMACS Proposed Design Constraints Summary The contribution proposes the following specific constraints for ULBC: Complexity: Single Model Size < 3M parameters < 600 MMACS RAM: < 3M parameters (assuming no switching between operation modes) Whether switching will be supported is FFS ROM: < 15M parameters Text Proposal The contribution includes a change request to TR 26.940, Section 6.2 (Design Constraint Parameter), Table 6.2-1, adding the specific complexity and memory constraints detailed above to the "Complexity and memory demands" parameter row.
S4-260144 PDF Info	China Mobile Com. Corporation	[FS_ULBC]TR 26.940 V 0.5.1	3GPP TR 26.940 - Study on Ultra Low Bit rate Speech Codecs (Release 20) Document Overview This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types. 1. Application Scenarios for Ultra-Low Bit Rate Communication Services 1.1 Scenario 1: IMS Voice Call over GEO (Primary Scenario) Background: - GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay - TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s - Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints Scenario Descriptions: Main Scenario (4.2.2.2): One UE connects via GEO satellite access - UE1: Phone supporting IMS voice over GEO satellite - UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation) Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access - Less common but relevant for disaster/cyberattack scenarios - Even with transparent payload, voice packets transmit to ground before reaching other UE - May enable transcoder-free operation High-level Prerequisites: - Very low bitrate support - DTX support [TBC] - Error concealment - Real-time implementation on smartphones - Good audio quality for reasonable QoE 1.2 Scenario 2: Multi-Party Voice Communication Background: - Addresses poor/unstable network conditions in WLAN access - Network congestion during peak usage or in areas with limited infrastructure - Codec selection critical for maintaining quality under bandwidth constraints Scenario Description: - One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding) - Both participants on unstable networks using ULBC simultaneously (no transcoding needed) High-level Prerequisites: - Ultra-low bitrate capability - Real-time operation on consumer devices (smartphones, laptops) - Audio quality matching or exceeding existing voice services 1.3 Scenario 3: IMS Voice Call with ULBC over Other Access Types Motivation: - ULBC may provide enhanced robustness against poor network conditions - Lower bit rates may benefit coverage/capacity - Reduces transcoding needs when calls bridge GEO and other access types Scenario Description: - Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN) 2. Channel Characteristics and Service-Related Dependencies 2.1 Mouth-to-Ear Delay Estimation for GEO Scenarios Delay Components: UE Delay (Table 5.1.2-2): - Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms) - Performance objective range: 268-1435ms (excluding solution-specific delay) - Maximum requirement range: 355-1435ms (excluding solution-specific delay) - Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM Core Network Delay: - Ground station to core network: [5-20]ms minimum, [200]ms maximum - eNodeB to core network: 5-20ms - Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS) GEO Transmission Delay: - Minimum: 248ms - Maximum: 280ms (per TS 22.261 KPI requirement) - Variation of 32ms depending on UE location within beam Mouth-to-Ear Delay Estimates (Table 5.1.3-1): For Main Scenario (GEO-TN): - 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X For Sub-Scenario (GEO-GEO): - 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X 2.2 NB-IoT NTN System Parameters System Architecture: - Service link: UE to NTN payload - Feeder link: NTN payload to NTN Gateway RAN Parameters: - Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink) - MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM - Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1 - Resource unit (RU) duration varies with subcarrier spacing and number of tones QoS Characteristics: - Managed through QCI (QoS Class Identifier) - Same PELR (Packet Error Loss Rate) required for UL and DL - Suggests balanced UL/DL time-domain transmission resources 3. Design Constraints 3.1 Design Constraint Parameters (Table 6.2-1) Key parameters identified: - Bit rates: [TBD] - Sample rate and audio bandwidth: [TBD] - Frame length: [TBD] - Complexity and memory demands: [TBD] - Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing) - Packet loss concealment (PLC): [TBD] - Noise suppression: [TBD] - Discontinuous transmission (DTX): Including VAD and comfort noise [TBD] - Robustness to non-speech input: [TBD] 3.2 Complexity and Memory Considerations Current Evaluation Analysis: - Codec must support real-time thread and concurrent processing - ML codecs with [5-10M] parameters considered for efficient operation within latency bounds - Must operate within compute constraints of devices for real-time voice communication Memory and Power Considerations: - Larger models require more DRAM access → higher power consumption - Memory footprint critical for device performance and usability Complexity Metrics for AI-Based Codecs: TOPS (Tera Operations Per Second): - TOPS = 2 × MAC unit count × Frequency / 1 trillion - Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16) - TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs - Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics Alternative Metrics: - MACs (Multiply-Accumulate operations): Practical for complexity assessment - RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure - Model Size: Number of parameters and precision; directly impacts memory and power - Tools available: ptflops, torchinfo, fvcore for MAC counting Observations: - NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x) - Actual NPU performance depends on computational graph structure - Irregular/sequential/unsupported operations may require CPU fallback - ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs - Million MACs + model size provide first indication of complexity - RTF useful but requires standardized test benches - WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial Complexity Target Estimation: - Target devices: Modern smartphones with NPU components - Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS) - Actual power consumption on smartphone NPUs: TBD - Model size and architecture significantly impact DRAM operations and overall power consumption 3.3 Design Constraint Verification Editor's note: Algorithmic delay verification method for AI-based codecs required. 3.4 Additional Design Considerations Codec Parameters and Configuration: - Static parameters: Rarely changed, exchanged via SDP or predefined - Dynamic parameters: May change frequently, included in each packet/frame - Common static/dynamic parameters to be identified 4. Existing Technologies and Feasibility Evidence 4.1 Overview of Existing Codec Technologies (Table 7.1.1-1) Categories: 1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS) 2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2) 3. AI-based postprocessor: Enhancement of conventional codec output 4. AI-based encoder/decoder: - Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2) - Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec) Key Codec Properties: 3GPP IMS Codecs: - AMR: NB, 5ms delay, 20ms frame, 4.75 kbps - AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps - EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps Conventional Ultra Low Bitrate: - MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps - Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps AI-based (Causal): - LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps - LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps - Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps AI-based (Non-causal): - DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps - DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps - SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps 4.2 Observations on Codec Parameters Audio Bandwidth: - Conventional codecs: NB only - Modern AI codecs: WB or higher Algorithmic Codec Delay: - IMS codecs: 25-32ms - Conventional ultra-low: 60-126ms - Causal AI: 20-80ms - Non-causal AI: 500ms+ or full signal Frame Duration: - Conventional ultra-low: Increased vs. standard 20ms VoIP - Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms) Bitrate: - All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps Complexity: - AI codecs generally higher than IMS/conventional codecs - Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+) - RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB) - ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB) 4.3 Performance Evaluation P.808 ACR Test Results (Figure 7.1.4-1): Test setup: - English clean speech (4 talkers × 6 samples) - 32kHz, SWB, normalized to -26 dBoV - 24 subjects Key Findings: - Codec2 (all rates) significantly worse than AMR 4.75 kbps - SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps - Three conditions on par or slightly better than EVS 9.6 kbps: - Mimi-Codec 1.1 kbps (causal) - DAC-ibm 1.5 kbps (non-causal) - SNAC 0.98 kbps (non-causal) - AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs 4.4 Packet Loss Concealment (PLC) Experiments 4.4.1 PLC Experiment with DAC Test Configuration (Table 7.1.5.1-1): - Bitrates: 1, 2.5, 4.5, 6 kbps - Loss percentages: 1%, 6%, 10%, 20% - Frame size: 80ms - Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz) Loss Simulation Methods: 1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss 2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency) MUSHRA Test Results (8 listeners): - Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps - 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss - Interleaving benefit increases with error rate - Potential for improvement if model trained with random loss patterns 4.4.2 PLC Experiment with DAC and DAC-IBM Comparison: - DAC (default): 16kHz, general audio training, scalable bitrate - DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps MUSHRA Test Results (8 listeners, resampled to 16kHz): - DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions - DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR - Specific training for target bitrate crucial for optimal performance - Error resilience improvable through appropriate training/design choices Conclusions: - More design freedom needed in bitrate and BLER selection for optimal quality at given SNR - Optimal coding performance (even under errors) achieved with appropriate training strategy - Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates - Dedicated training (e.g., DAC-IBM) much more efficient 4.5 Very Low Bitrate Listening Test Results Test Setup (Nokia): - Clean Finnish speech (3 males, 3 females, 4 sample pairs each) - Diotic presentation via Sennheiser HD650 headphones - Experienced listeners - Extended ACR5 scale (0.5-5.5) and DCR methodologies - Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz) Codecs Tested: - DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps) - 3GPP: AMR, AMR-WB, EVS at various rates - ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps) Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2): - Increased bandwidth improves quality up to ~12kHz (saturation region) - 4kHz bandwidth significantly limits perceived quality - MELP 2.4k and MPEG4 HVXC perform better than Codec2 - 3GPP codecs perform as expected at lowest bitrates - TSAC and DAC show very good performance in clean speech - TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references - Both poor quality <1 kbps DCR Results (Figure 7.2.4-1): - Results align with ACR test - Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz) - Listeners more likely to notice degradations with reference available 4.6 Test Results on Clean Speech and Music/Mixed Content 4.6.1 DCR Test on Clean Speech (Figure 7.3.2-1) Test Setup: - French, 30 listeners (5 panels × 6) - 8sec double sentences, 3 male + 3 female - 20-20,000Hz bandpass, -26dB LKFS normalized Codecs: - Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps) - AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8) Key Findings: - DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps - EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate - Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps) 4.6.2 ACR Test on Clean Speech (Figure 7.3.3-1) Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test. 4.6.3 DCR Test on Music and Mixed Content (Figure 7.3.4-1) Test Setup: - 30 listeners (5 panels × 6) - 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background) - 20-20,000Hz bandpass, -26dB LKFS Codecs: - Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4) - AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5) - Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft) Key Findings: - Best quality: EVS and xHE-AAC @ ~24 kbps - Neural codec advantage visible at low bitrates - No tested neural codec achieves quality close to "Direct" - FlowDec 7.5 kbps: 4.08 DMOS (best neural codec) - No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps 4.7 Impact of Noise Suppression on AI-Based Codecs 4.7.1 Background on Existing Systems Classical Speech Coding: - Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality - Especially beneficial in noisy conditions and low SNRs - Improves intelligibility and perceptual quality - Integrated in 3GPP2 EVRC and VMR-WB standards Neural Speech Coding: - Known to be sensitive to noisy environments - Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization - Data-driven approaches make failure modes difficult to anticipate - Noise suppression can minimize issues and allow codec to focus on useful signal 4.7.2 Test Design Two Listening Tests (ITU-T P.808 ACR): Test 1 - High SNR: - Assumptions from 3GPP EVS characterization - SNRs: +15 to +20 dB (WB) - Noises: car, street, office (from ITU-T P.501 Annex B) - 24 pairs of sentences (8 pairs × 3 noises) - 20 listeners Test 2 - Low SNR: - More adversarial environments - SNRs: -5 to +15 dB - Noises: street, construction, metro, car, office, restaurant - 24 pairs of sentences (4 pairs × 6 noises) - 21 listeners Noise Suppression: - DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz - Applied as preprocessor before coding Mixing Procedure: - Loudness normalization using BS1770demo (ITU-T STL) - RMS long-term option for background noise level 4.7.3 Conditions Under Test Classical Codecs: - MELPe, AMR, AMR-WB, EVS Neural Codecs: - SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps) - LyraV2 3.2 kbps (likely trained on diverse data including noisy speech) - DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only All tested with and without noise suppression ("_nr" suffix). 4.7.4 High SNR Test Results (Figures 7.4.2.4-1, 7.4.2.4-2) Key Observations: - Listeners prefer uncoded denoised speech over uncoded noisy speech - Denoised speech as good as clean speech at high SNRs (minimal artifacts) - Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs) - Classical codecs: Benefit increases with bitrate/quality - Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps) - DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate - Plain DAC @ 24kHz not competitive at 1.5 kbps - LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par) 4.7.5 Low SNR Test Results (Figures 7.4.2.5-1, 7.4.2.5-2) Key Observations: - Listeners strongly prefer uncoded denoised speech (~1 MOS difference) - All classical codecs benefit from denoising (<1 MOS improvement) - Neural codecs benefit even more (>1 MOS improvement possible) - Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression - Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech 4.7.6 Conclusions Speech coder performance in noisy conditions significantly enhanced by ensuring high SNR (e.g., via noise suppression) Neural speech coders more sensitive to noisy environments; benefit more from noise suppression than traditional coders High SNR enables improved performance at very low bitrates under both high and low SNR conditions Noise suppression impact on delay/complexity requires further study Note: Removing all background audio may not always be desirable (e.g., emergency calls where background contains relevant information) 4.8 Analysis of Existing AI Codec: Lyra V2 Key Characteristics: - Publicly reported: "38x faster than real-time" on high-end 2021 smartphone - Entirely CPU execution (no NPU/TPU) - Open-source under Apache 2.0 license (permissive for commercial/standardization) Code-Level Analysis: - Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite) - Flag `use_xnn=true` directs to CPU execution - No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU) - Single-threaded execution (threads explicitly set to 1) - Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time Conclusion: - Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU - Significant margin towards max RTF - CPU-only approach viable for ULBC 4.9 Complexity Analysis of Existing AI Codec: DAC Methodology: - ONNX Runtime library for execution - Tested on CPU backend and NNAPI backend (Android NPU interface) - Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference) - No quantization applied (original float model) - Metrics: Real-Time Factor (RTF) for end-to-end and individual components Theoretical Complexity Analysis (Figure 7.6.2-1): - Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification) - Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms) - Model: 76.9M parameters, 293MB size - Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes Real-World Inference Performance: Test Platforms: 1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed) 2. High-end mobile: Qualcomm Snapdragon 8 Gen 2 Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3): Desktop CPU: - Single-threaded: NOT real-time (RTF 1.6-1.9) - Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86) - Still very slow for high-end desktop CPU Mobile SoC: - NO configuration achieves real-time performance - Best-case RTF: 2.125 (>2x slower than real-time) - Worst-case RTF: 5.884 (~6x slower than real-time) - NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU - Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required Critical Gap: - Significant gap between theoretical NPU capacity and actual measured performance (RTF) - Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone - Real-world testing essential Editor's note: NNAPI may fallback to CPU for float models; impact needs verification. 5. Test Methodologies 5.1 General Considerations 5.1.1 Typical Quality Impairments of Ultra-Low Bit Rate Speech Coding Categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech) Additional Considerations: - Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC 5.1.2 Challenges of Quality Assessment Traditional 3GPP Practice: - AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR - ACR: Generally for clean speech - DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content - Focus not on intelligibility, speaker identifiability, prosodic impairments ULBC Challenges: - May need to address additional aspects directly through dedicated tests - Hallucination: Specific to ML-based systems - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic) Alternative Test Methods: - Automatic speech recognition - Modified rhyme tests - DCR tests (for prosodic differences) - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests (word/semantic equivalence) - Phoneme recognition tests Noise Suppression Evaluation: - P.835: Multi-dimensional rating (speech quality and noise suppression capability separately) - Typically used for systems with noise suppression DCR Considerations: - Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference 5.1.3 Subjective Testing Considerations
S4-260146 PDF Info	China Mobile Com. Corporation	[FS_ULBC] Permanent Document v0.5.0	Comprehensive Summary of 3GPP FS_ULBC Permanent Document Document Overview This permanent document (p-doc) version 0.45.0 supports the Study Item on Ultra Low Bitrate Speech Codec (FS_ULBC), focusing on developing recommendations for normative work on an ultra-low bit rate codec for voice over Geostationary Orbit (GEO) satellites. The document tracks agreements, open issues, and progress across the study objectives defined in the SID. 1. Introduction and Scope The study addresses nine key objectives: - Document application scenarios for ultra-low bit rate communication services - Study GEO channel characteristics and derive service-related dependencies - Identify relevant design constraints - Provide feasibility evidence - Define performance requirements and test methodologies - Identify/develop objective measures for design constraint verification - Identify reference codecs - Coordinate with other 3GPP groups (SA2, RAN, CT1) - Define potential normative work item objectives and timeline Working Procedure: - Maintains one TR and one p-doc - Contributions via pCRs - Brackets restricted to values only - Open issues documented in p-doc 2. Application Scenarios 2.1 Main Scenario: IMS Voice Call over GEO Key Technical Assumptions: UE1 Uplink (UE1 → GEO satellite → Ground station): - Transmission data rate significantly limited ([1-3] kbit/s) - Requires ultra-low bit rate codec fitting this transmission rate - Subject to transmission errors reflecting GEO satellite access - Delay greater than typical terrestrial networks UE1 Downlink (Ground station → GEO satellite → UE1): - Similarly limited transmission data rate - Subject to similar transmission errors and delay UE2 Connection (Core Network → UE2): - Regular TN network transmission data rate available - Could use existing IMS codec (with transcoding) or same ULBC (transcoding-free) - Transcoding functionality in core network likely needed for seamless communication across network types 2.2 Sub-Scenario Both connections (UE1 and UE2) via GEO satellite with significantly limited transmission data rate ([1-3] kbit/s), allowing both transcoded and transcoding-free operation. 3. Channel Characteristics and Service-Related Dependencies 3.1 End-to-End Simulation Model Methodology: - Reuses simulation model from TS 26.132 Annex E (LTE reference scenario) - Adapted for GEO access scenario with "new GEO channel" - Potential inclusion of Non-IP Data Delivery (NIDD) option Key Input Parameters: BLER_tx/BLER_rx: Block error rates for uplink/downlink from RAN simulation drx_cycle_length: DRX cycle duration (20-40ms for LTE; suitability for GEO TBC with RAN2) mis_eNB1_eNB2: Scheduling time mis-alignment; determines buffer waiting time nFrames considerations: - Frame length: Maximum 80ms assumed for GEO (vs. 20ms for LTE) - Voice packet size: Depends on protocol overhead (user plane vs. control plane, IP vs. Non-IP NIDD) - RTP Payload Size: Product of frame length and codec bit rate Editor's Note: SA2 concluded in TR 23.700-19 that voice packets shall be transported over NB-IoT (GEO) user plane. 3.2 RAN Simulation Model for Error Traces Objective: Generate multiple loss traces for combinations of: - Frame loss rate (target BLER) - Raw bitrate (TBS) - Voice bundling period - Doppler spread Simulation Parameters: - Number of seeds: 10 - Trace duration: 400 seconds (6.67 minutes) - Channel consistency: Same channel realizations across all combinations 3.2.1 Link Budget Analysis Baseline CNR values from TR 36.763: - UL CNR = 2.6dB (0dBi UE antenna gain, 3.75kHz SCS, 1 tone, 23dBm UE max TX power) - DL CNR = -3.3dB (0dBi UE antenna gain, 15kHz SCS, 12 tones, 1 UE receive antenna, 23dBm UE max TX power) 3.2.2 Uplink Simulation Parameters Channel model: NTN-TDL-C [38811] Elevation angle: 10 degrees (parameters specified in Table 5.2.2.2-1) Modulation: QPSK, π/2 BPSK Subcarrier Spacing (SCS): 3.75kHz, 15kHz Number of tones: 1 for both SCS values Voice bundling period: 80ms, 160ms, 320ms - Note: 40ms not considered due to insufficient time for DL transmissions with 3.75kHz SCS Doppler spread: 1Hz, 5Hz Target BLER: 1%, 2%, 6%, 10% (fixed target BLER is FFS) Maximum Achievable SNR: SNR = (3GPP SET-1 UL SNR) - 10×log₁₀(B/3.75) + (P - 23dBm) + G + [X] dB Where: - 3GPP SET-1 UL SNR = 2.6dB - B = bandwidth (3.75kHz or 15kHz) - P = max UE TX power (23, 26, 31 dBm) - G = UE antenna gain difference (0 to -5.5dBi) - X = TBD (accounts for lower loss, better satellite performance) TBS Values and PHY Bitrates: For 80ms bundling: - TBS: 144, 256, 328, 424 bits - PHY bitrate: 1.8, 3.2, 4.1, 5.3 kbps - Codec bitrate: 1.1, 2.5, 3.4, 4.6 kbps (assuming 7 bytes packet header) For 160ms bundling: - TBS: 208, 424, 600, 808 bits - PHY bitrate: 1.30, 2.65, 3.75, 5.05 kbps - Codec bitrate: 0.95, 2.30, 3.40, 4.70 kbps For 320ms bundling: - TBS: 328, 776, 1096, 1544 bits - PHY bitrate: 1.025, 2.425, 3.425, 4.825 kbps - Codec bitrate: 0.850, 2.250, 3.250, 4.650 kbps Notes: - Packet header counted once regardless of bundled frames - Loss of single TB means loss of multiple consecutive voice frames - Need for 320ms bundling to be revisited after channel simulation results 3.2.3 Downlink Simulation Parameters SCS: 15kHz Number of tones: 12 Achievable SNR: SNR = (3GPP SET-1 DL SNR) + G + [Y] dB Where: - 3GPP SET-1 DL SNR = -3.3dB - G = UE antenna gain difference (0 to -5.5dBi) - Y = TBD (accounts for 2 RX antennas providing up to 3dB gain, lower loss, better G/T values, better satellite performance) Editor's Note: Four companies reported Y=3 due to better G/T from field measurements (-28.6dB/K vs. -31.6dB/K assumed), but no RAN1 consensus reached. TBS values: Identical to uplink (Clause 5.2.2.2) 3.2.4 Frame Structure Dynamic Scheduling Example (80ms bundling, Half-duplex FDD): - NPDSCH duration: 4ms (variable depending on DL SNR) - UL frequency allocation options: 1, 3, 6, 12 tones with 15kHz per tone Semi-Persistent Scheduling (SPS): - If specified by RAN for NB-IoT NTN - NPDSCH can be anywhere in first 15ms (maintaining minimum 1ms gap to NPUSCH) - "Cell_specific_Koffset" approach proposed (not dependent on "TA report UE capability") Gap between DL and UL consists of: - Processing time + DL-to-UL switching (minimum 1ms for half-duplex device) - Max differential delay: [close to 0 to 10.3ms] (TBC) RAN1 Note: Example frame structures supportable in most scenarios but may not work for very large cells (>3000km) when UE doesn't support TA report and network doesn't support UE-specific K-offset. RAN1/2 have not yet designed SPS. 3.3 Open Issues for NB-IoT GEO Simulation Issue 1 - UE Power Class: Whether to use specified 23dBm or broader range (26, 29, 31, 33 dBm) - Pending RAN input Issue 2 - Latitude-Dependent Loss: Scintillation loss (2.2dB or 0dB depending on latitude) - Solved (accounted via X term) Issue 3 - Elevation Angles: Keeping both 2.3° and 12.5° - Solved (accounted via X term) Issue 4 - UL/DL Guard Time: 1ms assumption - Pending RAN confirmation Issue 5 - Candidate TBS Values: Multiple proposals from companies - Unsolved Issue 6 - Approaches to Select TBS: Three approaches provided - Unsolved Issue 7 - Overall Simulation Methodology: High-level description needed - Unsolved (to be addressed after simulation completion) Issue 8 - Simulation Channel Model: NTN-TDL-C vs. NTN-TDL-C5 - Solved (NTN-TDL-C used) Issue 9 - Protocol Overhead: Clarify packet header for different transport options - Pending RAN2/SA2 confirmation Issue 10 - Repetition Numbers: Specify and report in simulation - Solved Issue 11 - RX G/T for Downlink: 3dB better value observed in field - Unsolved 3.4 Alternative Methodology for Determining ULBC Bit Rate Editor's Note: This methodology remains an open issue. Proposed Steps: Agree on operation points: Set of maximum achievable receive SNRs covering marginal to error-free operation with NTN-TDL-C fading Define performance requirements for each SNR operation point Agree on source bit rates for each bundling time (80, 160, 320ms) based on transport formats (TBS, SCS, MCS, NRep) Current range: 825-4650 bits/s Granularity appears insufficient and unequal Determine optimum transport format (SCS, MCS, NRep) for each source bit rate based on BLER vs. SNR curves Produce packet loss patterns for each bundling time and source bit rate at relevant SNRs (unknown to proponents during selection) Compare ULBC candidates based on performance requirements at relevant SNRs Example Workflow: - Proponent has design at 0.95 kbps and 3.4 kbps - For 160ms bundling with 7-byte overhead: - Low rate: TBS = 208 bits - High rate: TBS = 600 bits - Select best transport format configuration from available options - Generate BLER patterns for different UE TX powers (23, 26, 29, 31 dBm) - Run codec simulation with these patterns - Evaluate quality (e.g., listening test) with weighted averaging across power settings Note: Important to test candidates for other conditions beyond NTN NB-IoT (e.g., Terrestrial IMS with 1% BLER, OTT with 0% BLER, extreme conditions with 10% BLER or blockage losses) 3.5 Simulation Results Table 5-6 documents preliminary results: - 80ms bundling: Qualcomm submitted S4-251739 - Company A, B, C: TBD 4. Design Constraints 4.1 Complexity and Memory Demands Target Device Types: - Handheld mobile phones - Smart watches - Smart glasses/head mounted devices - TCU (Telematics Control Unit) - CPE (Customer Premises Equipment) - Vehicles - Other IoT devices Recommended Constraints: - Implementable on DSP/CPU/NPU enabled UE devices - For low-end DSP-only UEs: - Complexity: <500 WMOPS (measured on C reference code) - ROM memory: <20MB assuming 32bit/parameter (or 5M model parameters) Editor's Notes: - Definition of "DSP enabled UE devices" needs clarification - Exact complexity estimation metric and limits are TBD 4.2 Design Constraint Verification Complexity Verification: - Constraints may be based on platform-agnostic metrics: - MACs/FLOPs for AI-based components - WMOPS for traditional signal processing - Model size and precision - Verification process details and timing are FFS Algorithmic Delay: - Verification method for AI-based codecs required 5. Performance Requirements 5.1 Scope Define performance requirements and test methodologies for: - Speech quality, intelligibility, conversational quality - Clean speech and noisy speech - Tandeming with existing IMS voice codecs - Clean channel and GEO channel conditions - Identify relevant reference codecs 5.2 Status Tracking Core influencing factors identified: - DC: Sample rate and audio bandwidth - DC: Bitrates (External dependency) - DC: Frame length - DC: PLC (External dependency) - DC: Algorithmic Delay - DC: Complexity, Memory - Test Methodologies - DC: Noise suppression - DC: DTX/CNG - DC: Robust Non-Speech - Evidence DCs - Reference codec All items currently have open issues and progress TBD 6. Coordination and Dependencies 6.1 External Dependencies From RAN: - HARQ retransmission parameters (max_tx/max_rx) - DRX cycle length suitability for GEO - Scheduling parameters (dynamic vs. SPS) - Frame structure confirmation - UE power class - UL/DL guard time - Protocol overhead - G/T values for downlink From SA2: - Transport path for voice packets (user plane vs. control plane, IP vs. Non-IP NIDD) - Protocol overhead details - Transcoding functionality requirements From RAN2: - Dynamic scheduling vs. Semi-Persistent Scheduling - MAC header size (1-byte feasibility) - Timing parameters 7. Key Technical Contributions 7.1 Simulation Framework Establishment The document establishes a comprehensive RAN simulation framework for generating error traces: - Defined methodology using NTN-TDL-C channel model - Specified uplink and downlink parameters - Established TBS values and corresponding codec bitrates for multiple bundling periods - Defined channel consistency requirements across simulations 7.2 Link Budget Analysis Adopted baseline CNR values from TR 36.763 with provisions for: - Variable UE power classes - Latitude-dependent losses - Elevation angle variations - Better-than-assumed satellite performance 7.3 Bitrate Determination Methodology Proposed alternative methodology allowing proponents design freedom: - Operation point definition based on receive SNRs - Transport format optimization for each source bit rate - Packet loss pattern generation - Comparative evaluation framework 7.4 Frame Structure Definition Defined frame structures for: - Dynamic scheduling with Half-duplex FDD - Semi-Persistent Scheduling options - Cell_specific_Koffset approach for large cells 7.5 End-to-End Delay-Error Profile Model Adapted TS 26.132 Annex E model for GEO scenarios: - Identified required input parameters - Defined voice packet structure with protocol overhead - Established relationship between frame length, bundling, and packet loss 8. Open Issues Summary High Priority (Blocking): 1. Consensus on UE power class (23 dBm vs. higher values) 2. RAN confirmation on frame structures and scheduling 3. SA2/RAN2 confirmation on protocol overhead 4. Selection of candidate TBS values and selection methodology 5. Downlink RX G/T value consensus Medium Priority: 1. Fixed vs. variable target BLER 2. Need for 320ms bundling option 3. Complexity metric definition and limits 4. Algorithmic delay verification for AI codecs Lower Priority: 1. Overall simulation methodology description (after completion) 2. Definition of "DSP enabled UE devices" 9. Document Status Current Version: 0.45.0 (SA4#135, February 2026) Recent Updates: - Added 10-degree channel model parameters - Updated simulation parameters per multiple agreed TDOCs - Added company simulation results reporting - Clarified voice packet transport over user plane Working Status: - Active study phase - Collecting simulation results from companies - Coordinating with RAN and SA2 for parameter confirmation - Developing design constraints and performance requirements
S4-260148 PDF Info	China Mobile Com. Corporation	[FS_ULBC] WorkPlan of FS_ULBC v0.5	Timeplan for FS_ULBC Study Item 1. Introduction This document outlines the timeplan for the Feasibility Study on Ultra Low Bitrate Speech Codec (FS_ULBC). The study focuses on developing a codec for ultra-low bit rate communication services, particularly for IMS Voice Call Using GEO Access as documented in TR 22.887. Study Item Objectives The FS_ULBC study has nine main objectives: Application Scenarios: Document ultra-low bit rate communication service scenarios based on TR 22.887 use cases and requirements for IMS Voice Call Using GEO Access GEO Channel Characteristics: Study GEO channel characteristics and derive service-related dependencies (bitrates, mouth-to-ear delay, loss/delay/jitter profiles) Note: NB-IoT services impact is out of scope Design Constraints: Identify relevant design constraints in coordination with other WGs: Bit rates Sample rate and audio bandwidth Frame length Complexity and memory demands Algorithmic delay Packet loss concealment (PLC) Potential noise suppression integration Discontinuous transmission (DTX) including VAD and comfort noise Speech quality Robustness to non-speech input Feasibility Evidence: Provide evidence that design criteria can be met using existing reference codecs Performance Requirements: Define performance requirements and identify test methodologies for: Speech quality and intelligibility Conversational quality Clean and noisy speech conditions Tandeming with existing IMS voice codecs Clean channel and GEO channel conditions Objective Measures: Identify or develop objective measures to verify design constraints (e.g., complexity and memory measurements) Reference Codecs: Identify relevant reference codecs for comparison and evaluation Coordination: Coordinate with other 3GPP groups (SA2, RAN, CT1, etc.) Normative Work: Define potential normative work item objectives and timeline 2. Current Progress Status Application Scenarios (85% Complete) Scenario 1: IMS Voice Call over GEO (TR 4.2, P-doc 4.1) Scenario 2: Multi-Party Voice Communication (TR 4.3) Scenario 3: IMS Voice Call with ULBC over other access types than GEO (TR 4.4) Next Steps: Finalize high-level prerequisites and resolve ENs Dependencies: SA2, RAN, CT GEO Channel Characteristics & Simulation (75% Complete) NB-IoT system design and simulation parameters documented (TR 5.1.4, P-doc 5.2.2) SA4#135 Plans: Finalize remaining parameters Gather candidate Transport Block Sizes (TBS) Dependencies: SA2, RAN Simulation Methodology (60% Complete) SA4#135 Plans: Discuss and confirm simulation methodology Dependencies: RAN Company Simulation Results (40% Complete) Companies providing simulation results (P-doc 5.2.3) SA4#135 and Post Ad-hoc Plans: Select appropriate TBS Collect company simulation results SA4#136 Plans: Cross-check simulation results Finalize feasible TBS values and loss traces Mouth-to-Ear Delay (95% Complete) Documented in TR 5.1 Next Steps: Resolve ENs Design Constraints Progress Bit Rates (0% Complete) SA4#136 and Post Ad-hoc: Decide bit rates for ULBC (dependent on simulation results) Dependencies: RAN Sample Rate and Audio Bandwidth (5% Complete) SA4#135 and Post Ad-hoc: Discuss supported audio bandwidth SA4#136 and Post Ad-hoc: Decide supported audio bandwidth Frame Length (0% Complete) SA4#136 and Post Ad-hoc: Decide frame length for ULBC Complexity and Memory Demands (80% Complete) Documented in TR 6.2.1, P-doc 6.1.1 SA4#135 and Post Ad-hoc: Finalize complexity measurement metrics and resolve ENs Algorithmic Delay (0% Complete) SA4#135 and Post Ad-hoc: Discuss algorithm delay for ULBC Packet Loss Concealment (15% Complete) Documented in TR 7.1.5 SA4#135 and Post Ad-hoc: Discuss PLC for ULBC Noise Suppression (15% Complete) Documented in TR 7.4 SA4#135 and Post Ad-hoc: Discuss noise suppression for ULBC DTX (0% Complete) SA4#135 and Post Ad-hoc: Discuss DTX support for ULBC Design Constraint Verification (5% Complete) P-doc 6.3.1 Next Steps: Verify design constraints Other Considerations (5% Complete) TR 6.4.1 Next Steps: Document additional design considerations and resolve ENs Existing Codec Technologies (85% Complete) Reference codecs documented (e.g., DAC, Lyra) in TR 7 SA4#135 and Post Ad-hoc: Continue documenting evidence of existing technologies and resolve ENs Performance Requirements (0% Complete) SA4#136 and Post Ad-hoc: Define performance requirements Test Methodologies (50% Complete) Subjective test methodologies documented in TR 9.1.3 SA4#135 and Post Ad-hoc: Identify appropriate test methodologies Coordination with Other WGs Analysis of current liaisons from RAN, CT, and SA2 available (S4aA250139) Ongoing coordination as needed 3. Detailed Timeplan TSG SA#107 (March 12-14, 2025, Incheon, KR) Approval of FS_ULBC study item SA4#131-bis-e (April 11-17, 2025) Start documenting application scenarios for ultra-low bit rate communication services Start studying GEO channel characteristics and service-related dependencies Start identifying relevant reference codecs Start coordinating with other 3GPP groups Audio SWG Telco (May 5, 2025) Focus on application scenarios and technical contributions SA4#132 (May 19-23, 2025, Fukuoka, JP) Finalize application scenarios documentation Progress GEO channel characteristics study Progress reference codec identification Progress coordination with other WGs Start identifying/developing objective measures for design constraint verification Start identifying relevant design constraints (bit rates, sample rate, frame length, complexity, algorithmic delay, PLC, noise suppression, DTX, speech quality, robustness) Start providing feasibility evidence using existing reference codecs Start defining performance requirements and test methodologies for speech quality If time permits: Start documenting additional application scenarios Audio SWG Telcos (June-July 2025) Multiple telcos scheduled to: - Progress GEO channel characteristics study - Perform RAN-related simulations within SA4 - Align on RAN link-level simulations - Power to send LS to SA2 and RAN WGs if needed SA4#133-e (July 21-25, 2025) Progress all ongoing work items: GEO channel characteristics Coordination with other WGs Reference codec identification Objective measures development Design constraints identification Feasibility evidence Performance requirements and test methodologies If time permits: Progress additional application scenarios F2F Ad-hoc Meeting (September 23-25, 2025, Erlangen, Germany) Hosted by Fraunhofer IIS Electronic participation on best effort basis Audio SWG Telcos (October 2025) Opportunity for feedback from other WGs Progress work on: GEO channel characteristics study Existing technologies documentation Design constraints identification Performance requirements and test methodologies Application scenarios if time permits SA4#134 (November 17-21, 2025, Dallas, US) Major milestone meeting: - Finalize: - GEO channel characteristics study - Coordination with other WGs - Reference codec identification - Design constraints: bit rates, sample rate, audio bandwidth, frame length, PLC, noise suppression, DTX - Progress: - Feasibility evidence - Objective measures development - Design constraints: complexity, algorithmic delay, speech quality, robustness - Performance requirements and test methodologies - Start defining potential normative work item objectives and timeline - If time permits: Finalize additional application scenarios Audio SWG Telcos (December 2025 - January 2026) Finalize GEO channel characteristics study Progress: Simulation parameters for end-to-end simulation Existing technologies documentation Design constraints identification Performance requirements and test methodologies Application scenarios if time permits Power to send reply LS for incoming LS postponed during SA4#134 SA4#135 (February 9-13, 2026, India) Finalize objective measures for design constraint verification Progress: Design constraints: complexity, algorithmic delay, speech quality, robustness Feasibility evidence Performance requirements and test methodologies TSG SA#111 (March 10-13, 2026, Japan) TR for information SA4#136 (April 13-17, 2026) Finalize: Design constraints: complexity, algorithmic delay, robustness to non-speech input Feasibility evidence Progress: Design constraints: speech quality Performance requirements for speech quality SA4#137 (May 11-15, 2026) Final study meeting: - Finalize: - Design constraints: speech quality - Performance requirements and test methodologies (clean/noisy speech, tandeming, clean/GEO channel conditions) - Potential normative work item objectives and timeline TSG SA#112 (June 9-12, 2026, Singapore) TR for approval - Study completion
S4-260149 PDF Info	China Mobile Com. Corporation	[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation	Summary of S4-260149: Updates on Assumptions and Open Issues for NB-IoT GEO Simulation Document Overview This contribution from China Mobile addresses outstanding assumptions and open issues for NB-IoT GEO satellite simulation work within the ULBC (Ultra-Low Bitrate Codec) study. The document consolidates discussions from multiple Audio Ad-hoc meetings (June 4, June 17, and July 11) and proposes updates to TS 26.940 clause 5.2.2.4. Main Technical Contributions Status Updates on Simulation Parameters The document provides a comprehensive status table tracking 11 key simulation issues, with updates on their resolution status: Resolved Issues UE Power Class (Issue 1): Previously pending decision between 23 dBm (specified for NTN NB-IoT) vs. higher commercial values (26-33 dBm) Resolution: 37 dBm adopted based on RAN4 Reply LS S4aA250219 Note: 37 dB is under study in ongoing RAN work Latitude-Dependent Loss (Issue 2): Addressed scintillation loss variation (2.2 dB vs. 0 dB) based on latitude per TR 38.821 Resolution: Simulation accounts for latitude-dependent loss using X term Additional note: New 10-degree channel model introduced, may increase feasible TBS Elevation Angles (Issue 3): Proposal to maintain both 2.3° and 12.5° elevation angles for worst-case scenarios Resolution: Simulation accounts for elevation angles using X term Simulation Channel Model (Issue 8): Choice between NTN-TDL-C or NTN-TDL-C5 Resolution: NTN-TDL-C is used Repetition Numbers (Issue 10): Proposal to specify and report repetition numbers in simulation Resolution: Solved Partially Resolved Issues Protocol Overhead (Issue 9): Requires clarification of packet header overhead for different protocol combinations (user plane, control plane, IP vs. non-IP) Partial Resolution: SA2 confirmed voice packets transported over User Plane Still Pending: Overhead for User Plane (IP vs. Non-IP) needs RAN confirmation RX G/T for Downlink (Issue 11): Field measurements show 3dB better value than current RAN assumptions Status: Editorial note added in P-doc 5.2.2.3 to capture field-measured data Current Status: Listed as "Unsolved" in table Unresolved Issues UL/DL Guard Time (Issue 4): Current assumption: 1 ms guard time for UL/DL switching Status: Needs RAN confirmation on feasibility Determine Candidate TBS Values (Issue 5): Multiple proposals from different companies: Xiaomi (S4aA250035) Fraunhofer (S4aA250031) Skylo (S4-251540) Dolby (S4-251390) Huawei (S4aA250230) Qualcomm (S4-251548) vivo (S4aA250215) Status: Unsolved, requires further verification Approaches to Select TBS (Issue 6): Three approaches provided in S4aA250072 One approach detailed in clause 5.2.2.4.1 Status: Unsolved, requires further discussion Overall Simulation Methodology Description (Issue 7): Need for high-level description of simulation execution, including optimization parameters and result parameters General description documented in P-doc Clause 5.2.2 Status: Unsolved, to be addressed after all simulation work completed Proposal The document proposes to: 1. Update the P-doc (TS 26.940) based on the status updates provided 2. Continue tracking these issues until full resolution Key Dependencies The document highlights several dependencies on other working groups: - RAN4: UE power class confirmation - RAN: UL/DL guard time feasibility, protocol overhead confirmation - SA2: Protocol overhead for different transport configurations
S4-260150 PDF Info	vivo Mobile Communication Co.,	[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19	Summary of 3GPP Technical Document: Updates to FS_ULBC Permanent Document Document Overview This contribution updates the FS_ULBC (Ultra Low Bitrate Speech Codec) Permanent Document to align with SA2 conclusions on Key Issue #1 regarding IMS voice call support over NB-IoT via GEO satellite connecting to EPC, as documented in TR 23.700-19. Main Technical Contributions 1. Reference Updates The document adds critical new references to align with recent 3GPP work: TR 23.700-19 V1.2.0: Study on Integration of satellite components in the 5G architecture; Phase 4 S2-2509293: Interim conclusions on KI#1 Support of IMS voice call over NB-IoT NTN via GEO satellite connecting to EPC TR 36.763: Study on NB-IoT/eMTC support for Non-Terrestrial Networks R1-2506541: Reply LS on RAN simulation assumptions for ULBC 2. End-to-End Simulation Model Updates (Clause 5.2.1.3) 2.1 Architecture and Protocol Stack Changes The document introduces significant modifications to the end-to-end simulation model: New GEO Channel Model: Extends the reference LTE scenario (Annex E of TS 26.132) to accommodate GEO satellite access Three Architectural Scenarios Defined: Reference LTE VoLTE scenario (Figure 5.2.1.3-1) Main GEO scenario with IP transport (Figure 5.2.1.3-2) GEO scenario with Non-IP Data Delivery option (Figure 5.2.1.3-2a) 2.2 Transport Mechanism Agreements Based on SA2 conclusions in TR 23.700-19: User Plane Transport: Voice packets shall be transported over NB-IoT (GEO) user plane using DRB and S1-U Single PDN Connection: Both IMS signaling and IMS voice use a single PDN connection Mandatory Mechanism: Transport of IP packets (UP/IP) with RoHC recommended Optional Mechanism: Transport using removal and restoration of parts of RTP/UDP/IP headers (UP/non-IP) 2.3 Simulation Input Parameters Key parameters updated for GEO scenarios: BLER_tx/BLER_rx: Block error rates for UL/DL based on error traces from Clause 5.2.2 max_tx/max_rx: HARQ retransmissions (note: HARQ feedback suggested to be disabled for IMS voice over GEO per Release 18) drx_cycle_length: DRX cycle duration (LTE values 20-40ms, suitability for GEO requires RAN2 confirmation) mis_eNB1_eNB2: Scheduling time misalignment between eNBs Speech sequence frame length: Maximum 80ms frame length for GEO (vs. 20ms for LTE) Voice packet size: Depends on protocol overhead, varies by transport mechanism 2.4 Protocol Overhead Considerations Two protocol overhead scenarios illustrated: UP/IP with RoHC (Figure 5.2.1.3-4 left): Mandatory mechanism UP/non-IP with header removal (Figure 5.2.1.3-4 right): Optional mechanism Editor's Note: Exact overhead for UDP/IP (SA2 scope) and RTP (SA4 scope) for the removal/restoration mechanism requires determination. 3. Simulation Assumptions and Open Issues (Clause 5.2.2.4) 3.1 Resolved Issues \| Issue \| Resolution \| \|-------\|-----------\| \| Latitude-Dependent Loss \| Simulation accounts for latitude-dependent scintillation loss using X term (2.2 dB or 0 dB beyond ±20° latitude per TR 38.821) \| \| Elevation Angles \| Both 2.3° and 12.5° angles considered using X term for worst-case scenarios \| \| Simulation Channel Model \| NTN-TDL-C selected \| \| Repetition Numbers \| Specified and reported in simulation \| 3.2 Pending Issues Requiring RAN Input UE Power Class: 23 dBm (specified for NTN NB-IoT) vs. commercial UE range (26-37 dBm) - requires RAN confirmation UL/DL Guard Time: 1ms assumption needs RAN verification RX G/T for Downlink: Field observations show 3dB better performance than current RAN assumptions 3.3 Unresolved Issues Candidate TBS Values: Multiple proposals from Xiaomi, Fraunhofer, Skylo, Dolby, Huawei, Qualcomm, and vivo require evaluation TBS Selection Approaches: Three approaches in S4aA250072 need discussion Overall Simulation Methodology: High-level description to be completed after simulation work Protocol Overhead for UP/non-IP: Exact overhead values for removal/restoration mechanism depend on specific RTP fields selected (SA4 decision) 3.4 Updated Understanding on Protocol Overhead Based on SA2 agreements: Control Plane transport excluded: Only User Plane transport considered Mandatory: UP/IP with RoHC recommended Optional: UP/non-IP with partial header removal/restoration Exact overhead values for optional mechanism pending SA4 decisions on RTP field selection Key Dependencies and Cross-WG Coordination The document identifies several inter-working group dependencies: RAN1: Physical layer timing, power class confirmation RAN2: HARQ configuration, DRX cycle parameters, scheduling mechanisms SA2: UDP/IP overhead for non-IP mechanism SA4: RTP overhead, frame length confirmation, RTP field selection for header removal Editor's Notes Two critical editor's notes remain: Whether the eNB1-eNB2 delay model for LTE scenarios accurately reflects GEO deployment delays Whether RTP payload size affects the delay-error profile
S4-260151 PDF Info	China Mobile Com. Corporation	[FS_ULBC]Considerations for ULBC Codec Selection Process	Comprehensive Summary: Considerations for ULBC Codec Selection Process Document Overview This document appears to be a presentation or discussion paper related to ULBC (Uplink Broadcast) codec selection process. However, the provided content is fragmentary and contains mixed language elements (English and Chinese), making comprehensive technical analysis challenging. Main Technical Areas Identified 1. ULBC Codec Selection Process The document's primary focus is on considerations for selecting codecs in the ULBC (Uplink Broadcast) context. However, specific technical criteria, evaluation methodologies, or selection parameters are not detailed in the provided content. 2. JPEG AI Integration Overview The document references JPEG AI as a relevant technology JPEG AI appears to be considered as a potential codec or compression technology for the ULBC use case Working Mechanism A section is dedicated to JPEG AI's working mechanism Specific technical details of the mechanism are not provided in the extracted content Timeline Timeline considerations for JPEG AI are mentioned Specific milestones or deployment schedules are not detailed 3. Cross-Working Group Coordination SA2 Related Work SA2 has related work in Release 18 (R18) and Release 19 (R19) Key Issue Identified: Lack of unified architecture design Requirements are coming from RAN but lack unified architectural framework Suggests fragmentation in approach across different scenarios RAN Liaison Statements Latest LS (Liaison Statement) from RAN concerns model transmission Indicates coordination requirements between SA and RAN working groups 4. Architecture Considerations Network Function Changes Reference to "NF变CN" (Network Function changes to Core Network) Suggests potential architectural modifications at the Core Network level Specific changes or proposals are not detailed in the provided content Open Questions The document includes an "Open Questions" section, indicating ongoing discussions and unresolved technical issues. However, the specific questions are not provided in the extracted content. Technical Gaps in Provided Content Due to the fragmentary nature of the document provided: - Specific codec selection criteria are not detailed - Technical evaluation parameters are missing - Comparison methodologies between candidate codecs are not present - Detailed architectural proposals are not included - Specific agreements or decisions are not documented Observations Multi-Release Scope: The work spans R18 and R19, indicating ongoing evolution Cross-WG Dependencies: Clear dependencies between SA2 and RAN work Architecture Fragmentation: Identified need for unified architecture design Emerging Technologies: JPEG AI considered as potential solution Core Network Impact: Potential changes to CN architecture implied Note: This summary is based on fragmentary content with significant portions in template format or non-English text. A complete technical analysis would require the full document with all technical details, agreements, and proposals.
S4-260152 PDF Info	vivo Mobile Communication Co., Nokia, Xiaomi Technology, Samsung, Spreadtrum, Bytedance	[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals	Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals 1. Introduction and Objectives This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics. The study evaluates: - Descript Audio Codec (DAC) - AI-based codec - Enhanced Voice Services (EVS) codec - 3GPP standard reference The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination. 2. Background and Motivation 2.1 ULBC Context The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges: - Wideband audio (16 kHz) offers naturalness and perceptual quality - Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits 2.2 Critical Communication Requirements For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include: - Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification - System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas - Need to balance modern expectations with legacy requirements and emergency scenarios 2.3 EVS as Reference Anchor EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling: - Practical quality floor definition for ULBC - Comparison against established carrier-grade standards - Isolation of bandwidth choice impact independent of codec architecture 3. Methodology 3.1 Evaluation Pipeline Dataset: LibriSpeech train-clean-100 subset (standard benchmark for high-quality read English speech) Sample size: 500 audio files randomly sampled across three seeds (101, 102, 103) Consistency: Same audio files used for all codec and bitrate configurations 3.2 Processing Chain Process input audio through target codecs (DAC and EVS) at various bitrates Transcribe processed audio using OpenAI Whisper model (large-v3) - selected for state-of-the-art performance and noise robustness Compare transcripts against LibriSpeech ground truth Calculate WER using jiwer library 4. Experimental Setup 4.1 Codec Configurations DAC model: Evaluated at three sampling rates - 16 kHz - 24 kHz - 44 kHz EVS codec: Evaluated in standard modes - Narrowband (NB) - Wideband (WB) Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB) 4.2 Observations on Baseline Variance NB PCM occasionally scored ~0.1% better than WB PCM Attributed to inherent ASR model variance rather than signal quality differences Explains why high-bitrate DAC models occasionally score slightly lower than WB baseline 4.3 Primary Metric WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range. 5. Results and Analysis 5.1 DAC Performance vs Bitrate Key Findings: - DAC achieves high efficiency at low bitrates (~2 kbps) - WER drops rapidly as bitrate increases, stabilizing around 3-4% - At 1.5 kbps: WER approximately 5.5% - Significant improvement observed in 1.5-3.0 kbps range Bandwidth Impact at Low Bitrates: - At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model - With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band - Results in better semantic preservation vs. 24 kHz model suffering from bit starvation 5.2 DAC 8 kHz Narrowband Model Analysis A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound. Model Configuration: - Sample rate: 8000 Hz - Encoder rates: [2, 4, 4, 8], dimension: 64 - Decoder rates: [8, 4, 4, 2], dimension: 1536 - Quantization: 6 codebooks, size 1024, dimension 36 - Training: 200,000 steps on VCTK corpus Critical Findings at Sub-1.5 kbps: - At ~1 kbps: - 8 kHz model (938 bps): WER 5.86% - 16 kHz model (1000 bps): WER 11.23% - Semantic penalty > 5 percentage points when forcing WB at 1 kbps At 1.5 kbps: 8 kHz model (1563 bps): WER 3.86% 16 kHz model (1500 bps): WER 5.46% Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively. 5.3 DAC vs EVS Comparison Competitive Advantage: - DAC achieves comparable WER scores at significantly lower bitrates than EVS - DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor. 5.4 EVS Narrowband vs Wideband Analysis Performance at Different Bitrates: At 13.2 kbps (highest tested): EVS-NB: 3.16% EVS-WB: 3.14% Nearly identical, indicating saturation point in semantic quality At 5.9-8.0 kbps range: EVS-WB maintains marginal advantage (e.g., at 8.0 kbps: WB 3.15% vs. NB 3.41%) Both modes provide sufficient basic audio quality At 9.6 kbps: EVS-NB: 3.19% EVS-WB: 3.23% NB performance very close to WB, difference within ASR model error margin Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency. 5.5 EVS Degradation Analysis Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance. Key Findings: - Semantic loss introduced by EVS in both NB and WB modes is minimal - Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance - Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates. 5.6 Summary of Findings Strategic Conclusions for ULBC Design: For ~1 kbps emergency/GEO scenarios: Semantic intelligibility is paramount NB and WB offer comparable semantic preservation Enforcing wider bandwidth at extremely low bitrates is risky due to limited bit budget Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation AI-based codec sampling rate optimization: DAC 16 kHz model provides distinct advantage over 24 kHz model at lower bitrates 8 kHz model (trained only 200k steps) defeats official 16 kHz model at low bitrates Optimizing sampling rate to match available bit budget is critical for system design Performance of intermediate rates (e.g., 12 kHz) remains open question 6. Proposals Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing: - Methodology - Experimental setup - Analysis of results concerning audio bandwidth impact on semantic intelligibility 7. Detailed Results Tables Complete experimental data provided in appendix tables covering: - Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps - Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps - Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps
S4-260154 PDF Info	China Mobile Com. Corporation	[FS_ULBC]pCR on Existing codec technologies	Summary of pCR on Existing Codec Technologies (S4-260154) Document Information Source: China Mobile Com. Corporation Specification: 3GPP TR 26.940 V0.5.1 Meeting: TSG-SA WG4 Meeting #135, Goa, India, 09-13 February 2026 Purpose and Scope This pCR proposes updates to Clause 7.1 of TR 26.940, which documents existing codec technologies for evidence that design criteria can be met and for comparison/evaluation purposes. The document adds information about recently emerged ultra-low bit-rate voice codecs (below 1 kbps) as reference for further work. Main Technical Contributions Expanded Codec Technology Reference Table The pCR significantly expands Table 7.1.1-1 "List of existing codec technologies" by adding multiple categories of codecs beyond the existing 3GPP IMS codecs. The table includes the following parameters for each codec: Source/Reference Audio bandwidth (NB/WB/SWB/FB) Codec delay (ms) Frame duration (ms) Bitrates (kbps) Specification access/software availability New Codec Categories Added 1. Conventional Ultra Low Bitrate Codecs MELP/MELPe: 0.6-2.4 kbps, NB, 22.5-90ms frame duration AMBE-LR: 1.6-1.8 kbps, NB MPEG-HVXC: 2-4 kbps, NB TWELP MR: 0.3-3.2 kbps, NB, various frame durations (40-120ms) Codec2: 0.45-2.4 kbps, NB, primarily 40ms frames 2. AI-Based Decoders WaveNet Codec2: 2.4 kbps, WB, 20ms frames CQNV Codec2: 1.0-1.1 kbps, WB, 40-60ms frames 3. AI-Based Encoder and Decoder (Causal) These codecs support real-time operation: - LPCNet: 1.6 kbps, WB, 40ms frames, 25ms delay - LyraV2 (SoundStream): 3.2-9.2 kbps, WB, 20ms frames - EnCodec: 1.5-24 kbps, 24kHz/FB, 0-1000ms delay, 13.3ms frames - Mimi-Codec: 0.55-1.1 kbps, 24kHz, 80ms frames, 0ms delay - TS3: 0.64-0.8 kbps, WB, 20ms frames, 0ms delay - TAAE: 0.4-0.7 kbps, WB, 20-40ms frames, 0ms delay - LMCodec2: Parameters TBD 4. AI-Based Encoder and Decoder (Non-Causal) These codecs are designed for offline/non-real-time applications: - DAC: 0.5-3 kbps, WB/24kHz, 244-366ms delay - DAC-IBM: 0.75-3 kbps, 24kHz, 366ms delay - SNAC: 0.98 kbps, 24kHz, 1000ms delay, 80ms frames - SpeechTokenizer: 0.5-1.0 kbps, WB, full-signal delay - SemantiCodec: 0.31-1.4 kbps, WB, 10-40ms frames, full-signal delay - FunCodec: 0.25-1.0+ kbps, WB, 20-40ms frames - WavTokenizer: 0.25-0.9 kbps, 24kHz, 25-40ms frames - BigCodec: 1.04 kbps, WB, 12.5ms frames - FocalCodec: 0.16-0.65 kbps, WB, 20-80ms frames - ALMTokenizer: 0.41 kbps, WB, 13.3ms frames - XY-Tokenizer: 1 kbps, WB, 20ms frames - LongCat-Audio-Codec: 0.43-0.87 kbps, WB, 60ms frames - AcademiCodec: Parameters TBD - MuCodec: 0.35-1.35 kbps, FB Additional Notes The pCR includes several important notes: Note 1: Some codecs may include noise suppression Note 2: MPEG-HVXC decoder and reference encoder available only to MPEG members Note 3: Codec2 uses 20ms overlapping FFT/iFFT with overlap-add Note 4: Some codecs only have non-causal versions publicly available Note 5: TWELP has a complete quality assessment testbench available despite lacking open reference implementation An editor's note indicates that more codecs may be added to the table in future revisions. Key Observations The pCR demonstrates significant industry progress in ultra-low bitrate speech coding, particularly: - Multiple AI-based solutions achieving sub-1 kbps bitrates - Wide range of delay characteristics (0ms to 1000ms) - Various bandwidth support (NB to FB) - Different availability levels for specifications and software implementations
S4-260155 PDF Info	vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance	[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling	Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling 1. Introduction and Motivation This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware. 2. Experimental Setup 2.1 Model Configuration Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M: Architecture: Fully convolutional encoder-decoder with Residual Vector Quantization (RVQ) Frame length: 40ms (fixed across all variants) Total up/down-sampling factor: 320 (consistent across variants) Sample rates tested: 8 kHz (320 samples), 16 kHz (640 samples), 32 kHz (1280 samples) Export format: ONNX with Float32 precision Key complexity observations from Table 1: - Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536) - Model sizes range from 4.3 MB to 283.6 MB - Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz) 2.2 Device Under Test (DUT) Environment Platform: MediaTek Dimensity 1200 (6nm) - representative mid-range SoC Inference engine: ONNX Runtime v1.14+ with CPU execution provider (single-threaded) CPU clusters tested: Efficiency cluster: Cortex-A55 Performance cluster: Cortex-A78 Prime core: Cortex-A78+ Methodology: Frequency-locked operation with disabled thermal services and power HALs to eliminate dynamic frequency scaling noise 3. Results and Analysis 3.1 Complexity Scaling vs. Bandwidth Critical finding: For a given model variant, computational complexity scales linearly with sample rate: - enc32dec768 example: - 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s) - 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase - 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended. 3.2 Real-Time Factor (RTF) Analysis Across Three Frequency Tiers 3.2.1 Tier 1: Low Frequency (A55@750MHz, A78@902MHz, A78+@1.1GHz) Energy-conserving state with severe constraints: Cortex-A55 @ 750 MHz: Only smallest models (enc8dec144) maintain real-time at 8 kHz; 16/32 kHz unfeasible Cortex-A78 @ 902 MHz: 32 kHz: Limited to <3M parameters 16 kHz: Supports up to ~8M parameters 8 kHz: Supports up to ~10M parameters Cortex-A78+ @ 1.108 GHz: Similar to A78 but extends 16 kHz limit closer to 10M parameters 3.2.2 Tier 2: Mid Frequency (A55@1.0GHz, A78@1.16GHz, A78+@1.37GHz) Typical sustained workload state: Cortex-A55 @ 1.0 GHz: 8 kHz supports up to ~2M parameters; 16/32 kHz remain largely unfeasible Cortex-A78 @ 1.162 GHz: 32 kHz: ~5M parameter limit 16 kHz: ~10M parameters (covers "Low Complexity" profile) 8 kHz: Robust up to ~20M parameters Cortex-A78+ @ 1.37 GHz: Performance parity with A78 (clock speed is primary differentiator) 3.2.3 Tier 3: High Frequency (A55@1.73GHz, A78@1.45GHz, A78+@1.63GHz) High-performance state approaching sustained limits: Cortex-A55 @ 1.73 GHz: 8 kHz: ~3M parameters 16 kHz: ~2M parameters 32 kHz: ~1M parameters Cortex-A78 @ 1.451 GHz: 32 kHz: ~7M parameters 16 kHz: ~10M parameters 8 kHz: ~20M parameters Cortex-A78+ @ 1.632 GHz: Highest headroom 32 kHz: ~8M parameters 16 kHz: Comfortably supports 10M parameters 8 kHz: ~20M parameters Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated. 3.3 Maximum Performance Envelope Analysis at peak locked frequencies establishes absolute upper bounds: 3.3.1 Efficiency Core (Cortex-A55 @ 2.0 GHz) Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices. 3.3.2 Performance Core (Cortex-A78 @ 2.6 GHz) Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices. Critical "Complexity vs. Bandwidth" trade-off identified: 32 kHz: RTF crosses 1.0 near 10M parameters (enc24dec576 variant) Hard limit for High-Fidelity ULBC candidates 16 kHz: Feasible model size effectively doubles to ~20M parameters (enc32dec768 variant) enc40dec960 fails real-time constraints Linear relationship between bandwidth reduction and parameter capacity 8 kHz: Extends to ~39M parameters enc40dec960 (29M) is safe Trend suggests failure before enc64dec1536 3.3.3 Prime Core (Cortex-A78+ @ 3.0 GHz) Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category. 4. Key Technical Contributions 4.1 Quantified Complexity-Bandwidth Trade-off Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores: - 32 kHz → 10M parameters - 16 kHz → 20M parameters - 8 kHz → 39M parameters 4.2 Real-World Performance Benchmarks Provided concrete RTF measurements across representative mobile hardware configurations, revealing that: - Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks - Memory bandwidth and thermal constraints significantly impact feasibility - Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity 4.3 Practical Complexity Constraints for ULBC Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection. 5. Proposal The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.
S4-260157 PDF Info	vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum	[FS_ULBC] Discussion on Audio Bandwidth for ULBC	Technical Summary: Audio Bandwidth Requirements for ULBC 1. Introduction and Scope This contribution addresses audio bandwidth design constraints for the Ultra-Low Bitrate Codec (ULBC), targeting primarily voice over GEO satellite communications. The document argues against mandatory Wideband (WB) and Super-Wideband (SWB) support, proposing instead that Narrowband (NB) should be mandatory with WB as an enhancement. 2. Key Technical Arguments 2.1 Global NB Usage and System Efficiency Current Network Reality: - 2G/3G connections (primarily AMR-NB) still represent 20% of global technology mix (end of 2023) - Regional variations: 81% in Sub-Saharan Africa, 46% in Middle East and North Africa - NB serves as universal fallback for interoperability (CS fallback scenarios) System Inefficiency Without NB Mode: - WB ULBC to NB user calls waste upper frequency band (4-8 kHz) - Significant bitrate wasted transmitting data that recipient cannot hear - Over expensive, scarce satellite link, this inefficiency is unacceptable - Native NB mode provides most efficient solution for legacy network connectivity 2.2 User Expectations in "Last Resort" Scenarios Baseline Expectation Setting: - GEO call is final option after terrestrial network failure - Users typically experience AMR-NB fallback before resorting to GEO - ULBC must be at least as reliable as NB fallback to meet user expectations - WB-only ULBC failure in conditions where NB would work represents service failure 2.3 Primary Use Case: Emergency Communications Typical Deployment Scenario: - Rescue teams in remote areas (e.g., Himalayan mountains) - Mixed-connectivity environment: - Squad A: GEO-only (outside TN coverage) - Squad B: GSM fallback at coverage fringe - Base Camp: PSTN connection (NB service) Technical Implications: - Terminating endpoints predominantly NB - Emergency systems use traditional NB codecs (Codec2, MELP) for robustness - Transmitting WB over satellite to NB endpoint wastes critical resources in life-or-death situations - Real-world deployment example provided (China rescue missions) Evaluation Priority: - ULBC candidates should prioritize intelligibility and robustness testing in NB mode 2.4 Performance at Very Low Bitrates Quality vs. Bandwidth Trade-off: - Forcing wider bandwidth at very low bitrates spreads available data too thinly - Research shows lower sampling rates can achieve higher perceptual quality at very low bitrates - WB codec at ~1 kbps may compromise intelligibility, especially with packet loss - NB signal more robustly reconstructed under constrained conditions Analogy: "Spreading butter" - concentrating bits on narrower bandwidth preserves speech richness and intelligibility 2.5 Complexity and Power Consumption Computational Scaling Issues: - AI-based codec architectures don't scale gracefully - Doubling sampling rate (NB to WB): 2x to 4x complexity increase for CNN/Transformer models - WB-only mandate imposes unnecessary computational burden - Critical issue for power-constrained mobile devices - Native NB mode offers high-quality voice at significantly lower complexity/power budget 3. Experimental Analysis: Higher Bandwidth Inefficiency 3.1 Experiment Setup Test Configuration: - Codec: Descript Audio Codec (DAC) with pre-trained models - Sampling rates tested: 44.1 kHz, 24 kHz (SWB), 16 kHz (WB) - Test corpus: 100 clean speech samples from MS-SNSD dataset - Bitrate variation: 1-9 active quantization codebooks - Quality metric: ViSQOL algorithm (speech mode, MOS estimate) Model Specifications: \| Model \| Compression \| Frame Rate \| Codebooks \| Bitrate/Codebook \| \|-------\|-------------\|------------\|-----------\|------------------\| \| 16 kHz (WB) \| 320x [2,4,5,8] \| 50 Hz \| 12 (10-bit) \| 0.50 kbps \| \| 24 kHz (SWB) \| 320x [2,4,5,8] \| 75 Hz \| 32 (10-bit) \| 0.75 kbps \| \| 44.1 kHz \| 512x [2,4,8,8] \| ~86.1 Hz \| 9 (10-bit) \| ~0.86 kbps \| 3.2 Key Experimental Findings Quality vs. Bitrate Results: - WB (16 kHz): Achieves excellent quality (ViSQOL MOS > 4.0) at ~2.5 kbps - 24 kHz SWB: Requires higher bitrate to match WB quality - 44.1 kHz: Provides minimal perceptible improvement over 24 kHz SWB - Conclusion: Bitrate cost of SWB not justified by quality improvement for voice content Efficiency Analysis: - Clear trend: diminishing returns for bandwidth beyond WB - SWB/FB represents inefficient use of bandwidth for ULBC service 4. Proposed Design Constraints 4.1 Bandwidth Requirements Mandatory Support: 1. 8 kHz sampling rate (NB): 50-4000 Hz audio bandwidth 2. 16 kHz sampling rate (WB): 50-8000 Hz audio bandwidth - Enhanced quality where channel conditions and device capabilities permit - WB support can be limited to higher bitrates than NB operation Further Study: - Necessity and feasibility of SWB and FB support remains FFS 4.2 Text Proposal for TR 26.940 Change to Table 6.2-1 (Design Constraint Parameters): Sample rate and audio bandwidth: - The ultra low bitrate codec shall support sampling rates of 8kHz (NB) and 16kHz (WB) - Supported audio bandwidth: - NB: 50-4000 Hz - WB: 50-8000 Hz 5. Supporting Evidence Summary Quantitative Data: - 20% global 2G/3G connections (hundreds of millions of users) - Regional NB dominance: up to 81% in some areas - WB achieves MOS > 4.0 at 2.5 kbps - 2x-4x complexity increase for WB vs. NB in AI codecs Qualitative Arguments: - System efficiency (no wasted bandwidth to NB endpoints) - User expectation alignment (last resort reliability) - Emergency use case requirements - Computational/power constraints for mobile devices - Diminishing returns for SWB/FB at target bitrates
S4-260158 PDF Info	vivo Mobile Communication Co.,	[FS_ULBC] Analysis of AI Codec Complexity Scaling	Complexity Analysis of AI Codec Scaling for ULBC 1. Introduction This contribution addresses the need for establishing relevant complexity evaluation methods for the new ULBC codec standardization. Previous contributions (e.g., S4aA250264) highlighted potential gaps between theoretical complexity metrics (FLOPs) and practical on-device performance (Real-Time Factor). This document provides a complementary analysis focusing on how complexity metrics scale with AI model architecture itself. The analysis investigates the relationship between model architecture, theoretical complexity, and traditional metrics using the publicly available DAC codec as a test case. 2. Analysis of AI Codec Complexity Scaling 2.1. Methodology The analysis created seven "dummy" model variants based on the open-source DAC codec's 16kHz configuration. The approach: Base Configuration: Sample rate: 16kHz Encoder dimension: 64 Encoder rates: [2, 4, 5, 8] Decoder dimension: 1536 Decoder rates: [8, 5, 4, 2] Scaling Approach: Only `encoder_dim` and `decoder_dim` were modified Encoder/decoder rates kept constant across all variants Total up/down-sampling factor maintained at 320 (2×4×5×8 = 8×5×4×2) Frame size: 20ms (320 samples at 16kHz) Variant Configurations: enc8dec144 enc12dec288 enc16dec384 enc24dec576 enc32dec768 enc40dec960 enc64dec1536 Complexity Metrics Measured: Model Parameters (Millions): Total trainable parameters Theoretical Complexity (MFLOP/s): Calculated using thop profiling library (aligned with S4aA250264 and S4aA250231) WMOPS: Traditional methodology using ITU-T STL wmc_tool, measured separately for encoder and decoder Implementation Notes: - Each AI operation implemented in pure C - Source files annotated and compiled using wmc_tool - WMOPS highly sensitive to C implementation efficiency - Naive implementations can yield significantly higher counts than optimized versions 2.2. Complexity vs. Model Dimensions Key Findings: Clear non-linear relationship between latent dimensions and resulting parameters/computational load Model parameters and MFLOP/s scale quadratically (or faster), not linearly, as encoder_dim and decoder_dim increase Results visualized in Figure 1 (Parameters vs. Dimension) and Figure 2 (MFLOP/s vs. Dimension) Encoder and decoder points are linked pairs corresponding to bundled setups 2.3. WMOPS vs. Model Parameters Key Finding: Clear relationship between AI model size (in millions of parameters) and traditional WMOPS complexity. Observations on DAC Model: Clear correlation between number of model parameters and resulting WMOPS when using same architecture with same C optimization level Decoder complexity scales significantly faster and is substantially higher than encoder complexity for all variants (DAC arranges more parameters/complexity for decoder to achieve better reconstructed audio quality) Growth in WMOPS appears linear relative to increase in parameters for both encoder and decoder 2.4. Summary of Scaled Variants Complete complexity metrics for all seven DAC variants (16kHz, 20ms frame): \| Variant \| Enc Dim \| Dec Dim \| Params (M) \| GFLOP counts \| MFLOP/s \| WMOPS Enc \| WMOPS Dec \| \|---------\|---------\|---------\|------------\|--------------\|---------\|-----------\|-----------\| \| enc8dec144 \| 8 \| 144 \| 1.09 \| 0.009 \| 437.09 \| 333.92 \| 760.53 \| \| enc12dec288 \| 12 \| 288 \| 2.89 \| 0.028 \| 1397.63 \| 648.23 \| 2732.96 \| \| enc16dec384 \| 16 \| 384 \| 4.94 \| 0.050 \| 2481.98 \| 1060.79 \| 4724.38 \| \| enc24dec576 \| 24 \| 576 \| 10.76 \| 0.112 \| 5578.38 \| 2228.92 \| 10399.00 \| \| enc32dec768 \| 32 \| 768 \| 18.90 \| 0.198 \| 9911.72 \| 3693.56 \| 18093.30 \| \| enc40dec960 \| 40 \| 960 \| 29.34 \| 0.310 \| 15482.00 \| 5599.48 \| 28019.70 \| \| enc64dec1536 \| 64 \| 1536 \| 74.50 \| 0.792 \| 39614.50 \| 13675.30 \| 70766.69 \| Data demonstrates rapid scaling of all metrics as encoder and decoder dimensions increase. 3. Observations and Conclusions Based on the DAC model variant analysis: Linear Relationship: For the DAC model, there is a clear linear relationship between Theoretical Complexity (MFLOP/s), Model Parameters, and measured WMOPS. As MFLOP/s or parameter count increases, WMOPS increases linearly, provided C coding style remains consistent. Quadratic Growth: Increasing model's internal dimensions causes complexity to grow quadratically. Even small dimension increases lead to disproportionately large jumps in MFLOP/s and WMOPS. Implementation Dependency: WMOPS score depends heavily on source C code efficiency. 4. Proposal It is proposed to capture the above analysis into 3GPP TR 26.940.
S4-260159 PDF Info	vivo, Samsung, Spreadtrum, MediaTek Inc.	[FS_ULBC] On codec bitrate and capacity discussion for ULBC	Summary of 3GPP CR S4-260159: On Codec Bitrate and Capacity Discussion for ULBC 1. Introduction This CR addresses the TBS (Transport Block Size) and codec bitrate values for ULBC (Ultra Low Bitrate Codec) evaluation, which are currently noted as 'companies reported' in TR 26.940 v0.4.0. The contribution provides analysis on: - Multiplexed UE number analysis - Confirmation of TBS/Codec bitrate values 2. Technical Analysis 2.1 Multiplexed UE Number Analysis The document presents a methodology for calculating supported UE numbers considering: - TDM (Time Division Multiplexing): Both UL and DL can schedule different UEs in TDM manner - FDM (Frequency Division Multiplexing): UL can additionally use FDM since NPUSCH may occupy few subcarriers within 180kHz bandwidth - FDM capacity: 48 UEs for 3.75kHz SCS (single tone) - FDM capacity: 12 UEs for 15kHz SCS (single tone) - Bi-directional constraint: Final supported UE number is the minimum of UL and DL capacity 2.2 Capacity Evaluation Results Analysis conducted under 50-degree elevation channel model with 2% BLER: Key Observations: - Observation 1: Higher UE transmit power leads to higher capacity (multiplexed UE number) for a given codec bitrate - Observation 2: For codec bitrate of ~3kbps, capacity is limited to ~10 UEs with 31dBm UE power. Capacity further degrades with increased bitrate (e.g., ≤5 UEs for 4.5kbps) Performance characteristics: - 23 dBm UE power shows very poor capacity - Performance improves with higher power UEs (26 dBm, 31 dBm) - Capacity increases with ptime value 2.3 Benchmark Considerations SA1 assumes transmission data rate of 1-3kbps as benchmark Commercial GEO system (Tiantong) operates within 0.8-2.4kbps range (per clause 5.2.1.3 of reference [1]) Real-world deployments support focusing on bitrates below 3kbps Additional analysis provided in Annex assuming 1% BLER under 10-degree elevation channel model. 3. Proposed Changes to TR 26.940 3.1 TBS and Codec Bitrate Tables The CR proposes specific TBS values selected from TS 36.213 table 16.5.1.2-2 for NB-IoT NPUSCH, with corresponding PHY bitrates and codec bitrates calculated for each bundling period (assuming 7-byte packet header). Table 1: 80ms bundling - TBS range: 88-256 bits - PHY bitrate range: 1.1-3.2 kbps - Codec bitrate range: 0.4-2.5 kbps Table 2: 160ms bundling - TBS range: 120-424 bits - PHY bitrate range: 0.75-2.65 kbps - Codec bitrate range: 0.4-2.30 kbps Table 3: 320ms bundling - TBS range: 208-808 bits - PHY bitrate range: 0.65-2.52 kbps - Codec bitrate range: 0.475-2.35 kbps 3.2 Additional Notes NOTE 1: Final packet header size depends on SA2 and RAN conclusions, including feasibility of 1-byte MAC header NOTE 2: Packet header counted only once regardless of bundled voice frames NOTE 3: Relationship between voice frame duration and bundling time depends on RTP payload design. Loss of single TB means loss of multiple consecutive voice frames when bundled. 4. Proposals Proposal 1: Agree that codec bitrate should be bound to be less than 3kbps Proposal 2: Agree to the proposed changes to Section 5.2.2.2 (Uplink simulation parameters) of TR 26.940, including: - Updated TBS values and PHY bitrates tables - Voice bundling periods: 80ms, 160ms, 320ms (40ms excluded due to insufficient time for DL transmissions with 3.75kHz SCS) - Target BLER values: 1%, 2%, 6%, 10% - Maximum Achievable SNR formula incorporating UE power (23/26/31 dBm), bandwidth, and antenna gain variations 5. Supporting Information The Annex provides additional multiplexed UE number analysis for different codec bitrates and UE power levels under 10-degree elevation channel model, supporting the main technical conclusions.
S4-260165 PDF Info	Dolby Laboratories Inc., Novamint, Nokia	On ULBC complexity and RTF analysis	Summary of S4-260165: ULBC Complexity and RTF Analysis Background and Motivation This contribution addresses the need to finalize complexity and memory design constraints for the ULBC (Ultra-Low Bitrate Codec) study. Previous discussions at SA4 #133-e and the ULBC ADHOC meeting explored various complexity metrics and RTF performance data for existing AI codecs (DAC, Lyra v2, HIL). However, insufficient data exists to draw definitive conclusions on complexity constraints for ULBC. The document builds upon previous contribution S4-251844 with the following modifications: - Added CPU core information for experiments - Aligned RTF definition with TR 26.940 clause 7.5.3 - Focused on model sizes 3-20M parameters (more relevant to ULBC use cases) - Provided pCR for TR 26.940 - Removed large chunk-based processing experiments (not relevant for real-time voice communication) Experimental Setup and Methodology Model Configuration Modified DAC architecture with reduced parameters while maintaining general structure: - Model sizes: 20M, 15M, 9M, and 3M parameters (float32 precision) - Training: Optimized for ~1 kbps bitrate at 32 kHz sampling rate - Encoder rates: 4,4,8,10 for all models Complexity Analysis Theoretical Complexity (GMACS): - Computed using ptflops library - Results show linear relationship between model size and GMACS: - 20M: 5.14 GMACS - 15M: 4.03 GMACS - 9M: 2.39 GMACS - 3M: 0.79 GMACS RTF Testing Methodology PyTorch models converted to ONNX format ONNX runtime with XNNPACK execution provider Frame-by-frame processing (80 ms frames) Test duration: 2 minutes (1500 inferences per session) 5 repetitions per experiment Single-threaded execution RTF calculation: max(inference time / frame length) across all frames Experimental Results Test Devices Device 1 (2023): - Hexa-core CPU: 2×3.46 GHz (P core) + 4×2.02 GHz (E core) - Dynamic core switching observed between P and E cores Device 2 (2022): - Octa-core CPU: 1×3.00 GHz Cortex-X2 + 3×2.50 GHz Cortex-A710 + 4×1.80 GHz Cortex-A510 - Processing on Cortex-X2 with frequency switching between 2.4 GHz and 1.8 GHz RTF Performance Results \| Model Size \| Max RTF (High Performance) \| Max RTF (Power Efficient) \| \|------------\|---------------------------\|---------------------------\| \| 20M \| 0.39-0.63 \| 0.81-0.9 \| \| 15M \| 0.29-0.43 \| 0.66-0.74 \| \| 9M \| 0.19-0.29 \| 0.44-0.57 \| \| 3M \| 0.09-0.13 \| 0.18-0.31 \| Results demonstrate linear increase in RTF with model size across both performance modes. Key Observations All tested models achieve RTF < 1.0, indicating real-time capability Significant RTF variation between high-performance and power-efficient modes Dynamic CPU core/frequency switching impacts performance 20M model shows max RTF=0.63 (high performance) and RTF=0.9 (power efficient) Smaller models (3M-9M) provide substantial RTF headroom for real-time operation Proposed Text for TR 26.940 The contribution provides a comprehensive pCR adding new clause 6.2.1.7 "RTF and MACS analysis for AI based codecs" with detailed experimental results. Key additions to TR 26.940 include: Complexity Considerations (Clause 6.2.1) Real-time processing requirements for voice communication Model size considerations (5-10M parameters for efficient operation) Memory access and power consumption challenges with larger models Complexity Metrics (Clause 6.2.1.4) Discussion of NPU/TPU capabilities measured in TOPS TOPS/W as power efficiency metric (2-15 TOPS/W range for smartphones) MAC operations and MACS as practical complexity metrics RTF as reliable complexity assessment metric Comparison with traditional WMOPS metric Target Devices (Clause 6.2.1.5) NPUs present in most modern smartphones Theoretical max TOPS: 8-59 TOPS (varying precision) TOPS/W range: 2-15 TOPS/W DAC codec estimate: ~150 Giga MAC/sec (~0.3 TOPS) Note on DRAM operations significantly impacting power consumption Key Conclusions (Clause 6.2.1.6) ML codecs require careful model size and complexity optimization NPUs offer 5-20× power efficiency vs CPUs for AI tasks ULBC complexity constraints should not reference existing 3GPP speech codecs Million MACS + model size provide first-order complexity indication RTF useful but requires standardized test platforms WMOPS not directly suitable for NPU-based AI solutions Experimental Data (Clause 6.2.1.7) Complete documentation of DAC-like architecture experiments Detailed RTF and GMACS results for 3M-20M parameter models Device specifications and performance characteristics Proposal Document the experimental methodology, results, and observations in clause 6.2.1 of TR 26.940 as shown in the provided pCR.
S4-260173 PDF Info	vivo Mobile Communication Co.,	[FS_ULBC] Discussion on Methodology for Delay & Error Trace Generation	Discussion on Methodology for Delay & Error Trace Generation for FS_ULBC Introduction This contribution addresses the ongoing debate within SA4 Audio SWG regarding the methodology for generating delay and error traces for Ultra Low Bitrate Codec (ULBC) evaluation under Non-Terrestrial Network (NTN) conditions. Two competing approaches have emerged: Fixed BLER / Target Error Rate: Prioritizes "realistic" channel behavior by fixing a target BLER (e.g., 2% or 10%) and finding feasible Transport Block Sizes (TBS) Fixed Resource / Link Budget: Prioritizes "fair resource usage" by fixing the SNR/Link Budget and allowing codec/modem to trade off bitrate against error robustness (Best Effort) The contribution proposes clarifying the purpose of these simulations by distinguishing between Design and Verification phases. Analysis of Current Approaches 2.1 The Precedent: LTE Simulation Methodology (TS 26.132) Legacy Mechanism (Trace Generation) The LTE MTSI testing methodology in TS 26.132 (Annex E and F) operated on "Stationary" conditions: Input: BLER_tx (e.g., 10%) was a fixed input parameter Process: The model assumed the network had converged to this average BLER with random error using `if (rand(1) < BLER_tx)` logic Output: Traces reflecting packet losses and delays based on re-transmissions Usage for Verification (Annex E & F) Critically, TS 26.132 defined these traces as verification tools (System Testing): UE Delay Verification (Annex E): Generated profiles verify UE can maintain synchronization and meet delay budgets under specific error conditions JBM and PLC Evaluation (Annex F): Profiles constructed with deliberate impairments to verify robustness: Jitter Bursts Packet Inversions Packet Duplication Key Finding: Profiles were treated as Test Vectors to verify robustness against defined impairments, not as "realistic channel recordings" to train codec design. The Shift for NTN NTN scenarios introduce challenges that invalidate the LTE approach: Cannot rely on simplistic i.i.d. (independent and identically distributed) random error models NTN channel impairments (shadowing, scintillation) introduce complex, non-stationary error patterns ULBC robustness directly influences tolerance levels, making fixed BLER targets inappropriate Must pivot from 'Assumed BLER' model toward 'Derived Performance' model 2.2 Analysis of Current Approaches for FS_ULBC Approach A: The "Realism" Perspective (Fixed BLER) Methodology: - Define TBSs for each candidate bitrate and bundling time - Traverse all link parameters (SCS, Tone, etc.) to evaluate if resulting link budgets satisfy predefined Target BLER - Generate error trace for each configuration meeting BLER threshold - Number of output traces = Number of defined Target BLERs (for each TBS) Underlying Assumption: AI-based Codecs (specifically PLC mechanisms) require specific "real" error patterns during training/design phase Observation: Limits testing scope to specific "safe" operating points, potentially overlooking codec behavior under unexpected channel degradation Approach B: The "Resource" Perspective (Fixed SNR) Methodology: - Normalize TBS across all candidate codec bitrates assuming consistent packet overhead - For each unique Link Budget (fixed SNR) derived from specific UE, satellite, and link parameters, generate dedicated error traces - Number of output traces = Number of unique Link Budgets (for each TBS) Underlying Assumption: Mimics "Best Effort" or competitive scenario similar to EVS selection, where end-to-end quality (MOS) matters more than intermediate BLER Observation: Logically sound for optimizing system performance, but implies vast search space potentially leading to unmanageable simulation workload 2.3 The Core Issue: Verification vs. Design The Logic Chain The standard workflow should be: Delay/Error Profiles Generation → Codec/PLC Verification → System Performance Evaluation Misalignment The current deadlock stems from treating RAN simulation outputs as Design Constraints (training data) rather than Verification Tools. Key Principles: Robustness over Overfitting: Robust Codec and PLC design should not rely on "learning" a specific channel trace from a specific simulator. Design should handle variety of harsh conditions (burst losses, high jitter, varying BLER). Data augmentation is standard practice for training robust AI models. The Role of Traces: As in TS 26.132 Annex F, generated traces serve as "Test Vectors" defining challenging conditions under which the Codec must survive. Whether traces represent 90% or 99% of real-world cases is secondary to sufficiently stress-testing JBM and PLC algorithms. Historical Practice: Delay/Error profiles officially generated by SA4 were never distributed to codec proponents for training purposes; they were solely used to verify codec candidates fulfill design constraints and performance requirements. 2.4 Proposal for the Way Forward Re-orient simulation efforts towards generating a Verification Suite rather than a "Perfect Reality Model": Avoid Excessive "Realism" Filtering: Do not discard simulation results simply because they don't meet strict low-BLER threshold. High BLER conditions are valid "Corner Cases" that ULBC must handle, especially in satellite scenarios with tight link budgets. Limit the Search Space: Select representative subset of challenging conditions (e.g., Deep Fading, High Doppler) at fixed SNR points resulting in range of BLERs (e.g., from <1% up to >10%). Verification Focus: Output traces should verify candidate codecs degrade gracefully under varied conditions. Burden is on Codec proponent to design PLC that works across these profiles, not on RAN simulation group to provide "training set" guaranteeing codec works. Proposal: Multi-point Fine-grained Trace Generation (MFTG) The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design by providing a high-resolution library of error traces rather than a single static operating point. Step 1: Resource Baseline Normalization (TBS Definition) Define set of Reference Transport Block Sizes (TBS) based on unified packet overhead Keep TBS values consistent across all candidate codec bitrates to ensure fair comparison of resource efficiency Step 2: Link Budget Mapping and Granularity Setup Identify target range of Link Budgets (SNR/CNR) based on realistic NTN deployment scenarios (e.g., LEO/GEO, UE power classes) Establish fine-grained sampling interval (e.g., 1% BLER to 10% BLER in steps of 1% or 2% from BLER perspective, or -5dB to 10dB in steps of 1dB from SNR perspective) Step 3: Large-scale Link-Level Simulation (LLS) Execute Monte Carlo simulations for each defined TBS at every fine-grained sampling interval Step 4: Flexible Trace Selection for Verification For Performance Comparison: Proponents selecting specific source bitrate can identify and utilize trace from library whose SNR/BLER most closely matches their design's intended link budget For Robustness Testing: Proponents can select "stress-test" traces (e.g., those with higher BLER or specific jitter profiles) from same library to verify PLC and JBM algorithms Conclusion While the source understands the rationale behind both the Fixed BLER approach and Fixed Resource / Link Budget approach for GEO network simulation, a compromised solution is necessary for FS_ULBC to progress. MFTG is therefore proposed for consideration and agreement.
S4-260175 PDF Info	vivo Mobile Communication Co.,	[FS_ULBC] Proposed ULBC design constraints living document	Candidate Convenor for 3GPP Systems Aspects TSG - ULBC Design Constraints Living Document 1. Scope This living document consolidates design constraints being considered within SA4 for FS_ULBC (Feasibility Study on Ultra-Low Bitrate Codec). Due to the working procedure requiring consensus agreements for design constraints to be integrated into ULBC PD or TR 26.940, and the lack of such consensus so far, this document captures the current status of design constraints even though some items are not fully agreed. 2. ULBC Design Constraints 2.1 Sampling Frequency and Audio Bandwidth Design Constraint: Support of [8, 16, 32] kHz / [NB, WB, SWB] required [1], [2] Editor's Notes and Open Issues: - Support of 8 kHz justified for interoperability; clarification needed on whether NB would be tested/supported "externally" based on external resampling - Support of 48 kHz may be considered at higher bitrate operation - Consideration of at least a single model (e.g., SWB) - Many neural codecs operate at 24 kHz; this specific sampling rate should be discussed - Complexity considerations associated with this parameter; joint decisions may be needed Reference: NB audio typically sampled at 8 kHz (100-3500 Hz), WB at 16 kHz (50-7000 Hz), SWB at 32 kHz (50-14000 Hz), FB up to 20000 Hz 2.2 Number of Audio Channels Design Constraint: ULBC candidate codecs shall support mono coding with one channel input and one channel output 2.3 Bit Rates Design Constraint: ULBC candidate codecs shall operate at bitrates lower than [3.00] kb/s [3] 2.4 Frame Length Design Constraint: Candidate codecs shall operate with a coding frame size of multiple of 20 ms Note: Since larger than 20ms bundling time periods will be used, codec proponents should be allowed to consider solutions with larger than 20ms frame sizes 2.5 Algorithmic Delay Design Constraint: Algorithmic delay shall be less than [coding frame size + x] ms 2.6 Complexity Design Constraint: Complexity limits applied according to categories. Computational complexity and program ROM (PROM) of candidate codecs for each category shall be measured with ITU-T STL2009 [1] as observed worst-case encoder + observed worst-case decoder complexity within the same category [5], [6] Categories: - Computational: wMOPS: Less than [x] wMOPS - Memory: RAM, ROM, Program ROM (values TBD) Editor's Notes: - Model size per operation mode is less than [5-10] million parameters - Total number of parameters is less than [Z] million - ULBC Codec should be implementable on mobile device using today's technology - Increased computational complexity and memory usage should be commensurate with gain in quality of user experience (e.g., higher audio bandwidth such as SWB or stereo) or increased efficiency (e.g., lower bit rate for same quality compared to reference codec) 2.7 Potential Use of Noise Suppression as Part of the Codec Design Constraint: If noise suppression is supported inside ULBC, there should be a mechanism to disable noise suppression in the codec [7], [8] Editor's Notes - Clarifications Needed: - Need to support noise suppression in ULBC? (typically vendor specific, defined outside the codec) - Impacts on test methodology, DTX operation/performance Motivations: - Disabling noise suppression required to test feature separately - Avoid tandeming in real operation - IMS voice communication defined in TS 22.228; GEO satellite access has no specific requirement on noise handling 2.8 Jitter Buffer Management (JBM) Design Constraint: A JBM solution conforming to requirements in TS 26.114, except for the functional requirement in sub-clause 8.2.2 of TS 26.114: "Speech JBM used in MTSI shall support all the codecs as defined in clause 5.2.1", shall be provided with candidate codecs 2.9 Rate Switching Design Constraint: Candidate codecs shall perform rate switching upon command to the encoder throughout the entire bit rate range at arbitrary frame boundaries. Rate switching may imply switching between different bandwidths Note: Due to the Bundling period and associated TBS, switching might have to happen at the boundary of bundling period 2.10 Packet Loss Concealment (PLC) Design Constraint: A PLC solution shall be provided by ULBC candidate codecs [9] Editor's Notes: - Typical loss profiles/characteristics to be clarified - Support of redundancy to be clarified - Need to be able to handle BLER up to [x%] 2.11 RTP Payload Format Design Constraint: Candidate codecs shall provide an RTP payload format specification supporting the full set of features and functionality of the ULBC candidate codecs 2.12 DTX Design Constraint: Candidate codecs shall provide a complete VAD/DTX/CNG framework. It shall be possible to operate the codec with DTX on or DTX off Editor's Note: Typical radio characteristics and optimizations (SPS, DRX, bitrate) to be clarified 2.13 Output Gain Limitation Design Constraint: ULBC candidate codecs shall not amplify the output signal relative to the input signal beyond limits Editor's Note: Similar limits and methodology to measure the amplification are described in the EVS-7a,b processing plan permanent document 3. References [1] S4-251794 - Discussion on Audio Bandwidth for ULBC (vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum) [2] S4-251808 - Pseudo-CR on Design Constraints of ULBC: Audio bandwidth (Fraunhofer IIS) [3] S4-251792 - On codec bitrate and capacity discussion for ULBC (vivo, Samsung, Spreadtrum, MediaTek Inc.) [4] ITU-T G.191 - Software tools for speech and audio coding standardization (March 2010) [5] S4-251747 - On complexity constraints for ULBC (Huawei Technologies Co., Ltd.) [6] S4-251807 - On complexity design constraints for ULBC (Fraunhofer IIS) [7] S4-251395 - Pseudo-CR on Design Constraints of ULBC: Noise suppression (Fraunhofer IIS) [8] S4-251748 - On noise suppression for ULBC (Huawei Technologies Co., Ltd.) [9] S4aA250268 - Packet Loss Concealment with existing AI based codec DAC (Dolby Laboratories Inc., Ericsson LM, Nokia, Novamint) Note: Items in light blue are candidates for agreement at SA4#135.
S4-260209 PDF Info	vivo Mobile Communication Co.,	[FS_ULBC] Alignment Analysis on Complexity of DAC model	Alignment Analysis on Complexity of DAC Model 1. Introduction This contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics: S4-260165: ~3M parameter model (32 kHz) requires 0.79 GMACS S4-260155: ~3M parameter model (32 kHz) requires approximately 1.41 GMACS (derived from 2821 MFlops/s) Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate. The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF. 2. Architectural Analysis and Discrepancy Resolution 2.1 The "Model Size" Trap A detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints: \| Metric \| [2] (16k, ~3M) \| [1] (32k, ~3M) \| \|--------\|----------------\|----------------\| \| Input Rate \| 16,000 Hz \| 32,000 Hz \| \| Total Stride \| 320 (2×4×5×8) \| 1280 (4×4×8×10) \| \| Latent Rate \| 50.0 Hz \| 25.0 Hz \| \| Encoder MACs (M) \| 436.30 \| 461.92 \| \| Quantizer MACs (M) \| 2.25 \| 0.50 \| \| Decoder MACs (M) \| 984.50 \| 1037.12 \| \| Total MFlops/s \| 1423.05 \| 1499.54 \| Key Analysis: The S4-260165 (32k, ~3M) model runs at 2× higher input rate (32k vs 16k), increasing encoder computational cost The S4-260165 model uses 4× higher stride (1280 vs 320), reducing the latent rate to 25Hz (compared to standard 50Hz) The reduced latent rate significantly lowers decoder cost (fewer frames to upsample) Higher input cost balances with lower decoder/latent cost, resulting in comparable total MFlops/s Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration. 2.2 Verification of Complexity Metrics Theoretical complexity (GMACS) was recalculated to validate the analysis: Using the standard conversion: GMACS ≈ MFlops/s / 1000 × 0.5 The S4-260165 (32k, ~3M) model at 32 kHz yields ~1,499.5 MFlops/s Calculated GMACS: 1499.5 / 1000 × 0.5 ≈ 0.75 GMACS This aligns closely with the reference value of 0.79 GMACS reported in S4-260165 3. GMACS as the Metric When RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures. Key Findings: RTF scales linearly with GMACS across different CPU tiers (Efficiency, Performance, Prime cores) A specific GMACS budget (e.g., 2.0 GMACS) yields predictable RTF on a target CPU core and frequency, regardless of architectural choices (high-sample-rate input vs. large parameter count in decoder) This metric decouples complexity constraint from specific architectural choices (stride, latent rate), allowing codec designers flexibility in optimization High-complexity validation: S4-260155's 20M model (~5.14 GMACS) demonstrates RTF of 0.9 in power-efficient execution mode on high-end 2023 device, aligning with mid-range Prime Core (3.0 GHz) trend where ~5.3 GMACS corresponds to RTF ≈ 1.0 4. Conclusion By adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices. 5. Proposal Propose to include this analysis in 3GPP TR 26.940, specifically capturing: Model Size is not a consistent proxy for complexity across varying architectures (e.g., high-stride vs. low-stride configurations) GMACS/GFLOPs demonstrates strong linear correlation with real-time performance on mobile devices This analysis provides a solid basis for defining complexity constraints for ULBC candidates References [1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis" [2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling"
S4-260214 PDF Info	Qualcomm Incorporated, Xiaomi	[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle	Summary of S4-260214: Feasible Bitrates for NTN-TDL-C Channel Model with 10-Degree Elevation Angle Background and Motivation This contribution addresses the determination of feasible Transport Block Sizes (TBS) for the newly agreed NTN-TDL-C channel model at 10-degree elevation angle, which was adopted at SA4 #134 (documented in S4-252108). Key observations include: Two channel models now exist: The original channel model from TR 38.811 Table 6.9.2-3 and the new NTN-TDL-C model for 10-degree elevation Channel model validation: The new channel model shows better correlation with field data from NB-IoT GSO service with handheld devices Field data shows 1-3 dB gap between 1st and 50th percentile SNR New channel model: ~1 dB gap (consistent with field data) Initial channel model: ~6 dB gap (less consistent) TBS table update requirement: The TBS values in permanent document tables (5.2.2.1-1, 5.2.2.1-2, 5.2.2.1-3) should reflect the union of supported TBS values for both channel models Simulation Methodology The contribution evaluates maximum feasible bitrates under worst-case conditions without DMRS bundling, considering two scenarios: Scenario 1: Ideal Timing 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread Without DMRS bundling No uncertainty in scheduling timing Scenario 2: Timing Uncertainty 80ms bundling period, 2 UE RX, 15kHz SCS, 5Hz Doppler spread Without DMRS bundling 10ms uncertainty in scheduling timing (relevant for large cells without UE-specific Koffset or TA report) Both scenarios target 2% BLER for uplink and downlink. Simulation Results Scenario 1 Results (No Timing Uncertainty) Maximum TBS: 936 bits Uplink: 15kHz SCS, 3 tones, 48ms (N_RU=6, N_rep=2) BLER: 1.5% at 31dBm UE TX power Downlink: 24ms (N_SF=6, N_rep=4) BLER: 1.1% at -3.3dB SNR Scenario 2 Results (10ms Timing Uncertainty) Maximum TBS: 680 bits Uplink: 15kHz SCS, 3 tones, 40ms (N_RU=5, N_rep=2) BLER: 0.2% at 31dBm UE TX power Downlink: 20ms (N_SF=5, N_rep=4) BLER: 0.5% at -3.3dB SNR Proposed Changes to Permanent Document TBS Table Updates The contribution proposes adding new TBS values to support the higher bitrates enabled by the new channel model: For 80ms Bundling Period (Table 5.2.2.1-1) Add TBS 936 bits (PHY bitrate: 11.7 kbps, Net bitrate: 11.0 kbps) Add TBS 680 bits (PHY bitrate: 8.5 kbps, Net bitrate: 7.8 kbps) Add intermediate value between 680 and current maximum (424) For 160ms Bundling Period (Table 5.2.2.1-2) Add corresponding new TBS values scaled appropriately For 320ms Bundling Period (Table 5.2.2.1-3) Add corresponding new TBS values scaled appropriately Terminology Change Change "codec bitrate" to "net bitrate" to clarify that this represents the bitrate available to the codec (after accounting for packet headers), not a required codec operating bitrate Updated Tables The proposed tables include: - Packet header assumption: 7 bytes (with note that final size depends on SA2/RAN conclusions on 1-byte MAC header feasibility) - Header counting: Packet header counted only once per bundling period, regardless of number of voice frames bundled - TBS values: Selected from TS 36.213 Table 16.5.1.2-2 for NB-IoT NPUSCH - Net bitrate calculation: PHY bitrate minus overhead from packet headers The complete updated tables show TBS values ranging from 144 to 936 bits for 80ms bundling, with corresponding PHY and net bitrates calculated for each bundling period configuration.
S4-260215 PDF Info	Qualcomm Incorporated, Ericsson LM	[FS_ULBC] On the scheduling timing uncertainty	Summary of 3GPP Technical Document: FS_ULBC - Scheduling Timing Uncertainty 1. Background and Motivation This contribution addresses ambiguities in interpreting RAN1 LS S4-251654 regarding uplink and downlink timing for NB-IoT NTN with GEO satellites. The interpretation of this LS has direct implications on: - Scheduling timing uncertainty assumptions - Link capacity calculations The document proposes clarifications to the Permanent Document (PD) version 0.4.0 to resolve these interpretation issues discussed at SA4 #133-e and subsequent meetings. 2. Main Technical Contributions 2.1 Frame Structure for Dynamic Scheduling The document maintains the existing frame structure example for Half-duplex FDD with 80ms bundling period: - NPDSCH duration: 4ms (variable depending on DL SNR) - Multiple UL frequency allocation options: 1, 3, 6, and 12 tones with 15 kHz per tone - Allocation choice depends on UL and DL channel capacity 2.2 Semi-Persistent Scheduling (SPS) Frame Structure Two SPS approaches are presented: Approach 1 (Figure 5.2.2.3-2): - NPDSCH can be positioned anywhere within first 15ms - Maintains minimum 1ms gap to NPUSCH Approach 2 (Figure 5.2.2.3-3): - Based on "Cell_specific_Koffset" approach - Does not depend on "TA report UE capability" 2.3 Gap Composition Between DL and UL The gap consists of: 1. Processing time + DL-to-UL switching: Minimum 1ms for half-duplex device switching 2. Max differential delay: Accounts for different round-trip delays of UEs in NTN cell - Typical range: close to 0 to 10.3ms depending on deployment 2.4 Baseline Assumptions for Codec Simulations Key Changes Proposed: For 80ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 68ms - Proposed change: Replace with reference to beam size no larger than 1500km - Note clarifies this corresponds to scenarios where difference between closest and farthest point to satellite is <1500km - Explicitly states codec can be deployed in scenarios not meeting these constraints For 160ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 148ms - Proposed change: Replace with reference to beam size no larger than 1500km - Same flexibility noted regarding deployment in other scenarios 2.5 Important Notes and Editor's Notes RAN1 LS Clarification: - Figure 5.2.2.3-1 supportable in most scenarios - May not be supportable when: - Cell is very large (e.g., >3000km) - UE does not support TA report - Network does not support UE-specific K-offset - Requires UE configuration with two HARQ processes and HARQ feedback disabled SPS Design Status: - RAN1/2 have not yet started SPS design work - RAN1 cannot currently confirm whether SPS frame structure examples (Figures 5.2.2.3-2 and associated text) will be supported Editor's Note: - Range of "Max differential delay" is TBC (To Be Confirmed) 3. Summary of Changes The primary technical contribution is replacing specific timing constraint assumptions (X + Y values and max differential delay) with a more practical reference based on beam size ≤ 1500km for codec simulation baseline, while explicitly allowing codec deployment in scenarios exceeding these reference conditions. This provides clearer guidance to SA4 while maintaining flexibility for various deployment scenarios.
S4-260216 PDF Info	Qualcomm Incorporated	[FS_ULBC] On transmission delay for voice over NB-IoT NTN	Summary of S4-260216: On Transmission Delay for Voice over NB-IoT NTN Document Overview This contribution from Qualcomm addresses gaps in TR 26.940's mouth-to-ear delay calculations for NB-IoT NTN systems, specifically highlighting the omission of NPUSCH/NPDSCH transmission durations and clarifying the distinction between propagation delay and transmission delay. Main Technical Issues Identified Problem Statement Missing Transmission Duration: TR 26.940 did not account for the duration of NPUSCH transmission or NPDSCH transmission, which can be significant for NB-IoT (e.g., 64ms for NPUSCH) Terminology Confusion: TR 26.940 confuses propagation delay with transmission delay, where: Transmission delay: The interval from when the first bit leaves a transmitter to when the last bit leaves the transmitter Propagation delay: The time for signal propagation through the medium Processing delay (up to 3ms) can be ignored in mouth-to-ear calculations Proposed Technical Changes 5.2.2.4 Propagation Delay Corrections Key Change: Renamed "Transmission delay" to "Propagation delay" for GEO satellite link Maximum propagation delay: 280ms (per KPI requirement in clause 7.4.2 of reference document) Minimum propagation delay: 248ms (280ms - 32ms, accounting for UE location within beam) Assumes no retransmissions over GEO satellite link 5.2.2.5 Transmission Delay (New Section) New Addition: Introduces proper definition and consideration of transmission delay Defines transmission delay as the interval from first bit to last bit leaving the transmitter Highlights significance for NB-IoT NTN (up to 64ms for NPUSCH in uplink) Must be accounted for in mouth-to-ear delay calculations Transmission delay for transport block size should be based on RAN simulation results 5.2.2.5/6 ULBC Delay Components Section renumbered from 5.2.2.5 to 5.2.2.6 References existing algorithmic delays for IMS codecs (AMR and EVS: 5ms to 12ms) Notes that ULBC may have different delay values from codec processing and algorithmic delays Marked as FFS (Further Study) 5.1.3 Mouth-to-Ear Delay Estimation Updates Editorial Note Added: - Numbers in Table 5.1.3-1 will be updated once RAN simulation is completed to account for transmission delays in uplink and downlink - Current values assume AMR and EVS algorithmic delays - ULBC delay components still need to be addressed - Minimum Delay_GSCN assumed as 20ms Existing Table Structure Maintained: - Frame sizes: 20ms, 40ms, 80ms, 160ms, 320ms - Two scenarios: GEO-TN (main) and GEO-GEO (sub-scenario 1) - Lower and upper bounds for mouth-to-ear delay - Delay ranges from 428-712ms (20ms frame, GEO-TN) to 984-1455ms (320ms frame, GEO-GEO) Dependencies and Next Steps Awaiting RAN simulation results to determine actual transmission delay values for different transport block sizes ULBC-specific delay components require further study Terminology alignment needed with clause 4 "Application Scenario" Table 5.1.3-1 values pending update based on RAN simulation completion
S4-260217 PDF Info	Qualcomm Incorporated	[FS_ULBC] Support for Dual-Tone Multi-Frequency for IMS voice over NB-IoT NTN	Summary of S4-260217: Support for Dual-Tone Multi-Frequency for IMS Voice over NB-IoT NTN Background SA1 has mandated support for Dual-Tone Multi-Frequency (DTMF) for IMS voice over NB-IoT NTN. The document addresses the need to consider multiplexing of DTMF traffic with voice traffic in the system design, referencing RFC 4733 for DTMF payload formats. DTMF Use in IMS Voice Services and Traffic Characteristics DTMF Payload Types RFC 4733 defines two DTMF payload format types: - Telephone events: User button presses (0-9, , #) during calls - Tones: Ringing tone, busy tone, etc. For IMS calls, tones are generated locally (e.g., "180 Ringing" or "486 Busy Here" SIP messages trigger local tone generation), so only telephone events need to be transported* over the air. Technical Specifications RTP payload size: 4 bytes Telephone events: Standard DTMF digits and symbols Traffic Characteristics The document identifies key DTMF traffic characteristics: - DTMF packets are transmitted infrequently (only on button press) - Telephone events may or may not overlap with active voice activity - Multiple DTMF packets may be transmitted per button press, with the RTP marker bit indicating the first packet - RTP packets must differentiate between voice and DTMF packets for multiplexing Design Assumptions Three key assumptions are established: 1. DTMF packet size ≤ voice packet size 2. DTMF delay requirements are less stringent than voice service 3. DTMF takes priority over voice SPS Configuration Considerations When SPS (Semi-Persistent Scheduling) is configured for voice traffic with fixed TBS: - If DTMF packets don't overlap with active voice frames, they can be multiplexed with SID frames (smaller than active voice frames) and transmitted in SPS occasions - If overlapping occurs, the UE can puncture an active voice frame and send the DTMF frame instead - SA4 needs to coordinate with RAN1 and RAN2 on SPS design Proposals Proposal 1: Make DTMF support an integral part of IMS voice service over NB-IoT NTN Proposal 2: Design DTMF support based on the three assumptions: - DTMF packet size ≤ voice packet size - DTMF delay requirement less stringent than voice - DTMF priority over voice Proposal 3: SA4 to design mechanisms for voice and DTMF multiplexing for SPS and coordinate with RAN1 and RAN2
S4-260220 PDF Info	Nokia	Proposed design constraints for noise suppression, DTX, and non-speech inputs	Summary of S4-260220: Design Constraints for Noise Suppression, DTX, and Non-Speech Inputs 1. Background and Context This contribution addresses design constraints for the ULBC (Ultra-Low Bit-rate Communication) over GEO channel solution, building upon previous discussions from S4-251881 and S4-251786. The document focuses on three key areas: - Noise suppression handling - Discontinuous transmission (DTX) framework - Robustness to non-speech inputs Emergency Call Use Case The contribution emphasizes that emergency calls represent a critical use case for ULBC over GEO, particularly when terrestrial network (TN) service coverage is unavailable. Key considerations include: - Background signals may contain critical contextual information (e.g., voices, environmental sounds indicating danger) - Post-call analysis requirements (ASR transcripts, emergency response evaluation, criminal investigations) - Need for full situational awareness rather than aggressive noise suppression 2. Technical Analysis 2.1 Noise Suppression Trade-offs The document identifies several technical challenges: Performance requirements alone may be insufficient: Testing with background signals (even using ITU-T P.800 DCR methodology) may not prevent systems from employing aggressive noise suppression that removes critical background information Ultra-low bit rate optimization: At very low bit rates, there exists an unknown trade-off between: Applying noise suppression Accepting more coding artifacts Potentially reduced intelligibility in presence of background signals Device-specific processing: Acknowledges that device-specific noise suppression is standard practice and will likely be applied before ULBC encoding 2.2 Updated Approach The contribution updates the original proposal from S4-251881 by: - Maintaining the requirement for disableable noise suppression within the codec - Adding specific SNR ranges for stationary (5-15 dB) and non-stationary (10-25 dB) noise - Deferring specific noise type definitions for future discussion - Linking noise suppression behavior primarily to performance requirements 3. Proposed Design Constraints The document proposes updates to Table 6.2-1 in draft TR 26.940 with three new/modified constraint parameters: 3.1 Noise Suppression Constraint Requirement: If noise suppression is supported as part of the candidate codec, it must be possible to disable it to preserve background signals. Editor's Notes: - EN1: Requirement to disable may be considered in connection with specific operating bit rate(s) - EN2: Solution behavior w.r.t. potential noise suppression is primarily enforced via performance requirements; default operation for tests is with noise suppression disabled 3.2 DTX Framework Constraint Requirement: The candidate codec shall provide a framework for: - Voice Activity Detection (VAD) - Discontinuous Transmission (DTX) - Comfort Noise Generation (CNG) - Operation with DTX on or DTX off Editor's Note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need clarification 3.3 Robustness to Non-Speech Input Requirement: The candidate codec shall be robust to: - Noisy speech with stationary noise (5-15 dB SNR) - Noisy speech with non-stationary noise (10-25 dB SNR) - Background signals during and between speech segments - Other non-speech input signals Editor's Notes: - EN1: May need to be in performance requirements - EN2: Relevant background signals to be further defined as part of performance requirements, including both stationary and non-stationary types 4. Key Technical Contributions Balanced approach to noise suppression: Recognizes both the need for flexibility in noise suppression (for speech quality) and the critical requirement to preserve background signals (for emergency use cases) Mandatory DTX framework: Establishes VAD/DTX/CNG as a required feature rather than optional, with explicit on/off control Quantified robustness requirements: Provides specific SNR ranges for different noise conditions that the codec must handle Testing methodology guidance: Proposes default testing with noise suppression disabled, while allowing performance requirements to govern overall behavior 5. Open Issues Several editor's notes indicate areas requiring further work: - Specific operating bit rates where noise suppression disable requirement applies - Clarification of DTX and noise suppression interaction - Final placement of robustness requirements (design constraints vs. performance requirements) - Definition of specific background signal types for testing - Speech quality requirements (to be addressed separately in performance requirements)
S4-260223 PDF Info	Ericsson LM	UE Antenna Gain in link-budget evaluations	UE Antenna Gain in Link-Budget Evaluations Introduction This contribution addresses the need to establish common assumptions for UE Antenna Gain in link-budget evaluations for FS_ULBC (Ultra Low Bitrate Speech Codec). The document highlights that different assumptions on UE Antenna Gain lead to significantly different conclusions on suitable radio configurations, and proposes alignment with existing 5G NR-NTN assumptions. Problem Statement The current FS_ULBC Pdoc references TR 36.763 with UE Antenna Gain assumptions ranging between 0 dBi and -5.5 dBi. Previous SA4 contributions on link level simulations have shown divergent assumptions regarding achievable link level performance, leading to inconsistent conclusions. The lack of a common assumption for UE Antenna Gain (G_Tx) significantly impacts: Link-budget results Performance references for link-level evaluations Overall system design conclusions Link-Budget Analysis Comparative Evaluation The document presents a detailed side-by-side comparison of link-budget calculations for GEO satellite uplink with two different UE Antenna Gain assumptions: Scenario Parameters (Common): - Satellite Orbit: GEO - Link Direction: Uplink - Device Type: Handheld - Satellite Elevation Angle: 2.3 degrees - Satellite Altitude: 35,786 km - Slant Range: 41,417.91 km - Carrier Frequency: 2000 MHz - Free Space Path Loss (FSPL): 190.8 dB - UE Transmit Power: 23 dBm - Receive Antenna Gain: 51 dBi - Satellite G/T: 19 dB/K - Bandwidth: 3750 Hz - Various losses (atmospheric, shadow fading, scintillation, polarization, additional): 11.4 dB total Key Results: \| UE Antenna Gain \| Received Power \| Noise Power \| SNR at Satellite Receiver \| \|-----------------\|----------------\|-------------\|---------------------------\| \| 0 dBi \| -135.58 dBm \| -138.23 dBm \| 2.66 dB \| \| -5.5 dBi \| -141 dBm \| -138.23 dBm \| -2.84 dB \| The difference in UE Antenna Gain assumption results in a 5.5 dB difference in SNR, which is highly significant for link-level performance evaluation and system design. Observations Observation 1: The assumption for UE_Antenna_Gain (G_Tx) critically impacts the resulting SNR at the satellite receiver, which in turn affects conclusions on link-level results. Clarification is needed on whether to use 0 dBi, -5.5 dBi, or both values. Observation 2: It is unlikely that an NB-IoT device would have superior antenna performance compared to an NR handheld device. Therefore, the UE_Antenna_Gain assumption should align with 5G NR-NTN specifications, which use -5.5 dBi. Observation 3: RAN4 guidance (R1-2208353) explicitly recommends -5.5 dBi as a realistic UE antenna gain value, stating: "The UE antenna gain varies depending on the operating frequency and UE design. RAN4 thinks that a realistic UE antenna gain value would be -5.5 dBi. RAN4 would then recommend RAN1 to take this value as an assumption for their link budget evaluation." Proposal Proposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on UE_Antenna_Gain (G_Tx) with 5G NR-NTN specifications, i.e., -5.5 dBi. This alignment ensures: - Consistency with existing 3GPP NTN specifications - Realistic assumptions based on RAN4 recommendations - Comparable link-budget evaluations across different contributions - Appropriate performance targets for codec and system design
S4-260233 PDF Info	Fraunhofer IIS, Apple Inc.	On reference code and model format	3GPP Change Request Summary: S4-260233 Document Overview This contribution proposes the use of ML model formats as intermediate representations (IR) for the ULBC (Ultra Low Bitrate Codec) reference implementation, rather than a pure C implementation. The document is structured as a proposed Change Request (pCR) to TR 26.940, introducing a new clause 6.4.2. Main Technical Contributions 1. Problem Statement and Motivation (Goal Section) The document identifies a fundamental question for ULBC standardization: whether to provide the entire codec reference implementation in C (including neural network components) or to define specific parts based on ML model formats (e.g., ONNX, PyTorch, TensorFlow). Key concerns with pure C implementation: - Limits UE vendors from leveraging custom architectures and optimizations - UE vendors typically have custom optimization pipelines to port ML models to internal formats - Pure C approach restricts full utilization of specialized hardware (NPUs, DSPs, TPUs) 2. Limitations of C-Based Reference Implementation (Clause 6.4.2.1) Issues with existing WMC (Weighted Million Operations) tool for complexity measurement: - Weights in Table 18.3 of G.191 do not account for vectorized implementations of matrix multiplications - Theoretical complexity estimation does not reflect actual runtime complexity - Does not account for diversity of target platforms Additional limitations identified: - Hardware/platform dependencies: C implementations may rely on platform-specific intrinsics and vectorization pragmas, limiting portability to NPUs - Unoptimized reference code: May not be optimized for certain platforms - Compiler dependencies: Intrinsics are compiler-specific - Maintenance burden: Keeping C implementation updated with new ML operators and architectures is costly and error-prone 3. Definitions and Concepts (Clause 6.4.2.1 - Definitions) The document establishes clear terminology: Graph format: Describes neural network as computational graph (structure only, no parameters) Model format: Combines graph representation, trained parameters (weights, biases), and metadata; self-contained and directly runnable Intermediate Representation (IR): Serves as bridge between high-level ML framework and execution runtimes Note: PyTorch does not contain a graph format and requires model definition as Torchcode. 4. Advantages of Model Format Approach (Clause 6.4.3.2) Platform Portability: - Specifies what is computed, not how it's executed - Framework-agnostic: models can be exported from different training frameworks - Allows vendors to use custom toolchains for hardware-specific optimization Hardware Evolution: - Future-proof method to leverage latest AI processor developments - Maintains compatibility with low maintenance effort Combination with Standard C-code: - ULBC can combine ML parts (as model format) with classic signal processing (in ANSI C) - Backend runtime in C can integrate ML components - Enables traditional 3GPP codec reference implementation structure 5. Comprehensive Model Format Analysis (Clause 6.4.3.3) The document provides detailed comparison of major ML model formats: \| Format \| Type \| Key Advantages \| Key Limitations \| \|--------\|------\|----------------\|-----------------\| \| ONNX \| Framework-agnostic IR \| Cross-framework portability, wide runtime/hardware support, native OS support (Windows/Linux), dedicated C/C++ runtime \| Operator coverage limitations, limited dynamic graph support \| \| TensorFlow Lite (TFLite/LiteRT) \| Edge/embedded-focused IR \| Mobile/edge optimized, strong Android ecosystem, quantization tools, C/C++ runtime \| TensorFlow-centric, partially vendor-specific maintenance \| \| PyTorch/Python \| Torch.nn.Module + checkpoints \| Easy prototyping, highly optimized conversion tools \| Suboptimal for real-world testing, Python dependencies, no C/C++ runtime without Python \| \| TorchScript \| PyTorch-specific serialized IR \| Static graph without Python dependencies, supports custom ops, LibTorch C++ runtime \| PyTorch-specific, deprecated (being replaced by ExportedProgram) \| \| ExportedProgram & ExecuTorch \| Two IRs: ExportedProgram and .pte \| Replaces TorchScript, canonical PyTorch export IR, dedicated C++ runtime \| PyTorch-specific, requires compilation to another IR, pipeline not fully mature \| \| OpenVINO IR \| Intel/CPU-centric IR \| Strong Intel CPU/GPU optimization \| Not suitable for mobile SoCs, extra conversion step needed \| \| Proprietary vendor IRs \| Vendor-specific internal IR \| Highly hardware-optimized \| Not portable, requires conversion from open IR \| Key observations: - PyTorch format provides maximum flexibility and transparency but may have long-term compatibility concerns due to format evolution - ONNX and TFLite are designed for inference deployment and cross-platform compatibility, representing stable industry standards - ULBC ML parts will likely be based on PyTorch format, convertible to stable formats like ONNX or TFLite 6. SoC AI Engine Support Analysis (Clause 6.4.3.4) Hardware landscape: - Major smartphone SoCs include NPUs, DSPs, TPUs, GPUs, and CPUs - Vendors provide specialized runtime environments and SDKs - Vendors use native/preferred internal model formats optimized for their architecture Industry pattern: - All major vendors provide conversion mechanisms from popular open-source formats - Common supported formats: ONNX, TFLite, PyTorch, TensorFlow - References provided for major vendors: Qualcomm, Apple, Samsung, MediaTek, Google, Huawei 7. Summary and Recommendations (Clause 6.4.3.4) Advantages of model-format/IR-based reference implementation: - Decouples algorithm definition from hardware-specific implementation - Leverages existing SoC vendor compilers, AI accelerators, and runtimes - Significantly more portable, maintainable, and future-proof Recommended approach for ULBC reference implementation: 1. Base reference on ML model-format with auxiliary signal processing in C 2. Include both ONNX and PyTorch as ML model-formats 3. Define neural network model-format including operator set and version 4. Specify I/O interfaces of ML models and auxiliary signal processing steps in C 5. Use reference implementation for integration illustration, verification, and testing Proposal The document proposes: 1. Discussion and agreement on selection of one or more model formats for ULBC reference implementation 2. Agreement on principle of using model format as part of ULBC standardization reference model 3. Documentation of findings in TR 26.940 under new clause 6.4.2 Key Technical Impact This contribution represents a significant departure from traditional 3GPP codec standardization approaches by advocating for ML model formats rather than pure C implementations. The proposal addresses practical deployment considerations for ML-based codecs while maintaining compatibility with 3GPP standardization practices through hybrid approach combining model formats with C code for signal processing components.
S4-260235 PDF Info	Orange	On the use of objective metrics in ULBC standardization	Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization Introduction and Scope This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions). Main Technical Contributions Test Methodologies (Clause 9) Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1) The document identifies specific impairment categories relevant to ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech) Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization). Challenges of Quality Assessment (9.1.2) The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS): Traditional 3GPP Approach: - Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech - P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content - Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments ULBC-Specific Considerations: - ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues) - DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition Alternative Test Methodologies Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation Subjective Testing Considerations (9.1.3) Robustness Related to Source Material (9.1.3.1): - Multiple languages with diverse intonations - Non-speech signals - Various linguistic features and accents - Wide range of speakers (different voice pitches, speaking styles) - Overlapping talkers Simulation of Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Various reverberation levels (RT60 ranging from 0.3s to 1.0s) Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov Conclusion (9.1.3.4): - ITU-T P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming Objective Testing Considerations (9.1.4) Correlation Analysis Results (9.1.4.1): The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models: Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ General audio metrics: PEAQ, ViSQOL-A Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient Key Observations for Clean Speech: - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior - 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA) Correlation Analysis for Music/Mixed Content: Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model Key Observations for Music/Mixed Content: - POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f-model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance - Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective and objective metrics may be considered for codec characterization - Objective metrics have merits in other tasks such as codec conformance testing Proposed Changes to TR 26.940 The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.
S4-260241 PDF Info	Fraunhofer IIS	On complexity estimation of ULBC	Summary of S4-260241: On Complexity Estimation of ULBC Document Overview This contribution addresses the complexity measurement methodology for the Ultra-Low Bitrate Codec (ULBC) under development in 3GPP SA4. The document proposes a hybrid complexity metric that combines traditional DSP-based measurements with ML-specific metrics. Background and Motivation Multiple input documents [1-4] have previously discussed complexity measurement approaches: - Documents [1] and [3] proposed using WMOPS (Weighted Million Operations Per Second), following conventional speech codec practices - Document [2] suggested using MACs and a modified WMOPS version - Document [4] emphasized model size considerations The key challenge is that ULBC will operate on heterogeneous, non-fixed target hardware and processors, requiring a platform-agnostic complexity metric. Main Technical Contributions Proposed Hybrid Complexity Metric The document proposes combining two complementary measurement approaches: For DSP-based components: - Use traditional WMOPS measurement For ML-based components: - Use MAC (Multiply-Accumulate) operations count - Include parameter count for memory/model size considerations Combined metric formula: `WMOPS + w · MACs` where `w` is an ML weighting factor (expected to be < 1) that reflects the vectorization capability of matrix multiplications. Rationale for the Hybrid Approach Limitations of WMOPS-only approach: - WMOPS reflects complexity primarily for DSP operations - Does not account for modern vectorization capabilities available even on modern DSPs - Less relevant for non-DSP processor types - The WMOPS toolbox doesn't reflect modern computational capabilities ML-specific considerations: - ML component complexity is dominated by matrix multiplications - Inference time and energy consumption are highly platform-dependent - MAC count provides architecture-agnostic computational load measurement - Parameter count relates directly to model size, memory usage, and energy consumption Advantages of the Proposed Metric The hybrid approach provides: 1. Overall complexity estimate for hybrid DSP+ML codec designs 2. Avoids over-constraining codec design toward specific platforms (referenced S4-260233) 3. Allows UE vendors to leverage custom architectures and optimizations 4. Accounts for efficient vectorization of ML components 5. Enables flexible computational cost balancing between DSP-based and ML-based components 6. Maintains continuity with established practice while accommodating emerging ML-based designs Vectorization Capability Reference Data The document provides example processing units and their vectorization capabilities to inform the ML weighting factor `w`: \| Chip \| Type \| Vectorization Capabilities \| \|------\|------\|---------------------------\| \| HiFi 5s \| DSP \| 32×(8×8 bit MAC) 16×(32×16 bit MAC) 8×(32×32 bit MAC) \| \| ARM Cortex A55 \| CPU \| 16×(8×8 MAC) 8×(16×16 MAC FP) \| Proposal The source proposes to: Define computational complexity metric by counting: WMOPS for DSP-based components MAC for ML-based components Combine according to: WMOPS + w · MACs (where w is an ML weighting factor) Define a maximum value as the computational complexity limit in design constraints Apply similar principles for memory counting metrics References The document references five previous contributions [1-4] and two external technical specifications [5-6] for processor capabilities.
S4-260255 PDF Info	Dolby Laboratories Inc., Nokia, Novamint	[FS_ULBC] ULBC Re-Focus Proposal	ULBC Re-Focus Proposal Background and Motivation The FS_ULBC study item, initiated nearly a year ago, aims to establish a normative ULBC standard for voice communication over GEO within Rel-20. However, progress has been slow, with crucial issues such as end-to-end simulation parameters remaining unresolved. This contribution proposes a focused approach to meet 3GPP standardization timelines. Core Proposal: Two-Phase Standardization Approach The document proposes separating ULBC standardization into two distinct phases to ensure timely delivery while accommodating future enhancements: Phase 1: Rel-20 ULBC Baseline Scope: GEO-focused functionality based strictly on stable Rel-19 service requirements Rationale: Ongoing 6G Media requirements in SA1, SA2, and SA4 have not yet produced consolidated or normative requirement sets suitable for codec design Principle: Following established 3GPP procedures, the ULBC work item shall not define new service requirements but rely on formally defined and stabilized upstream requirements Phase 2: Rel-21 ULBC Advanced Scope: Extended functionality aligned with finalized 6G Media requirements Application scenarios: Beyond Rel-20 IMS Voice Call over GEO Compatibility: Should be backward compatible extension of Rel-20 baseline Technical Configuration Comparison Application Scenarios Baseline (Rel-20): - IMS Voice Call over GEO based strictly on Rel-19 service requirements Advanced (Rel-21): - Multi-Party Voice Communication - IMS Voice Call with ULBC over additional access types beyond GEO GEO Channel Characteristics & Simulation Baseline (Rel-20): - Single baseline UE Tx/Rx capability - Single CNR in UL and DL (e.g., UL single-tone 23 dBm: CNR=5.28 dB for SCS=3.75 kHz, CNR=-0.74 dB for 15 kHz; DL 12-tone single Rx: CNR=-0.61 dB) - Single agreed target bitrate compatible with baseline UE capability enabling acceptable system capacity - Reliance only on mandatory Rel-19 NB-IoT radio protocol features (except SPS) - i.i.d. random block error patterns - Single SPS/bundling period (160 ms) Advanced (Rel-21): - Advanced UE capabilities (e.g., increased Tx power, multiple Rx antennas) - Multiple CNR assumptions in UL and DL - Codec designers may choose optimal bitrate/TBS per CNR - Allow reliance on expected Rel-20 and selected non-mandatory NB-IoT features - Simulated block error patterns based on advanced features - Additional SPS/bundling periods (e.g., 80 ms, 320 ms) Design Constraints: Bitrate Baseline (Rel-20): - Single target bitrate derived from Rel-19 GEO IMS voice service requirements - Example: TBS=208 with SPS period 160 ms, achieving 950 bps net bitrate Advanced (Rel-21): - Multiple target CNRs with bitrate as codec design choice - Additional bitrates for future 6G-related scenarios Design Constraints: Sample Rate and Audio Bandwidth Baseline (Rel-20): - Single sample rate: e.g., 16 kHz - Audio bandwidth: up to WB - Note: May depend on agreed target bitrate Advanced (Rel-21): - Input/output sampling rates: at least 8, 16, 32, 48 kHz - Audio bandwidth unconstrained (codec design choice) Design Constraints: Frame Length and Algorithmic Delay Baseline (Rel-20): - Corresponding to SPS/bundling period (160 ms) or sub-multiples thereof - Algorithmic delay excl. framing: e.g., ≤80 ms (0.5 × SPS/bundling period) Advanced (Rel-21): - Frame structure and algorithmic delay aligned with advanced SPS/bundling options and future 6G Media requirements Design Constraints: Complexity and Memory Baseline (Rel-20): - Limited; sufficiently low to not preclude deployment on current-generation smartphones - TBD MMAC/s - E.g., 3M parameters Advanced (Rel-21): - Relaxed, enabling multiple models - Addressing future 6G Media requirements while leveraging new UE hardware trends Design Constraints: Packet Loss Concealment Baseline (Rel-20): - Required; capable of addressing single agreed-upon target bit rate and operation point of IMS Voice Call over GEO Advanced (Rel-21): - Required; capable of supporting anticipated extended application scenarios beyond Rel-20 IMS Voice Call over GEO, while fulfilling potential 6G Media requirements Design Constraints: Noise Suppression and Robustness Baseline (Rel-20): - No requirement to provide noise suppression - Required capability to handle and reconstruct noisy speech input with moderate to high SNR - Note: Noise reconstruction capability primarily enforced through performance requirements Advanced (Rel-21): - No requirement to provide noise suppression - Required capability to handle speech and generic input anticipated in extended application scenarios Design Constraints: DTX Baseline (Rel-20): - No requirement to support DTX - Note: No separate DTX-related performance requirement Advanced (Rel-21): - DTX support may be required for certain extended application scenarios, depending on potential 6G Media requirements Performance Requirements Baseline (Rel-20): - Requirements focusing on clean and noisy speech performance - NWT AMR7.4 or NWT AMR-WB8.85 depending on target bandwidth for: - Clean speech - Noisy speech (AMR/AMR-WB references operated with DTX on) - Relevant transcoding cases with G.711, AMR, AMR-WB, EVS Advanced (Rel-21): - Complex set of requirements considering required capability to handle speech and generic input anticipated in extended application scenarios Test Methodology and Test Plan Baseline (Rel-20): - Subjective: P.800 DCR - Note: Test methodology and test plan should be conceptually aligned with corresponding EVS codec standardization Pdocs (e.g., DCR test design, applicable SNRs and types of noises for noisy speech test cases) Advanced (Rel-21): - Subjective: Suitable for critical evaluation of candidate codec(s) against expected complex set of performance requirements Proposal SA4 is asked to adopt this phased approach for ULBC standardization as working assumption: Rel-20 ULBC Baseline: GEO-focused functionality based solely on Rel-19 service requirements and mandatory Rel-19 features (except SPS), enabling completion of viable ULBC baseline standard within Rel-20 schedule Rel-21 ULBC Advanced: Extended ULBC functionality aligned with finalized 6G Media requirements, supporting application scenarios beyond Rel-20 IMS Voice Call over GEO, possibly leveraging advanced UE capabilities, and providing backward compatible extension of Rel-20 baseline This approach ensures deliverable ULBC baseline in Rel-20 while providing clear and orderly path toward enhanced ULBC design in Rel-21.
S4-260256 PDF Info	Qualcomm Incorporated	[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel	Feasible TBS Values and Packet Loss Traces for 80ms Bundling Period for ULBC over NB-IoT NTN GEO Channel 1. Background and Scope This contribution presents simulation results for 80ms bundling period following the Simulation One ("target QoS based simulation") methodology. The document provides: Feasible TBS values Packet loss traces for optimal configurations System capacity analysis 2. Simulation Parameters and Trace Labeling The simulations cover the following parameter ranges: Direction: UL/DL TBS: 144, 256, 328, 424 bits (for 80ms bundling) Bundling period: 80ms (focus of this paper) Doppler spread: 1Hz, 5Hz Number of RX: UL: 1, DL: 1, 2 SCS: UL: 3.75kHz, 15kHz; DL: 15kHz Number of tones: UL: 1 for 3.75kHz SCS; 1, 3, 6, 12 for 15kHz SCS; DL: 15 BLER targets: 1%, 2%, 6%, 10% UE power class: 23dBm, 26dBm, 31dBm Trace file naming convention established for both UL and DL scenarios. 3. Optimal Configurations 3.1 TBS 144, 1 UE RX Optimality criterion: Tradeoff between per-UE performance (TBS and BLER) and system capacity. 1% BLER DL: 16ms NPDSCH (N_SF=4, N_rep=4), required SNR: -4.6dB, capacity: 5 UEs UL: 48ms NPUSCH (N_RU=6, SCS 15kHz, 1 tone), required SNR: 0.0dB, UE TX power: 26.4dBm Feasibility: Only with 31dBm UE power class 2% BLER DL: 12ms NPDSCH (N_SF=3, N_rep=4), required SNR: -4.1dB, capacity: 6 UEs UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -2.2dB, UE TX power: 24.2dBm Feasibility: 26dBm and 31dBm UE power classes 6% BLER DL: 8ms NPDSCH (N_SF=4, N_rep=2), required SNR: -4.0dB, capacity: 10 UEs UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.2dB, UE TX power: 23.2dBm Feasibility: 26dBm and 31dBm UE power classes 10% BLER DL: 6ms NPDSCH (N_SF=3, N_rep=2), required SNR: -3.4dB, capacity: 12 UEs (limited by UL) UL: 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), required SNR: -3.7dB, UE TX power: 22.7dBm Feasibility: All power classes (23dBm, 26dBm, 31dBm) 3.2 TBS 144, 2 UE RX For 23dBm UE Power Class Only 10% BLER achievable with system capacity: 12 UEs Uses 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone) and 3ms NPDSCH (N_SF=3, N_rep=1) For 26dBm and 31dBm UE Power Classes 1% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.6dBm, 4ms NPDSCH 2% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 25.0dBm, 4ms NPDSCH 6% BLER: 20 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.9dBm, 4ms NPDSCH 10% BLER: 26 UEs, uses 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 23.4dBm, 3ms NPDSCH Key observation: 3.75kHz SCS configuration becomes optimal for higher power classes due to better coding rate. 3.3 TBS 256, 2 UE RX For 26dBm UE Power Class 10% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 24.8dBm 6% BLER: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.3dBm 2% and 1% BLER: Infeasible For 31dBm UE Power Class 10% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.2dBm, 5ms NPDSCH 6% BLER: 16 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 27.8dBm, 5ms NPDSCH 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.3dBm, 8ms NPDSCH 1% BLER: 10 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 26.8dBm, 8ms NPDSCH 3.4 TBS 328, 2 UE RX For 26dBm UE Power Class Only 10% BLER achievable: 12 UEs, 64ms NPUSCH (N_RU=8, SCS 15kHz, 1 tone), UE TX power: 25.88dBm For 31dBm UE Power Class 10% BLER: 13 UEs, 64ms NPUSCH (N_RU=2, SCS 3.75kHz, 1 tone), UE TX power: 30.5dBm, 6ms NPDSCH 6% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 26.4dBm, 8ms NPDSCH 2% BLER: 10 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 27.5dBm, 8ms NPDSCH 1% BLER: 8 UEs, 64ms NPUSCH (N_RU=4, N_rep=2, SCS 15kHz, 1 tone), UE TX power: 28.1dBm, 10ms NPDSCH 3.5 TBS 424, 2 UE RX Note: Coarse 5ms granularity for NPDSCH time-domain configuration. For 31dBm UE Power Class 10% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.10dBm, 10ms NPDSCH 6% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 29.73dBm, 10ms NPDSCH 2% BLER: 4 UEs, 40ms NPUSCH (N_RU=5, SCS 15kHz, 3 tones), UE TX power: 30.96dBm, 10ms NPDSCH 1% BLER: Infeasible 4. Feasible TBS Values Observation: For 80ms bundling period with UE power class up to 31dBm: - All TBS values (144, 256, 328, 424) are feasible for BLERs 1%, 2%, 6%, and 10% - Exception: TBS 424 is not feasible at 1% BLER 5. Packet Loss Traces 299,391 traces provided in attached zip file for all 4 TBS values (144, 256, 328, 424). 6. Proposal Proposal: Include clauses 2 through 5 to the PD or TR to provide a workable example on determining configurations based on optimal tradeoff between per-UE performance and system capacity.
S4-260271 PDF Info	Apple Inc.	[FS_ULBC] ULBC Performance Requirements	Summary of S4-260271: ULBC Performance Requirements Document Information Source: Apple Inc. Meeting: 3GPP TSG SA WG4#135, Goa, India (09-13 February 2026) Type: Discussion and Agreement Revision: Revision of S4aA250135 addressing comments from SA4 #134 post-adhoc telco (Dec 02) Main Technical Contributions Performance Requirements Framework The document proposes establishing minimum performance requirements for the Ultra-Low Bitrate Codec (ULBC) based on the following rationale: ULBC targets IMS voice service over GEO and NGSO satellite systems (per Clause 4, TR 26.940) Quality must be consistent with deployed VoLTE IMS voice services Current TBS discussions center on bitrates in the 1-3 kbps range AMR-WB 12.65kbps and EVS-SWB 13.2kbps are commonly deployed VoLTE operating points Proposed Minimum Performance Benchmarks The document establishes two key performance anchors: At lowest operating range (~1 kbps): ULBC shall provide speech quality No Worse Than (NWT) AMR-WB @12.65kbps Applies to: clean speech, noisy speech, and packet loss conditions At higher operating range (~3 kbps): ULBC shall provide speech quality No Worse Than (NWT) EVS-SWB @13.2kbps Applies to: clean speech, noisy speech, and packet loss conditions Reference Codecs and Operating Points for Testing The document proposes a comprehensive list of reference codecs and operating points for ToR comparison testing in subjective evaluation: AMR: 12.2kbps AMR-WB: 8.85kbps, 12.65kbps, 23.85kbps EVS AMR-WB-IO: 8.85kbps, 12.65kbps, 23.85kbps EVS-WB/SWB: 7.2kbps, 8kbps, 9.6kbps, 13.2kbps, 13.2kbps CA, 24.4kbps Text Proposal The document proposes updates to Clause 8 (Performance requirements) of TR 26.940, adding: - New Clause 8.1 (General) containing the performance requirements framework and minimum benchmarks - New Clause 8.1.1 (A List of Reference Codecs and Operating Points) containing the reference codec list for subjective evaluation
S4-260272 PDF Info	Apple Inc.	[FS_ULBC] ULBC Codec Testing in Background Noise	Summary of S4-260272: ULBC Codec Testing in Background Noise Document Overview This contribution proposes a testing framework for the Ultra-Low Bitrate Codec (ULBC) in noisy conditions, drawing from EVS codec testing methodologies. The document is a revision of S4-251786 from SA4#134 and proposes updates to TR 26.940 Clause 9. Background and Motivation Noise Suppression Considerations The document argues against mandating NS algorithms within the codec specification based on several key considerations: Device-Specific Optimization: NS algorithms are typically optimized for specific device microphone array configurations. A generic NS algorithm applied uniformly could result in suboptimal performance across different device types. Codec Robustness vs. NS Artifacts: Testing ULBC with clean, noisy, and optionally NS-processed speech provides better understanding of the codec's inherent robustness. NS algorithms may introduce speech distortions that could bias codec testing results. Emergency Call Requirements: For emergency calls, preserving background noise is critical as it may contain important contextual information (alarms, traffic, voices) that helps identify the caller's environment or ongoing danger. Complexity and Latency Concerns: ML-based NS algorithms can be computationally complex, increasing power consumption and end-to-end latency. Mandating complex NS could burden some devices inefficiently. The document advocates for flexibility in NS implementation to enable manufacturers to develop device-specific solutions. Proposed Testing Framework Core Testing Scenarios (Table 9.1.4.1) Following EVS codec testing principles (TR 26.952), the proposal includes: \| Source Material \| Noise Type \| SNR \| Test Methodology \| \|----------------\|------------\|-----\|------------------\| \| Clean speech \| - \| - \| ITU-T P.800 ACR and/or DCR \| \| Speech + Noise \| Stationary (car, etc.) \| 15 dB \| ITU-T P.800 DCR \| \| Speech + Noise \| Non-stationary (street, babble, etc.) \| 20-25 dB \| ITU-T P.800 DCR \| This framework aligns with EVS testing which used: - Car noise at 15 dB - Street noise at 20 dB - Office/babble noise at 20 dB - ITU-T P.800 DCR methodology ("Degradation of Speech in Noise" DMOS test) Optional Extended Testing for Low SNR (Table 9.1.4.2) To characterize ULBC robustness in challenging low SNR conditions: \| Source Material \| Noise Type \| SNR \| Test Methodology \| \|----------------\|------------\|-----\|------------------\| \| Speech + Noise \| Stationary (car, etc.) \| 5-10 dB \| ITU-T P.800 DCR \| \| Speech + Noise \| Non-stationary (street, babble, etc.) \| 10-15 dB \| ITU-T P.800 DCR \| \| NS processed speech + Noise \| Stationary (car, etc.) \| 5-10 dB \| ITU-T P.800 DCR \| \| NS processed speech + Noise \| Non-stationary (street, babble, etc.) \| 10-15 dB \| ITU-T P.800 DCR \| Key Notes: - To avoid bias, a common NS processing tool should be used for generating NS-processed speech - Selection of specific noise types and the NS processing tool is FFS - Reference is made to TR 26.989 v19.0.0 (MCPTT work) where EVS was evaluated in siren noise at 5 dB SNR Proposed Specification Changes The document proposes adding new Clause 9.1.4 to TR 26.940 with two subclauses: 9.1.4.1 Background: Captures the rationale for flexible NS implementation 9.1.4.2 Recommendations for ULBC Codec Testing: Defines the testing framework with Tables 9.1.4.1 and 9.1.4.2 Action Requested The document seeks Discussion and Agreement on: 1. The proposed testing framework for ULBC in noisy conditions 2. Updates to TR 26.940 Clause 9 as specified in the text proposal
S4-260275 PDF Info	Dolby Laboratories Inc., Nokia, Novamint	[FS_ULBC] On device capability diversity	Summary of S4-250275: On Device Capability Diversity for ULBC Overview This document (revision of S4aA260006) addresses UE capability diversity in NB-IoT NTN deployments for ULBC voice services. It proposes a capability-aware system design approach rather than assuming uniform baseline UE capabilities, accompanied by a pCR to TR 26.940. Key Technical Contributions 1. UE Capability Diversity Framework Identified Capability Dimensions: Transmit Power Classes: Baseline: PC3 (23 dBm) Enhanced: PC2 (26 dBm) or PC1 (31 dBm) Future: up to 37 dBm under study for Rel-20 Receive Antenna Configurations: Standard: Single RX antenna Enhanced: Dual RX antennas (providing ~3 dB gain) Advanced Features: Improved RF sensitivity Multi-tone NPUSCH transmission capability Key Insight: These capabilities are optional and vary across device categories, market segments, and implementations. 2. Benefits of Enhanced Capabilities Enhanced UE capabilities enable: Reduced time-domain resource usage in half-duplex NB-IoT transmission Overcoming limitations of 80 ms SPS periods (excessive BLER and capacity constraints) Multi-tone NPUSCH transmission for: Higher ULBC bitrates Reduced time-domain resource usage Improved link robustness (reduced packet error rates) 3. Capability-Aware Multi-User SPS Scheduling Proposed Scheduling Strategy: Dynamic SPS Assignment: Enhanced UEs use shorter SPS periods (80 ms) while baseline UEs use longer periods (160/320 ms) Multi-Tone Transmission: Enhanced UEs utilize multi-tone NPUSCH formats Load Balancing: Resource allocation prioritized based on UE capability Service Differentiation: Three-tier service model: Baseline Service: Conservative configurations (long SPS, single-tone NPUSCH) Intermediate Service: Moderate enhancements (shorter SPS, possible multi-tone) Enhanced Service: Higher bitrates and reduced latency (shortest SPS, multi-tone, dual RX) Practical Example (Figure 1): UE Type A (Baseline): 160 ms SPS, 128 ms single-tone NPUSCH, 950 bits/s net bitrate (TBS 208 bits) UE Type B (Intermediate): 80 ms SPS, 64 ms NPUSCH (higher TX power), 1100 bits/s net bitrate (TBS 144 bits) UE Type C (Enhanced): 80 ms SPS, 64 ms NPUSCH, reduced NPDSCH duration (dual RX), 1100 bits/s net bitrate (TBS 144 bits) 4. ULBC Bitrate Differentiation Proposed Approach: Leverage UE capability diversity for bitrate differentiation Align bitrates with service tiers and UE capabilities Recommended minimum set of 3 ULBC target bitrates: Basic tier: [600 - 1000] bits/s Intermediate tier: [1000 - 1800] bits/s Enhanced tier: [1800 - 3000] bits/s Higher bitrates may be considered in second ULBC standardization phase Note: Actual bitrates subject to ongoing TBS discussions; values >3000 bits/s may become relevant. Proposed Changes to TR 26.940 (pCR) Section 5.2.4: New Clause on UE Capabilities Documents capability variations for NB-IoT NTN: Transmit Power Classes: PC3/PC5 (Rel-18), PC1/PC2 (Rel-19), potential >31 dBm (Rel-20) Receive Antennas: Single (typical) vs. dual (enhanced) Enhanced Capabilities: Higher TX power, improved RF sensitivity Section 5.2.5: Enhanced Multi-User Considerations Replaces assumption of uniform UE configuration with capability-aware scheduling: Capability-Aware Resource Allocation: Different SPS periods based on UE capabilities Multi-Tone Transmission for Enhanced UEs: Increased bitrate and/or reduced resource usage Dynamic Load Balancing: Optimized capacity through capability-based prioritization Service Level Differentiation: Three-tier service model aligned with UE capabilities Includes Figure 1 demonstrating practical multi-user scheduling scenario with three UE types. Section 5.1.2.2: UE Delay Tables Updates delay estimation tables (5.1.2-2, 5.1.3-1) to include: Voice bundling periods: 80, 160, 320 ms Codec frame sizes: 20, 40, 80, 160, 320 ms Mouth-to-ear delay estimates for GEO-TN and GEO-GEO scenarios Recommendations Adopt capability-aware ULBC system design rather than assuming single baseline configuration Agree on minimum set of 3 ULBC target bitrates for codec evaluation (approximate ranges: [600-1000], [1000-1800], [1800-3000] bits/s) Document agreed ULBC target bitrates in Pdoc Consider higher bitrates in second ULBC standardization phase References Key dependencies: S4aA260006 (previous version), S4-260144 (TR 26.940 v0.5.1), S4-260255 (ULBC Re-Focus Proposal), TS 36.763 (UE radio transmission/reception), S4-251863 (system capacity), S4aA250112 (error trace methodology), S4aA250118 (RAN simulation results)
S4-260279 PDF Info	Orange, Dolby Laboratories Inc.	On the use of objective metrics in ULBC standardization	Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization Introduction and Scope This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions). The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814. Main Technical Contributions Test Methodologies - General Considerations (Clause 9.1.1-9.1.2) Quality Impairment Categories for ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech) Testing Challenges: - ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS) - Traditional P.800 ACR methodology may not optimally quantify all potential impairments - DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences - Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations Alternative Test Methods Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation Subjective Testing Considerations (Clause 9.1.3) Source Material Robustness (9.1.3.1): - Multiple languages with diverse intonations - Various phonetic and linguistic environments - Different voice pitches and speaking styles - Overlapping talkers Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Varying reverberation levels (RT60 ranging from 0.3s to 1.0s) Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov Conclusion: - P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming Objective Testing Considerations (Clause 9.1.4) Correlation Analysis on Clean Speech (9.1.4.1): Evaluated objective models from references [7-11]: - Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS - General audio metrics: PEAQ, ViSQOL-A - Additional metric: SCOREQ Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient Key Observations (Clean Speech): - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs - Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA) Correlation Analysis on Music/Mixed Content: Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model Key Observations (Music/Mixed Content): - POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content - Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective/objective metrics may be considered for characterization of new codec - Objective metrics have merits in other tasks such as codec conformance testing Document Type This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.

Total Summaries: 38 | PDFs Available: 38

Back to TDoc List

All Summaries - Table View

Analysis on Complexity Evaluation of ULBC with WMOPS

1. Introduction

2. Technical Analysis: Discrepancies Between ITU-T Documentation and WMC Tool Implementation

2.1 'Move' Operator Counting

2.2 Increment Operator ('++')

2.3 Logical Operators ('AND/OR')

2.4 Indirect Addressing

2.5 Instrumentation with Array Subscripts

3. Observations and Impact Assessment

4. Proposal

Summary of S4-260128: Influence of Code Optimization on WMOPS

1. Introduction and Motivation

2. Experimental Analysis

2.1 Operator-Level Analysis

2.2 Full-Model Level Analysis

3. Observations and Conclusions

4. Proposal

Summary of S4-260132: Discussion of FS_ULBC Objective Speech Quality Assessment Method

Background

Overview of Existing Speech Objective Quality Evaluation Methods

Standardized ITU-T Methods

Open Source Methods

Capabilities and Limitations for ULBC

Proposal

Recommended Objective Assessment Methods

Text Proposal for TR 26.940

New Section 9.1.1: Typical Quality Impairments

New Section 9.1.2: Challenges of Quality Assessment

Technical Contribution

Summary of S4-260136: Updates to Simulation Results for FS_ULBC

1. Introduction and Context

2. Channel Model Assumptions

3. Link Budget Analysis

3.1 CNR Baseline Values (from RAN1)

3.2 Additional UL CNR Configurations

3.3 Additional DL CNR Configurations

4. NPUSCH Simulation Results

4.1 Common Simulation Parameters

4.2 Results Part 1 - Various TBS and Bundling Times

80 ms Bundling Time

160 ms Bundling Time

320 ms Bundling Time

4.3 Results Part 2 - Additional Cases

5. NPDSCH Simulation Results

5.1 Common Simulation Parameters

5.2 Results Part 1 - Various TBS and Bundling Times

80 ms Bundling Time

160 ms Bundling Time

320 ms Bundling Time

5.3 Results Part 2 - Additional Cases

6. Conclusions

Summary of S4-260137: On eCall Scenario for ULBC

1. Background

2. eCall Scenario Description

2.1 System Overview

2.2 Communication Architecture

3. Key Observations

4. Proposed Changes to TR 26.940

4.1 New Clause 4.5: eCall Communication

High-level Prerequisites for ULBC in eCall:

4.2 Modified Clause 6.2: Design Constraint Parameters

Key Differences for eCall Design Constraints:

5. Technical Contributions

Summary of S4-260141: Target Platforms for ULBC

1. Introduction and Motivation

2. Technical Problem Statement

3. DSP-Enabled Device Definition

4. Proposed Text Changes

Main Technical Contribution

Context Preservation

5. Technical Impact

Summary of S4-260142: On Complexity and Memory Constraints for ULBC

Introduction

Main Technical Contributions

Complexity Measurement Metrics

Memory Constraints Clarification

ROM Constraints

RAM Constraints

Complexity Constraints