Meeting: TSGS4_135_India | Agenda Item: 7.8
38 documents found
[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS
This contribution examines the use of WMOPS (Weighted Million Operations Per Second) as a complexity metric for ULBC (Ultra Low Bitrate Codec). WMOPS has been proposed as one of the possible complexity metrics and is traditionally used for evaluating 3GPP speech codecs complexity. The analysis focuses on the WMS tool used for automated WMOPS calculation with floating point C code.
The source conducted systematic testing of the WMC tool against the examples provided in ITU-T standards documentation (specifically clause 18.12.7 and related tables in the ITU-T Software Tool Library 2024 User's Manual). Several discrepancies were identified:
Issue: Extra MOVE operations are counted by the WMC tool
b = a / L should count as 1 MULT (since 1/L is constant, operation becomes multiplication)Issue: Missing operations in WMC tool output
(*rnd_T0)++ should count as 1 ADD + 1 STORE (equivalent to *rnd_T0 = *rnd_T0 + 1)Issue: Missing TEST operation counting
if (a!=b || c==d) should count as 2 ADD + 2 BRANCH + 1 TESTIssue: Extra MOVE operation and incorrect INDIRECT counting
Indice[0] = indirect_dico1[indice[0]] should count as 2 INDIRECTIssue: Arithmetic operations inside array subscripts not counted
a[i*2+1] should count arithmetic operations within the subscript (1 INDIRECT + 1 MULT + 1 ADD)The source identifies three key observations:
Systematic discrepancies exist between ITU-T standards documentation and WMC tool implementation, with both over-counting (e.g., extra MOVE operations) and under-counting (e.g., missing operations in array subscripts) observed
Potential significance for AI codecs: Some discrepancies, particularly the counting of MOVE operators and instrumentation inside arrays, could significantly impact WMOPS measurements for AI-based codecs
Need for clarification: If WMOPS is adopted as a complexity metric for ULBC, these differences must be carefully addressed and the calculation methodology must be clearly defined
The source proposes to document the findings from Clause 2 and Clause 3 in clause 6.2 of the permanent document to ensure proper consideration of these WMOPS calculation issues in the ULBC complexity evaluation framework.
[FS_ULBC] Influence of code optimization on WMOPS
This contribution investigates the impact of C code implementation choices on WMOPS (Weighted Million Operations Per Second) measurements for neural audio codecs, specifically in the context of the ULBC (Ultra-Low Bitrate Codec) study. The source examines whether WMOPS, traditionally used for 3GPP speech codec complexity evaluation, is suitable for neural audio codecs given that actual C implementation can significantly affect WMOPS measurements.
The source conducted experiments on Conv1D and Conv1DTranspose operators, which are extensively used in DAC (Discrete Audio Codec) for audio feature dimension manipulation:
Key Results: - Conv1D: 441 WOPS (non-optimized) → 301 WOPS (optimized) = 68.25% ratio - Conv1DTranspose: 554 WOPS (non-optimized) → 260 WOPS (optimized) = 46.93% ratio
Finding: The same optimization strategy yields significantly different optimization ratios for different operators.
Using the optimized and non-optimized operator implementations, the source measured WMOPS for two DAC configurations (enc16dec384 and enc64dec1536) and compared against previously reported results [4]:
Total WMOPS: - enc16dec384: 13,320.35 (non-opt) → 8,152.01 (opt) → 5,785.17 (reported in [4]) - enc64dec1536: 201,552.55 (non-opt) → 123,966.49 (opt) → 84,441.99 (reported in [4])
Encoder WMOPS: - enc16dec384: 3,411.08 (non-opt) → 2,621.98 (opt) → 1,060.79 (reported in [4]) - enc64dec1536: 50,089.70 (non-opt) → 39,604.59 (opt) → 13,675.30 (reported in [4])
Decoder WMOPS: - enc16dec384: 9,847.22 (non-opt) → 5,484.21 (opt) → 4,724.38 (reported in [4]) - enc64dec1536: 151,291.59 (non-opt) → 84,255.25 (opt) → 70,766.69 (reported in [4])
The source draws two critical observations:
Main Conclusion: If WMOPS is adopted as a complexity metric for ULBC, results will be highly influenced not only by model design but also by the actual C code implementation, potentially making comparisons between different codec proposals inconsistent.
The source proposes to document the experimental findings and observations as a new clause 7.6.5 "WMOPS analysis on DAC" in TR 26.940, with three sub-clauses:
- 7.6.5.1: On operator level
- 7.6.5.2: On full-model level
- 7.6.5.3: Observation
This would capture the implementation-dependency issues of WMOPS measurements for neural audio codecs in the technical report.
[FS_ULBC] Discussion of FS_ULBC Objective Speech Quality Assessment Method
This contribution addresses speech quality assessment challenges for ultra-low bitrate codecs (ULBC). While subjective testing remains the benchmark for ULBC codec selection, objective speech evaluation methods can serve as predictive tools during intermediate testing and parameter adjustment processes, enabling more convenient and efficient quality verification.
The document provides a comprehensive comparison of available objective assessment tools:
The document analyzes each method's suitability for ultra-low bitrate scenarios:
After excluding unsuitable methods, the contribution recommends considering P.863, ViSQOL, and ESTOI as potential objective quality assessment methods for ULBC.
The document proposes a pCR to TR 26.940 Section 9 (Test methodologies) that includes:
Identifies ULBC-specific impairment categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, reverberant speech)
Addresses testing challenges specific to ULBC:
Traditional 3GPP Practice: AMR/AMR-WB/EVS used P.800 ACR for clean speech and DCR for noisy/mixed content, but did not focus on intelligibility, speaker identifiability, or prosodic impairments
ULBC-Specific Challenges: ML-based codecs introduce new impairment types (e.g., hallucination) requiring alternative test methods
Additional Test Methodologies (non-exhaustive list):
Automatic speech recognition tests
Objective Methods as Optional Tools: Proposes documenting that objective methods (P.863, ViSQOL, ESTOI, etc.) can be considered as optional tools for predicting speech quality during ULBC simulation testing and parameter optimization, acknowledging that subjective listening remains the most important evaluation method despite being time and resource-intensive
Speech Enhancement Evaluation: Notes that P.835 multi-dimensional rating scales can be used for speech enhancement tools that may be part of ULBC
The main technical contribution is establishing a framework for objective quality assessment in ULBC standardization that: 1. Recognizes the unique challenges of ML-based codecs 2. Identifies suitable objective methods as predictive tools 3. Proposes their documentation as optional assessment methods in TR 26.940 4. Maintains subjective testing as the primary benchmark while enabling more efficient intermediate evaluation
Updates to the simulation results for FS_ULBC
This document presents updated link-level simulation (LLS) results for Ultra-Low Bitrate Communication (ULBC) over Non-Terrestrial Networks (NTN). The simulations follow the NTN-TDL-C channel model as specified in TS 36.102. This revision adds: - Missing simulation results for NTN-TDL-C 10 NPUSCH - New simulation results for NTN-TDL-C 10 NPDSCH - Updated TBS (Transport Block Size) values for both NPDSCH and NPUSCH with 10 degrees elevation angle
The simulations are based on parameters discussed in S4aA250038 and follow agreements from previous meetings.
Key Parameters: - Satellite elevation angle: 12.5 degrees for link budget calculations - Channel model parameters (delay and power of each path) determined for 10 degrees elevation (approximation of 12.5 degrees) - Channel model: NTN-TDL-C from TS 36.102
Uplink: - CNR = 2.6 dB (0 dBi UE antenna gain, 3.75 kHz SCS, 1 tone, 23 dBm UE TX power)
Downlink: - CNR = -3.3 dB (0 dBi UE antenna gain, 15 kHz SCS, 12 tones, 1 RX antenna, 7 dB noise figure)
| Configuration | SCS | UE Power | UL CNR | |---------------|-----|----------|---------| | Config 1 | 3.75 kHz | 23 dBm | 2.6 dB | | Config 2 | 15 kHz | 23 dBm | -3.42 dB | | Config 3 | 3.75 kHz | 26 dBm | 5.6 dB | | Config 4 | 15 kHz | 26 dBm | -0.42 dB | | Config 5 | 3.75 kHz | 31 dBm | 10.6 dB | | Config 6 | 15 kHz | 31 dBm | 4.58 dB |
| Configuration | Number of RX | G/T value | DL CNR | |---------------|--------------|-----------|---------| | Config 1 | 1 | -31.6 | -3.3 dB | | Config 2 | 2 | -31.6 | -0.3 dB | | Config 3 | 1 | -28.6 | -0.3 dB | | Config 4 | 2 | -28.6 | 2.7 dB |
144 bits (Cases 1-4): - Case 1: 15 kHz, MCS 2, 4 RUs, 2 reps → SNR: -4.97 dB (10% BLER) to -4.35 dB (1% BLER) - Case 2: 15 kHz, MCS 2, 2 RUs, 1 rep → SNR: 1.8 dB to 2.7 dB - Case 3: 3.75 kHz, MCS 10, 1 RU, 2 reps → SNR: 1.5 dB to 2.30 dB - Case 4: 15 kHz, MCS 10, 1 RU, 4 reps → SNR: -4.64 dB to -3.90 dB
256 bits (Cases 5-8): - SNR ranges from -2.84 dB to 5.9 dB depending on configuration
328 bits (Cases 9-11): - SNR ranges from -1.53 dB to 9.45 dB depending on configuration
424 bits (Case 12): - SNR: 1.44 dB (10% BLER) to 2.05 dB (1% BLER)
208 bits (Cases 13-15): - SNR ranges from -5.56 dB to 1.53 dB
424 bits (Case 16): - SNR: -1.95 dB to -1.52 dB
600 bits (Case 17): - SNR: -1.38 dB to -0.97 dB
808 bits (Cases 18-19): - SNR ranges from -1.42 dB to 0.21 dB
328 bits (Cases 20-25): - SNR ranges from -6.80 dB to -0.22 dB
776 bits (Cases 26-27): - SNR ranges from -2.48 dB to 6.46 dB
1000 bits (Cases 28-30): - SNR ranges from -1.95 dB to 7.47 dB
1544 bits (Case 31): - SNR: 0.48 dB to 0.76 dB
Covers Cases 32-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. SNR requirements range from -3.6 dB to 8.0 dB depending on configuration.
144 bits (Case 0a): - 1R: SNR -6.6 dB to -5.3 dB - 2R: SNR -9.3 dB to -8.1 dB
256 bits (Case 0b): - 1R: SNR -4.3 dB to -3.1 dB - 2R: SNR -7.1 dB to -6.1 dB
328 bits (Cases 1-2): - SNR ranges from -11.8 dB to -4.0 dB
424 bits (Cases 3, 3b): - SNR ranges from -11.6 dB to -5.0 dB
208 bits (Case 4): - SNR: -15.3 dB to -11.8 dB
424 bits (Cases 5, 5b): - SNR ranges from -14.6 dB to -8.0 dB
600 bits (Case 6): - SNR: -11.1 dB to -7.2 dB
808 bits (Cases 7, 7b, 8, 8b): - SNR ranges from -11.0 dB to -1.1 dB
328 bits (Cases 9-11b): - SNR ranges from -17.7 dB to -8.1 dB
776 bits (Cases 12, 12b): - SNR ranges from -14.8 dB to -8.1 dB
1000 bits (Case 13): - SNR: -10.3 dB to -6.4 dB
1544 bits (Cases 14, 14b): - SNR ranges from -11.7 dB to -5.0 dB
Covers Cases 15-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times.
1T1R Results: - SNR requirements range from -10.9 dB to 1.1 dB
1T2R Results: - SNR requirements range from -13.6 dB to -1.9 dB - Approximately 3 dB gain compared to 1T1R configurations
The document recommends considering these simulation results for determining design constraints for ULBC. The results demonstrate: - Performance across various TBS values (144 to 1736 bits) - Multiple bundling times (80, 160, 320 ms) - Different SCS configurations (3.75 kHz, 15 kHz) - Impact of repetitions on SNR requirements - Benefits of 2 RX antennas (approximately 3 dB gain)
[FS_ULBC] On eCall scenario for ULBC
This contribution addresses the eCall (emergency call) scenario for Ultra-Low Bitrate Codec (ULBC) work. Previous contributions (S4-251908, SA-251848, SA-251881) emphasized the importance of preserving background signals in emergency communications. China has developed a related national standard "On-Board Emergency Call System for Road Vehicles" expected to take effect on July 1, 2027.
The document highlights that eCall scenarios have special requirements and different conditions compared to regular call scenarios, necessitating different design constraints and test methodologies.
The eCall system is an in-vehicle safety technology that: - Automatically dials emergency numbers (e.g., 112 in EU) upon severe collision detection - Sends minimum data set (MSD) including GPS location, VIN, collision direction and time - Can be triggered by built-in sensors or manual SOS button - Functions via GEO satellite even without terrestrial network coverage
The bi-directional voice data flow involves: - Vehicle side: Integrated microphones and speakers communicating over GEO satellite network - Emergency response center: Connected via terrestrial mobile network (VoLTE, VoNR), fixed-line, or other IMS-supported platform - Key requirement: Background noise captured within vehicle must be delivered with fidelity to emergency response center - Asymmetric requirement: Noise preservation may not be required in the opposite direction (emergency center to vehicle) - Dedicated system: No mobile phones involved in the communication link
Observation 1: eCall is a dedicated system between vehicles and emergency response centers. Speech codec designed for eCall is not necessarily the same as that for regular call scenarios, allowing for separate design constraints or performance requirements for ULBC-eCall.
Observation 2: Vehicle and emergency response center have significantly different hardware capabilities compared to regular call scenarios: - Less sensitive to power consumption - Higher computational capability - Higher storage capability - This allows for relaxed design constraints and more critical performance requirements for ULBC-eCall
The contribution proposes adding a new scenario (Scenario 4) to TR 26.940 documenting:
The contribution proposes creating separate design constraint columns in Table 6.2-1: - Design Constraint (regular call): Existing constraints - Design Constraint (eCall): New column with eCall-specific constraints
| Parameter | Regular Call | eCall | |-----------|-------------|-------| | Noise Suppression | Not required; noise suppression may be applied | Background noise preserved during call (at least vehicle-to-center direction); opposite direction may not require preservation | | DTX Support | Support | No DTX support during call (at least vehicle-to-center direction) | | Complexity/Memory | Standard mobile constraints | Relaxed constraints possible |
The main technical contributions of this document are:
The document establishes that eCall scenarios justify different codec design approaches due to their dedicated nature, different hardware capabilities, and specific regulatory/safety requirements.
[FS_ULBC] On target platforms for ULBC
This contribution addresses a gap in TR 26.940 regarding target platforms for Ultra Low Bit rate Codec (ULBC) deployment. While the TR currently discusses NPU as a possible platform in clause 6.2.1.5.1, it lacks coverage of other non-NPU platforms. The document aims to complete this missing information, particularly focusing on DSP-enabled devices.
The contribution identifies an inconsistency in TR 26.940:
The source references previous contributions (S4aA250267 and S4-251747) that discussed the need for DSP deployment and provided clarification on DSP-enabled UE devices.
The contribution adopts the definition from S4-251747 for DSP-enabled UE devices:
The proposal adds a new paragraph to clause 6.2.1.5.1 that:
Acknowledges vendor flexibility: Vendors may choose any computing unit to implement ULBC based on business needs or product constraints
Highlights DSP advantages:
Potentially wider range of product support
Establishes DSP deployment requirement: ULBC should be deployable on DSP-enabled UE devices, including:
Devices with multiple computing units including DSP
Provides deployment rationale: Even when CPU or NPU are available, DSP may be preferred for power-sensitive applications (wearables, mobile phones)
Defines DSP computational power: Audio processing DSPs typically range from several hundred to over a thousand MIPS
The proposal maintains the existing text about: - NPU prevalence in modern smartphones - NPU being 5-20x more power efficient than CPUs for AI tasks - The note that ULBC may need to run on non-NPU platforms in certain configurations
This contribution ensures that TR 26.940 provides comprehensive guidance on target platforms for ULBC deployment, balancing the AI-optimized NPU approach with the power-efficient DSP approach, thereby supporting a wider range of device implementations and use cases.
[FS_ULBC] On complexity and memory constraints for ULBC
This contribution addresses complexity and memory constraints for Ultra Low Bitrate Codec (ULBC) as part of the study in TR 26.940. The document aims to clarify previous discussions on measurement metrics and specific constraints, proposing concrete values for complexity, RAM, and ROM requirements.
The contribution proposes using both MACS (Million Multiply-Accumulate Operations per Second) and Codec/Model Size together to characterize ULBC complexity, rather than relying on a single metric:
The document clarifies confusion from previous contributions (S4aA250253 and S4-251807) regarding the 5-10M parameters proposal:
The contribution references the 2025 Low-Resource Audio Codec (LRAC) Challenge sponsored by Cisco Systems as a relevant benchmark:
LRAC Challenge Requirements: - Sampling rate: 24 kHz - Mono audio input - Bitrate: up to 1 kbps (ultralow) and up to 6 kbps (low) - Latency: 30 ms (Track 1) or 50 ms (Track 2) - Compute complexity: ≤ 350 MMACS total; ≤ 150 MMACS receive-side - Winner (ByteDance) used ~4M parameters
The contribution proposes the following specific constraints for ULBC:
< 600 MMACS
RAM:
Whether switching will be supported is FFS
ROM:
The contribution includes a change request to TR 26.940, Section 6.2 (Design Constraint Parameter), Table 6.2-1, adding the specific complexity and memory constraints detailed above to the "Complexity and memory demands" parameter row.
[FS_ULBC]TR 26.940 V 0.5.1
This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types.
Background: - GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay - TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s - Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints
Scenario Descriptions:
Main Scenario (4.2.2.2): One UE connects via GEO satellite access - UE1: Phone supporting IMS voice over GEO satellite - UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation)
Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access - Less common but relevant for disaster/cyberattack scenarios - Even with transparent payload, voice packets transmit to ground before reaching other UE - May enable transcoder-free operation
High-level Prerequisites: - Very low bitrate support - DTX support [TBC] - Error concealment - Real-time implementation on smartphones - Good audio quality for reasonable QoE
Background: - Addresses poor/unstable network conditions in WLAN access - Network congestion during peak usage or in areas with limited infrastructure - Codec selection critical for maintaining quality under bandwidth constraints
Scenario Description: - One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding) - Both participants on unstable networks using ULBC simultaneously (no transcoding needed)
High-level Prerequisites: - Ultra-low bitrate capability - Real-time operation on consumer devices (smartphones, laptops) - Audio quality matching or exceeding existing voice services
Motivation: - ULBC may provide enhanced robustness against poor network conditions - Lower bit rates may benefit coverage/capacity - Reduces transcoding needs when calls bridge GEO and other access types
Scenario Description: - Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN)
Delay Components:
UE Delay (Table 5.1.2-2): - Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms) - Performance objective range: 268-1435ms (excluding solution-specific delay) - Maximum requirement range: 355-1435ms (excluding solution-specific delay) - Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM
Core Network Delay: - Ground station to core network: [5-20]ms minimum, [200]ms maximum - eNodeB to core network: 5-20ms - Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS)
GEO Transmission Delay: - Minimum: 248ms - Maximum: 280ms (per TS 22.261 KPI requirement) - Variation of 32ms depending on UE location within beam
Mouth-to-Ear Delay Estimates (Table 5.1.3-1):
For Main Scenario (GEO-TN): - 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
For Sub-Scenario (GEO-GEO): - 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
System Architecture: - Service link: UE to NTN payload - Feeder link: NTN payload to NTN Gateway
RAN Parameters: - Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink) - MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM - Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1 - Resource unit (RU) duration varies with subcarrier spacing and number of tones
QoS Characteristics: - Managed through QCI (QoS Class Identifier) - Same PELR (Packet Error Loss Rate) required for UL and DL - Suggests balanced UL/DL time-domain transmission resources
Key parameters identified: - Bit rates: [TBD] - Sample rate and audio bandwidth: [TBD] - Frame length: [TBD] - Complexity and memory demands: [TBD] - Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing) - Packet loss concealment (PLC): [TBD] - Noise suppression: [TBD] - Discontinuous transmission (DTX): Including VAD and comfort noise [TBD] - Robustness to non-speech input: [TBD]
Current Evaluation Analysis: - Codec must support real-time thread and concurrent processing - ML codecs with [5-10M] parameters considered for efficient operation within latency bounds - Must operate within compute constraints of devices for real-time voice communication
Memory and Power Considerations: - Larger models require more DRAM access → higher power consumption - Memory footprint critical for device performance and usability
Complexity Metrics for AI-Based Codecs:
TOPS (Tera Operations Per Second): - TOPS = 2 × MAC unit count × Frequency / 1 trillion - Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16) - TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs - Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics
Alternative Metrics: - MACs (Multiply-Accumulate operations): Practical for complexity assessment - RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure - Model Size: Number of parameters and precision; directly impacts memory and power - Tools available: ptflops, torchinfo, fvcore for MAC counting
Observations: - NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x) - Actual NPU performance depends on computational graph structure - Irregular/sequential/unsupported operations may require CPU fallback - ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs - Million MACs + model size provide first indication of complexity - RTF useful but requires standardized test benches - WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial
Complexity Target Estimation: - Target devices: Modern smartphones with NPU components - Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS) - Actual power consumption on smartphone NPUs: TBD - Model size and architecture significantly impact DRAM operations and overall power consumption
Editor's note: Algorithmic delay verification method for AI-based codecs required.
Codec Parameters and Configuration: - Static parameters: Rarely changed, exchanged via SDP or predefined - Dynamic parameters: May change frequently, included in each packet/frame - Common static/dynamic parameters to be identified
Categories: 1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS) 2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2) 3. AI-based postprocessor: Enhancement of conventional codec output 4. AI-based encoder/decoder: - Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2) - Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec)
Key Codec Properties:
3GPP IMS Codecs: - AMR: NB, 5ms delay, 20ms frame, 4.75 kbps - AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps - EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps
Conventional Ultra Low Bitrate: - MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps - Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps
AI-based (Causal): - LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps - LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps - Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps
AI-based (Non-causal): - DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps - DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps - SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps
Audio Bandwidth: - Conventional codecs: NB only - Modern AI codecs: WB or higher
Algorithmic Codec Delay: - IMS codecs: 25-32ms - Conventional ultra-low: 60-126ms - Causal AI: 20-80ms - Non-causal AI: 500ms+ or full signal
Frame Duration: - Conventional ultra-low: Increased vs. standard 20ms VoIP - Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms)
Bitrate: - All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps
Complexity: - AI codecs generally higher than IMS/conventional codecs - Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+) - RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB) - ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB)
P.808 ACR Test Results (Figure 7.1.4-1):
Test setup: - English clean speech (4 talkers × 6 samples) - 32kHz, SWB, normalized to -26 dBoV - 24 subjects
Key Findings: - Codec2 (all rates) significantly worse than AMR 4.75 kbps - SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps - Three conditions on par or slightly better than EVS 9.6 kbps: - Mimi-Codec 1.1 kbps (causal) - DAC-ibm 1.5 kbps (non-causal) - SNAC 0.98 kbps (non-causal) - AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs
Test Configuration (Table 7.1.5.1-1): - Bitrates: 1, 2.5, 4.5, 6 kbps - Loss percentages: 1%, 6%, 10%, 20% - Frame size: 80ms - Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz)
Loss Simulation Methods: 1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss 2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency)
MUSHRA Test Results (8 listeners): - Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps - 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss - Interleaving benefit increases with error rate - Potential for improvement if model trained with random loss patterns
Comparison: - DAC (default): 16kHz, general audio training, scalable bitrate - DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps
MUSHRA Test Results (8 listeners, resampled to 16kHz): - DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions - DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR - Specific training for target bitrate crucial for optimal performance - Error resilience improvable through appropriate training/design choices
Conclusions: - More design freedom needed in bitrate and BLER selection for optimal quality at given SNR - Optimal coding performance (even under errors) achieved with appropriate training strategy - Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates - Dedicated training (e.g., DAC-IBM) much more efficient
Test Setup (Nokia): - Clean Finnish speech (3 males, 3 females, 4 sample pairs each) - Diotic presentation via Sennheiser HD650 headphones - Experienced listeners - Extended ACR5 scale (0.5-5.5) and DCR methodologies - Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz)
Codecs Tested: - DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps) - 3GPP: AMR, AMR-WB, EVS at various rates - ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps)
Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2): - Increased bandwidth improves quality up to ~12kHz (saturation region) - 4kHz bandwidth significantly limits perceived quality - MELP 2.4k and MPEG4 HVXC perform better than Codec2 - 3GPP codecs perform as expected at lowest bitrates - TSAC and DAC show very good performance in clean speech - TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references - Both poor quality <1 kbps
DCR Results (Figure 7.2.4-1): - Results align with ACR test - Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz) - Listeners more likely to notice degradations with reference available
Test Setup: - French, 30 listeners (5 panels × 6) - 8sec double sentences, 3 male + 3 female - 20-20,000Hz bandpass, -26dB LKFS normalized
Codecs: - Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps) - AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8)
Key Findings: - DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps - EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate - Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps)
Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test.
Test Setup: - 30 listeners (5 panels × 6) - 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background) - 20-20,000Hz bandpass, -26dB LKFS
Codecs: - Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4) - AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5) - Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft)
Key Findings: - Best quality: EVS and xHE-AAC @ ~24 kbps - Neural codec advantage visible at low bitrates - No tested neural codec achieves quality close to "Direct" - FlowDec 7.5 kbps: 4.08 DMOS (best neural codec) - No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps
Classical Speech Coding: - Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality - Especially beneficial in noisy conditions and low SNRs - Improves intelligibility and perceptual quality - Integrated in 3GPP2 EVRC and VMR-WB standards
Neural Speech Coding: - Known to be sensitive to noisy environments - Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization - Data-driven approaches make failure modes difficult to anticipate - Noise suppression can minimize issues and allow codec to focus on useful signal
Two Listening Tests (ITU-T P.808 ACR):
Test 1 - High SNR: - Assumptions from 3GPP EVS characterization - SNRs: +15 to +20 dB (WB) - Noises: car, street, office (from ITU-T P.501 Annex B) - 24 pairs of sentences (8 pairs × 3 noises) - 20 listeners
Test 2 - Low SNR: - More adversarial environments - SNRs: -5 to +15 dB - Noises: street, construction, metro, car, office, restaurant - 24 pairs of sentences (4 pairs × 6 noises) - 21 listeners
Noise Suppression: - DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz - Applied as preprocessor before coding
Mixing Procedure: - Loudness normalization using BS1770demo (ITU-T STL) - RMS long-term option for background noise level
Classical Codecs: - MELPe, AMR, AMR-WB, EVS
Neural Codecs: - SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps) - LyraV2 3.2 kbps (likely trained on diverse data including noisy speech) - DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only
All tested with and without noise suppression ("_nr" suffix).
Key Observations: - Listeners prefer uncoded denoised speech over uncoded noisy speech - Denoised speech as good as clean speech at high SNRs (minimal artifacts) - Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs) - Classical codecs: Benefit increases with bitrate/quality - Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps) - DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate - Plain DAC @ 24kHz not competitive at 1.5 kbps - LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par)
Key Observations: - Listeners strongly prefer uncoded denoised speech (~1 MOS difference) - All classical codecs benefit from denoising (<1 MOS improvement) - Neural codecs benefit even more (>1 MOS improvement possible) - Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression - Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech
Key Characteristics: - Publicly reported: "38x faster than real-time" on high-end 2021 smartphone - Entirely CPU execution (no NPU/TPU) - Open-source under Apache 2.0 license (permissive for commercial/standardization)
Code-Level Analysis:
- Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite)
- Flag use_xnn=true directs to CPU execution
- No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU)
- Single-threaded execution (threads explicitly set to 1)
- Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time
Conclusion: - Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU - Significant margin towards max RTF - CPU-only approach viable for ULBC
Methodology: - ONNX Runtime library for execution - Tested on CPU backend and NNAPI backend (Android NPU interface) - Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference) - No quantization applied (original float model) - Metrics: Real-Time Factor (RTF) for end-to-end and individual components
Theoretical Complexity Analysis (Figure 7.6.2-1): - Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification) - Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms) - Model: 76.9M parameters, 293MB size - Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes
Real-World Inference Performance:
Test Platforms: 1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed) 2. High-end mobile: Qualcomm Snapdragon 8 Gen 2
Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3):
Desktop CPU: - Single-threaded: NOT real-time (RTF 1.6-1.9) - Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86) - Still very slow for high-end desktop CPU
Mobile SoC: - NO configuration achieves real-time performance - Best-case RTF: 2.125 (>2x slower than real-time) - Worst-case RTF: 5.884 (~6x slower than real-time) - NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU - Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required
Critical Gap: - Significant gap between theoretical NPU capacity and actual measured performance (RTF) - Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone - Real-world testing essential
Editor's note: NNAPI may fallback to CPU for float models; impact needs verification.
Categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech)
Additional Considerations: - Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC
Traditional 3GPP Practice: - AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR - ACR: Generally for clean speech - DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content - Focus not on intelligibility, speaker identifiability, prosodic impairments
ULBC Challenges: - May need to address additional aspects directly through dedicated tests - Hallucination: Specific to ML-based systems - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic)
Alternative Test Methods: - Automatic speech recognition - Modified rhyme tests - DCR tests (for prosodic differences) - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests (word/semantic equivalence) - Phoneme recognition tests
Noise Suppression Evaluation: - P.835: Multi-dimensional rating (speech quality and noise suppression capability separately) - Typically used for systems with noise suppression
DCR Considerations: - Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference
[FS_ULBC] Permanent Document v0.5.0
This permanent document (p-doc) version 0.45.0 supports the Study Item on Ultra Low Bitrate Speech Codec (FS_ULBC), focusing on developing recommendations for normative work on an ultra-low bit rate codec for voice over Geostationary Orbit (GEO) satellites. The document tracks agreements, open issues, and progress across the study objectives defined in the SID.
The study addresses nine key objectives: - Document application scenarios for ultra-low bit rate communication services - Study GEO channel characteristics and derive service-related dependencies - Identify relevant design constraints - Provide feasibility evidence - Define performance requirements and test methodologies - Identify/develop objective measures for design constraint verification - Identify reference codecs - Coordinate with other 3GPP groups (SA2, RAN, CT1) - Define potential normative work item objectives and timeline
Working Procedure: - Maintains one TR and one p-doc - Contributions via pCRs - Brackets restricted to values only - Open issues documented in p-doc
Key Technical Assumptions:
UE1 Uplink (UE1 → GEO satellite → Ground station): - Transmission data rate significantly limited ([1-3] kbit/s) - Requires ultra-low bit rate codec fitting this transmission rate - Subject to transmission errors reflecting GEO satellite access - Delay greater than typical terrestrial networks
UE1 Downlink (Ground station → GEO satellite → UE1): - Similarly limited transmission data rate - Subject to similar transmission errors and delay
UE2 Connection (Core Network → UE2): - Regular TN network transmission data rate available - Could use existing IMS codec (with transcoding) or same ULBC (transcoding-free) - Transcoding functionality in core network likely needed for seamless communication across network types
Both connections (UE1 and UE2) via GEO satellite with significantly limited transmission data rate ([1-3] kbit/s), allowing both transcoded and transcoding-free operation.
Methodology: - Reuses simulation model from TS 26.132 Annex E (LTE reference scenario) - Adapted for GEO access scenario with "new GEO channel" - Potential inclusion of Non-IP Data Delivery (NIDD) option
Key Input Parameters:
BLER_tx/BLER_rx: Block error rates for uplink/downlink from RAN simulation
drx_cycle_length: DRX cycle duration (20-40ms for LTE; suitability for GEO TBC with RAN2)
mis_eNB1_eNB2: Scheduling time mis-alignment; determines buffer waiting time
nFrames considerations: - Frame length: Maximum 80ms assumed for GEO (vs. 20ms for LTE) - Voice packet size: Depends on protocol overhead (user plane vs. control plane, IP vs. Non-IP NIDD) - RTP Payload Size: Product of frame length and codec bit rate
Editor's Note: SA2 concluded in TR 23.700-19 that voice packets shall be transported over NB-IoT (GEO) user plane.
Objective: Generate multiple loss traces for combinations of: - Frame loss rate (target BLER) - Raw bitrate (TBS) - Voice bundling period - Doppler spread
Simulation Parameters: - Number of seeds: 10 - Trace duration: 400 seconds (6.67 minutes) - Channel consistency: Same channel realizations across all combinations
Baseline CNR values from TR 36.763: - UL CNR = 2.6dB (0dBi UE antenna gain, 3.75kHz SCS, 1 tone, 23dBm UE max TX power) - DL CNR = -3.3dB (0dBi UE antenna gain, 15kHz SCS, 12 tones, 1 UE receive antenna, 23dBm UE max TX power)
Channel model: NTN-TDL-C [38811]
Elevation angle: 10 degrees (parameters specified in Table 5.2.2.2-1)
Modulation: QPSK, π/2 BPSK
Subcarrier Spacing (SCS): 3.75kHz, 15kHz
Number of tones: 1 for both SCS values
Voice bundling period: 80ms, 160ms, 320ms - Note: 40ms not considered due to insufficient time for DL transmissions with 3.75kHz SCS
Doppler spread: 1Hz, 5Hz
Target BLER: 1%, 2%, 6%, 10% (fixed target BLER is FFS)
Maximum Achievable SNR: SNR = (3GPP SET-1 UL SNR) - 10×log₁₀(B/3.75) + (P - 23dBm) + G + [X] dB
Where: - 3GPP SET-1 UL SNR = 2.6dB - B = bandwidth (3.75kHz or 15kHz) - P = max UE TX power (23, 26, 31 dBm) - G = UE antenna gain difference (0 to -5.5dBi) - X = TBD (accounts for lower loss, better satellite performance)
TBS Values and PHY Bitrates:
For 80ms bundling: - TBS: 144, 256, 328, 424 bits - PHY bitrate: 1.8, 3.2, 4.1, 5.3 kbps - Codec bitrate: 1.1, 2.5, 3.4, 4.6 kbps (assuming 7 bytes packet header)
For 160ms bundling: - TBS: 208, 424, 600, 808 bits - PHY bitrate: 1.30, 2.65, 3.75, 5.05 kbps - Codec bitrate: 0.95, 2.30, 3.40, 4.70 kbps
For 320ms bundling: - TBS: 328, 776, 1096, 1544 bits - PHY bitrate: 1.025, 2.425, 3.425, 4.825 kbps - Codec bitrate: 0.850, 2.250, 3.250, 4.650 kbps
Notes: - Packet header counted once regardless of bundled frames - Loss of single TB means loss of multiple consecutive voice frames - Need for 320ms bundling to be revisited after channel simulation results
SCS: 15kHz
Number of tones: 12
Achievable SNR: SNR = (3GPP SET-1 DL SNR) + G + [Y] dB
Where: - 3GPP SET-1 DL SNR = -3.3dB - G = UE antenna gain difference (0 to -5.5dBi) - Y = TBD (accounts for 2 RX antennas providing up to 3dB gain, lower loss, better G/T values, better satellite performance)
Editor's Note: Four companies reported Y=3 due to better G/T from field measurements (-28.6dB/K vs. -31.6dB/K assumed), but no RAN1 consensus reached.
TBS values: Identical to uplink (Clause 5.2.2.2)
Dynamic Scheduling Example (80ms bundling, Half-duplex FDD): - NPDSCH duration: 4ms (variable depending on DL SNR) - UL frequency allocation options: 1, 3, 6, 12 tones with 15kHz per tone
Semi-Persistent Scheduling (SPS): - If specified by RAN for NB-IoT NTN - NPDSCH can be anywhere in first 15ms (maintaining minimum 1ms gap to NPUSCH) - "Cell_specific_Koffset" approach proposed (not dependent on "TA report UE capability")
Gap between DL and UL consists of: - Processing time + DL-to-UL switching (minimum 1ms for half-duplex device) - Max differential delay: [close to 0 to 10.3ms] (TBC)
RAN1 Note: Example frame structures supportable in most scenarios but may not work for very large cells (>3000km) when UE doesn't support TA report and network doesn't support UE-specific K-offset. RAN1/2 have not yet designed SPS.
Issue 1 - UE Power Class: Whether to use specified 23dBm or broader range (26, 29, 31, 33 dBm) - Pending RAN input
Issue 2 - Latitude-Dependent Loss: Scintillation loss (2.2dB or 0dB depending on latitude) - Solved (accounted via X term)
Issue 3 - Elevation Angles: Keeping both 2.3° and 12.5° - Solved (accounted via X term)
Issue 4 - UL/DL Guard Time: 1ms assumption - Pending RAN confirmation
Issue 5 - Candidate TBS Values: Multiple proposals from companies - Unsolved
Issue 6 - Approaches to Select TBS: Three approaches provided - Unsolved
Issue 7 - Overall Simulation Methodology: High-level description needed - Unsolved (to be addressed after simulation completion)
Issue 8 - Simulation Channel Model: NTN-TDL-C vs. NTN-TDL-C5 - Solved (NTN-TDL-C used)
Issue 9 - Protocol Overhead: Clarify packet header for different transport options - Pending RAN2/SA2 confirmation
Issue 10 - Repetition Numbers: Specify and report in simulation - Solved
Issue 11 - RX G/T for Downlink: 3dB better value observed in field - Unsolved
Editor's Note: This methodology remains an open issue.
Proposed Steps:
Agree on operation points: Set of maximum achievable receive SNRs covering marginal to error-free operation with NTN-TDL-C fading
Define performance requirements for each SNR operation point
Agree on source bit rates for each bundling time (80, 160, 320ms) based on transport formats (TBS, SCS, MCS, NRep)
Granularity appears insufficient and unequal
Determine optimum transport format (SCS, MCS, NRep) for each source bit rate based on BLER vs. SNR curves
Produce packet loss patterns for each bundling time and source bit rate at relevant SNRs (unknown to proponents during selection)
Compare ULBC candidates based on performance requirements at relevant SNRs
Example Workflow: - Proponent has design at 0.95 kbps and 3.4 kbps - For 160ms bundling with 7-byte overhead: - Low rate: TBS = 208 bits - High rate: TBS = 600 bits - Select best transport format configuration from available options - Generate BLER patterns for different UE TX powers (23, 26, 29, 31 dBm) - Run codec simulation with these patterns - Evaluate quality (e.g., listening test) with weighted averaging across power settings
Note: Important to test candidates for other conditions beyond NTN NB-IoT (e.g., Terrestrial IMS with 1% BLER, OTT with 0% BLER, extreme conditions with 10% BLER or blockage losses)
Table 5-6 documents preliminary results: - 80ms bundling: Qualcomm submitted S4-251739 - Company A, B, C: TBD
Target Device Types: - Handheld mobile phones - Smart watches - Smart glasses/head mounted devices - TCU (Telematics Control Unit) - CPE (Customer Premises Equipment) - Vehicles - Other IoT devices
Recommended Constraints: - Implementable on DSP/CPU/NPU enabled UE devices - For low-end DSP-only UEs: - Complexity: <500 WMOPS (measured on C reference code) - ROM memory: <20MB assuming 32bit/parameter (or 5M model parameters)
Editor's Notes: - Definition of "DSP enabled UE devices" needs clarification - Exact complexity estimation metric and limits are TBD
Complexity Verification: - Constraints may be based on platform-agnostic metrics: - MACs/FLOPs for AI-based components - WMOPS for traditional signal processing - Model size and precision - Verification process details and timing are FFS
Algorithmic Delay: - Verification method for AI-based codecs required
Define performance requirements and test methodologies for: - Speech quality, intelligibility, conversational quality - Clean speech and noisy speech - Tandeming with existing IMS voice codecs - Clean channel and GEO channel conditions - Identify relevant reference codecs
Core influencing factors identified: - DC: Sample rate and audio bandwidth - DC: Bitrates (External dependency) - DC: Frame length - DC: PLC (External dependency) - DC: Algorithmic Delay - DC: Complexity, Memory - Test Methodologies - DC: Noise suppression - DC: DTX/CNG - DC: Robust Non-Speech - Evidence DCs - Reference codec
All items currently have open issues and progress TBD
From RAN: - HARQ retransmission parameters (max_tx/max_rx) - DRX cycle length suitability for GEO - Scheduling parameters (dynamic vs. SPS) - Frame structure confirmation - UE power class - UL/DL guard time - Protocol overhead - G/T values for downlink
From SA2: - Transport path for voice packets (user plane vs. control plane, IP vs. Non-IP NIDD) - Protocol overhead details - Transcoding functionality requirements
From RAN2: - Dynamic scheduling vs. Semi-Persistent Scheduling - MAC header size (1-byte feasibility) - Timing parameters
The document establishes a comprehensive RAN simulation framework for generating error traces: - Defined methodology using NTN-TDL-C channel model - Specified uplink and downlink parameters - Established TBS values and corresponding codec bitrates for multiple bundling periods - Defined channel consistency requirements across simulations
Adopted baseline CNR values from TR 36.763 with provisions for: - Variable UE power classes - Latitude-dependent losses - Elevation angle variations - Better-than-assumed satellite performance
Proposed alternative methodology allowing proponents design freedom: - Operation point definition based on receive SNRs - Transport format optimization for each source bit rate - Packet loss pattern generation - Comparative evaluation framework
Defined frame structures for: - Dynamic scheduling with Half-duplex FDD - Semi-Persistent Scheduling options - Cell_specific_Koffset approach for large cells
Adapted TS 26.132 Annex E model for GEO scenarios: - Identified required input parameters - Defined voice packet structure with protocol overhead - Established relationship between frame length, bundling, and packet loss
High Priority (Blocking): 1. Consensus on UE power class (23 dBm vs. higher values) 2. RAN confirmation on frame structures and scheduling 3. SA2/RAN2 confirmation on protocol overhead 4. Selection of candidate TBS values and selection methodology 5. Downlink RX G/T value consensus
Medium Priority: 1. Fixed vs. variable target BLER 2. Need for 320ms bundling option 3. Complexity metric definition and limits 4. Algorithmic delay verification for AI codecs
Lower Priority: 1. Overall simulation methodology description (after completion) 2. Definition of "DSP enabled UE devices"
Current Version: 0.45.0 (SA4#135, February 2026)
Recent Updates: - Added 10-degree channel model parameters - Updated simulation parameters per multiple agreed TDOCs - Added company simulation results reporting - Clarified voice packet transport over user plane
Working Status: - Active study phase - Collecting simulation results from companies - Coordinating with RAN and SA2 for parameter confirmation - Developing design constraints and performance requirements
[FS_ULBC] WorkPlan of FS_ULBC v0.5
This document outlines the timeplan for the Feasibility Study on Ultra Low Bitrate Speech Codec (FS_ULBC). The study focuses on developing a codec for ultra-low bit rate communication services, particularly for IMS Voice Call Using GEO Access as documented in TR 22.887.
The FS_ULBC study has nine main objectives:
Multiple telcos scheduled to: - Progress GEO channel characteristics study - Perform RAN-related simulations within SA4 - Align on RAN link-level simulations - Power to send LS to SA2 and RAN WGs if needed
Major milestone meeting: - Finalize: - GEO channel characteristics study - Coordination with other WGs - Reference codec identification - Design constraints: bit rates, sample rate, audio bandwidth, frame length, PLC, noise suppression, DTX - Progress: - Feasibility evidence - Objective measures development - Design constraints: complexity, algorithmic delay, speech quality, robustness - Performance requirements and test methodologies - Start defining potential normative work item objectives and timeline - If time permits: Finalize additional application scenarios
Final study meeting: - Finalize: - Design constraints: speech quality - Performance requirements and test methodologies (clean/noisy speech, tandeming, clean/GEO channel conditions) - Potential normative work item objectives and timeline
[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation
This contribution from China Mobile addresses outstanding assumptions and open issues for NB-IoT GEO satellite simulation work within the ULBC (Ultra-Low Bitrate Codec) study. The document consolidates discussions from multiple Audio Ad-hoc meetings (June 4, June 17, and July 11) and proposes updates to TS 26.940 clause 5.2.2.4.
The document provides a comprehensive status table tracking 11 key simulation issues, with updates on their resolution status:
Note: 37 dB is under study in ongoing RAN work
Latitude-Dependent Loss (Issue 2):
Additional note: New 10-degree channel model introduced, may increase feasible TBS
Elevation Angles (Issue 3):
Resolution: Simulation accounts for elevation angles using X term
Simulation Channel Model (Issue 8):
Resolution: NTN-TDL-C is used
Repetition Numbers (Issue 10):
Still Pending: Overhead for User Plane (IP vs. Non-IP) needs RAN confirmation
RX G/T for Downlink (Issue 11):
Status: Needs RAN confirmation on feasibility
Determine Candidate TBS Values (Issue 5):
Status: Unsolved, requires further verification
Approaches to Select TBS (Issue 6):
Overall Simulation Methodology Description (Issue 7):
The document proposes to: 1. Update the P-doc (TS 26.940) based on the status updates provided 2. Continue tracking these issues until full resolution
The document highlights several dependencies on other working groups: - RAN4: UE power class confirmation - RAN: UL/DL guard time feasibility, protocol overhead confirmation - SA2: Protocol overhead for different transport configurations
[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19
This contribution updates the FS_ULBC (Ultra Low Bitrate Speech Codec) Permanent Document to align with SA2 conclusions on Key Issue #1 regarding IMS voice call support over NB-IoT via GEO satellite connecting to EPC, as documented in TR 23.700-19.
The document adds critical new references to align with recent 3GPP work:
The document introduces significant modifications to the end-to-end simulation model:
Based on SA2 conclusions in TR 23.700-19:
Key parameters updated for GEO scenarios:
Two protocol overhead scenarios illustrated:
Editor's Note: Exact overhead for UDP/IP (SA2 scope) and RTP (SA4 scope) for the removal/restoration mechanism requires determination.
| Issue | Resolution | |-------|-----------| | Latitude-Dependent Loss | Simulation accounts for latitude-dependent scintillation loss using X term (2.2 dB or 0 dB beyond ±20° latitude per TR 38.821) | | Elevation Angles | Both 2.3° and 12.5° angles considered using X term for worst-case scenarios | | Simulation Channel Model | NTN-TDL-C selected | | Repetition Numbers | Specified and reported in simulation |
Based on SA2 agreements:
The document identifies several inter-working group dependencies:
Two critical editor's notes remain:
[FS_ULBC]Considerations for ULBC Codec Selection Process
This document appears to be a presentation or discussion paper related to ULBC (Uplink Broadcast) codec selection process. However, the provided content is fragmentary and contains mixed language elements (English and Chinese), making comprehensive technical analysis challenging.
The document's primary focus is on considerations for selecting codecs in the ULBC (Uplink Broadcast) context. However, specific technical criteria, evaluation methodologies, or selection parameters are not detailed in the provided content.
The document includes an "Open Questions" section, indicating ongoing discussions and unresolved technical issues. However, the specific questions are not provided in the extracted content.
Due to the fragmentary nature of the document provided: - Specific codec selection criteria are not detailed - Technical evaluation parameters are missing - Comparison methodologies between candidate codecs are not present - Detailed architectural proposals are not included - Specific agreements or decisions are not documented
Note: This summary is based on fragmentary content with significant portions in template format or non-English text. A complete technical analysis would require the full document with all technical details, agreements, and proposals.
[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals
This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics.
The study evaluates: - Descript Audio Codec (DAC) - AI-based codec - Enhanced Voice Services (EVS) codec - 3GPP standard reference
The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.
The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges: - Wideband audio (16 kHz) offers naturalness and perceptual quality - Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits
For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include: - Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification - System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas - Need to balance modern expectations with legacy requirements and emergency scenarios
EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling: - Practical quality floor definition for ULBC - Comparison against established carrier-grade standards - Isolation of bandwidth choice impact independent of codec architecture
DAC model: Evaluated at three sampling rates
- 16 kHz
- 24 kHz
- 44 kHz
EVS codec: Evaluated in standard modes - Narrowband (NB) - Wideband (WB)
Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)
WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.
Key Findings: - DAC achieves high efficiency at low bitrates (~2 kbps) - WER drops rapidly as bitrate increases, stabilizing around 3-4% - At 1.5 kbps: WER approximately 5.5% - Significant improvement observed in 1.5-3.0 kbps range
Bandwidth Impact at Low Bitrates: - At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model - With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band - Results in better semantic preservation vs. 24 kHz model suffering from bit starvation
A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound.
Model Configuration: - Sample rate: 8000 Hz - Encoder rates: [2, 4, 4, 8], dimension: 64 - Decoder rates: [8, 4, 4, 2], dimension: 1536 - Quantization: 6 codebooks, size 1024, dimension 36 - Training: 200,000 steps on VCTK corpus
Critical Findings at Sub-1.5 kbps: - At ~1 kbps: - 8 kHz model (938 bps): WER 5.86% - 16 kHz model (1000 bps): WER 11.23% - Semantic penalty > 5 percentage points when forcing WB at 1 kbps
Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.
Competitive Advantage: - DAC achieves comparable WER scores at significantly lower bitrates than EVS - DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs
ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.
Performance at Different Bitrates:
Nearly identical, indicating saturation point in semantic quality
At 5.9-8.0 kbps range:
Both modes provide sufficient basic audio quality
At 9.6 kbps:
Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.
Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.
Key Findings: - Semantic loss introduced by EVS in both NB and WB modes is minimal - Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance - Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum
Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.
Strategic Conclusions for ULBC Design:
Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation
AI-based codec sampling rate optimization:
Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing: - Methodology - Experimental setup - Analysis of results concerning audio bandwidth impact on semantic intelligibility
Complete experimental data provided in appendix tables covering:
- Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps
- Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps
[FS_ULBC]pCR on Existing codec technologies
This pCR proposes updates to Clause 7.1 of TR 26.940, which documents existing codec technologies for evidence that design criteria can be met and for comparison/evaluation purposes. The document adds information about recently emerged ultra-low bit-rate voice codecs (below 1 kbps) as reference for further work.
The pCR significantly expands Table 7.1.1-1 "List of existing codec technologies" by adding multiple categories of codecs beyond the existing 3GPP IMS codecs. The table includes the following parameters for each codec:
These codecs support real-time operation: - LPCNet: 1.6 kbps, WB, 40ms frames, 25ms delay - LyraV2 (SoundStream): 3.2-9.2 kbps, WB, 20ms frames - EnCodec: 1.5-24 kbps, 24kHz/FB, 0-1000ms delay, 13.3ms frames - Mimi-Codec: 0.55-1.1 kbps, 24kHz, 80ms frames, 0ms delay - TS3: 0.64-0.8 kbps, WB, 20ms frames, 0ms delay - TAAE: 0.4-0.7 kbps, WB, 20-40ms frames, 0ms delay - LMCodec2: Parameters TBD
These codecs are designed for offline/non-real-time applications: - DAC: 0.5-3 kbps, WB/24kHz, 244-366ms delay - DAC-IBM: 0.75-3 kbps, 24kHz, 366ms delay - SNAC: 0.98 kbps, 24kHz, 1000ms delay, 80ms frames - SpeechTokenizer: 0.5-1.0 kbps, WB, full-signal delay - SemantiCodec: 0.31-1.4 kbps, WB, 10-40ms frames, full-signal delay - FunCodec: 0.25-1.0+ kbps, WB, 20-40ms frames - WavTokenizer: 0.25-0.9 kbps, 24kHz, 25-40ms frames - BigCodec: 1.04 kbps, WB, 12.5ms frames - FocalCodec: 0.16-0.65 kbps, WB, 20-80ms frames - ALMTokenizer: 0.41 kbps, WB, 13.3ms frames - XY-Tokenizer: 1 kbps, WB, 20ms frames - LongCat-Audio-Codec: 0.43-0.87 kbps, WB, 60ms frames - AcademiCodec: Parameters TBD - MuCodec: 0.35-1.35 kbps, FB
The pCR includes several important notes:
An editor's note indicates that more codecs may be added to the table in future revisions.
The pCR demonstrates significant industry progress in ultra-low bitrate speech coding, particularly: - Multiple AI-based solutions achieving sub-1 kbps bitrates - Wide range of delay characteristics (0ms to 1000ms) - Various bandwidth support (NB to FB) - Different availability levels for specifications and software implementations
[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling
This contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware.
Eight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:
Key complexity observations from Table 1: - Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536) - Model sizes range from 4.3 MB to 283.6 MB - Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz)
Critical finding: For a given model variant, computational complexity scales linearly with sample rate: - enc32dec768 example: - 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s) - 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase - 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase
Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended.
Energy-conserving state with severe constraints:
Typical sustained workload state:
High-performance state approaching sustained limits:
Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated.
Analysis at peak locked frequencies establishes absolute upper bounds:
Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices.
Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices.
Critical "Complexity vs. Bandwidth" trade-off identified:
Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category.
Established precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores:
- 32 kHz → 10M parameters
- 16 kHz → 20M parameters
- 8 kHz → 39M parameters
Provided concrete RTF measurements across representative mobile hardware configurations, revealing that: - Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks - Memory bandwidth and thermal constraints significantly impact feasibility - Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity
Identified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection.
The contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics.
[FS_ULBC] Discussion on Audio Bandwidth for ULBC
This contribution addresses audio bandwidth design constraints for the Ultra-Low Bitrate Codec (ULBC), targeting primarily voice over GEO satellite communications. The document argues against mandatory Wideband (WB) and Super-Wideband (SWB) support, proposing instead that Narrowband (NB) should be mandatory with WB as an enhancement.
Current Network Reality: - 2G/3G connections (primarily AMR-NB) still represent 20% of global technology mix (end of 2023) - Regional variations: 81% in Sub-Saharan Africa, 46% in Middle East and North Africa - NB serves as universal fallback for interoperability (CS fallback scenarios)
System Inefficiency Without NB Mode: - WB ULBC to NB user calls waste upper frequency band (4-8 kHz) - Significant bitrate wasted transmitting data that recipient cannot hear - Over expensive, scarce satellite link, this inefficiency is unacceptable - Native NB mode provides most efficient solution for legacy network connectivity
Baseline Expectation Setting: - GEO call is final option after terrestrial network failure - Users typically experience AMR-NB fallback before resorting to GEO - ULBC must be at least as reliable as NB fallback to meet user expectations - WB-only ULBC failure in conditions where NB would work represents service failure
Typical Deployment Scenario: - Rescue teams in remote areas (e.g., Himalayan mountains) - Mixed-connectivity environment: - Squad A: GEO-only (outside TN coverage) - Squad B: GSM fallback at coverage fringe - Base Camp: PSTN connection (NB service)
Technical Implications: - Terminating endpoints predominantly NB - Emergency systems use traditional NB codecs (Codec2, MELP) for robustness - Transmitting WB over satellite to NB endpoint wastes critical resources in life-or-death situations - Real-world deployment example provided (China rescue missions)
Evaluation Priority: - ULBC candidates should prioritize intelligibility and robustness testing in NB mode
Quality vs. Bandwidth Trade-off: - Forcing wider bandwidth at very low bitrates spreads available data too thinly - Research shows lower sampling rates can achieve higher perceptual quality at very low bitrates - WB codec at ~1 kbps may compromise intelligibility, especially with packet loss - NB signal more robustly reconstructed under constrained conditions
Analogy: "Spreading butter" - concentrating bits on narrower bandwidth preserves speech richness and intelligibility
Computational Scaling Issues: - AI-based codec architectures don't scale gracefully - Doubling sampling rate (NB to WB): 2x to 4x complexity increase for CNN/Transformer models - WB-only mandate imposes unnecessary computational burden - Critical issue for power-constrained mobile devices - Native NB mode offers high-quality voice at significantly lower complexity/power budget
Test Configuration: - Codec: Descript Audio Codec (DAC) with pre-trained models - Sampling rates tested: 44.1 kHz, 24 kHz (SWB), 16 kHz (WB) - Test corpus: 100 clean speech samples from MS-SNSD dataset - Bitrate variation: 1-9 active quantization codebooks - Quality metric: ViSQOL algorithm (speech mode, MOS estimate)
Model Specifications:
| Model | Compression | Frame Rate | Codebooks | Bitrate/Codebook | |-------|-------------|------------|-----------|------------------| | 16 kHz (WB) | 320x [2,4,5,8] | 50 Hz | 12 (10-bit) | 0.50 kbps | | 24 kHz (SWB) | 320x [2,4,5,8] | 75 Hz | 32 (10-bit) | 0.75 kbps | | 44.1 kHz | 512x [2,4,8,8] | ~86.1 Hz | 9 (10-bit) | ~0.86 kbps |
Quality vs. Bitrate Results: - WB (16 kHz): Achieves excellent quality (ViSQOL MOS > 4.0) at ~2.5 kbps - 24 kHz SWB: Requires higher bitrate to match WB quality - 44.1 kHz: Provides minimal perceptible improvement over 24 kHz SWB - Conclusion: Bitrate cost of SWB not justified by quality improvement for voice content
Efficiency Analysis: - Clear trend: diminishing returns for bandwidth beyond WB - SWB/FB represents inefficient use of bandwidth for ULBC service
Mandatory Support: 1. 8 kHz sampling rate (NB): 50-4000 Hz audio bandwidth 2. 16 kHz sampling rate (WB): 50-8000 Hz audio bandwidth - Enhanced quality where channel conditions and device capabilities permit - WB support can be limited to higher bitrates than NB operation
Further Study: - Necessity and feasibility of SWB and FB support remains FFS
Change to Table 6.2-1 (Design Constraint Parameters):
Sample rate and audio bandwidth: - The ultra low bitrate codec shall support sampling rates of 8kHz (NB) and 16kHz (WB) - Supported audio bandwidth: - NB: 50-4000 Hz - WB: 50-8000 Hz
Quantitative Data: - 20% global 2G/3G connections (hundreds of millions of users) - Regional NB dominance: up to 81% in some areas - WB achieves MOS > 4.0 at 2.5 kbps - 2x-4x complexity increase for WB vs. NB in AI codecs
Qualitative Arguments: - System efficiency (no wasted bandwidth to NB endpoints) - User expectation alignment (last resort reliability) - Emergency use case requirements - Computational/power constraints for mobile devices - Diminishing returns for SWB/FB at target bitrates
[FS_ULBC] Analysis of AI Codec Complexity Scaling
This contribution addresses the need for establishing relevant complexity evaluation methods for the new ULBC codec standardization. Previous contributions (e.g., S4aA250264) highlighted potential gaps between theoretical complexity metrics (FLOPs) and practical on-device performance (Real-Time Factor).
This document provides a complementary analysis focusing on how complexity metrics scale with AI model architecture itself. The analysis investigates the relationship between model architecture, theoretical complexity, and traditional metrics using the publicly available DAC codec as a test case.
The analysis created seven "dummy" model variants based on the open-source DAC codec's 16kHz configuration. The approach:
Decoder rates: [8, 5, 4, 2]
Scaling Approach:
encoder_dim and decoder_dim were modifiedFrame size: 20ms (320 samples at 16kHz)
Variant Configurations:
Complexity Metrics Measured:
Implementation Notes: - Each AI operation implemented in pure C - Source files annotated and compiled using wmc_tool - WMOPS highly sensitive to C implementation efficiency - Naive implementations can yield significantly higher counts than optimized versions
Key Findings:
Key Finding: Clear relationship between AI model size (in millions of parameters) and traditional WMOPS complexity.
Observations on DAC Model:
Complete complexity metrics for all seven DAC variants (16kHz, 20ms frame):
| Variant | Enc Dim | Dec Dim | Params (M) | GFLOP counts | MFLOP/s | WMOPS Enc | WMOPS Dec | |---------|---------|---------|------------|--------------|---------|-----------|-----------| | enc8dec144 | 8 | 144 | 1.09 | 0.009 | 437.09 | 333.92 | 760.53 | | enc12dec288 | 12 | 288 | 2.89 | 0.028 | 1397.63 | 648.23 | 2732.96 | | enc16dec384 | 16 | 384 | 4.94 | 0.050 | 2481.98 | 1060.79 | 4724.38 | | enc24dec576 | 24 | 576 | 10.76 | 0.112 | 5578.38 | 2228.92 | 10399.00 | | enc32dec768 | 32 | 768 | 18.90 | 0.198 | 9911.72 | 3693.56 | 18093.30 | | enc40dec960 | 40 | 960 | 29.34 | 0.310 | 15482.00 | 5599.48 | 28019.70 | | enc64dec1536 | 64 | 1536 | 74.50 | 0.792 | 39614.50 | 13675.30 | 70766.69 |
Data demonstrates rapid scaling of all metrics as encoder and decoder dimensions increase.
Based on the DAC model variant analysis:
Linear Relationship: For the DAC model, there is a clear linear relationship between Theoretical Complexity (MFLOP/s), Model Parameters, and measured WMOPS. As MFLOP/s or parameter count increases, WMOPS increases linearly, provided C coding style remains consistent.
Quadratic Growth: Increasing model's internal dimensions causes complexity to grow quadratically. Even small dimension increases lead to disproportionately large jumps in MFLOP/s and WMOPS.
Implementation Dependency: WMOPS score depends heavily on source C code efficiency.
It is proposed to capture the above analysis into 3GPP TR 26.940.
[FS_ULBC] On codec bitrate and capacity discussion for ULBC
This CR addresses the TBS (Transport Block Size) and codec bitrate values for ULBC (Ultra Low Bitrate Codec) evaluation, which are currently noted as 'companies reported' in TR 26.940 v0.4.0. The contribution provides analysis on: - Multiplexed UE number analysis - Confirmation of TBS/Codec bitrate values
The document presents a methodology for calculating supported UE numbers considering: - TDM (Time Division Multiplexing): Both UL and DL can schedule different UEs in TDM manner - FDM (Frequency Division Multiplexing): UL can additionally use FDM since NPUSCH may occupy few subcarriers within 180kHz bandwidth - FDM capacity: 48 UEs for 3.75kHz SCS (single tone) - FDM capacity: 12 UEs for 15kHz SCS (single tone) - Bi-directional constraint: Final supported UE number is the minimum of UL and DL capacity
Analysis conducted under 50-degree elevation channel model with 2% BLER:
Key Observations: - Observation 1: Higher UE transmit power leads to higher capacity (multiplexed UE number) for a given codec bitrate - Observation 2: For codec bitrate of ~3kbps, capacity is limited to ~10 UEs with 31dBm UE power. Capacity further degrades with increased bitrate (e.g., ≤5 UEs for 4.5kbps)
Performance characteristics: - 23 dBm UE power shows very poor capacity - Performance improves with higher power UEs (26 dBm, 31 dBm) - Capacity increases with ptime value
Additional analysis provided in Annex assuming 1% BLER under 10-degree elevation channel model.
The CR proposes specific TBS values selected from TS 36.213 table 16.5.1.2-2 for NB-IoT NPUSCH, with corresponding PHY bitrates and codec bitrates calculated for each bundling period (assuming 7-byte packet header).
Table 1: 80ms bundling - TBS range: 88-256 bits - PHY bitrate range: 1.1-3.2 kbps - Codec bitrate range: 0.4-2.5 kbps
Table 2: 160ms bundling - TBS range: 120-424 bits - PHY bitrate range: 0.75-2.65 kbps - Codec bitrate range: 0.4-2.30 kbps
Table 3: 320ms bundling - TBS range: 208-808 bits - PHY bitrate range: 0.65-2.52 kbps - Codec bitrate range: 0.475-2.35 kbps
NOTE 1: Final packet header size depends on SA2 and RAN conclusions, including feasibility of 1-byte MAC header
NOTE 2: Packet header counted only once regardless of bundled voice frames
NOTE 3: Relationship between voice frame duration and bundling time depends on RTP payload design. Loss of single TB means loss of multiple consecutive voice frames when bundled.
Proposal 1: Agree that codec bitrate should be bound to be less than 3kbps
Proposal 2: Agree to the proposed changes to Section 5.2.2.2 (Uplink simulation parameters) of TR 26.940, including: - Updated TBS values and PHY bitrates tables - Voice bundling periods: 80ms, 160ms, 320ms (40ms excluded due to insufficient time for DL transmissions with 3.75kHz SCS) - Target BLER values: 1%, 2%, 6%, 10% - Maximum Achievable SNR formula incorporating UE power (23/26/31 dBm), bandwidth, and antenna gain variations
The Annex provides additional multiplexed UE number analysis for different codec bitrates and UE power levels under 10-degree elevation channel model, supporting the main technical conclusions.
On ULBC complexity and RTF analysis
This contribution addresses the need to finalize complexity and memory design constraints for the ULBC (Ultra-Low Bitrate Codec) study. Previous discussions at SA4 #133-e and the ULBC ADHOC meeting explored various complexity metrics and RTF performance data for existing AI codecs (DAC, Lyra v2, HIL). However, insufficient data exists to draw definitive conclusions on complexity constraints for ULBC.
The document builds upon previous contribution S4-251844 with the following modifications: - Added CPU core information for experiments - Aligned RTF definition with TR 26.940 clause 7.5.3 - Focused on model sizes 3-20M parameters (more relevant to ULBC use cases) - Provided pCR for TR 26.940 - Removed large chunk-based processing experiments (not relevant for real-time voice communication)
Modified DAC architecture with reduced parameters while maintaining general structure: - Model sizes: 20M, 15M, 9M, and 3M parameters (float32 precision) - Training: Optimized for ~1 kbps bitrate at 32 kHz sampling rate - Encoder rates: 4,4,8,10 for all models
Theoretical Complexity (GMACS): - Computed using ptflops library - Results show linear relationship between model size and GMACS: - 20M: 5.14 GMACS - 15M: 4.03 GMACS - 9M: 2.39 GMACS - 3M: 0.79 GMACS
Device 1 (2023): - Hexa-core CPU: 2×3.46 GHz (P core) + 4×2.02 GHz (E core) - Dynamic core switching observed between P and E cores
Device 2 (2022): - Octa-core CPU: 1×3.00 GHz Cortex-X2 + 3×2.50 GHz Cortex-A710 + 4×1.80 GHz Cortex-A510 - Processing on Cortex-X2 with frequency switching between 2.4 GHz and 1.8 GHz
| Model Size | Max RTF (High Performance) | Max RTF (Power Efficient) | |------------|---------------------------|---------------------------| | 20M | 0.39-0.63 | 0.81-0.9 | | 15M | 0.29-0.43 | 0.66-0.74 | | 9M | 0.19-0.29 | 0.44-0.57 | | 3M | 0.09-0.13 | 0.18-0.31 |
Results demonstrate linear increase in RTF with model size across both performance modes.
The contribution provides a comprehensive pCR adding new clause 6.2.1.7 "RTF and MACS analysis for AI based codecs" with detailed experimental results. Key additions to TR 26.940 include:
Document the experimental methodology, results, and observations in clause 6.2.1 of TR 26.940 as shown in the provided pCR.
[FS_ULBC] Discussion on Methodology for Delay & Error Trace Generation
This contribution addresses the ongoing debate within SA4 Audio SWG regarding the methodology for generating delay and error traces for Ultra Low Bitrate Codec (ULBC) evaluation under Non-Terrestrial Network (NTN) conditions. Two competing approaches have emerged:
The contribution proposes clarifying the purpose of these simulations by distinguishing between Design and Verification phases.
The LTE MTSI testing methodology in TS 26.132 (Annex E and F) operated on "Stationary" conditions:
if (rand(1) < BLER_tx) logicCritically, TS 26.132 defined these traces as verification tools (System Testing):
Key Finding: Profiles were treated as Test Vectors to verify robustness against defined impairments, not as "realistic channel recordings" to train codec design.
NTN scenarios introduce challenges that invalidate the LTE approach:
Methodology: - Define TBSs for each candidate bitrate and bundling time - Traverse all link parameters (SCS, Tone, etc.) to evaluate if resulting link budgets satisfy predefined Target BLER - Generate error trace for each configuration meeting BLER threshold - Number of output traces = Number of defined Target BLERs (for each TBS)
Underlying Assumption: AI-based Codecs (specifically PLC mechanisms) require specific "real" error patterns during training/design phase
Observation: Limits testing scope to specific "safe" operating points, potentially overlooking codec behavior under unexpected channel degradation
Methodology: - Normalize TBS across all candidate codec bitrates assuming consistent packet overhead - For each unique Link Budget (fixed SNR) derived from specific UE, satellite, and link parameters, generate dedicated error traces - Number of output traces = Number of unique Link Budgets (for each TBS)
Underlying Assumption: Mimics "Best Effort" or competitive scenario similar to EVS selection, where end-to-end quality (MOS) matters more than intermediate BLER
Observation: Logically sound for optimizing system performance, but implies vast search space potentially leading to unmanageable simulation workload
The standard workflow should be:
Delay/Error Profiles Generation → Codec/PLC Verification → System Performance Evaluation
The current deadlock stems from treating RAN simulation outputs as Design Constraints (training data) rather than Verification Tools.
Key Principles:
Robustness over Overfitting: Robust Codec and PLC design should not rely on "learning" a specific channel trace from a specific simulator. Design should handle variety of harsh conditions (burst losses, high jitter, varying BLER). Data augmentation is standard practice for training robust AI models.
The Role of Traces: As in TS 26.132 Annex F, generated traces serve as "Test Vectors" defining challenging conditions under which the Codec must survive. Whether traces represent 90% or 99% of real-world cases is secondary to sufficiently stress-testing JBM and PLC algorithms.
Historical Practice: Delay/Error profiles officially generated by SA4 were never distributed to codec proponents for training purposes; they were solely used to verify codec candidates fulfill design constraints and performance requirements.
Re-orient simulation efforts towards generating a Verification Suite rather than a "Perfect Reality Model":
Avoid Excessive "Realism" Filtering: Do not discard simulation results simply because they don't meet strict low-BLER threshold. High BLER conditions are valid "Corner Cases" that ULBC must handle, especially in satellite scenarios with tight link budgets.
Limit the Search Space: Select representative subset of challenging conditions (e.g., Deep Fading, High Doppler) at fixed SNR points resulting in range of BLERs (e.g., from <1% up to >10%).
Verification Focus: Output traces should verify candidate codecs degrade gracefully under varied conditions. Burden is on Codec proponent to design PLC that works across these profiles, not on RAN simulation group to provide "training set" guaranteeing codec works.
The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design by providing a high-resolution library of error traces rather than a single static operating point.
For Performance Comparison: Proponents selecting specific source bitrate can identify and utilize trace from library whose SNR/BLER most closely matches their design's intended link budget
For Robustness Testing: Proponents can select "stress-test" traces (e.g., those with higher BLER or specific jitter profiles) from same library to verify PLC and JBM algorithms
While the source understands the rationale behind both the Fixed BLER approach and Fixed Resource / Link Budget approach for GEO network simulation, a compromised solution is necessary for FS_ULBC to progress. MFTG is therefore proposed for consideration and agreement.
[FS_ULBC] Proposed ULBC design constraints living document
This living document consolidates design constraints being considered within SA4 for FS_ULBC (Feasibility Study on Ultra-Low Bitrate Codec). Due to the working procedure requiring consensus agreements for design constraints to be integrated into ULBC PD or TR 26.940, and the lack of such consensus so far, this document captures the current status of design constraints even though some items are not fully agreed.
Design Constraint: Support of [8, 16, 32] kHz / [NB, WB, SWB] required [1], [2]
Editor's Notes and Open Issues: - Support of 8 kHz justified for interoperability; clarification needed on whether NB would be tested/supported "externally" based on external resampling - Support of 48 kHz may be considered at higher bitrate operation - Consideration of at least a single model (e.g., SWB) - Many neural codecs operate at 24 kHz; this specific sampling rate should be discussed - Complexity considerations associated with this parameter; joint decisions may be needed
Reference: NB audio typically sampled at 8 kHz (100-3500 Hz), WB at 16 kHz (50-7000 Hz), SWB at 32 kHz (50-14000 Hz), FB up to 20000 Hz
Design Constraint: ULBC candidate codecs shall support mono coding with one channel input and one channel output
Design Constraint: ULBC candidate codecs shall operate at bitrates lower than [3.00] kb/s [3]
Design Constraint: Candidate codecs shall operate with a coding frame size of multiple of 20 ms
Note: Since larger than 20ms bundling time periods will be used, codec proponents should be allowed to consider solutions with larger than 20ms frame sizes
Design Constraint: Algorithmic delay shall be less than [coding frame size + x] ms
Design Constraint: Complexity limits applied according to categories. Computational complexity and program ROM (PROM) of candidate codecs for each category shall be measured with ITU-T STL2009 [1] as observed worst-case encoder + observed worst-case decoder complexity within the same category [5], [6]
Categories: - Computational: wMOPS: Less than [x] wMOPS - Memory: RAM, ROM, Program ROM (values TBD)
Editor's Notes: - Model size per operation mode is less than [5-10] million parameters - Total number of parameters is less than [Z] million - ULBC Codec should be implementable on mobile device using today's technology - Increased computational complexity and memory usage should be commensurate with gain in quality of user experience (e.g., higher audio bandwidth such as SWB or stereo) or increased efficiency (e.g., lower bit rate for same quality compared to reference codec)
Design Constraint: If noise suppression is supported inside ULBC, there should be a mechanism to disable noise suppression in the codec [7], [8]
Editor's Notes - Clarifications Needed: - Need to support noise suppression in ULBC? (typically vendor specific, defined outside the codec) - Impacts on test methodology, DTX operation/performance
Motivations: - Disabling noise suppression required to test feature separately - Avoid tandeming in real operation - IMS voice communication defined in TS 22.228; GEO satellite access has no specific requirement on noise handling
Design Constraint: A JBM solution conforming to requirements in TS 26.114, except for the functional requirement in sub-clause 8.2.2 of TS 26.114: "Speech JBM used in MTSI shall support all the codecs as defined in clause 5.2.1", shall be provided with candidate codecs
Design Constraint: Candidate codecs shall perform rate switching upon command to the encoder throughout the entire bit rate range at arbitrary frame boundaries. Rate switching may imply switching between different bandwidths
Note: Due to the Bundling period and associated TBS, switching might have to happen at the boundary of bundling period
Design Constraint: A PLC solution shall be provided by ULBC candidate codecs [9]
Editor's Notes: - Typical loss profiles/characteristics to be clarified - Support of redundancy to be clarified - Need to be able to handle BLER up to [x%]
Design Constraint: Candidate codecs shall provide an RTP payload format specification supporting the full set of features and functionality of the ULBC candidate codecs
Design Constraint: Candidate codecs shall provide a complete VAD/DTX/CNG framework. It shall be possible to operate the codec with DTX on or DTX off
Editor's Note: Typical radio characteristics and optimizations (SPS, DRX, bitrate) to be clarified
Design Constraint: ULBC candidate codecs shall not amplify the output signal relative to the input signal beyond limits
Editor's Note: Similar limits and methodology to measure the amplification are described in the EVS-7a,b processing plan permanent document
[1] S4-251794 - Discussion on Audio Bandwidth for ULBC (vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum)
[2] S4-251808 - Pseudo-CR on Design Constraints of ULBC: Audio bandwidth (Fraunhofer IIS)
[3] S4-251792 - On codec bitrate and capacity discussion for ULBC (vivo, Samsung, Spreadtrum, MediaTek Inc.)
[4] ITU-T G.191 - Software tools for speech and audio coding standardization (March 2010)
[5] S4-251747 - On complexity constraints for ULBC (Huawei Technologies Co., Ltd.)
[6] S4-251807 - On complexity design constraints for ULBC (Fraunhofer IIS)
[7] S4-251395 - Pseudo-CR on Design Constraints of ULBC: Noise suppression (Fraunhofer IIS)
[8] S4-251748 - On noise suppression for ULBC (Huawei Technologies Co., Ltd.)
[9] S4aA250268 - Packet Loss Concealment with existing AI based codec DAC (Dolby Laboratories Inc., Ericsson LM, Nokia, Novamint)
Note: Items in light blue are candidates for agreement at SA4#135.
[FS_ULBC] Alignment Analysis on Complexity of DAC model
This contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics:
Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate.
The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF.
A detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints:
| Metric | [2] (16k, ~3M) | [1] (32k, ~3M) | |--------|----------------|----------------| | Input Rate | 16,000 Hz | 32,000 Hz | | Total Stride | 320 (2×4×5×8) | 1280 (4×4×8×10) | | Latent Rate | 50.0 Hz | 25.0 Hz | | Encoder MACs (M) | 436.30 | 461.92 | | Quantizer MACs (M) | 2.25 | 0.50 | | Decoder MACs (M) | 984.50 | 1037.12 | | Total MFlops/s | 1423.05 | 1499.54 |
Key Analysis:
Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration.
Theoretical complexity (GMACS) was recalculated to validate the analysis:
When RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures.
Key Findings:
By adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices.
Propose to include this analysis in 3GPP TR 26.940, specifically capturing:
[1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis"
[2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling"
[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle
This contribution addresses the determination of feasible Transport Block Sizes (TBS) for the newly agreed NTN-TDL-C channel model at 10-degree elevation angle, which was adopted at SA4 #134 (documented in S4-252108). Key observations include:
The contribution evaluates maximum feasible bitrates under worst-case conditions without DMRS bundling, considering two scenarios:
Both scenarios target 2% BLER for uplink and downlink.
The contribution proposes adding new TBS values to support the higher bitrates enabled by the new channel model:
The proposed tables include: - Packet header assumption: 7 bytes (with note that final size depends on SA2/RAN conclusions on 1-byte MAC header feasibility) - Header counting: Packet header counted only once per bundling period, regardless of number of voice frames bundled - TBS values: Selected from TS 36.213 Table 16.5.1.2-2 for NB-IoT NPUSCH - Net bitrate calculation: PHY bitrate minus overhead from packet headers
The complete updated tables show TBS values ranging from 144 to 936 bits for 80ms bundling, with corresponding PHY and net bitrates calculated for each bundling period configuration.
[FS_ULBC] On the scheduling timing uncertainty
This contribution addresses ambiguities in interpreting RAN1 LS S4-251654 regarding uplink and downlink timing for NB-IoT NTN with GEO satellites. The interpretation of this LS has direct implications on: - Scheduling timing uncertainty assumptions - Link capacity calculations
The document proposes clarifications to the Permanent Document (PD) version 0.4.0 to resolve these interpretation issues discussed at SA4 #133-e and subsequent meetings.
The document maintains the existing frame structure example for Half-duplex FDD with 80ms bundling period: - NPDSCH duration: 4ms (variable depending on DL SNR) - Multiple UL frequency allocation options: 1, 3, 6, and 12 tones with 15 kHz per tone - Allocation choice depends on UL and DL channel capacity
Two SPS approaches are presented:
Approach 1 (Figure 5.2.2.3-2): - NPDSCH can be positioned anywhere within first 15ms - Maintains minimum 1ms gap to NPUSCH
Approach 2 (Figure 5.2.2.3-3): - Based on "Cell_specific_Koffset" approach - Does not depend on "TA report UE capability"
The gap consists of: 1. Processing time + DL-to-UL switching: Minimum 1ms for half-duplex device switching 2. Max differential delay: Accounts for different round-trip delays of UEs in NTN cell - Typical range: close to 0 to 10.3ms depending on deployment
Key Changes Proposed:
For 80ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 68ms - Proposed change: Replace with reference to beam size no larger than 1500km - Note clarifies this corresponds to scenarios where difference between closest and farthest point to satellite is <1500km - Explicitly states codec can be deployed in scenarios not meeting these constraints
For 160ms bundling:
- Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 148ms
- Proposed change: Replace with reference to beam size no larger than 1500km
- Same flexibility noted regarding deployment in other scenarios
RAN1 LS Clarification: - Figure 5.2.2.3-1 supportable in most scenarios - May not be supportable when: - Cell is very large (e.g., >3000km) - UE does not support TA report - Network does not support UE-specific K-offset - Requires UE configuration with two HARQ processes and HARQ feedback disabled
SPS Design Status: - RAN1/2 have not yet started SPS design work - RAN1 cannot currently confirm whether SPS frame structure examples (Figures 5.2.2.3-2 and associated text) will be supported
Editor's Note: - Range of "Max differential delay" is TBC (To Be Confirmed)
The primary technical contribution is replacing specific timing constraint assumptions (X + Y values and max differential delay) with a more practical reference based on beam size ≤ 1500km for codec simulation baseline, while explicitly allowing codec deployment in scenarios exceeding these reference conditions. This provides clearer guidance to SA4 while maintaining flexibility for various deployment scenarios.
[FS_ULBC] On transmission delay for voice over NB-IoT NTN
This contribution from Qualcomm addresses gaps in TR 26.940's mouth-to-ear delay calculations for NB-IoT NTN systems, specifically highlighting the omission of NPUSCH/NPDSCH transmission durations and clarifying the distinction between propagation delay and transmission delay.
Missing Transmission Duration: TR 26.940 did not account for the duration of NPUSCH transmission or NPDSCH transmission, which can be significant for NB-IoT (e.g., 64ms for NPUSCH)
Terminology Confusion: TR 26.940 confuses propagation delay with transmission delay, where:
Key Change: Renamed "Transmission delay" to "Propagation delay" for GEO satellite link
New Addition: Introduces proper definition and consideration of transmission delay
Editorial Note Added: - Numbers in Table 5.1.3-1 will be updated once RAN simulation is completed to account for transmission delays in uplink and downlink - Current values assume AMR and EVS algorithmic delays - ULBC delay components still need to be addressed - Minimum Delay_GSCN assumed as 20ms
Existing Table Structure Maintained: - Frame sizes: 20ms, 40ms, 80ms, 160ms, 320ms - Two scenarios: GEO-TN (main) and GEO-GEO (sub-scenario 1) - Lower and upper bounds for mouth-to-ear delay - Delay ranges from 428-712ms (20ms frame, GEO-TN) to 984-1455ms (320ms frame, GEO-GEO)
[FS_ULBC] Support for Dual-Tone Multi-Frequency for IMS voice over NB-IoT NTN
SA1 has mandated support for Dual-Tone Multi-Frequency (DTMF) for IMS voice over NB-IoT NTN. The document addresses the need to consider multiplexing of DTMF traffic with voice traffic in the system design, referencing RFC 4733 for DTMF payload formats.
RFC 4733 defines two DTMF payload format types: - Telephone events: User button presses (0-9, , #) during calls - Tones*: Ringing tone, busy tone, etc.
For IMS calls, tones are generated locally (e.g., "180 Ringing" or "486 Busy Here" SIP messages trigger local tone generation), so only telephone events need to be transported over the air.
The document identifies key DTMF traffic characteristics: - DTMF packets are transmitted infrequently (only on button press) - Telephone events may or may not overlap with active voice activity - Multiple DTMF packets may be transmitted per button press, with the RTP marker bit indicating the first packet - RTP packets must differentiate between voice and DTMF packets for multiplexing
Three key assumptions are established: 1. DTMF packet size ≤ voice packet size 2. DTMF delay requirements are less stringent than voice service 3. DTMF takes priority over voice
When SPS (Semi-Persistent Scheduling) is configured for voice traffic with fixed TBS: - If DTMF packets don't overlap with active voice frames, they can be multiplexed with SID frames (smaller than active voice frames) and transmitted in SPS occasions - If overlapping occurs, the UE can puncture an active voice frame and send the DTMF frame instead - SA4 needs to coordinate with RAN1 and RAN2 on SPS design
Proposal 1: Make DTMF support an integral part of IMS voice service over NB-IoT NTN
Proposal 2: Design DTMF support based on the three assumptions: - DTMF packet size ≤ voice packet size - DTMF delay requirement less stringent than voice - DTMF priority over voice
Proposal 3: SA4 to design mechanisms for voice and DTMF multiplexing for SPS and coordinate with RAN1 and RAN2
Proposed design constraints for noise suppression, DTX, and non-speech inputs
This contribution addresses design constraints for the ULBC (Ultra-Low Bit-rate Communication) over GEO channel solution, building upon previous discussions from S4-251881 and S4-251786. The document focuses on three key areas: - Noise suppression handling - Discontinuous transmission (DTX) framework - Robustness to non-speech inputs
The contribution emphasizes that emergency calls represent a critical use case for ULBC over GEO, particularly when terrestrial network (TN) service coverage is unavailable. Key considerations include: - Background signals may contain critical contextual information (e.g., voices, environmental sounds indicating danger) - Post-call analysis requirements (ASR transcripts, emergency response evaluation, criminal investigations) - Need for full situational awareness rather than aggressive noise suppression
The document identifies several technical challenges:
The contribution updates the original proposal from S4-251881 by: - Maintaining the requirement for disableable noise suppression within the codec - Adding specific SNR ranges for stationary (5-15 dB) and non-stationary (10-25 dB) noise - Deferring specific noise type definitions for future discussion - Linking noise suppression behavior primarily to performance requirements
The document proposes updates to Table 6.2-1 in draft TR 26.940 with three new/modified constraint parameters:
Requirement: If noise suppression is supported as part of the candidate codec, it must be possible to disable it to preserve background signals.
Editor's Notes: - EN1: Requirement to disable may be considered in connection with specific operating bit rate(s) - EN2: Solution behavior w.r.t. potential noise suppression is primarily enforced via performance requirements; default operation for tests is with noise suppression disabled
Requirement: The candidate codec shall provide a framework for: - Voice Activity Detection (VAD) - Discontinuous Transmission (DTX) - Comfort Noise Generation (CNG) - Operation with DTX on or DTX off
Editor's Note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need clarification
Requirement: The candidate codec shall be robust to: - Noisy speech with stationary noise (5-15 dB SNR) - Noisy speech with non-stationary noise (10-25 dB SNR) - Background signals during and between speech segments - Other non-speech input signals
Editor's Notes: - EN1: May need to be in performance requirements - EN2: Relevant background signals to be further defined as part of performance requirements, including both stationary and non-stationary types
Balanced approach to noise suppression: Recognizes both the need for flexibility in noise suppression (for speech quality) and the critical requirement to preserve background signals (for emergency use cases)
Mandatory DTX framework: Establishes VAD/DTX/CNG as a required feature rather than optional, with explicit on/off control
Quantified robustness requirements: Provides specific SNR ranges for different noise conditions that the codec must handle
Testing methodology guidance: Proposes default testing with noise suppression disabled, while allowing performance requirements to govern overall behavior
Several editor's notes indicate areas requiring further work: - Specific operating bit rates where noise suppression disable requirement applies - Clarification of DTX and noise suppression interaction - Final placement of robustness requirements (design constraints vs. performance requirements) - Definition of specific background signal types for testing - Speech quality requirements (to be addressed separately in performance requirements)
UE Antenna Gain in link-budget evaluations
This contribution addresses the need to establish common assumptions for UE Antenna Gain in link-budget evaluations for FS_ULBC (Ultra Low Bitrate Speech Codec). The document highlights that different assumptions on UE Antenna Gain lead to significantly different conclusions on suitable radio configurations, and proposes alignment with existing 5G NR-NTN assumptions.
The current FS_ULBC Pdoc references TR 36.763 with UE Antenna Gain assumptions ranging between 0 dBi and -5.5 dBi. Previous SA4 contributions on link level simulations have shown divergent assumptions regarding achievable link level performance, leading to inconsistent conclusions. The lack of a common assumption for UE Antenna Gain (G_Tx) significantly impacts:
The document presents a detailed side-by-side comparison of link-budget calculations for GEO satellite uplink with two different UE Antenna Gain assumptions:
Scenario Parameters (Common): - Satellite Orbit: GEO - Link Direction: Uplink - Device Type: Handheld - Satellite Elevation Angle: 2.3 degrees - Satellite Altitude: 35,786 km - Slant Range: 41,417.91 km - Carrier Frequency: 2000 MHz - Free Space Path Loss (FSPL): 190.8 dB - UE Transmit Power: 23 dBm - Receive Antenna Gain: 51 dBi - Satellite G/T: 19 dB/K - Bandwidth: 3750 Hz - Various losses (atmospheric, shadow fading, scintillation, polarization, additional): 11.4 dB total
Key Results:
| UE Antenna Gain | Received Power | Noise Power | SNR at Satellite Receiver | |-----------------|----------------|-------------|---------------------------| | 0 dBi | -135.58 dBm | -138.23 dBm | 2.66 dB | | -5.5 dBi | -141 dBm | -138.23 dBm | -2.84 dB |
The difference in UE Antenna Gain assumption results in a 5.5 dB difference in SNR, which is highly significant for link-level performance evaluation and system design.
Observation 1: The assumption for UE_Antenna_Gain (G_Tx) critically impacts the resulting SNR at the satellite receiver, which in turn affects conclusions on link-level results. Clarification is needed on whether to use 0 dBi, -5.5 dBi, or both values.
Observation 2: It is unlikely that an NB-IoT device would have superior antenna performance compared to an NR handheld device. Therefore, the UE_Antenna_Gain assumption should align with 5G NR-NTN specifications, which use -5.5 dBi.
Observation 3: RAN4 guidance (R1-2208353) explicitly recommends -5.5 dBi as a realistic UE antenna gain value, stating: "The UE antenna gain varies depending on the operating frequency and UE design. RAN4 thinks that a realistic UE antenna gain value would be -5.5 dBi. RAN4 would then recommend RAN1 to take this value as an assumption for their link budget evaluation."
Proposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on UE_Antenna_Gain (G_Tx) with 5G NR-NTN specifications, i.e., -5.5 dBi.
This alignment ensures: - Consistency with existing 3GPP NTN specifications - Realistic assumptions based on RAN4 recommendations - Comparable link-budget evaluations across different contributions - Appropriate performance targets for codec and system design
On reference code and model format
This contribution proposes the use of ML model formats as intermediate representations (IR) for the ULBC (Ultra Low Bitrate Codec) reference implementation, rather than a pure C implementation. The document is structured as a proposed Change Request (pCR) to TR 26.940, introducing a new clause 6.4.2.
The document identifies a fundamental question for ULBC standardization: whether to provide the entire codec reference implementation in C (including neural network components) or to define specific parts based on ML model formats (e.g., ONNX, PyTorch, TensorFlow).
Key concerns with pure C implementation: - Limits UE vendors from leveraging custom architectures and optimizations - UE vendors typically have custom optimization pipelines to port ML models to internal formats - Pure C approach restricts full utilization of specialized hardware (NPUs, DSPs, TPUs)
Issues with existing WMC (Weighted Million Operations) tool for complexity measurement: - Weights in Table 18.3 of G.191 do not account for vectorized implementations of matrix multiplications - Theoretical complexity estimation does not reflect actual runtime complexity - Does not account for diversity of target platforms
Additional limitations identified: - Hardware/platform dependencies: C implementations may rely on platform-specific intrinsics and vectorization pragmas, limiting portability to NPUs - Unoptimized reference code: May not be optimized for certain platforms - Compiler dependencies: Intrinsics are compiler-specific - Maintenance burden: Keeping C implementation updated with new ML operators and architectures is costly and error-prone
The document establishes clear terminology:
Note: PyTorch does not contain a graph format and requires model definition as Torchcode.
Platform Portability: - Specifies what is computed, not how it's executed - Framework-agnostic: models can be exported from different training frameworks - Allows vendors to use custom toolchains for hardware-specific optimization
Hardware Evolution: - Future-proof method to leverage latest AI processor developments - Maintains compatibility with low maintenance effort
Combination with Standard C-code: - ULBC can combine ML parts (as model format) with classic signal processing (in ANSI C) - Backend runtime in C can integrate ML components - Enables traditional 3GPP codec reference implementation structure
The document provides detailed comparison of major ML model formats:
| Format | Type | Key Advantages | Key Limitations | |--------|------|----------------|-----------------| | ONNX | Framework-agnostic IR | Cross-framework portability, wide runtime/hardware support, native OS support (Windows/Linux), dedicated C/C++ runtime | Operator coverage limitations, limited dynamic graph support | | TensorFlow Lite (TFLite/LiteRT) | Edge/embedded-focused IR | Mobile/edge optimized, strong Android ecosystem, quantization tools, C/C++ runtime | TensorFlow-centric, partially vendor-specific maintenance | | PyTorch/Python | Torch.nn.Module + checkpoints | Easy prototyping, highly optimized conversion tools | Suboptimal for real-world testing, Python dependencies, no C/C++ runtime without Python | | TorchScript | PyTorch-specific serialized IR | Static graph without Python dependencies, supports custom ops, LibTorch C++ runtime | PyTorch-specific, deprecated (being replaced by ExportedProgram) | | ExportedProgram & ExecuTorch | Two IRs: ExportedProgram and .pte | Replaces TorchScript, canonical PyTorch export IR, dedicated C++ runtime | PyTorch-specific, requires compilation to another IR, pipeline not fully mature | | OpenVINO IR | Intel/CPU-centric IR | Strong Intel CPU/GPU optimization | Not suitable for mobile SoCs, extra conversion step needed | | Proprietary vendor IRs | Vendor-specific internal IR | Highly hardware-optimized | Not portable, requires conversion from open IR |
Key observations: - PyTorch format provides maximum flexibility and transparency but may have long-term compatibility concerns due to format evolution - ONNX and TFLite are designed for inference deployment and cross-platform compatibility, representing stable industry standards - ULBC ML parts will likely be based on PyTorch format, convertible to stable formats like ONNX or TFLite
Hardware landscape: - Major smartphone SoCs include NPUs, DSPs, TPUs, GPUs, and CPUs - Vendors provide specialized runtime environments and SDKs - Vendors use native/preferred internal model formats optimized for their architecture
Industry pattern: - All major vendors provide conversion mechanisms from popular open-source formats - Common supported formats: ONNX, TFLite, PyTorch, TensorFlow - References provided for major vendors: Qualcomm, Apple, Samsung, MediaTek, Google, Huawei
Advantages of model-format/IR-based reference implementation: - Decouples algorithm definition from hardware-specific implementation - Leverages existing SoC vendor compilers, AI accelerators, and runtimes - Significantly more portable, maintainable, and future-proof
Recommended approach for ULBC reference implementation: 1. Base reference on ML model-format with auxiliary signal processing in C 2. Include both ONNX and PyTorch as ML model-formats 3. Define neural network model-format including operator set and version 4. Specify I/O interfaces of ML models and auxiliary signal processing steps in C 5. Use reference implementation for integration illustration, verification, and testing
The document proposes: 1. Discussion and agreement on selection of one or more model formats for ULBC reference implementation 2. Agreement on principle of using model format as part of ULBC standardization reference model 3. Documentation of findings in TR 26.940 under new clause 6.4.2
This contribution represents a significant departure from traditional 3GPP codec standardization approaches by advocating for ML model formats rather than pure C implementations. The proposal addresses practical deployment considerations for ML-based codecs while maintaining compatibility with 3GPP standardization practices through hybrid approach combining model formats with C code for signal processing components.
On the use of objective metrics in ULBC standardization
This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
The document identifies specific impairment categories relevant to ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).
The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):
Traditional 3GPP Approach: - Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech - P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content - Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments
ULBC-Specific Considerations: - ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues) - DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition
Alternative Test Methodologies Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation
Robustness Related to Source Material (9.1.3.1): - Multiple languages with diverse intonations - Non-speech signals - Various linguistic features and accents - Wide range of speakers (different voice pitches, speaking styles) - Overlapping talkers
Simulation of Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Various reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion (9.1.3.4): - ITU-T P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Correlation Analysis Results (9.1.4.1):
The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:
Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ
General audio metrics: PEAQ, ViSQOL-A
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations for Clean Speech: - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior - 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis for Music/Mixed Content:
Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations for Music/Mixed Content: - POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f-model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance - Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology
Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective and objective metrics may be considered for codec characterization - Objective metrics have merits in other tasks such as codec conformance testing
The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.
On complexity estimation of ULBC
This contribution addresses the complexity measurement methodology for the Ultra-Low Bitrate Codec (ULBC) under development in 3GPP SA4. The document proposes a hybrid complexity metric that combines traditional DSP-based measurements with ML-specific metrics.
Multiple input documents [1-4] have previously discussed complexity measurement approaches: - Documents [1] and [3] proposed using WMOPS (Weighted Million Operations Per Second), following conventional speech codec practices - Document [2] suggested using MACs and a modified WMOPS version - Document [4] emphasized model size considerations
The key challenge is that ULBC will operate on heterogeneous, non-fixed target hardware and processors, requiring a platform-agnostic complexity metric.
The document proposes combining two complementary measurement approaches:
For DSP-based components: - Use traditional WMOPS measurement
For ML-based components: - Use MAC (Multiply-Accumulate) operations count - Include parameter count for memory/model size considerations
Combined metric formula:
WMOPS + w · MACs
where w is an ML weighting factor (expected to be < 1) that reflects the vectorization capability of matrix multiplications.
Limitations of WMOPS-only approach: - WMOPS reflects complexity primarily for DSP operations - Does not account for modern vectorization capabilities available even on modern DSPs - Less relevant for non-DSP processor types - The WMOPS toolbox doesn't reflect modern computational capabilities
ML-specific considerations: - ML component complexity is dominated by matrix multiplications - Inference time and energy consumption are highly platform-dependent - MAC count provides architecture-agnostic computational load measurement - Parameter count relates directly to model size, memory usage, and energy consumption
The hybrid approach provides: 1. Overall complexity estimate for hybrid DSP+ML codec designs 2. Avoids over-constraining codec design toward specific platforms (referenced S4-260233) 3. Allows UE vendors to leverage custom architectures and optimizations 4. Accounts for efficient vectorization of ML components 5. Enables flexible computational cost balancing between DSP-based and ML-based components 6. Maintains continuity with established practice while accommodating emerging ML-based designs
The document provides example processing units and their vectorization capabilities to inform the ML weighting factor w:
| Chip | Type | Vectorization Capabilities |
|------|------|---------------------------|
| HiFi 5s | DSP | 32×(8×8 bit MAC)
16×(32×16 bit MAC)
8×(32×32 bit MAC) |
| ARM Cortex A55 | CPU | 16×(8×8 MAC)
8×(16×16 MAC FP) |
The source proposes to:
Combine according to: WMOPS + w · MACs (where w is an ML weighting factor)
Define a maximum value as the computational complexity limit in design constraints
Apply similar principles for memory counting metrics
The document references five previous contributions [1-4] and two external technical specifications [5-6] for processor capabilities.
[FS_ULBC] ULBC Re-Focus Proposal
The FS_ULBC study item, initiated nearly a year ago, aims to establish a normative ULBC standard for voice communication over GEO within Rel-20. However, progress has been slow, with crucial issues such as end-to-end simulation parameters remaining unresolved. This contribution proposes a focused approach to meet 3GPP standardization timelines.
The document proposes separating ULBC standardization into two distinct phases to ensure timely delivery while accommodating future enhancements:
Baseline (Rel-20): - IMS Voice Call over GEO based strictly on Rel-19 service requirements
Advanced (Rel-21): - Multi-Party Voice Communication - IMS Voice Call with ULBC over additional access types beyond GEO
Baseline (Rel-20): - Single baseline UE Tx/Rx capability - Single CNR in UL and DL (e.g., UL single-tone 23 dBm: CNR=5.28 dB for SCS=3.75 kHz, CNR=-0.74 dB for 15 kHz; DL 12-tone single Rx: CNR=-0.61 dB) - Single agreed target bitrate compatible with baseline UE capability enabling acceptable system capacity - Reliance only on mandatory Rel-19 NB-IoT radio protocol features (except SPS) - i.i.d. random block error patterns - Single SPS/bundling period (160 ms)
Advanced (Rel-21): - Advanced UE capabilities (e.g., increased Tx power, multiple Rx antennas) - Multiple CNR assumptions in UL and DL - Codec designers may choose optimal bitrate/TBS per CNR - Allow reliance on expected Rel-20 and selected non-mandatory NB-IoT features - Simulated block error patterns based on advanced features - Additional SPS/bundling periods (e.g., 80 ms, 320 ms)
Baseline (Rel-20): - Single target bitrate derived from Rel-19 GEO IMS voice service requirements - Example: TBS=208 with SPS period 160 ms, achieving 950 bps net bitrate
Advanced (Rel-21): - Multiple target CNRs with bitrate as codec design choice - Additional bitrates for future 6G-related scenarios
Baseline (Rel-20): - Single sample rate: e.g., 16 kHz - Audio bandwidth: up to WB - Note: May depend on agreed target bitrate
Advanced (Rel-21): - Input/output sampling rates: at least 8, 16, 32, 48 kHz - Audio bandwidth unconstrained (codec design choice)
Baseline (Rel-20): - Corresponding to SPS/bundling period (160 ms) or sub-multiples thereof - Algorithmic delay excl. framing: e.g., ≤80 ms (0.5 × SPS/bundling period)
Advanced (Rel-21): - Frame structure and algorithmic delay aligned with advanced SPS/bundling options and future 6G Media requirements
Baseline (Rel-20): - Limited; sufficiently low to not preclude deployment on current-generation smartphones - TBD MMAC/s - E.g., 3M parameters
Advanced (Rel-21): - Relaxed, enabling multiple models - Addressing future 6G Media requirements while leveraging new UE hardware trends
Baseline (Rel-20): - Required; capable of addressing single agreed-upon target bit rate and operation point of IMS Voice Call over GEO
Advanced (Rel-21): - Required; capable of supporting anticipated extended application scenarios beyond Rel-20 IMS Voice Call over GEO, while fulfilling potential 6G Media requirements
Baseline (Rel-20): - No requirement to provide noise suppression - Required capability to handle and reconstruct noisy speech input with moderate to high SNR - Note: Noise reconstruction capability primarily enforced through performance requirements
Advanced (Rel-21): - No requirement to provide noise suppression - Required capability to handle speech and generic input anticipated in extended application scenarios
Baseline (Rel-20): - No requirement to support DTX - Note: No separate DTX-related performance requirement
Advanced (Rel-21): - DTX support may be required for certain extended application scenarios, depending on potential 6G Media requirements
Baseline (Rel-20): - Requirements focusing on clean and noisy speech performance - NWT AMR7.4 or NWT AMR-WB8.85 depending on target bandwidth for: - Clean speech - Noisy speech (AMR/AMR-WB references operated with DTX on) - Relevant transcoding cases with G.711, AMR, AMR-WB, EVS
Advanced (Rel-21): - Complex set of requirements considering required capability to handle speech and generic input anticipated in extended application scenarios
Baseline (Rel-20): - Subjective: P.800 DCR - Note: Test methodology and test plan should be conceptually aligned with corresponding EVS codec standardization Pdocs (e.g., DCR test design, applicable SNRs and types of noises for noisy speech test cases)
Advanced (Rel-21): - Subjective: Suitable for critical evaluation of candidate codec(s) against expected complex set of performance requirements
SA4 is asked to adopt this phased approach for ULBC standardization as working assumption:
Rel-20 ULBC Baseline: GEO-focused functionality based solely on Rel-19 service requirements and mandatory Rel-19 features (except SPS), enabling completion of viable ULBC baseline standard within Rel-20 schedule
Rel-21 ULBC Advanced: Extended ULBC functionality aligned with finalized 6G Media requirements, supporting application scenarios beyond Rel-20 IMS Voice Call over GEO, possibly leveraging advanced UE capabilities, and providing backward compatible extension of Rel-20 baseline
This approach ensures deliverable ULBC baseline in Rel-20 while providing clear and orderly path toward enhanced ULBC design in Rel-21.
[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel
This contribution presents simulation results for 80ms bundling period following the Simulation One ("target QoS based simulation") methodology. The document provides:
The simulations cover the following parameter ranges:
Trace file naming convention established for both UL and DL scenarios.
Optimality criterion: Tradeoff between per-UE performance (TBS and BLER) and system capacity.
Key observation: 3.75kHz SCS configuration becomes optimal for higher power classes due to better coding rate.
Note: Coarse 5ms granularity for NPDSCH time-domain configuration.
Observation: For 80ms bundling period with UE power class up to 31dBm: - All TBS values (144, 256, 328, 424) are feasible for BLERs 1%, 2%, 6%, and 10% - Exception: TBS 424 is not feasible at 1% BLER
299,391 traces provided in attached zip file for all 4 TBS values (144, 256, 328, 424).
Proposal: Include clauses 2 through 5 to the PD or TR to provide a workable example on determining configurations based on optimal tradeoff between per-UE performance and system capacity.
[FS_ULBC] ULBC Performance Requirements
The document proposes establishing minimum performance requirements for the Ultra-Low Bitrate Codec (ULBC) based on the following rationale:
The document establishes two key performance anchors:
Applies to: clean speech, noisy speech, and packet loss conditions
At higher operating range (~3 kbps):
The document proposes a comprehensive list of reference codecs and operating points for ToR comparison testing in subjective evaluation:
The document proposes updates to Clause 8 (Performance requirements) of TR 26.940, adding: - New Clause 8.1 (General) containing the performance requirements framework and minimum benchmarks - New Clause 8.1.1 (A List of Reference Codecs and Operating Points) containing the reference codec list for subjective evaluation
[FS_ULBC] ULBC Codec Testing in Background Noise
This contribution proposes a testing framework for the Ultra-Low Bitrate Codec (ULBC) in noisy conditions, drawing from EVS codec testing methodologies. The document is a revision of S4-251786 from SA4#134 and proposes updates to TR 26.940 Clause 9.
The document argues against mandating NS algorithms within the codec specification based on several key considerations:
Device-Specific Optimization: NS algorithms are typically optimized for specific device microphone array configurations. A generic NS algorithm applied uniformly could result in suboptimal performance across different device types.
Codec Robustness vs. NS Artifacts: Testing ULBC with clean, noisy, and optionally NS-processed speech provides better understanding of the codec's inherent robustness. NS algorithms may introduce speech distortions that could bias codec testing results.
Emergency Call Requirements: For emergency calls, preserving background noise is critical as it may contain important contextual information (alarms, traffic, voices) that helps identify the caller's environment or ongoing danger.
Complexity and Latency Concerns: ML-based NS algorithms can be computationally complex, increasing power consumption and end-to-end latency. Mandating complex NS could burden some devices inefficiently.
The document advocates for flexibility in NS implementation to enable manufacturers to develop device-specific solutions.
Following EVS codec testing principles (TR 26.952), the proposal includes:
| Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Clean speech | - | - | ITU-T P.800 ACR and/or DCR | | Speech + Noise | Stationary (car, etc.) | 15 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 20-25 dB | ITU-T P.800 DCR |
This framework aligns with EVS testing which used:
- Car noise at 15 dB
- Street noise at 20 dB
- Office/babble noise at 20 dB
- ITU-T P.800 DCR methodology ("Degradation of Speech in Noise" DMOS test)
To characterize ULBC robustness in challenging low SNR conditions:
| Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR |
Key Notes: - To avoid bias, a common NS processing tool should be used for generating NS-processed speech - Selection of specific noise types and the NS processing tool is FFS - Reference is made to TR 26.989 v19.0.0 (MCPTT work) where EVS was evaluated in siren noise at 5 dB SNR
The document proposes adding new Clause 9.1.4 to TR 26.940 with two subclauses:
The document seeks Discussion and Agreement on: 1. The proposed testing framework for ULBC in noisy conditions 2. Updates to TR 26.940 Clause 9 as specified in the text proposal
[FS_ULBC] On device capability diversity
This document (revision of S4aA260006) addresses UE capability diversity in NB-IoT NTN deployments for ULBC voice services. It proposes a capability-aware system design approach rather than assuming uniform baseline UE capabilities, accompanied by a pCR to TR 26.940.
Identified Capability Dimensions:
Future: up to 37 dBm under study for Rel-20
Receive Antenna Configurations:
Enhanced: Dual RX antennas (providing ~3 dB gain)
Advanced Features:
Key Insight: These capabilities are optional and vary across device categories, market segments, and implementations.
Enhanced UE capabilities enable:
Proposed Scheduling Strategy:
Practical Example (Figure 1):
Proposed Approach:
Note: Actual bitrates subject to ongoing TBS discussions; values >3000 bits/s may become relevant.
Documents capability variations for NB-IoT NTN:
Replaces assumption of uniform UE configuration with capability-aware scheduling:
Includes Figure 1 demonstrating practical multi-user scheduling scenario with three UE types.
Updates delay estimation tables (5.1.2-2, 5.1.3-1) to include:
Key dependencies: S4aA260006 (previous version), S4-260144 (TR 26.940 v0.5.1), S4-260255 (ULBC Re-Focus Proposal), TS 36.763 (UE radio transmission/reception), S4-251863 (system capacity), S4aA250112 (error trace methodology), S4aA250118 (RAN simulation results)
On the use of objective metrics in ULBC standardization
This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.
Quality Impairment Categories for ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Testing Challenges: - ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS) - Traditional P.800 ACR methodology may not optimally quantify all potential impairments - DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences - Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations
Alternative Test Methods Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation
Source Material Robustness (9.1.3.1): - Multiple languages with diverse intonations - Various phonetic and linguistic environments - Different voice pitches and speaking styles - Overlapping talkers
Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion: - P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Correlation Analysis on Clean Speech (9.1.4.1):
Evaluated objective models from references [7-11]: - Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS - General audio metrics: PEAQ, ViSQOL-A - Additional metric: SCOREQ
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations (Clean Speech): - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs - Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis on Music/Mixed Content:
Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations (Music/Mixed Content): - POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content - Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading
Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective/objective metrics may be considered for characterization of new codec - Objective metrics have merits in other tasks such as codec conformance testing
This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.