Meeting: TSGS4_135_India | Agenda Item: 7.8
38 documents found
| TDoc Number | Source | Title | Summarie |
|---|---|---|---|
| Bytedance |
[FS_ULBC] Analysis on complexity evaluation of ULBC with WMOPS
|
Analysis on Complexity Evaluation of ULBC with WMOPS1. IntroductionThis contribution examines the use of WMOPS (Weighted Million Operations Per Second) as a complexity metric for ULBC (Ultra Low Bitrate Codec). WMOPS has been proposed as one of the possible complexity metrics and is traditionally used for evaluating 3GPP speech codecs complexity. The analysis focuses on the WMS tool used for automated WMOPS calculation with floating point C code. 2. Technical Analysis: Discrepancies Between ITU-T Documentation and WMC Tool ImplementationThe source conducted systematic testing of the WMC tool against the examples provided in ITU-T standards documentation (specifically clause 18.12.7 and related tables in the ITU-T Software Tool Library 2024 User's Manual). Several discrepancies were identified: 2.1 'Move' Operator CountingIssue: Extra MOVE operations are counted by the WMC tool
2.2 Increment Operator ('++')Issue: Missing operations in WMC tool output
2.3 Logical Operators ('AND/OR')Issue: Missing TEST operation counting
2.4 Indirect AddressingIssue: Extra MOVE operation and incorrect INDIRECT counting
2.5 Instrumentation with Array SubscriptsIssue: Arithmetic operations inside array subscripts not counted
3. Observations and Impact AssessmentThe source identifies three key observations:
4. ProposalThe source proposes to document the findings from Clause 2 and Clause 3 in clause 6.2 of the permanent document to ensure proper consideration of these WMOPS calculation issues in the ULBC complexity evaluation framework. |
|
| Bytedance |
[FS_ULBC] Influence of code optimization on WMOPS
|
Summary of S4-260128: Influence of Code Optimization on WMOPS1. Introduction and MotivationThis contribution investigates the impact of C code implementation choices on WMOPS (Weighted Million Operations Per Second) measurements for neural audio codecs, specifically in the context of the ULBC (Ultra-Low Bitrate Codec) study. The source examines whether WMOPS, traditionally used for 3GPP speech codec complexity evaluation, is suitable for neural audio codecs given that actual C implementation can significantly affect WMOPS measurements. 2. Experimental Analysis2.1 Operator-Level AnalysisThe source conducted experiments on Conv1D and Conv1DTranspose operators, which are extensively used in DAC (Discrete Audio Codec) for audio feature dimension manipulation:
Key Results: - Conv1D: 441 WOPS (non-optimized) → 301 WOPS (optimized) = 68.25% ratio - Conv1DTranspose: 554 WOPS (non-optimized) → 260 WOPS (optimized) = 46.93% ratio Finding: The same optimization strategy yields significantly different optimization ratios for different operators. 2.2 Full-Model Level AnalysisUsing the optimized and non-optimized operator implementations, the source measured WMOPS for two DAC configurations (enc16dec384 and enc64dec1536) and compared against previously reported results [4]: Total WMOPS: - enc16dec384: 13,320.35 (non-opt) → 8,152.01 (opt) → 5,785.17 (reported in [4]) - enc64dec1536: 201,552.55 (non-opt) → 123,966.49 (opt) → 84,441.99 (reported in [4]) Encoder WMOPS: - enc16dec384: 3,411.08 (non-opt) → 2,621.98 (opt) → 1,060.79 (reported in [4]) - enc64dec1536: 50,089.70 (non-opt) → 39,604.59 (opt) → 13,675.30 (reported in [4]) Decoder WMOPS: - enc16dec384: 9,847.22 (non-opt) → 5,484.21 (opt) → 4,724.38 (reported in [4]) - enc64dec1536: 151,291.59 (non-opt) → 84,255.25 (opt) → 70,766.69 (reported in [4]) 3. Observations and ConclusionsThe source draws two critical observations:
Main Conclusion: If WMOPS is adopted as a complexity metric for ULBC, results will be highly influenced not only by model design but also by the actual C code implementation, potentially making comparisons between different codec proposals inconsistent. 4. ProposalThe source proposes to document the experimental findings and observations as a new clause 7.6.5 "WMOPS analysis on DAC" in TR 26.940, with three sub-clauses:
- 7.6.5.1: On operator level
- 7.6.5.2: On full-model level This would capture the implementation-dependency issues of WMOPS measurements for neural audio codecs in the technical report. |
|
| China Mobile Com. Corporation |
[FS_ULBC] Discussion of FS_ULBC Objective Speech Quality Assessment Method
|
Summary of S4-260132: Discussion of FS_ULBC Objective Speech Quality Assessment MethodBackgroundThis contribution addresses speech quality assessment challenges for ultra-low bitrate codecs (ULBC). While subjective testing remains the benchmark for ULBC codec selection, objective speech evaluation methods can serve as predictive tools during intermediate testing and parameter adjustment processes, enabling more convenient and efficient quality verification. Overview of Existing Speech Objective Quality Evaluation MethodsThe document provides a comprehensive comparison of available objective assessment tools: Standardized ITU-T Methods
Open Source Methods
Capabilities and Limitations for ULBCThe document analyzes each method's suitability for ultra-low bitrate scenarios:
ProposalRecommended Objective Assessment MethodsAfter excluding unsuitable methods, the contribution recommends considering P.863, ViSQOL, and ESTOI as potential objective quality assessment methods for ULBC. Text Proposal for TR 26.940The document proposes a pCR to TR 26.940 Section 9 (Test methodologies) that includes: New Section 9.1.1: Typical Quality ImpairmentsIdentifies ULBC-specific impairment categories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, reverberant speech) New Section 9.1.2: Challenges of Quality AssessmentAddresses testing challenges specific to ULBC:
Technical ContributionThe main technical contribution is establishing a framework for objective quality assessment in ULBC standardization that: 1. Recognizes the unique challenges of ML-based codecs 2. Identifies suitable objective methods as predictive tools 3. Proposes their documentation as optional assessment methods in TR 26.940 4. Maintains subjective testing as the primary benchmark while enabling more efficient intermediate evaluation |
|
| Xiaomi Technology |
Updates to the simulation results for FS_ULBC
|
Summary of S4-260136: Updates to Simulation Results for FS_ULBC1. Introduction and ContextThis document presents updated link-level simulation (LLS) results for Ultra-Low Bitrate Communication (ULBC) over Non-Terrestrial Networks (NTN). The simulations follow the NTN-TDL-C channel model as specified in TS 36.102. This revision adds: - Missing simulation results for NTN-TDL-C 10 NPUSCH - New simulation results for NTN-TDL-C 10 NPDSCH - Updated TBS (Transport Block Size) values for both NPDSCH and NPUSCH with 10 degrees elevation angle The simulations are based on parameters discussed in S4aA250038 and follow agreements from previous meetings. 2. Channel Model AssumptionsKey Parameters: - Satellite elevation angle: 12.5 degrees for link budget calculations - Channel model parameters (delay and power of each path) determined for 10 degrees elevation (approximation of 12.5 degrees) - Channel model: NTN-TDL-C from TS 36.102 3. Link Budget Analysis3.1 CNR Baseline Values (from RAN1)Uplink: - CNR = 2.6 dB (0 dBi UE antenna gain, 3.75 kHz SCS, 1 tone, 23 dBm UE TX power) Downlink: - CNR = -3.3 dB (0 dBi UE antenna gain, 15 kHz SCS, 12 tones, 1 RX antenna, 7 dB noise figure) 3.2 Additional UL CNR Configurations| Configuration | SCS | UE Power | UL CNR | |---------------|-----|----------|---------| | Config 1 | 3.75 kHz | 23 dBm | 2.6 dB | | Config 2 | 15 kHz | 23 dBm | -3.42 dB | | Config 3 | 3.75 kHz | 26 dBm | 5.6 dB | | Config 4 | 15 kHz | 26 dBm | -0.42 dB | | Config 5 | 3.75 kHz | 31 dBm | 10.6 dB | | Config 6 | 15 kHz | 31 dBm | 4.58 dB | 3.3 Additional DL CNR Configurations| Configuration | Number of RX | G/T value | DL CNR | |---------------|--------------|-----------|---------| | Config 1 | 1 | -31.6 | -3.3 dB | | Config 2 | 2 | -31.6 | -0.3 dB | | Config 3 | 1 | -28.6 | -0.3 dB | | Config 4 | 2 | -28.6 | 2.7 dB | 4. NPUSCH Simulation Results4.1 Common Simulation Parameters
4.2 Results Part 1 - Various TBS and Bundling Times80 ms Bundling Time144 bits (Cases 1-4): - Case 1: 15 kHz, MCS 2, 4 RUs, 2 reps → SNR: -4.97 dB (10% BLER) to -4.35 dB (1% BLER) - Case 2: 15 kHz, MCS 2, 2 RUs, 1 rep → SNR: 1.8 dB to 2.7 dB - Case 3: 3.75 kHz, MCS 10, 1 RU, 2 reps → SNR: 1.5 dB to 2.30 dB - Case 4: 15 kHz, MCS 10, 1 RU, 4 reps → SNR: -4.64 dB to -3.90 dB 256 bits (Cases 5-8): - SNR ranges from -2.84 dB to 5.9 dB depending on configuration 328 bits (Cases 9-11): - SNR ranges from -1.53 dB to 9.45 dB depending on configuration 424 bits (Case 12): - SNR: 1.44 dB (10% BLER) to 2.05 dB (1% BLER) 160 ms Bundling Time208 bits (Cases 13-15): - SNR ranges from -5.56 dB to 1.53 dB 424 bits (Case 16): - SNR: -1.95 dB to -1.52 dB 600 bits (Case 17): - SNR: -1.38 dB to -0.97 dB 808 bits (Cases 18-19): - SNR ranges from -1.42 dB to 0.21 dB 320 ms Bundling Time328 bits (Cases 20-25): - SNR ranges from -6.80 dB to -0.22 dB 776 bits (Cases 26-27): - SNR ranges from -2.48 dB to 6.46 dB 1000 bits (Cases 28-30): - SNR ranges from -1.95 dB to 7.47 dB 1544 bits (Case 31): - SNR: 0.48 dB to 0.76 dB 4.3 Results Part 2 - Additional CasesCovers Cases 32-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. SNR requirements range from -3.6 dB to 8.0 dB depending on configuration. 5. NPDSCH Simulation Results5.1 Common Simulation Parameters
5.2 Results Part 1 - Various TBS and Bundling Times80 ms Bundling Time144 bits (Case 0a): - 1R: SNR -6.6 dB to -5.3 dB - 2R: SNR -9.3 dB to -8.1 dB 256 bits (Case 0b): - 1R: SNR -4.3 dB to -3.1 dB - 2R: SNR -7.1 dB to -6.1 dB 328 bits (Cases 1-2): - SNR ranges from -11.8 dB to -4.0 dB 424 bits (Cases 3, 3b): - SNR ranges from -11.6 dB to -5.0 dB 160 ms Bundling Time208 bits (Case 4): - SNR: -15.3 dB to -11.8 dB 424 bits (Cases 5, 5b): - SNR ranges from -14.6 dB to -8.0 dB 600 bits (Case 6): - SNR: -11.1 dB to -7.2 dB 808 bits (Cases 7, 7b, 8, 8b): - SNR ranges from -11.0 dB to -1.1 dB 320 ms Bundling Time328 bits (Cases 9-11b): - SNR ranges from -17.7 dB to -8.1 dB 776 bits (Cases 12, 12b): - SNR ranges from -14.8 dB to -8.1 dB 1000 bits (Case 13): - SNR: -10.3 dB to -6.4 dB 1544 bits (Cases 14, 14b): - SNR ranges from -11.7 dB to -5.0 dB 5.3 Results Part 2 - Additional CasesCovers Cases 15-46 with various TBS values (440, 584, 680, 936, 1096, 1384, 1736 bits) for 80 ms and 160 ms bundling times. 1T1R Results: - SNR requirements range from -10.9 dB to 1.1 dB 1T2R Results: - SNR requirements range from -13.6 dB to -1.9 dB - Approximately 3 dB gain compared to 1T1R configurations 6. ConclusionsThe document recommends considering these simulation results for determining design constraints for ULBC. The results demonstrate: - Performance across various TBS values (144 to 1736 bits) - Multiple bundling times (80, 160, 320 ms) - Different SCS configurations (3.75 kHz, 15 kHz) - Impact of repetitions on SNR requirements - Benefits of 2 RX antennas (approximately 3 dB gain) |
|
| Huawei Technologies Co., Ltd. |
[FS_ULBC] On eCall scenario for ULBC
|
Summary of S4-260137: On eCall Scenario for ULBC1. BackgroundThis contribution addresses the eCall (emergency call) scenario for Ultra-Low Bitrate Codec (ULBC) work. Previous contributions (S4-251908, SA-251848, SA-251881) emphasized the importance of preserving background signals in emergency communications. China has developed a related national standard "On-Board Emergency Call System for Road Vehicles" expected to take effect on July 1, 2027. The document highlights that eCall scenarios have special requirements and different conditions compared to regular call scenarios, necessitating different design constraints and test methodologies. 2. eCall Scenario Description2.1 System OverviewThe eCall system is an in-vehicle safety technology that: - Automatically dials emergency numbers (e.g., 112 in EU) upon severe collision detection - Sends minimum data set (MSD) including GPS location, VIN, collision direction and time - Can be triggered by built-in sensors or manual SOS button - Functions via GEO satellite even without terrestrial network coverage 2.2 Communication ArchitectureThe bi-directional voice data flow involves: - Vehicle side: Integrated microphones and speakers communicating over GEO satellite network - Emergency response center: Connected via terrestrial mobile network (VoLTE, VoNR), fixed-line, or other IMS-supported platform - Key requirement: Background noise captured within vehicle must be delivered with fidelity to emergency response center - Asymmetric requirement: Noise preservation may not be required in the opposite direction (emergency center to vehicle) - Dedicated system: No mobile phones involved in the communication link 3. Key ObservationsObservation 1: eCall is a dedicated system between vehicles and emergency response centers. Speech codec designed for eCall is not necessarily the same as that for regular call scenarios, allowing for separate design constraints or performance requirements for ULBC-eCall. Observation 2: Vehicle and emergency response center have significantly different hardware capabilities compared to regular call scenarios: - Less sensitive to power consumption - Higher computational capability - Higher storage capability - This allows for relaxed design constraints and more critical performance requirements for ULBC-eCall 4. Proposed Changes to TR 26.9404.1 New Clause 4.5: eCall CommunicationThe contribution proposes adding a new scenario (Scenario 4) to TR 26.940 documenting: High-level Prerequisites for ULBC in eCall:
4.2 Modified Clause 6.2: Design Constraint ParametersThe contribution proposes creating separate design constraint columns in Table 6.2-1: - Design Constraint (regular call): Existing constraints - Design Constraint (eCall): New column with eCall-specific constraints Key Differences for eCall Design Constraints:| Parameter | Regular Call | eCall | |-----------|-------------|-------| | Noise Suppression | Not required; noise suppression may be applied | Background noise preserved during call (at least vehicle-to-center direction); opposite direction may not require preservation | | DTX Support | Support | No DTX support during call (at least vehicle-to-center direction) | | Complexity/Memory | Standard mobile constraints | Relaxed constraints possible | 5. Technical ContributionsThe main technical contributions of this document are:
The document establishes that eCall scenarios justify different codec design approaches due to their dedicated nature, different hardware capabilities, and specific regulatory/safety requirements. |
|
| Huawei Technologies Co., Ltd. |
[FS_ULBC] On target platforms for ULBC
|
Summary of S4-260141: Target Platforms for ULBC1. Introduction and MotivationThis contribution addresses a gap in TR 26.940 regarding target platforms for Ultra Low Bit rate Codec (ULBC) deployment. While the TR currently discusses NPU as a possible platform in clause 6.2.1.5.1, it lacks coverage of other non-NPU platforms. The document aims to complete this missing information, particularly focusing on DSP-enabled devices. 2. Technical Problem StatementThe contribution identifies an inconsistency in TR 26.940:
The source references previous contributions (S4aA250267 and S4-251747) that discussed the need for DSP deployment and provided clarification on DSP-enabled UE devices. 3. DSP-Enabled Device DefinitionThe contribution adopts the definition from S4-251747 for DSP-enabled UE devices:
4. Proposed Text ChangesMain Technical ContributionThe proposal adds a new paragraph to clause 6.2.1.5.1 that:
Context PreservationThe proposal maintains the existing text about: - NPU prevalence in modern smartphones - NPU being 5-20x more power efficient than CPUs for AI tasks - The note that ULBC may need to run on non-NPU platforms in certain configurations 5. Technical ImpactThis contribution ensures that TR 26.940 provides comprehensive guidance on target platforms for ULBC deployment, balancing the AI-optimized NPU approach with the power-efficient DSP approach, thereby supporting a wider range of device implementations and use cases. |
|
| Huawei Technologies Co., Ltd. |
[FS_ULBC] On complexity and memory constraints for ULBC
|
Summary of S4-260142: On Complexity and Memory Constraints for ULBCIntroductionThis contribution addresses complexity and memory constraints for Ultra Low Bitrate Codec (ULBC) as part of the study in TR 26.940. The document aims to clarify previous discussions on measurement metrics and specific constraints, proposing concrete values for complexity, RAM, and ROM requirements. Main Technical ContributionsComplexity Measurement MetricsThe contribution proposes using both MACS (Million Multiply-Accumulate Operations per Second) and Codec/Model Size together to characterize ULBC complexity, rather than relying on a single metric:
Memory Constraints ClarificationThe document clarifies confusion from previous contributions (S4aA250253 and S4-251807) regarding the 5-10M parameters proposal: ROM Constraints
RAM Constraints
Complexity ConstraintsMACS Reference PointThe contribution references the 2025 Low-Resource Audio Codec (LRAC) Challenge sponsored by Cisco Systems as a relevant benchmark: LRAC Challenge Requirements: - Sampling rate: 24 kHz - Mono audio input - Bitrate: up to 1 kbps (ultralow) and up to 6 kbps (low) - Latency: 30 ms (Track 1) or 50 ms (Track 2) - Compute complexity: ≤ 350 MMACS total; ≤ 150 MMACS receive-side - Winner (ByteDance) used ~4M parameters Proposed MACS Value
Proposed Design Constraints SummaryThe contribution proposes the following specific constraints for ULBC:
Text ProposalThe contribution includes a change request to TR 26.940, Section 6.2 (Design Constraint Parameter), Table 6.2-1, adding the specific complexity and memory constraints detailed above to the "Complexity and memory demands" parameter row. |
|
| China Mobile Com. Corporation |
[FS_ULBC]TR 26.940 V 0.5.1
|
3GPP TR 26.940 - Study on Ultra Low Bit rate Speech Codecs (Release 20)Document OverviewThis Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types. 1. Application Scenarios for Ultra-Low Bit Rate Communication Services1.1 Scenario 1: IMS Voice Call over GEO (Primary Scenario)Background: - GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay - TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s - Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints Scenario Descriptions: Main Scenario (4.2.2.2): One UE connects via GEO satellite access - UE1: Phone supporting IMS voice over GEO satellite - UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation) Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access - Less common but relevant for disaster/cyberattack scenarios - Even with transparent payload, voice packets transmit to ground before reaching other UE - May enable transcoder-free operation High-level Prerequisites: - Very low bitrate support - DTX support [TBC] - Error concealment - Real-time implementation on smartphones - Good audio quality for reasonable QoE 1.2 Scenario 2: Multi-Party Voice CommunicationBackground: - Addresses poor/unstable network conditions in WLAN access - Network congestion during peak usage or in areas with limited infrastructure - Codec selection critical for maintaining quality under bandwidth constraints Scenario Description: - One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding) - Both participants on unstable networks using ULBC simultaneously (no transcoding needed) High-level Prerequisites: - Ultra-low bitrate capability - Real-time operation on consumer devices (smartphones, laptops) - Audio quality matching or exceeding existing voice services 1.3 Scenario 3: IMS Voice Call with ULBC over Other Access TypesMotivation: - ULBC may provide enhanced robustness against poor network conditions - Lower bit rates may benefit coverage/capacity - Reduces transcoding needs when calls bridge GEO and other access types Scenario Description: - Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN) 2. Channel Characteristics and Service-Related Dependencies2.1 Mouth-to-Ear Delay Estimation for GEO ScenariosDelay Components: UE Delay (Table 5.1.2-2): - Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms) - Performance objective range: 268-1435ms (excluding solution-specific delay) - Maximum requirement range: 355-1435ms (excluding solution-specific delay) - Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM Core Network Delay: - Ground station to core network: [5-20]ms minimum, [200]ms maximum - eNodeB to core network: 5-20ms - Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS) GEO Transmission Delay: - Minimum: 248ms - Maximum: 280ms (per TS 22.261 KPI requirement) - Variation of 32ms depending on UE location within beam Mouth-to-Ear Delay Estimates (Table 5.1.3-1): For Main Scenario (GEO-TN): - 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X For Sub-Scenario (GEO-GEO): - 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X - 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X 2.2 NB-IoT NTN System ParametersSystem Architecture: - Service link: UE to NTN payload - Feeder link: NTN payload to NTN Gateway RAN Parameters: - Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink) - MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM - Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1 - Resource unit (RU) duration varies with subcarrier spacing and number of tones QoS Characteristics: - Managed through QCI (QoS Class Identifier) - Same PELR (Packet Error Loss Rate) required for UL and DL - Suggests balanced UL/DL time-domain transmission resources 3. Design Constraints3.1 Design Constraint Parameters (Table 6.2-1)Key parameters identified: - Bit rates: [TBD] - Sample rate and audio bandwidth: [TBD] - Frame length: [TBD] - Complexity and memory demands: [TBD] - Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing) - Packet loss concealment (PLC): [TBD] - Noise suppression: [TBD] - Discontinuous transmission (DTX): Including VAD and comfort noise [TBD] - Robustness to non-speech input: [TBD] 3.2 Complexity and Memory ConsiderationsCurrent Evaluation Analysis: - Codec must support real-time thread and concurrent processing - ML codecs with [5-10M] parameters considered for efficient operation within latency bounds - Must operate within compute constraints of devices for real-time voice communication Memory and Power Considerations: - Larger models require more DRAM access → higher power consumption - Memory footprint critical for device performance and usability Complexity Metrics for AI-Based Codecs: TOPS (Tera Operations Per Second): - TOPS = 2 × MAC unit count × Frequency / 1 trillion - Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16) - TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs - Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics Alternative Metrics: - MACs (Multiply-Accumulate operations): Practical for complexity assessment - RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure - Model Size: Number of parameters and precision; directly impacts memory and power - Tools available: ptflops, torchinfo, fvcore for MAC counting Observations: - NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x) - Actual NPU performance depends on computational graph structure - Irregular/sequential/unsupported operations may require CPU fallback - ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs - Million MACs + model size provide first indication of complexity - RTF useful but requires standardized test benches - WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial Complexity Target Estimation: - Target devices: Modern smartphones with NPU components - Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS) - Actual power consumption on smartphone NPUs: TBD - Model size and architecture significantly impact DRAM operations and overall power consumption 3.3 Design Constraint VerificationEditor's note: Algorithmic delay verification method for AI-based codecs required. 3.4 Additional Design ConsiderationsCodec Parameters and Configuration: - Static parameters: Rarely changed, exchanged via SDP or predefined - Dynamic parameters: May change frequently, included in each packet/frame - Common static/dynamic parameters to be identified 4. Existing Technologies and Feasibility Evidence4.1 Overview of Existing Codec Technologies (Table 7.1.1-1)Categories: 1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS) 2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2) 3. AI-based postprocessor: Enhancement of conventional codec output 4. AI-based encoder/decoder: - Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2) - Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec) Key Codec Properties: 3GPP IMS Codecs: - AMR: NB, 5ms delay, 20ms frame, 4.75 kbps - AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps - EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps Conventional Ultra Low Bitrate: - MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps - Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps AI-based (Causal): - LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps - LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps - Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps AI-based (Non-causal): - DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps - DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps - SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps 4.2 Observations on Codec ParametersAudio Bandwidth: - Conventional codecs: NB only - Modern AI codecs: WB or higher Algorithmic Codec Delay: - IMS codecs: 25-32ms - Conventional ultra-low: 60-126ms - Causal AI: 20-80ms - Non-causal AI: 500ms+ or full signal Frame Duration: - Conventional ultra-low: Increased vs. standard 20ms VoIP - Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms) Bitrate: - All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps Complexity: - AI codecs generally higher than IMS/conventional codecs - Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+) - RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB) - ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB) 4.3 Performance EvaluationP.808 ACR Test Results (Figure 7.1.4-1): Test setup: - English clean speech (4 talkers × 6 samples) - 32kHz, SWB, normalized to -26 dBoV - 24 subjects Key Findings: - Codec2 (all rates) significantly worse than AMR 4.75 kbps - SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps - Three conditions on par or slightly better than EVS 9.6 kbps: - Mimi-Codec 1.1 kbps (causal) - DAC-ibm 1.5 kbps (non-causal) - SNAC 0.98 kbps (non-causal) - AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs 4.4 Packet Loss Concealment (PLC) Experiments4.4.1 PLC Experiment with DACTest Configuration (Table 7.1.5.1-1): - Bitrates: 1, 2.5, 4.5, 6 kbps - Loss percentages: 1%, 6%, 10%, 20% - Frame size: 80ms - Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz) Loss Simulation Methods: 1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss 2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency) MUSHRA Test Results (8 listeners): - Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps - 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss - Interleaving benefit increases with error rate - Potential for improvement if model trained with random loss patterns 4.4.2 PLC Experiment with DAC and DAC-IBMComparison: - DAC (default): 16kHz, general audio training, scalable bitrate - DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps MUSHRA Test Results (8 listeners, resampled to 16kHz): - DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions - DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR - Specific training for target bitrate crucial for optimal performance - Error resilience improvable through appropriate training/design choices Conclusions: - More design freedom needed in bitrate and BLER selection for optimal quality at given SNR - Optimal coding performance (even under errors) achieved with appropriate training strategy - Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates - Dedicated training (e.g., DAC-IBM) much more efficient 4.5 Very Low Bitrate Listening Test ResultsTest Setup (Nokia): - Clean Finnish speech (3 males, 3 females, 4 sample pairs each) - Diotic presentation via Sennheiser HD650 headphones - Experienced listeners - Extended ACR5 scale (0.5-5.5) and DCR methodologies - Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz) Codecs Tested: - DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps) - 3GPP: AMR, AMR-WB, EVS at various rates - ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps) Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2): - Increased bandwidth improves quality up to ~12kHz (saturation region) - 4kHz bandwidth significantly limits perceived quality - MELP 2.4k and MPEG4 HVXC perform better than Codec2 - 3GPP codecs perform as expected at lowest bitrates - TSAC and DAC show very good performance in clean speech - TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references - Both poor quality <1 kbps DCR Results (Figure 7.2.4-1): - Results align with ACR test - Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz) - Listeners more likely to notice degradations with reference available 4.6 Test Results on Clean Speech and Music/Mixed Content4.6.1 DCR Test on Clean Speech (Figure 7.3.2-1)Test Setup: - French, 30 listeners (5 panels × 6) - 8sec double sentences, 3 male + 3 female - 20-20,000Hz bandpass, -26dB LKFS normalized Codecs: - Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps) - AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8) Key Findings: - DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps - EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate - Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps) 4.6.2 ACR Test on Clean Speech (Figure 7.3.3-1)Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test. 4.6.3 DCR Test on Music and Mixed Content (Figure 7.3.4-1)Test Setup: - 30 listeners (5 panels × 6) - 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background) - 20-20,000Hz bandpass, -26dB LKFS Codecs: - Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4) - AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5) - Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft) Key Findings: - Best quality: EVS and xHE-AAC @ ~24 kbps - Neural codec advantage visible at low bitrates - No tested neural codec achieves quality close to "Direct" - FlowDec 7.5 kbps: 4.08 DMOS (best neural codec) - No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps 4.7 Impact of Noise Suppression on AI-Based Codecs4.7.1 Background on Existing SystemsClassical Speech Coding: - Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality - Especially beneficial in noisy conditions and low SNRs - Improves intelligibility and perceptual quality - Integrated in 3GPP2 EVRC and VMR-WB standards Neural Speech Coding: - Known to be sensitive to noisy environments - Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization - Data-driven approaches make failure modes difficult to anticipate - Noise suppression can minimize issues and allow codec to focus on useful signal 4.7.2 Test DesignTwo Listening Tests (ITU-T P.808 ACR): Test 1 - High SNR: - Assumptions from 3GPP EVS characterization - SNRs: +15 to +20 dB (WB) - Noises: car, street, office (from ITU-T P.501 Annex B) - 24 pairs of sentences (8 pairs × 3 noises) - 20 listeners Test 2 - Low SNR: - More adversarial environments - SNRs: -5 to +15 dB - Noises: street, construction, metro, car, office, restaurant - 24 pairs of sentences (4 pairs × 6 noises) - 21 listeners Noise Suppression: - DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz - Applied as preprocessor before coding Mixing Procedure: - Loudness normalization using BS1770demo (ITU-T STL) - RMS long-term option for background noise level 4.7.3 Conditions Under TestClassical Codecs: - MELPe, AMR, AMR-WB, EVS Neural Codecs: - SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps) - LyraV2 3.2 kbps (likely trained on diverse data including noisy speech) - DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only All tested with and without noise suppression ("_nr" suffix). 4.7.4 High SNR Test Results (Figures 7.4.2.4-1, 7.4.2.4-2)Key Observations: - Listeners prefer uncoded denoised speech over uncoded noisy speech - Denoised speech as good as clean speech at high SNRs (minimal artifacts) - Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs) - Classical codecs: Benefit increases with bitrate/quality - Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps) - DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate - Plain DAC @ 24kHz not competitive at 1.5 kbps - LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par) 4.7.5 Low SNR Test Results (Figures 7.4.2.5-1, 7.4.2.5-2)Key Observations: - Listeners strongly prefer uncoded denoised speech (~1 MOS difference) - All classical codecs benefit from denoising (<1 MOS improvement) - Neural codecs benefit even more (>1 MOS improvement possible) - Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression - Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech 4.7.6 Conclusions
4.8 Analysis of Existing AI Codec: Lyra V2Key Characteristics: - Publicly reported: "38x faster than real-time" on high-end 2021 smartphone - Entirely CPU execution (no NPU/TPU) - Open-source under Apache 2.0 license (permissive for commercial/standardization) Code-Level Analysis:
- Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite)
- Flag Conclusion: - Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU - Significant margin towards max RTF - CPU-only approach viable for ULBC 4.9 Complexity Analysis of Existing AI Codec: DACMethodology: - ONNX Runtime library for execution - Tested on CPU backend and NNAPI backend (Android NPU interface) - Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference) - No quantization applied (original float model) - Metrics: Real-Time Factor (RTF) for end-to-end and individual components Theoretical Complexity Analysis (Figure 7.6.2-1): - Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification) - Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms) - Model: 76.9M parameters, 293MB size - Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes Real-World Inference Performance: Test Platforms: 1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed) 2. High-end mobile: Qualcomm Snapdragon 8 Gen 2 Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3): Desktop CPU: - Single-threaded: NOT real-time (RTF 1.6-1.9) - Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86) - Still very slow for high-end desktop CPU Mobile SoC: - NO configuration achieves real-time performance - Best-case RTF: 2.125 (>2x slower than real-time) - Worst-case RTF: 5.884 (~6x slower than real-time) - NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU - Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required Critical Gap: - Significant gap between theoretical NPU capacity and actual measured performance (RTF) - Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone - Real-world testing essential Editor's note: NNAPI may fallback to CPU for float models; impact needs verification. 5. Test Methodologies5.1 General Considerations5.1.1 Typical Quality Impairments of Ultra-Low Bit Rate Speech CodingCategories: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech) Additional Considerations: - Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC 5.1.2 Challenges of Quality AssessmentTraditional 3GPP Practice: - AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR - ACR: Generally for clean speech - DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content - Focus not on intelligibility, speaker identifiability, prosodic impairments ULBC Challenges: - May need to address additional aspects directly through dedicated tests - Hallucination: Specific to ML-based systems - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic) Alternative Test Methods: - Automatic speech recognition - Modified rhyme tests - DCR tests (for prosodic differences) - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests (word/semantic equivalence) - Phoneme recognition tests Noise Suppression Evaluation: - P.835: Multi-dimensional rating (speech quality and noise suppression capability separately) - Typically used for systems with noise suppression DCR Considerations: - Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference 5.1.3 Subjective Testing Considerations |
|
| China Mobile Com. Corporation |
[FS_ULBC] Permanent Document v0.5.0
|
Comprehensive Summary of 3GPP FS_ULBC Permanent DocumentDocument OverviewThis permanent document (p-doc) version 0.45.0 supports the Study Item on Ultra Low Bitrate Speech Codec (FS_ULBC), focusing on developing recommendations for normative work on an ultra-low bit rate codec for voice over Geostationary Orbit (GEO) satellites. The document tracks agreements, open issues, and progress across the study objectives defined in the SID. 1. Introduction and ScopeThe study addresses nine key objectives: - Document application scenarios for ultra-low bit rate communication services - Study GEO channel characteristics and derive service-related dependencies - Identify relevant design constraints - Provide feasibility evidence - Define performance requirements and test methodologies - Identify/develop objective measures for design constraint verification - Identify reference codecs - Coordinate with other 3GPP groups (SA2, RAN, CT1) - Define potential normative work item objectives and timeline Working Procedure: - Maintains one TR and one p-doc - Contributions via pCRs - Brackets restricted to values only - Open issues documented in p-doc 2. Application Scenarios2.1 Main Scenario: IMS Voice Call over GEOKey Technical Assumptions: UE1 Uplink (UE1 → GEO satellite → Ground station): - Transmission data rate significantly limited ([1-3] kbit/s) - Requires ultra-low bit rate codec fitting this transmission rate - Subject to transmission errors reflecting GEO satellite access - Delay greater than typical terrestrial networks UE1 Downlink (Ground station → GEO satellite → UE1): - Similarly limited transmission data rate - Subject to similar transmission errors and delay UE2 Connection (Core Network → UE2): - Regular TN network transmission data rate available - Could use existing IMS codec (with transcoding) or same ULBC (transcoding-free) - Transcoding functionality in core network likely needed for seamless communication across network types 2.2 Sub-ScenarioBoth connections (UE1 and UE2) via GEO satellite with significantly limited transmission data rate ([1-3] kbit/s), allowing both transcoded and transcoding-free operation. 3. Channel Characteristics and Service-Related Dependencies3.1 End-to-End Simulation ModelMethodology: - Reuses simulation model from TS 26.132 Annex E (LTE reference scenario) - Adapted for GEO access scenario with "new GEO channel" - Potential inclusion of Non-IP Data Delivery (NIDD) option Key Input Parameters: BLER_tx/BLER_rx: Block error rates for uplink/downlink from RAN simulation drx_cycle_length: DRX cycle duration (20-40ms for LTE; suitability for GEO TBC with RAN2) mis_eNB1_eNB2: Scheduling time mis-alignment; determines buffer waiting time nFrames considerations: - Frame length: Maximum 80ms assumed for GEO (vs. 20ms for LTE) - Voice packet size: Depends on protocol overhead (user plane vs. control plane, IP vs. Non-IP NIDD) - RTP Payload Size: Product of frame length and codec bit rate Editor's Note: SA2 concluded in TR 23.700-19 that voice packets shall be transported over NB-IoT (GEO) user plane. 3.2 RAN Simulation Model for Error TracesObjective: Generate multiple loss traces for combinations of: - Frame loss rate (target BLER) - Raw bitrate (TBS) - Voice bundling period - Doppler spread Simulation Parameters: - Number of seeds: 10 - Trace duration: 400 seconds (6.67 minutes) - Channel consistency: Same channel realizations across all combinations 3.2.1 Link Budget AnalysisBaseline CNR values from TR 36.763: - UL CNR = 2.6dB (0dBi UE antenna gain, 3.75kHz SCS, 1 tone, 23dBm UE max TX power) - DL CNR = -3.3dB (0dBi UE antenna gain, 15kHz SCS, 12 tones, 1 UE receive antenna, 23dBm UE max TX power) 3.2.2 Uplink Simulation ParametersChannel model: NTN-TDL-C [38811] Elevation angle: 10 degrees (parameters specified in Table 5.2.2.2-1) Modulation: QPSK, π/2 BPSK Subcarrier Spacing (SCS): 3.75kHz, 15kHz Number of tones: 1 for both SCS values Voice bundling period: 80ms, 160ms, 320ms - Note: 40ms not considered due to insufficient time for DL transmissions with 3.75kHz SCS Doppler spread: 1Hz, 5Hz Target BLER: 1%, 2%, 6%, 10% (fixed target BLER is FFS) Maximum Achievable SNR: SNR = (3GPP SET-1 UL SNR) - 10×log₁₀(B/3.75) + (P - 23dBm) + G + [X] dB Where: - 3GPP SET-1 UL SNR = 2.6dB - B = bandwidth (3.75kHz or 15kHz) - P = max UE TX power (23, 26, 31 dBm) - G = UE antenna gain difference (0 to -5.5dBi) - X = TBD (accounts for lower loss, better satellite performance) TBS Values and PHY Bitrates: For 80ms bundling: - TBS: 144, 256, 328, 424 bits - PHY bitrate: 1.8, 3.2, 4.1, 5.3 kbps - Codec bitrate: 1.1, 2.5, 3.4, 4.6 kbps (assuming 7 bytes packet header) For 160ms bundling: - TBS: 208, 424, 600, 808 bits - PHY bitrate: 1.30, 2.65, 3.75, 5.05 kbps - Codec bitrate: 0.95, 2.30, 3.40, 4.70 kbps For 320ms bundling: - TBS: 328, 776, 1096, 1544 bits - PHY bitrate: 1.025, 2.425, 3.425, 4.825 kbps - Codec bitrate: 0.850, 2.250, 3.250, 4.650 kbps Notes: - Packet header counted once regardless of bundled frames - Loss of single TB means loss of multiple consecutive voice frames - Need for 320ms bundling to be revisited after channel simulation results 3.2.3 Downlink Simulation ParametersSCS: 15kHz Number of tones: 12 Achievable SNR: SNR = (3GPP SET-1 DL SNR) + G + [Y] dB Where: - 3GPP SET-1 DL SNR = -3.3dB - G = UE antenna gain difference (0 to -5.5dBi) - Y = TBD (accounts for 2 RX antennas providing up to 3dB gain, lower loss, better G/T values, better satellite performance) Editor's Note: Four companies reported Y=3 due to better G/T from field measurements (-28.6dB/K vs. -31.6dB/K assumed), but no RAN1 consensus reached. TBS values: Identical to uplink (Clause 5.2.2.2) 3.2.4 Frame StructureDynamic Scheduling Example (80ms bundling, Half-duplex FDD): - NPDSCH duration: 4ms (variable depending on DL SNR) - UL frequency allocation options: 1, 3, 6, 12 tones with 15kHz per tone Semi-Persistent Scheduling (SPS): - If specified by RAN for NB-IoT NTN - NPDSCH can be anywhere in first 15ms (maintaining minimum 1ms gap to NPUSCH) - "Cell_specific_Koffset" approach proposed (not dependent on "TA report UE capability") Gap between DL and UL consists of: - Processing time + DL-to-UL switching (minimum 1ms for half-duplex device) - Max differential delay: [close to 0 to 10.3ms] (TBC) RAN1 Note: Example frame structures supportable in most scenarios but may not work for very large cells (>3000km) when UE doesn't support TA report and network doesn't support UE-specific K-offset. RAN1/2 have not yet designed SPS. 3.3 Open Issues for NB-IoT GEO SimulationIssue 1 - UE Power Class: Whether to use specified 23dBm or broader range (26, 29, 31, 33 dBm) - Pending RAN input Issue 2 - Latitude-Dependent Loss: Scintillation loss (2.2dB or 0dB depending on latitude) - Solved (accounted via X term) Issue 3 - Elevation Angles: Keeping both 2.3° and 12.5° - Solved (accounted via X term) Issue 4 - UL/DL Guard Time: 1ms assumption - Pending RAN confirmation Issue 5 - Candidate TBS Values: Multiple proposals from companies - Unsolved Issue 6 - Approaches to Select TBS: Three approaches provided - Unsolved Issue 7 - Overall Simulation Methodology: High-level description needed - Unsolved (to be addressed after simulation completion) Issue 8 - Simulation Channel Model: NTN-TDL-C vs. NTN-TDL-C5 - Solved (NTN-TDL-C used) Issue 9 - Protocol Overhead: Clarify packet header for different transport options - Pending RAN2/SA2 confirmation Issue 10 - Repetition Numbers: Specify and report in simulation - Solved Issue 11 - RX G/T for Downlink: 3dB better value observed in field - Unsolved 3.4 Alternative Methodology for Determining ULBC Bit RateEditor's Note: This methodology remains an open issue. Proposed Steps:
Example Workflow: - Proponent has design at 0.95 kbps and 3.4 kbps - For 160ms bundling with 7-byte overhead: - Low rate: TBS = 208 bits - High rate: TBS = 600 bits - Select best transport format configuration from available options - Generate BLER patterns for different UE TX powers (23, 26, 29, 31 dBm) - Run codec simulation with these patterns - Evaluate quality (e.g., listening test) with weighted averaging across power settings Note: Important to test candidates for other conditions beyond NTN NB-IoT (e.g., Terrestrial IMS with 1% BLER, OTT with 0% BLER, extreme conditions with 10% BLER or blockage losses) 3.5 Simulation ResultsTable 5-6 documents preliminary results: - 80ms bundling: Qualcomm submitted S4-251739 - Company A, B, C: TBD 4. Design Constraints4.1 Complexity and Memory DemandsTarget Device Types: - Handheld mobile phones - Smart watches - Smart glasses/head mounted devices - TCU (Telematics Control Unit) - CPE (Customer Premises Equipment) - Vehicles - Other IoT devices Recommended Constraints: - Implementable on DSP/CPU/NPU enabled UE devices - For low-end DSP-only UEs: - Complexity: <500 WMOPS (measured on C reference code) - ROM memory: <20MB assuming 32bit/parameter (or 5M model parameters) Editor's Notes: - Definition of "DSP enabled UE devices" needs clarification - Exact complexity estimation metric and limits are TBD 4.2 Design Constraint VerificationComplexity Verification: - Constraints may be based on platform-agnostic metrics: - MACs/FLOPs for AI-based components - WMOPS for traditional signal processing - Model size and precision - Verification process details and timing are FFS Algorithmic Delay: - Verification method for AI-based codecs required 5. Performance Requirements5.1 ScopeDefine performance requirements and test methodologies for: - Speech quality, intelligibility, conversational quality - Clean speech and noisy speech - Tandeming with existing IMS voice codecs - Clean channel and GEO channel conditions - Identify relevant reference codecs 5.2 Status TrackingCore influencing factors identified: - DC: Sample rate and audio bandwidth - DC: Bitrates (External dependency) - DC: Frame length - DC: PLC (External dependency) - DC: Algorithmic Delay - DC: Complexity, Memory - Test Methodologies - DC: Noise suppression - DC: DTX/CNG - DC: Robust Non-Speech - Evidence DCs - Reference codec All items currently have open issues and progress TBD 6. Coordination and Dependencies6.1 External DependenciesFrom RAN: - HARQ retransmission parameters (max_tx/max_rx) - DRX cycle length suitability for GEO - Scheduling parameters (dynamic vs. SPS) - Frame structure confirmation - UE power class - UL/DL guard time - Protocol overhead - G/T values for downlink From SA2: - Transport path for voice packets (user plane vs. control plane, IP vs. Non-IP NIDD) - Protocol overhead details - Transcoding functionality requirements From RAN2: - Dynamic scheduling vs. Semi-Persistent Scheduling - MAC header size (1-byte feasibility) - Timing parameters 7. Key Technical Contributions7.1 Simulation Framework EstablishmentThe document establishes a comprehensive RAN simulation framework for generating error traces: - Defined methodology using NTN-TDL-C channel model - Specified uplink and downlink parameters - Established TBS values and corresponding codec bitrates for multiple bundling periods - Defined channel consistency requirements across simulations 7.2 Link Budget AnalysisAdopted baseline CNR values from TR 36.763 with provisions for: - Variable UE power classes - Latitude-dependent losses - Elevation angle variations - Better-than-assumed satellite performance 7.3 Bitrate Determination MethodologyProposed alternative methodology allowing proponents design freedom: - Operation point definition based on receive SNRs - Transport format optimization for each source bit rate - Packet loss pattern generation - Comparative evaluation framework 7.4 Frame Structure DefinitionDefined frame structures for: - Dynamic scheduling with Half-duplex FDD - Semi-Persistent Scheduling options - Cell_specific_Koffset approach for large cells 7.5 End-to-End Delay-Error Profile ModelAdapted TS 26.132 Annex E model for GEO scenarios: - Identified required input parameters - Defined voice packet structure with protocol overhead - Established relationship between frame length, bundling, and packet loss 8. Open Issues SummaryHigh Priority (Blocking): 1. Consensus on UE power class (23 dBm vs. higher values) 2. RAN confirmation on frame structures and scheduling 3. SA2/RAN2 confirmation on protocol overhead 4. Selection of candidate TBS values and selection methodology 5. Downlink RX G/T value consensus Medium Priority: 1. Fixed vs. variable target BLER 2. Need for 320ms bundling option 3. Complexity metric definition and limits 4. Algorithmic delay verification for AI codecs Lower Priority: 1. Overall simulation methodology description (after completion) 2. Definition of "DSP enabled UE devices" 9. Document StatusCurrent Version: 0.45.0 (SA4#135, February 2026) Recent Updates: - Added 10-degree channel model parameters - Updated simulation parameters per multiple agreed TDOCs - Added company simulation results reporting - Clarified voice packet transport over user plane Working Status: - Active study phase - Collecting simulation results from companies - Coordinating with RAN and SA2 for parameter confirmation - Developing design constraints and performance requirements |
|
| China Mobile Com. Corporation |
[FS_ULBC] WorkPlan of FS_ULBC v0.5
|
Timeplan for FS_ULBC Study Item1. IntroductionThis document outlines the timeplan for the Feasibility Study on Ultra Low Bitrate Speech Codec (FS_ULBC). The study focuses on developing a codec for ultra-low bit rate communication services, particularly for IMS Voice Call Using GEO Access as documented in TR 22.887. Study Item ObjectivesThe FS_ULBC study has nine main objectives:
2. Current Progress StatusApplication Scenarios (85% Complete)
GEO Channel Characteristics & Simulation (75% Complete)
Simulation Methodology (60% Complete)
Company Simulation Results (40% Complete)
Mouth-to-Ear Delay (95% Complete)
Design Constraints ProgressBit Rates (0% Complete)
Sample Rate and Audio Bandwidth (5% Complete)
Frame Length (0% Complete)
Complexity and Memory Demands (80% Complete)
Algorithmic Delay (0% Complete)
Packet Loss Concealment (15% Complete)
Noise Suppression (15% Complete)
DTX (0% Complete)
Design Constraint Verification (5% Complete)
Other Considerations (5% Complete)
Existing Codec Technologies (85% Complete)
Performance Requirements (0% Complete)
Test Methodologies (50% Complete)
Coordination with Other WGs
3. Detailed TimeplanTSG SA#107 (March 12-14, 2025, Incheon, KR)
SA4#131-bis-e (April 11-17, 2025)
Audio SWG Telco (May 5, 2025)
SA4#132 (May 19-23, 2025, Fukuoka, JP)
Audio SWG Telcos (June-July 2025)Multiple telcos scheduled to: - Progress GEO channel characteristics study - Perform RAN-related simulations within SA4 - Align on RAN link-level simulations - Power to send LS to SA2 and RAN WGs if needed SA4#133-e (July 21-25, 2025)
F2F Ad-hoc Meeting (September 23-25, 2025, Erlangen, Germany)
Audio SWG Telcos (October 2025)
SA4#134 (November 17-21, 2025, Dallas, US)Major milestone meeting: - Finalize: - GEO channel characteristics study - Coordination with other WGs - Reference codec identification - Design constraints: bit rates, sample rate, audio bandwidth, frame length, PLC, noise suppression, DTX - Progress: - Feasibility evidence - Objective measures development - Design constraints: complexity, algorithmic delay, speech quality, robustness - Performance requirements and test methodologies - Start defining potential normative work item objectives and timeline - If time permits: Finalize additional application scenarios Audio SWG Telcos (December 2025 - January 2026)
SA4#135 (February 9-13, 2026, India)
TSG SA#111 (March 10-13, 2026, Japan)
SA4#136 (April 13-17, 2026)
SA4#137 (May 11-15, 2026)Final study meeting: - Finalize: - Design constraints: speech quality - Performance requirements and test methodologies (clean/noisy speech, tandeming, clean/GEO channel conditions) - Potential normative work item objectives and timeline TSG SA#112 (June 9-12, 2026, Singapore)
|
|
| China Mobile Com. Corporation |
[FS_ULBC] On Assumptions and Open Issues for NB-IoT GEO Simulation
|
Summary of S4-260149: Updates on Assumptions and Open Issues for NB-IoT GEO SimulationDocument OverviewThis contribution from China Mobile addresses outstanding assumptions and open issues for NB-IoT GEO satellite simulation work within the ULBC (Ultra-Low Bitrate Codec) study. The document consolidates discussions from multiple Audio Ad-hoc meetings (June 4, June 17, and July 11) and proposes updates to TS 26.940 clause 5.2.2.4. Main Technical ContributionsStatus Updates on Simulation ParametersThe document provides a comprehensive status table tracking 11 key simulation issues, with updates on their resolution status: Resolved Issues
Partially Resolved Issues
Unresolved Issues
ProposalThe document proposes to: 1. Update the P-doc (TS 26.940) based on the status updates provided 2. Continue tracking these issues until full resolution Key DependenciesThe document highlights several dependencies on other working groups: - RAN4: UE power class confirmation - RAN: UL/DL guard time feasibility, protocol overhead confirmation - SA2: Protocol overhead for different transport configurations |
|
| vivo Mobile Communication Co., |
[FS_ULBC] Updates of the permanent document based on 3GPP TR 23.700-19
|
Summary of 3GPP Technical Document: Updates to FS_ULBC Permanent DocumentDocument OverviewThis contribution updates the FS_ULBC (Ultra Low Bitrate Speech Codec) Permanent Document to align with SA2 conclusions on Key Issue #1 regarding IMS voice call support over NB-IoT via GEO satellite connecting to EPC, as documented in TR 23.700-19. Main Technical Contributions1. Reference UpdatesThe document adds critical new references to align with recent 3GPP work:
2. End-to-End Simulation Model Updates (Clause 5.2.1.3)2.1 Architecture and Protocol Stack ChangesThe document introduces significant modifications to the end-to-end simulation model:
2.2 Transport Mechanism AgreementsBased on SA2 conclusions in TR 23.700-19:
2.3 Simulation Input ParametersKey parameters updated for GEO scenarios:
2.4 Protocol Overhead ConsiderationsTwo protocol overhead scenarios illustrated:
Editor's Note: Exact overhead for UDP/IP (SA2 scope) and RTP (SA4 scope) for the removal/restoration mechanism requires determination. 3. Simulation Assumptions and Open Issues (Clause 5.2.2.4)3.1 Resolved Issues| Issue | Resolution | |-------|-----------| | Latitude-Dependent Loss | Simulation accounts for latitude-dependent scintillation loss using X term (2.2 dB or 0 dB beyond ±20° latitude per TR 38.821) | | Elevation Angles | Both 2.3° and 12.5° angles considered using X term for worst-case scenarios | | Simulation Channel Model | NTN-TDL-C selected | | Repetition Numbers | Specified and reported in simulation | 3.2 Pending Issues Requiring RAN Input
3.3 Unresolved Issues
3.4 Updated Understanding on Protocol OverheadBased on SA2 agreements:
Key Dependencies and Cross-WG CoordinationThe document identifies several inter-working group dependencies:
Editor's NotesTwo critical editor's notes remain:
|
|
| China Mobile Com. Corporation |
[FS_ULBC]Considerations for ULBC Codec Selection Process
|
Comprehensive Summary: Considerations for ULBC Codec Selection ProcessDocument OverviewThis document appears to be a presentation or discussion paper related to ULBC (Uplink Broadcast) codec selection process. However, the provided content is fragmentary and contains mixed language elements (English and Chinese), making comprehensive technical analysis challenging. Main Technical Areas Identified1. ULBC Codec Selection ProcessThe document's primary focus is on considerations for selecting codecs in the ULBC (Uplink Broadcast) context. However, specific technical criteria, evaluation methodologies, or selection parameters are not detailed in the provided content. 2. JPEG AI IntegrationOverview
Working Mechanism
Timeline
3. Cross-Working Group CoordinationSA2 Related Work
RAN Liaison Statements
4. Architecture ConsiderationsNetwork Function Changes
Open QuestionsThe document includes an "Open Questions" section, indicating ongoing discussions and unresolved technical issues. However, the specific questions are not provided in the extracted content. Technical Gaps in Provided ContentDue to the fragmentary nature of the document provided: - Specific codec selection criteria are not detailed - Technical evaluation parameters are missing - Comparison methodologies between candidate codecs are not present - Detailed architectural proposals are not included - Specific agreements or decisions are not documented Observations
Note: This summary is based on fragmentary content with significant portions in template format or non-English text. A complete technical analysis would require the full document with all technical details, agreements, and proposals. |
|
| vivo Mobile Communication Co., Nokia, Xiaomi Technology, Samsung, Spreadtrum, Bytedance |
[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals
|
Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals1. Introduction and ObjectivesThis contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics. The study evaluates: - Descript Audio Codec (DAC) - AI-based codec - Enhanced Voice Services (EVS) codec - 3GPP standard reference The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination. 2. Background and Motivation2.1 ULBC ContextThe ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges: - Wideband audio (16 kHz) offers naturalness and perceptual quality - Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits 2.2 Critical Communication RequirementsFor emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include: - Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification - System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas - Need to balance modern expectations with legacy requirements and emergency scenarios 2.3 EVS as Reference AnchorEVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling: - Practical quality floor definition for ULBC - Comparison against established carrier-grade standards - Isolation of bandwidth choice impact independent of codec architecture 3. Methodology3.1 Evaluation Pipeline
3.2 Processing Chain
4. Experimental Setup4.1 Codec ConfigurationsDAC model: Evaluated at three sampling rates
- 16 kHz
- 24 kHz EVS codec: Evaluated in standard modes - Narrowband (NB) - Wideband (WB) Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB) 4.2 Observations on Baseline Variance
4.3 Primary MetricWER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range. 5. Results and Analysis5.1 DAC Performance vs BitrateKey Findings: - DAC achieves high efficiency at low bitrates (~2 kbps) - WER drops rapidly as bitrate increases, stabilizing around 3-4% - At 1.5 kbps: WER approximately 5.5% - Significant improvement observed in 1.5-3.0 kbps range Bandwidth Impact at Low Bitrates: - At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model - With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band - Results in better semantic preservation vs. 24 kHz model suffering from bit starvation 5.2 DAC 8 kHz Narrowband Model AnalysisA dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound. Model Configuration: - Sample rate: 8000 Hz - Encoder rates: [2, 4, 4, 8], dimension: 64 - Decoder rates: [8, 4, 4, 2], dimension: 1536 - Quantization: 6 codebooks, size 1024, dimension 36 - Training: 200,000 steps on VCTK corpus Critical Findings at Sub-1.5 kbps: - At ~1 kbps: - 8 kHz model (938 bps): WER 5.86% - 16 kHz model (1000 bps): WER 11.23% - Semantic penalty > 5 percentage points when forcing WB at 1 kbps
Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively. 5.3 DAC vs EVS ComparisonCompetitive Advantage: - DAC achieves comparable WER scores at significantly lower bitrates than EVS - DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor. 5.4 EVS Narrowband vs Wideband AnalysisPerformance at Different Bitrates:
Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency. 5.5 EVS Degradation AnalysisMethodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance. Key Findings: - Semantic loss introduced by EVS in both NB and WB modes is minimal - Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance - Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates. 5.6 Summary of FindingsStrategic Conclusions for ULBC Design:
6. ProposalsProposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing: - Methodology - Experimental setup - Analysis of results concerning audio bandwidth impact on semantic intelligibility 7. Detailed Results TablesComplete experimental data provided in appendix tables covering:
- Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps |
|
| China Mobile Com. Corporation |
[FS_ULBC]pCR on Existing codec technologies
|
Summary of pCR on Existing Codec Technologies (S4-260154)Document Information
Purpose and ScopeThis pCR proposes updates to Clause 7.1 of TR 26.940, which documents existing codec technologies for evidence that design criteria can be met and for comparison/evaluation purposes. The document adds information about recently emerged ultra-low bit-rate voice codecs (below 1 kbps) as reference for further work. Main Technical ContributionsExpanded Codec Technology Reference TableThe pCR significantly expands Table 7.1.1-1 "List of existing codec technologies" by adding multiple categories of codecs beyond the existing 3GPP IMS codecs. The table includes the following parameters for each codec:
New Codec Categories Added1. Conventional Ultra Low Bitrate Codecs
2. AI-Based Decoders
3. AI-Based Encoder and Decoder (Causal)These codecs support real-time operation: - LPCNet: 1.6 kbps, WB, 40ms frames, 25ms delay - LyraV2 (SoundStream): 3.2-9.2 kbps, WB, 20ms frames - EnCodec: 1.5-24 kbps, 24kHz/FB, 0-1000ms delay, 13.3ms frames - Mimi-Codec: 0.55-1.1 kbps, 24kHz, 80ms frames, 0ms delay - TS3: 0.64-0.8 kbps, WB, 20ms frames, 0ms delay - TAAE: 0.4-0.7 kbps, WB, 20-40ms frames, 0ms delay - LMCodec2: Parameters TBD 4. AI-Based Encoder and Decoder (Non-Causal)These codecs are designed for offline/non-real-time applications: - DAC: 0.5-3 kbps, WB/24kHz, 244-366ms delay - DAC-IBM: 0.75-3 kbps, 24kHz, 366ms delay - SNAC: 0.98 kbps, 24kHz, 1000ms delay, 80ms frames - SpeechTokenizer: 0.5-1.0 kbps, WB, full-signal delay - SemantiCodec: 0.31-1.4 kbps, WB, 10-40ms frames, full-signal delay - FunCodec: 0.25-1.0+ kbps, WB, 20-40ms frames - WavTokenizer: 0.25-0.9 kbps, 24kHz, 25-40ms frames - BigCodec: 1.04 kbps, WB, 12.5ms frames - FocalCodec: 0.16-0.65 kbps, WB, 20-80ms frames - ALMTokenizer: 0.41 kbps, WB, 13.3ms frames - XY-Tokenizer: 1 kbps, WB, 20ms frames - LongCat-Audio-Codec: 0.43-0.87 kbps, WB, 60ms frames - AcademiCodec: Parameters TBD - MuCodec: 0.35-1.35 kbps, FB Additional NotesThe pCR includes several important notes:
An editor's note indicates that more codecs may be added to the table in future revisions. Key ObservationsThe pCR demonstrates significant industry progress in ultra-low bitrate speech coding, particularly: - Multiple AI-based solutions achieving sub-1 kbps bitrates - Wide range of delay characteristics (0ms to 1000ms) - Various bandwidth support (NB to FB) - Different availability levels for specifications and software implementations |
|
| vivo Mobile Communication Co., Xiaomi Technology, Spreadtrum, Bytedance |
[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling
|
Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling1. Introduction and MotivationThis contribution addresses a critical gap in the Ultra Low Bitrate Speech Codec (ULBC) study by moving beyond theoretical complexity metrics (FLOPs, WMOPS) to evaluate real-world performance on mobile devices. The key observation is that static metrics fail to capture system-level bottlenecks including memory bandwidth pressure and thermal constraints on mobile SoCs. The document presents a comprehensive RTF analysis of a neural audio codec (based on Descript Audio Codec architecture) across multiple model sizes and sample rates on representative mid-range mobile hardware. 2. Experimental Setup2.1 Model ConfigurationEight model variants were evaluated, ranging from enc8dec144 to enc64dec1536, with parameter counts spanning 1M to 74M:
Key complexity observations from Table 1: - Parameter counts range from 1.09M (enc8dec144) to 74.50M (enc64dec1536) - Model sizes range from 4.3 MB to 283.6 MB - Computational complexity scales proportionally with sample rate (e.g., enc32dec768: 4955.9 MFlops/s @ 8kHz, 9972.6 MFlops/s @ 16kHz, 20006.1 MFlops/s @ 32kHz) 2.2 Device Under Test (DUT) Environment
3. Results and Analysis3.1 Complexity Scaling vs. BandwidthCritical finding: For a given model variant, computational complexity scales linearly with sample rate: - enc32dec768 example: - 8 kHz: ~0.20 GFLOP counts (4955.9 MFlops/s) - 16 kHz: ~0.40 GFLOP counts (9972.6 MFlops/s) - 2x increase - 32 kHz: ~0.80 GFLOP counts (20006.1 MFlops/s) - 4x increase Implication: Higher sampling rates incur proportional computational penalty. For resource-constrained devices (IoT, wearables), NB mode at 8 kHz is recommended. 3.2 Real-Time Factor (RTF) Analysis Across Three Frequency Tiers3.2.1 Tier 1: Low Frequency (A55@750MHz, A78@902MHz, A78+@1.1GHz)Energy-conserving state with severe constraints:
3.2.2 Tier 2: Mid Frequency (A55@1.0GHz, A78@1.16GHz, A78+@1.37GHz)Typical sustained workload state:
3.2.3 Tier 3: High Frequency (A55@1.73GHz, A78@1.45GHz, A78+@1.63GHz)High-performance state approaching sustained limits:
Key observation: Inverse relationship between sample rate and model size capacity is consistently demonstrated. 3.3 Maximum Performance EnvelopeAnalysis at peak locked frequencies establishes absolute upper bounds: 3.3.1 Efficiency Core (Cortex-A55 @ 2.0 GHz)Even at peak frequency, A55 remains highly constrained. Models exceeding ~5M parameters (enc16dec384) fail real-time constraints at 8 kHz and above. Unsuitable for large weight matrices. 3.3.2 Performance Core (Cortex-A78 @ 2.6 GHz)Most relevant benchmark for ULBC - represents sustained compute capability of modern mobile devices. Critical "Complexity vs. Bandwidth" trade-off identified:
3.3.3 Prime Core (Cortex-A78+ @ 3.0 GHz)Results mirror A78 trends with slight improvements due to higher clock frequency. The bandwidth bottleneck remains dominant - higher clock speed provides safety margin for borderline models (e.g., enc24dec576 @ 32kHz) but doesn't fundamentally shift feasible model size category. 4. Key Technical Contributions4.1 Quantified Complexity-Bandwidth Trade-offEstablished precise inverse relationship: halving sample rate approximately doubles feasible parameter count on performance cores:
- 32 kHz → 10M parameters
- 16 kHz → 20M parameters 4.2 Real-World Performance BenchmarksProvided concrete RTF measurements across representative mobile hardware configurations, revealing that: - Theoretical complexity metrics (FLOPs) don't capture real-world bottlenecks - Memory bandwidth and thermal constraints significantly impact feasibility - Efficiency cores (A55) are unsuitable for neural codec workloads beyond minimal complexity 4.3 Practical Complexity Constraints for ULBCIdentified 10M parameter hard limit for 32 kHz operation on mid-range mobile devices (A78 @ 2.6 GHz), providing concrete guidance for ULBC candidate selection. 5. ProposalThe contribution proposes including these RTF analysis findings in TR 26.940 to inform complexity constraint selection for ULBC candidates, moving the standardization process toward real-world deployability considerations rather than purely theoretical metrics. |
|
| vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum |
[FS_ULBC] Discussion on Audio Bandwidth for ULBC
|
Technical Summary: Audio Bandwidth Requirements for ULBC1. Introduction and ScopeThis contribution addresses audio bandwidth design constraints for the Ultra-Low Bitrate Codec (ULBC), targeting primarily voice over GEO satellite communications. The document argues against mandatory Wideband (WB) and Super-Wideband (SWB) support, proposing instead that Narrowband (NB) should be mandatory with WB as an enhancement. 2. Key Technical Arguments2.1 Global NB Usage and System EfficiencyCurrent Network Reality: - 2G/3G connections (primarily AMR-NB) still represent 20% of global technology mix (end of 2023) - Regional variations: 81% in Sub-Saharan Africa, 46% in Middle East and North Africa - NB serves as universal fallback for interoperability (CS fallback scenarios) System Inefficiency Without NB Mode: - WB ULBC to NB user calls waste upper frequency band (4-8 kHz) - Significant bitrate wasted transmitting data that recipient cannot hear - Over expensive, scarce satellite link, this inefficiency is unacceptable - Native NB mode provides most efficient solution for legacy network connectivity 2.2 User Expectations in "Last Resort" ScenariosBaseline Expectation Setting: - GEO call is final option after terrestrial network failure - Users typically experience AMR-NB fallback before resorting to GEO - ULBC must be at least as reliable as NB fallback to meet user expectations - WB-only ULBC failure in conditions where NB would work represents service failure 2.3 Primary Use Case: Emergency CommunicationsTypical Deployment Scenario: - Rescue teams in remote areas (e.g., Himalayan mountains) - Mixed-connectivity environment: - Squad A: GEO-only (outside TN coverage) - Squad B: GSM fallback at coverage fringe - Base Camp: PSTN connection (NB service) Technical Implications: - Terminating endpoints predominantly NB - Emergency systems use traditional NB codecs (Codec2, MELP) for robustness - Transmitting WB over satellite to NB endpoint wastes critical resources in life-or-death situations - Real-world deployment example provided (China rescue missions) Evaluation Priority: - ULBC candidates should prioritize intelligibility and robustness testing in NB mode 2.4 Performance at Very Low BitratesQuality vs. Bandwidth Trade-off: - Forcing wider bandwidth at very low bitrates spreads available data too thinly - Research shows lower sampling rates can achieve higher perceptual quality at very low bitrates - WB codec at ~1 kbps may compromise intelligibility, especially with packet loss - NB signal more robustly reconstructed under constrained conditions Analogy: "Spreading butter" - concentrating bits on narrower bandwidth preserves speech richness and intelligibility 2.5 Complexity and Power ConsumptionComputational Scaling Issues: - AI-based codec architectures don't scale gracefully - Doubling sampling rate (NB to WB): 2x to 4x complexity increase for CNN/Transformer models - WB-only mandate imposes unnecessary computational burden - Critical issue for power-constrained mobile devices - Native NB mode offers high-quality voice at significantly lower complexity/power budget 3. Experimental Analysis: Higher Bandwidth Inefficiency3.1 Experiment SetupTest Configuration: - Codec: Descript Audio Codec (DAC) with pre-trained models - Sampling rates tested: 44.1 kHz, 24 kHz (SWB), 16 kHz (WB) - Test corpus: 100 clean speech samples from MS-SNSD dataset - Bitrate variation: 1-9 active quantization codebooks - Quality metric: ViSQOL algorithm (speech mode, MOS estimate) Model Specifications: | Model | Compression | Frame Rate | Codebooks | Bitrate/Codebook | |-------|-------------|------------|-----------|------------------| | 16 kHz (WB) | 320x [2,4,5,8] | 50 Hz | 12 (10-bit) | 0.50 kbps | | 24 kHz (SWB) | 320x [2,4,5,8] | 75 Hz | 32 (10-bit) | 0.75 kbps | | 44.1 kHz | 512x [2,4,8,8] | ~86.1 Hz | 9 (10-bit) | ~0.86 kbps | 3.2 Key Experimental FindingsQuality vs. Bitrate Results: - WB (16 kHz): Achieves excellent quality (ViSQOL MOS > 4.0) at ~2.5 kbps - 24 kHz SWB: Requires higher bitrate to match WB quality - 44.1 kHz: Provides minimal perceptible improvement over 24 kHz SWB - Conclusion: Bitrate cost of SWB not justified by quality improvement for voice content Efficiency Analysis: - Clear trend: diminishing returns for bandwidth beyond WB - SWB/FB represents inefficient use of bandwidth for ULBC service 4. Proposed Design Constraints4.1 Bandwidth RequirementsMandatory Support: 1. 8 kHz sampling rate (NB): 50-4000 Hz audio bandwidth 2. 16 kHz sampling rate (WB): 50-8000 Hz audio bandwidth - Enhanced quality where channel conditions and device capabilities permit - WB support can be limited to higher bitrates than NB operation Further Study: - Necessity and feasibility of SWB and FB support remains FFS 4.2 Text Proposal for TR 26.940Change to Table 6.2-1 (Design Constraint Parameters): Sample rate and audio bandwidth: - The ultra low bitrate codec shall support sampling rates of 8kHz (NB) and 16kHz (WB) - Supported audio bandwidth: - NB: 50-4000 Hz - WB: 50-8000 Hz 5. Supporting Evidence SummaryQuantitative Data: - 20% global 2G/3G connections (hundreds of millions of users) - Regional NB dominance: up to 81% in some areas - WB achieves MOS > 4.0 at 2.5 kbps - 2x-4x complexity increase for WB vs. NB in AI codecs Qualitative Arguments: - System efficiency (no wasted bandwidth to NB endpoints) - User expectation alignment (last resort reliability) - Emergency use case requirements - Computational/power constraints for mobile devices - Diminishing returns for SWB/FB at target bitrates |
|
| vivo Mobile Communication Co., |
[FS_ULBC] Analysis of AI Codec Complexity Scaling
|
Complexity Analysis of AI Codec Scaling for ULBC1. IntroductionThis contribution addresses the need for establishing relevant complexity evaluation methods for the new ULBC codec standardization. Previous contributions (e.g., S4aA250264) highlighted potential gaps between theoretical complexity metrics (FLOPs) and practical on-device performance (Real-Time Factor). This document provides a complementary analysis focusing on how complexity metrics scale with AI model architecture itself. The analysis investigates the relationship between model architecture, theoretical complexity, and traditional metrics using the publicly available DAC codec as a test case. 2. Analysis of AI Codec Complexity Scaling2.1. MethodologyThe analysis created seven "dummy" model variants based on the open-source DAC codec's 16kHz configuration. The approach:
Complexity Metrics Measured:
Implementation Notes: - Each AI operation implemented in pure C - Source files annotated and compiled using wmc_tool - WMOPS highly sensitive to C implementation efficiency - Naive implementations can yield significantly higher counts than optimized versions 2.2. Complexity vs. Model DimensionsKey Findings:
2.3. WMOPS vs. Model ParametersKey Finding: Clear relationship between AI model size (in millions of parameters) and traditional WMOPS complexity. Observations on DAC Model:
2.4. Summary of Scaled VariantsComplete complexity metrics for all seven DAC variants (16kHz, 20ms frame): | Variant | Enc Dim | Dec Dim | Params (M) | GFLOP counts | MFLOP/s | WMOPS Enc | WMOPS Dec | |---------|---------|---------|------------|--------------|---------|-----------|-----------| | enc8dec144 | 8 | 144 | 1.09 | 0.009 | 437.09 | 333.92 | 760.53 | | enc12dec288 | 12 | 288 | 2.89 | 0.028 | 1397.63 | 648.23 | 2732.96 | | enc16dec384 | 16 | 384 | 4.94 | 0.050 | 2481.98 | 1060.79 | 4724.38 | | enc24dec576 | 24 | 576 | 10.76 | 0.112 | 5578.38 | 2228.92 | 10399.00 | | enc32dec768 | 32 | 768 | 18.90 | 0.198 | 9911.72 | 3693.56 | 18093.30 | | enc40dec960 | 40 | 960 | 29.34 | 0.310 | 15482.00 | 5599.48 | 28019.70 | | enc64dec1536 | 64 | 1536 | 74.50 | 0.792 | 39614.50 | 13675.30 | 70766.69 | Data demonstrates rapid scaling of all metrics as encoder and decoder dimensions increase. 3. Observations and ConclusionsBased on the DAC model variant analysis:
4. ProposalIt is proposed to capture the above analysis into 3GPP TR 26.940. |
|
| vivo, Samsung, Spreadtrum, MediaTek Inc. |
[FS_ULBC] On codec bitrate and capacity discussion for ULBC
|
Summary of 3GPP CR S4-260159: On Codec Bitrate and Capacity Discussion for ULBC1. IntroductionThis CR addresses the TBS (Transport Block Size) and codec bitrate values for ULBC (Ultra Low Bitrate Codec) evaluation, which are currently noted as 'companies reported' in TR 26.940 v0.4.0. The contribution provides analysis on: - Multiplexed UE number analysis - Confirmation of TBS/Codec bitrate values 2. Technical Analysis2.1 Multiplexed UE Number AnalysisThe document presents a methodology for calculating supported UE numbers considering: - TDM (Time Division Multiplexing): Both UL and DL can schedule different UEs in TDM manner - FDM (Frequency Division Multiplexing): UL can additionally use FDM since NPUSCH may occupy few subcarriers within 180kHz bandwidth - FDM capacity: 48 UEs for 3.75kHz SCS (single tone) - FDM capacity: 12 UEs for 15kHz SCS (single tone) - Bi-directional constraint: Final supported UE number is the minimum of UL and DL capacity 2.2 Capacity Evaluation ResultsAnalysis conducted under 50-degree elevation channel model with 2% BLER: Key Observations: - Observation 1: Higher UE transmit power leads to higher capacity (multiplexed UE number) for a given codec bitrate - Observation 2: For codec bitrate of ~3kbps, capacity is limited to ~10 UEs with 31dBm UE power. Capacity further degrades with increased bitrate (e.g., ≤5 UEs for 4.5kbps) Performance characteristics: - 23 dBm UE power shows very poor capacity - Performance improves with higher power UEs (26 dBm, 31 dBm) - Capacity increases with ptime value 2.3 Benchmark Considerations
Additional analysis provided in Annex assuming 1% BLER under 10-degree elevation channel model. 3. Proposed Changes to TR 26.9403.1 TBS and Codec Bitrate TablesThe CR proposes specific TBS values selected from TS 36.213 table 16.5.1.2-2 for NB-IoT NPUSCH, with corresponding PHY bitrates and codec bitrates calculated for each bundling period (assuming 7-byte packet header). Table 1: 80ms bundling - TBS range: 88-256 bits - PHY bitrate range: 1.1-3.2 kbps - Codec bitrate range: 0.4-2.5 kbps Table 2: 160ms bundling - TBS range: 120-424 bits - PHY bitrate range: 0.75-2.65 kbps - Codec bitrate range: 0.4-2.30 kbps Table 3: 320ms bundling - TBS range: 208-808 bits - PHY bitrate range: 0.65-2.52 kbps - Codec bitrate range: 0.475-2.35 kbps 3.2 Additional NotesNOTE 1: Final packet header size depends on SA2 and RAN conclusions, including feasibility of 1-byte MAC header NOTE 2: Packet header counted only once regardless of bundled voice frames NOTE 3: Relationship between voice frame duration and bundling time depends on RTP payload design. Loss of single TB means loss of multiple consecutive voice frames when bundled. 4. ProposalsProposal 1: Agree that codec bitrate should be bound to be less than 3kbps Proposal 2: Agree to the proposed changes to Section 5.2.2.2 (Uplink simulation parameters) of TR 26.940, including: - Updated TBS values and PHY bitrates tables - Voice bundling periods: 80ms, 160ms, 320ms (40ms excluded due to insufficient time for DL transmissions with 3.75kHz SCS) - Target BLER values: 1%, 2%, 6%, 10% - Maximum Achievable SNR formula incorporating UE power (23/26/31 dBm), bandwidth, and antenna gain variations 5. Supporting InformationThe Annex provides additional multiplexed UE number analysis for different codec bitrates and UE power levels under 10-degree elevation channel model, supporting the main technical conclusions. |
|
| Dolby Laboratories Inc., Novamint, Nokia |
On ULBC complexity and RTF analysis
|
Summary of S4-260165: ULBC Complexity and RTF AnalysisBackground and MotivationThis contribution addresses the need to finalize complexity and memory design constraints for the ULBC (Ultra-Low Bitrate Codec) study. Previous discussions at SA4 #133-e and the ULBC ADHOC meeting explored various complexity metrics and RTF performance data for existing AI codecs (DAC, Lyra v2, HIL). However, insufficient data exists to draw definitive conclusions on complexity constraints for ULBC. The document builds upon previous contribution S4-251844 with the following modifications: - Added CPU core information for experiments - Aligned RTF definition with TR 26.940 clause 7.5.3 - Focused on model sizes 3-20M parameters (more relevant to ULBC use cases) - Provided pCR for TR 26.940 - Removed large chunk-based processing experiments (not relevant for real-time voice communication) Experimental Setup and MethodologyModel ConfigurationModified DAC architecture with reduced parameters while maintaining general structure: - Model sizes: 20M, 15M, 9M, and 3M parameters (float32 precision) - Training: Optimized for ~1 kbps bitrate at 32 kHz sampling rate - Encoder rates: 4,4,8,10 for all models Complexity AnalysisTheoretical Complexity (GMACS): - Computed using ptflops library - Results show linear relationship between model size and GMACS: - 20M: 5.14 GMACS - 15M: 4.03 GMACS - 9M: 2.39 GMACS - 3M: 0.79 GMACS RTF Testing Methodology
Experimental ResultsTest DevicesDevice 1 (2023): - Hexa-core CPU: 2×3.46 GHz (P core) + 4×2.02 GHz (E core) - Dynamic core switching observed between P and E cores Device 2 (2022): - Octa-core CPU: 1×3.00 GHz Cortex-X2 + 3×2.50 GHz Cortex-A710 + 4×1.80 GHz Cortex-A510 - Processing on Cortex-X2 with frequency switching between 2.4 GHz and 1.8 GHz RTF Performance Results| Model Size | Max RTF (High Performance) | Max RTF (Power Efficient) | |------------|---------------------------|---------------------------| | 20M | 0.39-0.63 | 0.81-0.9 | | 15M | 0.29-0.43 | 0.66-0.74 | | 9M | 0.19-0.29 | 0.44-0.57 | | 3M | 0.09-0.13 | 0.18-0.31 | Results demonstrate linear increase in RTF with model size across both performance modes. Key Observations
Proposed Text for TR 26.940The contribution provides a comprehensive pCR adding new clause 6.2.1.7 "RTF and MACS analysis for AI based codecs" with detailed experimental results. Key additions to TR 26.940 include: Complexity Considerations (Clause 6.2.1)
Complexity Metrics (Clause 6.2.1.4)
Target Devices (Clause 6.2.1.5)
Key Conclusions (Clause 6.2.1.6)
Experimental Data (Clause 6.2.1.7)
ProposalDocument the experimental methodology, results, and observations in clause 6.2.1 of TR 26.940 as shown in the provided pCR. |
|
| vivo Mobile Communication Co., |
[FS_ULBC] Discussion on Methodology for Delay & Error Trace Generation
|
Discussion on Methodology for Delay & Error Trace Generation for FS_ULBCIntroductionThis contribution addresses the ongoing debate within SA4 Audio SWG regarding the methodology for generating delay and error traces for Ultra Low Bitrate Codec (ULBC) evaluation under Non-Terrestrial Network (NTN) conditions. Two competing approaches have emerged:
The contribution proposes clarifying the purpose of these simulations by distinguishing between Design and Verification phases. Analysis of Current Approaches2.1 The Precedent: LTE Simulation Methodology (TS 26.132)Legacy Mechanism (Trace Generation)The LTE MTSI testing methodology in TS 26.132 (Annex E and F) operated on "Stationary" conditions:
Usage for Verification (Annex E & F)Critically, TS 26.132 defined these traces as verification tools (System Testing):
Key Finding: Profiles were treated as Test Vectors to verify robustness against defined impairments, not as "realistic channel recordings" to train codec design. The Shift for NTNNTN scenarios introduce challenges that invalidate the LTE approach:
2.2 Analysis of Current Approaches for FS_ULBCApproach A: The "Realism" Perspective (Fixed BLER)Methodology: - Define TBSs for each candidate bitrate and bundling time - Traverse all link parameters (SCS, Tone, etc.) to evaluate if resulting link budgets satisfy predefined Target BLER - Generate error trace for each configuration meeting BLER threshold - Number of output traces = Number of defined Target BLERs (for each TBS) Underlying Assumption: AI-based Codecs (specifically PLC mechanisms) require specific "real" error patterns during training/design phase Observation: Limits testing scope to specific "safe" operating points, potentially overlooking codec behavior under unexpected channel degradation Approach B: The "Resource" Perspective (Fixed SNR)Methodology: - Normalize TBS across all candidate codec bitrates assuming consistent packet overhead - For each unique Link Budget (fixed SNR) derived from specific UE, satellite, and link parameters, generate dedicated error traces - Number of output traces = Number of unique Link Budgets (for each TBS) Underlying Assumption: Mimics "Best Effort" or competitive scenario similar to EVS selection, where end-to-end quality (MOS) matters more than intermediate BLER Observation: Logically sound for optimizing system performance, but implies vast search space potentially leading to unmanageable simulation workload 2.3 The Core Issue: Verification vs. DesignThe Logic ChainThe standard workflow should be: Delay/Error Profiles Generation → Codec/PLC Verification → System Performance Evaluation MisalignmentThe current deadlock stems from treating RAN simulation outputs as Design Constraints (training data) rather than Verification Tools. Key Principles:
2.4 Proposal for the Way ForwardRe-orient simulation efforts towards generating a Verification Suite rather than a "Perfect Reality Model":
Proposal: Multi-point Fine-grained Trace Generation (MFTG)The MFTG methodology aims to decouple physical layer simulation assumptions from application-layer codec design by providing a high-resolution library of error traces rather than a single static operating point. Step 1: Resource Baseline Normalization (TBS Definition)
Step 2: Link Budget Mapping and Granularity Setup
Step 3: Large-scale Link-Level Simulation (LLS)
Step 4: Flexible Trace Selection for Verification
ConclusionWhile the source understands the rationale behind both the Fixed BLER approach and Fixed Resource / Link Budget approach for GEO network simulation, a compromised solution is necessary for FS_ULBC to progress. MFTG is therefore proposed for consideration and agreement. |
|
| vivo Mobile Communication Co., |
[FS_ULBC] Proposed ULBC design constraints living document
|
Candidate Convenor for 3GPP Systems Aspects TSG - ULBC Design Constraints Living Document1. ScopeThis living document consolidates design constraints being considered within SA4 for FS_ULBC (Feasibility Study on Ultra-Low Bitrate Codec). Due to the working procedure requiring consensus agreements for design constraints to be integrated into ULBC PD or TR 26.940, and the lack of such consensus so far, this document captures the current status of design constraints even though some items are not fully agreed. 2. ULBC Design Constraints2.1 Sampling Frequency and Audio BandwidthDesign Constraint: Support of [8, 16, 32] kHz / [NB, WB, SWB] required [1], [2] Editor's Notes and Open Issues: - Support of 8 kHz justified for interoperability; clarification needed on whether NB would be tested/supported "externally" based on external resampling - Support of 48 kHz may be considered at higher bitrate operation - Consideration of at least a single model (e.g., SWB) - Many neural codecs operate at 24 kHz; this specific sampling rate should be discussed - Complexity considerations associated with this parameter; joint decisions may be needed Reference: NB audio typically sampled at 8 kHz (100-3500 Hz), WB at 16 kHz (50-7000 Hz), SWB at 32 kHz (50-14000 Hz), FB up to 20000 Hz 2.2 Number of Audio ChannelsDesign Constraint: ULBC candidate codecs shall support mono coding with one channel input and one channel output 2.3 Bit RatesDesign Constraint: ULBC candidate codecs shall operate at bitrates lower than [3.00] kb/s [3] 2.4 Frame LengthDesign Constraint: Candidate codecs shall operate with a coding frame size of multiple of 20 ms Note: Since larger than 20ms bundling time periods will be used, codec proponents should be allowed to consider solutions with larger than 20ms frame sizes 2.5 Algorithmic DelayDesign Constraint: Algorithmic delay shall be less than [coding frame size + x] ms 2.6 ComplexityDesign Constraint: Complexity limits applied according to categories. Computational complexity and program ROM (PROM) of candidate codecs for each category shall be measured with ITU-T STL2009 [1] as observed worst-case encoder + observed worst-case decoder complexity within the same category [5], [6] Categories: - Computational: wMOPS: Less than [x] wMOPS - Memory: RAM, ROM, Program ROM (values TBD) Editor's Notes: - Model size per operation mode is less than [5-10] million parameters - Total number of parameters is less than [Z] million - ULBC Codec should be implementable on mobile device using today's technology - Increased computational complexity and memory usage should be commensurate with gain in quality of user experience (e.g., higher audio bandwidth such as SWB or stereo) or increased efficiency (e.g., lower bit rate for same quality compared to reference codec) 2.7 Potential Use of Noise Suppression as Part of the CodecDesign Constraint: If noise suppression is supported inside ULBC, there should be a mechanism to disable noise suppression in the codec [7], [8] Editor's Notes - Clarifications Needed: - Need to support noise suppression in ULBC? (typically vendor specific, defined outside the codec) - Impacts on test methodology, DTX operation/performance Motivations: - Disabling noise suppression required to test feature separately - Avoid tandeming in real operation - IMS voice communication defined in TS 22.228; GEO satellite access has no specific requirement on noise handling 2.8 Jitter Buffer Management (JBM)Design Constraint: A JBM solution conforming to requirements in TS 26.114, except for the functional requirement in sub-clause 8.2.2 of TS 26.114: "Speech JBM used in MTSI shall support all the codecs as defined in clause 5.2.1", shall be provided with candidate codecs 2.9 Rate SwitchingDesign Constraint: Candidate codecs shall perform rate switching upon command to the encoder throughout the entire bit rate range at arbitrary frame boundaries. Rate switching may imply switching between different bandwidths Note: Due to the Bundling period and associated TBS, switching might have to happen at the boundary of bundling period 2.10 Packet Loss Concealment (PLC)Design Constraint: A PLC solution shall be provided by ULBC candidate codecs [9] Editor's Notes: - Typical loss profiles/characteristics to be clarified - Support of redundancy to be clarified - Need to be able to handle BLER up to [x%] 2.11 RTP Payload FormatDesign Constraint: Candidate codecs shall provide an RTP payload format specification supporting the full set of features and functionality of the ULBC candidate codecs 2.12 DTXDesign Constraint: Candidate codecs shall provide a complete VAD/DTX/CNG framework. It shall be possible to operate the codec with DTX on or DTX off Editor's Note: Typical radio characteristics and optimizations (SPS, DRX, bitrate) to be clarified 2.13 Output Gain LimitationDesign Constraint: ULBC candidate codecs shall not amplify the output signal relative to the input signal beyond limits Editor's Note: Similar limits and methodology to measure the amplification are described in the EVS-7a,b processing plan permanent document 3. References[1] S4-251794 - Discussion on Audio Bandwidth for ULBC (vivo, Samsung, MediaTek Inc., Bytedance, Nokia, Xiaomi, Spreadtrum) [2] S4-251808 - Pseudo-CR on Design Constraints of ULBC: Audio bandwidth (Fraunhofer IIS) [3] S4-251792 - On codec bitrate and capacity discussion for ULBC (vivo, Samsung, Spreadtrum, MediaTek Inc.) [4] ITU-T G.191 - Software tools for speech and audio coding standardization (March 2010) [5] S4-251747 - On complexity constraints for ULBC (Huawei Technologies Co., Ltd.) [6] S4-251807 - On complexity design constraints for ULBC (Fraunhofer IIS) [7] S4-251395 - Pseudo-CR on Design Constraints of ULBC: Noise suppression (Fraunhofer IIS) [8] S4-251748 - On noise suppression for ULBC (Huawei Technologies Co., Ltd.) [9] S4aA250268 - Packet Loss Concealment with existing AI based codec DAC (Dolby Laboratories Inc., Ericsson LM, Nokia, Novamint) Note: Items in light blue are candidates for agreement at SA4#135. |
|
| vivo Mobile Communication Co., |
[FS_ULBC] Alignment Analysis on Complexity of DAC model
|
Alignment Analysis on Complexity of DAC Model1. IntroductionThis contribution addresses a significant discrepancy in complexity reporting for AI-based codecs in the ULBC study. Two contributions (S4-260165 from Dolby et al. and S4-260155 from vivo et al.) both reported models with approximately 3M parameters but showed substantially different complexity metrics:
Notably, the S4-260165 model's complexity (0.79 GMACS) aligns more closely with the S4-260155 model operating at 16 kHz (~0.70 GMACS), despite the difference in sampling rate. The contribution demonstrates that Model Size (parameter count) is an insufficient metric for constraining complexity across different neural architectures, and proposes GMACS as a robust, architecture-agnostic metric that provides linear correlation with RTF. 2. Architectural Analysis and Discrepancy Resolution2.1 The "Model Size" TrapA detailed breakdown comparison was performed between the two architectures to understand why models with similar parameter counts exhibit different computational footprints: | Metric | [2] (16k, ~3M) | [1] (32k, ~3M) | |--------|----------------|----------------| | Input Rate | 16,000 Hz | 32,000 Hz | | Total Stride | 320 (2×4×5×8) | 1280 (4×4×8×10) | | Latent Rate | 50.0 Hz | 25.0 Hz | | Encoder MACs (M) | 436.30 | 461.92 | | Quantizer MACs (M) | 2.25 | 0.50 | | Decoder MACs (M) | 984.50 | 1037.12 | | Total MFlops/s | 1423.05 | 1499.54 | Key Analysis:
Conclusion: Two models with identical parameter counts can have vastly different runtimes depending on parameter location (shallow vs. deep layers) and stride configuration. 2.2 Verification of Complexity MetricsTheoretical complexity (GMACS) was recalculated to validate the analysis:
3. GMACS as the MetricWhen RTF data from S4-260155 is plotted against GMACS (rather than Model Size), the data aligns consistently across architectures. Key Findings:
4. ConclusionBy adopting GMACS as the primary complexity metric, the apparent discrepancies between different contribution data are resolved. This enables a unified set of requirements that accurately reflects real-time capability of mobile devices. 5. ProposalPropose to include this analysis in 3GPP TR 26.940, specifically capturing:
References[1] S4-260165, "[FS_ULBC] On ULBC complexity and RTF analysis" [2] S4-260155, "[FS_ULBC] Analysis of AI Codec Real-Time Performance (RTF) and Complexity Scaling" |
|
| Qualcomm Incorporated, Xiaomi |
[FS_ULBC] Feasible bitrates for the NTN-TDL-C channel model with 10-degree elevation angle
|
Summary of S4-260214: Feasible Bitrates for NTN-TDL-C Channel Model with 10-Degree Elevation AngleBackground and MotivationThis contribution addresses the determination of feasible Transport Block Sizes (TBS) for the newly agreed NTN-TDL-C channel model at 10-degree elevation angle, which was adopted at SA4 #134 (documented in S4-252108). Key observations include:
Simulation MethodologyThe contribution evaluates maximum feasible bitrates under worst-case conditions without DMRS bundling, considering two scenarios: Scenario 1: Ideal Timing
Scenario 2: Timing Uncertainty
Both scenarios target 2% BLER for uplink and downlink. Simulation ResultsScenario 1 Results (No Timing Uncertainty)
Scenario 2 Results (10ms Timing Uncertainty)
Proposed Changes to Permanent DocumentTBS Table UpdatesThe contribution proposes adding new TBS values to support the higher bitrates enabled by the new channel model: For 80ms Bundling Period (Table 5.2.2.1-1)
For 160ms Bundling Period (Table 5.2.2.1-2)
For 320ms Bundling Period (Table 5.2.2.1-3)
Terminology Change
Updated TablesThe proposed tables include: - Packet header assumption: 7 bytes (with note that final size depends on SA2/RAN conclusions on 1-byte MAC header feasibility) - Header counting: Packet header counted only once per bundling period, regardless of number of voice frames bundled - TBS values: Selected from TS 36.213 Table 16.5.1.2-2 for NB-IoT NPUSCH - Net bitrate calculation: PHY bitrate minus overhead from packet headers The complete updated tables show TBS values ranging from 144 to 936 bits for 80ms bundling, with corresponding PHY and net bitrates calculated for each bundling period configuration. |
|
| Qualcomm Incorporated, Ericsson LM |
[FS_ULBC] On the scheduling timing uncertainty
|
Summary of 3GPP Technical Document: FS_ULBC - Scheduling Timing Uncertainty1. Background and MotivationThis contribution addresses ambiguities in interpreting RAN1 LS S4-251654 regarding uplink and downlink timing for NB-IoT NTN with GEO satellites. The interpretation of this LS has direct implications on: - Scheduling timing uncertainty assumptions - Link capacity calculations The document proposes clarifications to the Permanent Document (PD) version 0.4.0 to resolve these interpretation issues discussed at SA4 #133-e and subsequent meetings. 2. Main Technical Contributions2.1 Frame Structure for Dynamic SchedulingThe document maintains the existing frame structure example for Half-duplex FDD with 80ms bundling period: - NPDSCH duration: 4ms (variable depending on DL SNR) - Multiple UL frequency allocation options: 1, 3, 6, and 12 tones with 15 kHz per tone - Allocation choice depends on UL and DL channel capacity 2.2 Semi-Persistent Scheduling (SPS) Frame StructureTwo SPS approaches are presented: Approach 1 (Figure 5.2.2.3-2): - NPDSCH can be positioned anywhere within first 15ms - Maintains minimum 1ms gap to NPUSCH Approach 2 (Figure 5.2.2.3-3): - Based on "Cell_specific_Koffset" approach - Does not depend on "TA report UE capability" 2.3 Gap Composition Between DL and ULThe gap consists of: 1. Processing time + DL-to-UL switching: Minimum 1ms for half-duplex device switching 2. Max differential delay: Accounts for different round-trip delays of UEs in NTN cell - Typical range: close to 0 to 10.3ms depending on deployment 2.4 Baseline Assumptions for Codec SimulationsKey Changes Proposed: For 80ms bundling: - Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 68ms - Proposed change: Replace with reference to beam size no larger than 1500km - Note clarifies this corresponds to scenarios where difference between closest and farthest point to satellite is <1500km - Explicitly states codec can be deployed in scenarios not meeting these constraints For 160ms bundling:
- Original assumption: "Max differential delay" is 10ms AND X + Y ≤ 148ms 2.5 Important Notes and Editor's NotesRAN1 LS Clarification: - Figure 5.2.2.3-1 supportable in most scenarios - May not be supportable when: - Cell is very large (e.g., >3000km) - UE does not support TA report - Network does not support UE-specific K-offset - Requires UE configuration with two HARQ processes and HARQ feedback disabled SPS Design Status: - RAN1/2 have not yet started SPS design work - RAN1 cannot currently confirm whether SPS frame structure examples (Figures 5.2.2.3-2 and associated text) will be supported Editor's Note: - Range of "Max differential delay" is TBC (To Be Confirmed) 3. Summary of ChangesThe primary technical contribution is replacing specific timing constraint assumptions (X + Y values and max differential delay) with a more practical reference based on beam size ≤ 1500km for codec simulation baseline, while explicitly allowing codec deployment in scenarios exceeding these reference conditions. This provides clearer guidance to SA4 while maintaining flexibility for various deployment scenarios. |
|
| Qualcomm Incorporated |
[FS_ULBC] On transmission delay for voice over NB-IoT NTN
|
Summary of S4-260216: On Transmission Delay for Voice over NB-IoT NTNDocument OverviewThis contribution from Qualcomm addresses gaps in TR 26.940's mouth-to-ear delay calculations for NB-IoT NTN systems, specifically highlighting the omission of NPUSCH/NPDSCH transmission durations and clarifying the distinction between propagation delay and transmission delay. Main Technical Issues IdentifiedProblem Statement
Proposed Technical Changes5.2.2.4 Propagation Delay CorrectionsKey Change: Renamed "Transmission delay" to "Propagation delay" for GEO satellite link
5.2.2.5 Transmission Delay (New Section)New Addition: Introduces proper definition and consideration of transmission delay
5.2.2.5/6 ULBC Delay Components
5.1.3 Mouth-to-Ear Delay Estimation UpdatesEditorial Note Added: - Numbers in Table 5.1.3-1 will be updated once RAN simulation is completed to account for transmission delays in uplink and downlink - Current values assume AMR and EVS algorithmic delays - ULBC delay components still need to be addressed - Minimum Delay_GSCN assumed as 20ms Existing Table Structure Maintained: - Frame sizes: 20ms, 40ms, 80ms, 160ms, 320ms - Two scenarios: GEO-TN (main) and GEO-GEO (sub-scenario 1) - Lower and upper bounds for mouth-to-ear delay - Delay ranges from 428-712ms (20ms frame, GEO-TN) to 984-1455ms (320ms frame, GEO-GEO) Dependencies and Next Steps
|
|
| Qualcomm Incorporated |
[FS_ULBC] Support for Dual-Tone Multi-Frequency for IMS voice over NB-IoT NTN
|
Summary of S4-260217: Support for Dual-Tone Multi-Frequency for IMS Voice over NB-IoT NTNBackgroundSA1 has mandated support for Dual-Tone Multi-Frequency (DTMF) for IMS voice over NB-IoT NTN. The document addresses the need to consider multiplexing of DTMF traffic with voice traffic in the system design, referencing RFC 4733 for DTMF payload formats. DTMF Use in IMS Voice Services and Traffic CharacteristicsDTMF Payload TypesRFC 4733 defines two DTMF payload format types: - Telephone events: User button presses (0-9, , #) during calls - Tones*: Ringing tone, busy tone, etc. For IMS calls, tones are generated locally (e.g., "180 Ringing" or "486 Busy Here" SIP messages trigger local tone generation), so only telephone events need to be transported over the air. Technical Specifications
Traffic CharacteristicsThe document identifies key DTMF traffic characteristics: - DTMF packets are transmitted infrequently (only on button press) - Telephone events may or may not overlap with active voice activity - Multiple DTMF packets may be transmitted per button press, with the RTP marker bit indicating the first packet - RTP packets must differentiate between voice and DTMF packets for multiplexing Design AssumptionsThree key assumptions are established: 1. DTMF packet size ≤ voice packet size 2. DTMF delay requirements are less stringent than voice service 3. DTMF takes priority over voice SPS Configuration ConsiderationsWhen SPS (Semi-Persistent Scheduling) is configured for voice traffic with fixed TBS: - If DTMF packets don't overlap with active voice frames, they can be multiplexed with SID frames (smaller than active voice frames) and transmitted in SPS occasions - If overlapping occurs, the UE can puncture an active voice frame and send the DTMF frame instead - SA4 needs to coordinate with RAN1 and RAN2 on SPS design ProposalsProposal 1: Make DTMF support an integral part of IMS voice service over NB-IoT NTN Proposal 2: Design DTMF support based on the three assumptions: - DTMF packet size ≤ voice packet size - DTMF delay requirement less stringent than voice - DTMF priority over voice Proposal 3: SA4 to design mechanisms for voice and DTMF multiplexing for SPS and coordinate with RAN1 and RAN2 |
|
| Nokia |
Proposed design constraints for noise suppression, DTX, and non-speech inputs
|
Summary of S4-260220: Design Constraints for Noise Suppression, DTX, and Non-Speech Inputs1. Background and ContextThis contribution addresses design constraints for the ULBC (Ultra-Low Bit-rate Communication) over GEO channel solution, building upon previous discussions from S4-251881 and S4-251786. The document focuses on three key areas: - Noise suppression handling - Discontinuous transmission (DTX) framework - Robustness to non-speech inputs Emergency Call Use CaseThe contribution emphasizes that emergency calls represent a critical use case for ULBC over GEO, particularly when terrestrial network (TN) service coverage is unavailable. Key considerations include: - Background signals may contain critical contextual information (e.g., voices, environmental sounds indicating danger) - Post-call analysis requirements (ASR transcripts, emergency response evaluation, criminal investigations) - Need for full situational awareness rather than aggressive noise suppression 2. Technical Analysis2.1 Noise Suppression Trade-offsThe document identifies several technical challenges:
2.2 Updated ApproachThe contribution updates the original proposal from S4-251881 by: - Maintaining the requirement for disableable noise suppression within the codec - Adding specific SNR ranges for stationary (5-15 dB) and non-stationary (10-25 dB) noise - Deferring specific noise type definitions for future discussion - Linking noise suppression behavior primarily to performance requirements 3. Proposed Design ConstraintsThe document proposes updates to Table 6.2-1 in draft TR 26.940 with three new/modified constraint parameters: 3.1 Noise Suppression ConstraintRequirement: If noise suppression is supported as part of the candidate codec, it must be possible to disable it to preserve background signals. Editor's Notes: - EN1: Requirement to disable may be considered in connection with specific operating bit rate(s) - EN2: Solution behavior w.r.t. potential noise suppression is primarily enforced via performance requirements; default operation for tests is with noise suppression disabled 3.2 DTX Framework ConstraintRequirement: The candidate codec shall provide a framework for: - Voice Activity Detection (VAD) - Discontinuous Transmission (DTX) - Comfort Noise Generation (CNG) - Operation with DTX on or DTX off Editor's Note: Operation relating to DTX on and disabling/enabling potential noise suppressor may need clarification 3.3 Robustness to Non-Speech InputRequirement: The candidate codec shall be robust to: - Noisy speech with stationary noise (5-15 dB SNR) - Noisy speech with non-stationary noise (10-25 dB SNR) - Background signals during and between speech segments - Other non-speech input signals Editor's Notes: - EN1: May need to be in performance requirements - EN2: Relevant background signals to be further defined as part of performance requirements, including both stationary and non-stationary types 4. Key Technical Contributions
5. Open IssuesSeveral editor's notes indicate areas requiring further work: - Specific operating bit rates where noise suppression disable requirement applies - Clarification of DTX and noise suppression interaction - Final placement of robustness requirements (design constraints vs. performance requirements) - Definition of specific background signal types for testing - Speech quality requirements (to be addressed separately in performance requirements) |
|
| Ericsson LM |
UE Antenna Gain in link-budget evaluations
|
UE Antenna Gain in Link-Budget EvaluationsIntroductionThis contribution addresses the need to establish common assumptions for UE Antenna Gain in link-budget evaluations for FS_ULBC (Ultra Low Bitrate Speech Codec). The document highlights that different assumptions on UE Antenna Gain lead to significantly different conclusions on suitable radio configurations, and proposes alignment with existing 5G NR-NTN assumptions. Problem StatementThe current FS_ULBC Pdoc references TR 36.763 with UE Antenna Gain assumptions ranging between 0 dBi and -5.5 dBi. Previous SA4 contributions on link level simulations have shown divergent assumptions regarding achievable link level performance, leading to inconsistent conclusions. The lack of a common assumption for UE Antenna Gain (G_Tx) significantly impacts:
Link-Budget AnalysisComparative EvaluationThe document presents a detailed side-by-side comparison of link-budget calculations for GEO satellite uplink with two different UE Antenna Gain assumptions: Scenario Parameters (Common): - Satellite Orbit: GEO - Link Direction: Uplink - Device Type: Handheld - Satellite Elevation Angle: 2.3 degrees - Satellite Altitude: 35,786 km - Slant Range: 41,417.91 km - Carrier Frequency: 2000 MHz - Free Space Path Loss (FSPL): 190.8 dB - UE Transmit Power: 23 dBm - Receive Antenna Gain: 51 dBi - Satellite G/T: 19 dB/K - Bandwidth: 3750 Hz - Various losses (atmospheric, shadow fading, scintillation, polarization, additional): 11.4 dB total Key Results: | UE Antenna Gain | Received Power | Noise Power | SNR at Satellite Receiver | |-----------------|----------------|-------------|---------------------------| | 0 dBi | -135.58 dBm | -138.23 dBm | 2.66 dB | | -5.5 dBi | -141 dBm | -138.23 dBm | -2.84 dB | The difference in UE Antenna Gain assumption results in a 5.5 dB difference in SNR, which is highly significant for link-level performance evaluation and system design. ObservationsObservation 1: The assumption for UE_Antenna_Gain (G_Tx) critically impacts the resulting SNR at the satellite receiver, which in turn affects conclusions on link-level results. Clarification is needed on whether to use 0 dBi, -5.5 dBi, or both values. Observation 2: It is unlikely that an NB-IoT device would have superior antenna performance compared to an NR handheld device. Therefore, the UE_Antenna_Gain assumption should align with 5G NR-NTN specifications, which use -5.5 dBi. Observation 3: RAN4 guidance (R1-2208353) explicitly recommends -5.5 dBi as a realistic UE antenna gain value, stating: "The UE antenna gain varies depending on the operating frequency and UE design. RAN4 thinks that a realistic UE antenna gain value would be -5.5 dBi. RAN4 would then recommend RAN1 to take this value as an assumption for their link budget evaluation." ProposalProposal 1: For the support of voice-over-GEO in NB-IoT NTN, align the assumption on UE_Antenna_Gain (G_Tx) with 5G NR-NTN specifications, i.e., -5.5 dBi. This alignment ensures: - Consistency with existing 3GPP NTN specifications - Realistic assumptions based on RAN4 recommendations - Comparable link-budget evaluations across different contributions - Appropriate performance targets for codec and system design |
|
| Fraunhofer IIS, Apple Inc. |
On reference code and model format
|
3GPP Change Request Summary: S4-260233Document OverviewThis contribution proposes the use of ML model formats as intermediate representations (IR) for the ULBC (Ultra Low Bitrate Codec) reference implementation, rather than a pure C implementation. The document is structured as a proposed Change Request (pCR) to TR 26.940, introducing a new clause 6.4.2. Main Technical Contributions1. Problem Statement and Motivation (Goal Section)The document identifies a fundamental question for ULBC standardization: whether to provide the entire codec reference implementation in C (including neural network components) or to define specific parts based on ML model formats (e.g., ONNX, PyTorch, TensorFlow). Key concerns with pure C implementation: - Limits UE vendors from leveraging custom architectures and optimizations - UE vendors typically have custom optimization pipelines to port ML models to internal formats - Pure C approach restricts full utilization of specialized hardware (NPUs, DSPs, TPUs) 2. Limitations of C-Based Reference Implementation (Clause 6.4.2.1)Issues with existing WMC (Weighted Million Operations) tool for complexity measurement: - Weights in Table 18.3 of G.191 do not account for vectorized implementations of matrix multiplications - Theoretical complexity estimation does not reflect actual runtime complexity - Does not account for diversity of target platforms Additional limitations identified: - Hardware/platform dependencies: C implementations may rely on platform-specific intrinsics and vectorization pragmas, limiting portability to NPUs - Unoptimized reference code: May not be optimized for certain platforms - Compiler dependencies: Intrinsics are compiler-specific - Maintenance burden: Keeping C implementation updated with new ML operators and architectures is costly and error-prone 3. Definitions and Concepts (Clause 6.4.2.1 - Definitions)The document establishes clear terminology:
Note: PyTorch does not contain a graph format and requires model definition as Torchcode. 4. Advantages of Model Format Approach (Clause 6.4.3.2)Platform Portability: - Specifies what is computed, not how it's executed - Framework-agnostic: models can be exported from different training frameworks - Allows vendors to use custom toolchains for hardware-specific optimization Hardware Evolution: - Future-proof method to leverage latest AI processor developments - Maintains compatibility with low maintenance effort Combination with Standard C-code: - ULBC can combine ML parts (as model format) with classic signal processing (in ANSI C) - Backend runtime in C can integrate ML components - Enables traditional 3GPP codec reference implementation structure 5. Comprehensive Model Format Analysis (Clause 6.4.3.3)The document provides detailed comparison of major ML model formats: | Format | Type | Key Advantages | Key Limitations | |--------|------|----------------|-----------------| | ONNX | Framework-agnostic IR | Cross-framework portability, wide runtime/hardware support, native OS support (Windows/Linux), dedicated C/C++ runtime | Operator coverage limitations, limited dynamic graph support | | TensorFlow Lite (TFLite/LiteRT) | Edge/embedded-focused IR | Mobile/edge optimized, strong Android ecosystem, quantization tools, C/C++ runtime | TensorFlow-centric, partially vendor-specific maintenance | | PyTorch/Python | Torch.nn.Module + checkpoints | Easy prototyping, highly optimized conversion tools | Suboptimal for real-world testing, Python dependencies, no C/C++ runtime without Python | | TorchScript | PyTorch-specific serialized IR | Static graph without Python dependencies, supports custom ops, LibTorch C++ runtime | PyTorch-specific, deprecated (being replaced by ExportedProgram) | | ExportedProgram & ExecuTorch | Two IRs: ExportedProgram and .pte | Replaces TorchScript, canonical PyTorch export IR, dedicated C++ runtime | PyTorch-specific, requires compilation to another IR, pipeline not fully mature | | OpenVINO IR | Intel/CPU-centric IR | Strong Intel CPU/GPU optimization | Not suitable for mobile SoCs, extra conversion step needed | | Proprietary vendor IRs | Vendor-specific internal IR | Highly hardware-optimized | Not portable, requires conversion from open IR | Key observations: - PyTorch format provides maximum flexibility and transparency but may have long-term compatibility concerns due to format evolution - ONNX and TFLite are designed for inference deployment and cross-platform compatibility, representing stable industry standards - ULBC ML parts will likely be based on PyTorch format, convertible to stable formats like ONNX or TFLite 6. SoC AI Engine Support Analysis (Clause 6.4.3.4)Hardware landscape: - Major smartphone SoCs include NPUs, DSPs, TPUs, GPUs, and CPUs - Vendors provide specialized runtime environments and SDKs - Vendors use native/preferred internal model formats optimized for their architecture Industry pattern: - All major vendors provide conversion mechanisms from popular open-source formats - Common supported formats: ONNX, TFLite, PyTorch, TensorFlow - References provided for major vendors: Qualcomm, Apple, Samsung, MediaTek, Google, Huawei 7. Summary and Recommendations (Clause 6.4.3.4)Advantages of model-format/IR-based reference implementation: - Decouples algorithm definition from hardware-specific implementation - Leverages existing SoC vendor compilers, AI accelerators, and runtimes - Significantly more portable, maintainable, and future-proof Recommended approach for ULBC reference implementation: 1. Base reference on ML model-format with auxiliary signal processing in C 2. Include both ONNX and PyTorch as ML model-formats 3. Define neural network model-format including operator set and version 4. Specify I/O interfaces of ML models and auxiliary signal processing steps in C 5. Use reference implementation for integration illustration, verification, and testing ProposalThe document proposes: 1. Discussion and agreement on selection of one or more model formats for ULBC reference implementation 2. Agreement on principle of using model format as part of ULBC standardization reference model 3. Documentation of findings in TR 26.940 under new clause 6.4.2 Key Technical ImpactThis contribution represents a significant departure from traditional 3GPP codec standardization approaches by advocating for ML model formats rather than pure C implementations. The proposal addresses practical deployment considerations for ML-based codecs while maintaining compatibility with 3GPP standardization practices through hybrid approach combining model formats with C code for signal processing components. |
|
| Orange |
On the use of objective metrics in ULBC standardization
|
Summary of 3GPP Technical Document on Objective Metrics in ULBC StandardizationIntroduction and ScopeThis document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions). Main Technical ContributionsTest Methodologies (Clause 9)Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1)The document identifies specific impairment categories relevant to ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech) Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization). Challenges of Quality Assessment (9.1.2)The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS): Traditional 3GPP Approach: - Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech - P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content - Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments ULBC-Specific Considerations: - ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs - ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues) - DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition Alternative Test Methodologies Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation Subjective Testing Considerations (9.1.3)Robustness Related to Source Material (9.1.3.1): - Multiple languages with diverse intonations - Non-speech signals - Various linguistic features and accents - Wide range of speakers (different voice pitches, speaking styles) - Overlapping talkers Simulation of Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Various reverberation levels (RT60 ranging from 0.3s to 1.0s) Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov Conclusion (9.1.3.4): - ITU-T P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming Objective Testing Considerations (9.1.4)Correlation Analysis Results (9.1.4.1): The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models: Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ General audio metrics: PEAQ, ViSQOL-A Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient Key Observations for Clean Speech: - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior - 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA) Correlation Analysis for Music/Mixed Content: Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model Key Observations for Music/Mixed Content: - POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f-model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance - Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective and objective metrics may be considered for codec characterization - Objective metrics have merits in other tasks such as codec conformance testing Proposed Changes to TR 26.940The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization. |
|
| Fraunhofer IIS |
On complexity estimation of ULBC
|
Summary of S4-260241: On Complexity Estimation of ULBCDocument OverviewThis contribution addresses the complexity measurement methodology for the Ultra-Low Bitrate Codec (ULBC) under development in 3GPP SA4. The document proposes a hybrid complexity metric that combines traditional DSP-based measurements with ML-specific metrics. Background and MotivationMultiple input documents [1-4] have previously discussed complexity measurement approaches: - Documents [1] and [3] proposed using WMOPS (Weighted Million Operations Per Second), following conventional speech codec practices - Document [2] suggested using MACs and a modified WMOPS version - Document [4] emphasized model size considerations The key challenge is that ULBC will operate on heterogeneous, non-fixed target hardware and processors, requiring a platform-agnostic complexity metric. Main Technical ContributionsProposed Hybrid Complexity MetricThe document proposes combining two complementary measurement approaches: For DSP-based components: - Use traditional WMOPS measurement For ML-based components: - Use MAC (Multiply-Accumulate) operations count - Include parameter count for memory/model size considerations Combined metric formula:
Rationale for the Hybrid ApproachLimitations of WMOPS-only approach: - WMOPS reflects complexity primarily for DSP operations - Does not account for modern vectorization capabilities available even on modern DSPs - Less relevant for non-DSP processor types - The WMOPS toolbox doesn't reflect modern computational capabilities ML-specific considerations: - ML component complexity is dominated by matrix multiplications - Inference time and energy consumption are highly platform-dependent - MAC count provides architecture-agnostic computational load measurement - Parameter count relates directly to model size, memory usage, and energy consumption Advantages of the Proposed MetricThe hybrid approach provides: 1. Overall complexity estimate for hybrid DSP+ML codec designs 2. Avoids over-constraining codec design toward specific platforms (referenced S4-260233) 3. Allows UE vendors to leverage custom architectures and optimizations 4. Accounts for efficient vectorization of ML components 5. Enables flexible computational cost balancing between DSP-based and ML-based components 6. Maintains continuity with established practice while accommodating emerging ML-based designs Vectorization Capability Reference DataThe document provides example processing units and their vectorization capabilities to inform the ML weighting factor | Chip | Type | Vectorization Capabilities |
|------|------|---------------------------|
| HiFi 5s | DSP | 32×(8×8 bit MAC) ProposalThe source proposes to:
ReferencesThe document references five previous contributions [1-4] and two external technical specifications [5-6] for processor capabilities. |
|
| Dolby Laboratories Inc., Nokia, Novamint |
[FS_ULBC] ULBC Re-Focus Proposal
|
ULBC Re-Focus ProposalBackground and MotivationThe FS_ULBC study item, initiated nearly a year ago, aims to establish a normative ULBC standard for voice communication over GEO within Rel-20. However, progress has been slow, with crucial issues such as end-to-end simulation parameters remaining unresolved. This contribution proposes a focused approach to meet 3GPP standardization timelines. Core Proposal: Two-Phase Standardization ApproachThe document proposes separating ULBC standardization into two distinct phases to ensure timely delivery while accommodating future enhancements: Phase 1: Rel-20 ULBC Baseline
Phase 2: Rel-21 ULBC Advanced
Technical Configuration ComparisonApplication ScenariosBaseline (Rel-20): - IMS Voice Call over GEO based strictly on Rel-19 service requirements Advanced (Rel-21): - Multi-Party Voice Communication - IMS Voice Call with ULBC over additional access types beyond GEO GEO Channel Characteristics & SimulationBaseline (Rel-20): - Single baseline UE Tx/Rx capability - Single CNR in UL and DL (e.g., UL single-tone 23 dBm: CNR=5.28 dB for SCS=3.75 kHz, CNR=-0.74 dB for 15 kHz; DL 12-tone single Rx: CNR=-0.61 dB) - Single agreed target bitrate compatible with baseline UE capability enabling acceptable system capacity - Reliance only on mandatory Rel-19 NB-IoT radio protocol features (except SPS) - i.i.d. random block error patterns - Single SPS/bundling period (160 ms) Advanced (Rel-21): - Advanced UE capabilities (e.g., increased Tx power, multiple Rx antennas) - Multiple CNR assumptions in UL and DL - Codec designers may choose optimal bitrate/TBS per CNR - Allow reliance on expected Rel-20 and selected non-mandatory NB-IoT features - Simulated block error patterns based on advanced features - Additional SPS/bundling periods (e.g., 80 ms, 320 ms) Design Constraints: BitrateBaseline (Rel-20): - Single target bitrate derived from Rel-19 GEO IMS voice service requirements - Example: TBS=208 with SPS period 160 ms, achieving 950 bps net bitrate Advanced (Rel-21): - Multiple target CNRs with bitrate as codec design choice - Additional bitrates for future 6G-related scenarios Design Constraints: Sample Rate and Audio BandwidthBaseline (Rel-20): - Single sample rate: e.g., 16 kHz - Audio bandwidth: up to WB - Note: May depend on agreed target bitrate Advanced (Rel-21): - Input/output sampling rates: at least 8, 16, 32, 48 kHz - Audio bandwidth unconstrained (codec design choice) Design Constraints: Frame Length and Algorithmic DelayBaseline (Rel-20): - Corresponding to SPS/bundling period (160 ms) or sub-multiples thereof - Algorithmic delay excl. framing: e.g., ≤80 ms (0.5 × SPS/bundling period) Advanced (Rel-21): - Frame structure and algorithmic delay aligned with advanced SPS/bundling options and future 6G Media requirements Design Constraints: Complexity and MemoryBaseline (Rel-20): - Limited; sufficiently low to not preclude deployment on current-generation smartphones - TBD MMAC/s - E.g., 3M parameters Advanced (Rel-21): - Relaxed, enabling multiple models - Addressing future 6G Media requirements while leveraging new UE hardware trends Design Constraints: Packet Loss ConcealmentBaseline (Rel-20): - Required; capable of addressing single agreed-upon target bit rate and operation point of IMS Voice Call over GEO Advanced (Rel-21): - Required; capable of supporting anticipated extended application scenarios beyond Rel-20 IMS Voice Call over GEO, while fulfilling potential 6G Media requirements Design Constraints: Noise Suppression and RobustnessBaseline (Rel-20): - No requirement to provide noise suppression - Required capability to handle and reconstruct noisy speech input with moderate to high SNR - Note: Noise reconstruction capability primarily enforced through performance requirements Advanced (Rel-21): - No requirement to provide noise suppression - Required capability to handle speech and generic input anticipated in extended application scenarios Design Constraints: DTXBaseline (Rel-20): - No requirement to support DTX - Note: No separate DTX-related performance requirement Advanced (Rel-21): - DTX support may be required for certain extended application scenarios, depending on potential 6G Media requirements Performance RequirementsBaseline (Rel-20): - Requirements focusing on clean and noisy speech performance - NWT AMR7.4 or NWT AMR-WB8.85 depending on target bandwidth for: - Clean speech - Noisy speech (AMR/AMR-WB references operated with DTX on) - Relevant transcoding cases with G.711, AMR, AMR-WB, EVS Advanced (Rel-21): - Complex set of requirements considering required capability to handle speech and generic input anticipated in extended application scenarios Test Methodology and Test PlanBaseline (Rel-20): - Subjective: P.800 DCR - Note: Test methodology and test plan should be conceptually aligned with corresponding EVS codec standardization Pdocs (e.g., DCR test design, applicable SNRs and types of noises for noisy speech test cases) Advanced (Rel-21): - Subjective: Suitable for critical evaluation of candidate codec(s) against expected complex set of performance requirements ProposalSA4 is asked to adopt this phased approach for ULBC standardization as working assumption:
This approach ensures deliverable ULBC baseline in Rel-20 while providing clear and orderly path toward enhanced ULBC design in Rel-21. |
|
| Qualcomm Incorporated |
[FS_ULBC] Feasible TBS values and packet loss traces for 80ms bundling period for ULBC over NB-IoT NTN GEO channel
|
Feasible TBS Values and Packet Loss Traces for 80ms Bundling Period for ULBC over NB-IoT NTN GEO Channel1. Background and ScopeThis contribution presents simulation results for 80ms bundling period following the Simulation One ("target QoS based simulation") methodology. The document provides:
2. Simulation Parameters and Trace LabelingThe simulations cover the following parameter ranges:
Trace file naming convention established for both UL and DL scenarios. 3. Optimal Configurations3.1 TBS 144, 1 UE RXOptimality criterion: Tradeoff between per-UE performance (TBS and BLER) and system capacity. 1% BLER
2% BLER
6% BLER
10% BLER
3.2 TBS 144, 2 UE RXFor 23dBm UE Power Class
For 26dBm and 31dBm UE Power Classes
Key observation: 3.75kHz SCS configuration becomes optimal for higher power classes due to better coding rate. 3.3 TBS 256, 2 UE RXFor 26dBm UE Power Class
For 31dBm UE Power Class
3.4 TBS 328, 2 UE RXFor 26dBm UE Power Class
For 31dBm UE Power Class
3.5 TBS 424, 2 UE RXNote: Coarse 5ms granularity for NPDSCH time-domain configuration. For 31dBm UE Power Class
4. Feasible TBS ValuesObservation: For 80ms bundling period with UE power class up to 31dBm: - All TBS values (144, 256, 328, 424) are feasible for BLERs 1%, 2%, 6%, and 10% - Exception: TBS 424 is not feasible at 1% BLER 5. Packet Loss Traces299,391 traces provided in attached zip file for all 4 TBS values (144, 256, 328, 424). 6. ProposalProposal: Include clauses 2 through 5 to the PD or TR to provide a workable example on determining configurations based on optimal tradeoff between per-UE performance and system capacity. |
|
| Apple Inc. |
[FS_ULBC] ULBC Performance Requirements
|
Summary of S4-260271: ULBC Performance RequirementsDocument Information
Main Technical ContributionsPerformance Requirements FrameworkThe document proposes establishing minimum performance requirements for the Ultra-Low Bitrate Codec (ULBC) based on the following rationale:
Proposed Minimum Performance BenchmarksThe document establishes two key performance anchors:
Reference Codecs and Operating Points for TestingThe document proposes a comprehensive list of reference codecs and operating points for ToR comparison testing in subjective evaluation:
Text ProposalThe document proposes updates to Clause 8 (Performance requirements) of TR 26.940, adding: - New Clause 8.1 (General) containing the performance requirements framework and minimum benchmarks - New Clause 8.1.1 (A List of Reference Codecs and Operating Points) containing the reference codec list for subjective evaluation |
|
| Apple Inc. |
[FS_ULBC] ULBC Codec Testing in Background Noise
|
Summary of S4-260272: ULBC Codec Testing in Background NoiseDocument OverviewThis contribution proposes a testing framework for the Ultra-Low Bitrate Codec (ULBC) in noisy conditions, drawing from EVS codec testing methodologies. The document is a revision of S4-251786 from SA4#134 and proposes updates to TR 26.940 Clause 9. Background and MotivationNoise Suppression ConsiderationsThe document argues against mandating NS algorithms within the codec specification based on several key considerations:
The document advocates for flexibility in NS implementation to enable manufacturers to develop device-specific solutions. Proposed Testing FrameworkCore Testing Scenarios (Table 9.1.4.1)Following EVS codec testing principles (TR 26.952), the proposal includes: | Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Clean speech | - | - | ITU-T P.800 ACR and/or DCR | | Speech + Noise | Stationary (car, etc.) | 15 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 20-25 dB | ITU-T P.800 DCR | This framework aligns with EVS testing which used:
- Car noise at 15 dB
- Street noise at 20 dB Optional Extended Testing for Low SNR (Table 9.1.4.2)To characterize ULBC robustness in challenging low SNR conditions: | Source Material | Noise Type | SNR | Test Methodology | |----------------|------------|-----|------------------| | Speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | Speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Stationary (car, etc.) | 5-10 dB | ITU-T P.800 DCR | | NS processed speech + Noise | Non-stationary (street, babble, etc.) | 10-15 dB | ITU-T P.800 DCR | Key Notes: - To avoid bias, a common NS processing tool should be used for generating NS-processed speech - Selection of specific noise types and the NS processing tool is FFS - Reference is made to TR 26.989 v19.0.0 (MCPTT work) where EVS was evaluated in siren noise at 5 dB SNR Proposed Specification ChangesThe document proposes adding new Clause 9.1.4 to TR 26.940 with two subclauses:
Action RequestedThe document seeks Discussion and Agreement on: 1. The proposed testing framework for ULBC in noisy conditions 2. Updates to TR 26.940 Clause 9 as specified in the text proposal |
|
| Dolby Laboratories Inc., Nokia, Novamint |
[FS_ULBC] On device capability diversity
|
Summary of S4-250275: On Device Capability Diversity for ULBCOverviewThis document (revision of S4aA260006) addresses UE capability diversity in NB-IoT NTN deployments for ULBC voice services. It proposes a capability-aware system design approach rather than assuming uniform baseline UE capabilities, accompanied by a pCR to TR 26.940. Key Technical Contributions1. UE Capability Diversity FrameworkIdentified Capability Dimensions:
Key Insight: These capabilities are optional and vary across device categories, market segments, and implementations. 2. Benefits of Enhanced CapabilitiesEnhanced UE capabilities enable:
3. Capability-Aware Multi-User SPS SchedulingProposed Scheduling Strategy:
Practical Example (Figure 1):
4. ULBC Bitrate DifferentiationProposed Approach:
Note: Actual bitrates subject to ongoing TBS discussions; values >3000 bits/s may become relevant. Proposed Changes to TR 26.940 (pCR)Section 5.2.4: New Clause on UE CapabilitiesDocuments capability variations for NB-IoT NTN:
Section 5.2.5: Enhanced Multi-User ConsiderationsReplaces assumption of uniform UE configuration with capability-aware scheduling:
Includes Figure 1 demonstrating practical multi-user scheduling scenario with three UE types. Section 5.1.2.2: UE Delay TablesUpdates delay estimation tables (5.1.2-2, 5.1.3-1) to include:
Recommendations
ReferencesKey dependencies: S4aA260006 (previous version), S4-260144 (TR 26.940 v0.5.1), S4-260255 (ULBC Re-Focus Proposal), TS 36.763 (UE radio transmission/reception), S4-251863 (system capacity), S4aA250112 (error trace methodology), S4aA250118 (RAN simulation results) |
|
| Orange, Dolby Laboratories Inc. |
On the use of objective metrics in ULBC standardization
|
Summary of 3GPP Technical Document on Objective Metrics in ULBC StandardizationIntroduction and ScopeThis document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions). The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814. Main Technical ContributionsTest Methodologies - General Considerations (Clause 9.1.1-9.1.2)Quality Impairment Categories for ULBC: - Loss of listening-only audio quality - Audio bandwidth loss - Impaired intelligibility - Impaired speaker identifiability - Prosodic impairments - Hallucination (word and phone confusions) - Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech) Testing Challenges: - ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS) - Traditional P.800 ACR methodology may not optimally quantify all potential impairments - DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences - Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations Alternative Test Methods Listed: - Diagnostic Rhyme Tests (DRT) - Modified Rhyme Tests (MRT) - MOS testing for speaker similarity - Speaker verification/identification tests - Prosodic naturalness MOS tests - Intonation recognition tests - Transcription tests for word and semantic equivalence - Phoneme recognition tests - Automatic speech recognition tests - P.835 multi-dimensional rating scales for speech enhancement evaluation Subjective Testing Considerations (Clause 9.1.3)Source Material Robustness (9.1.3.1): - Multiple languages with diverse intonations - Various phonetic and linguistic environments - Different voice pitches and speaking styles - Overlapping talkers Real-world Acoustic Conditions (9.1.3.2): - Clean environments (minimal background noise) - Noisy environments (traffic, human chatter, vehicle) - Varying reverberation levels (RT60 ranging from 0.3s to 1.0s) Tandeming and Compatibility Testing (9.1.3.3): - Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS - Various input levels: -16dBov, -26dBov, and -36dBov Conclusion: - P.800 ACR/DCR serves as backbone for most subjective testing - Other methodologies may be considered - Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming Objective Testing Considerations (Clause 9.1.4)Correlation Analysis on Clean Speech (9.1.4.1): Evaluated objective models from references [7-11]: - Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS - General audio metrics: PEAQ, ViSQOL-A - Additional metric: SCOREQ Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient Key Observations (Clean Speech): - Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs - Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs - Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA) Correlation Analysis on Music/Mixed Content: Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model Key Observations (Music/Mixed Content): - POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping) - 2f model was second-best performing - ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content - Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading Discussion (9.1.4.2): - P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard - P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative - Testing and parameter adjustment based on objective tools not recommended - 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided Conclusion (9.1.4.3): - Subjective testing remains "golden reference" for codec selection - Objective metrics NOT recommended for codec selection criteria or codec tuning - Correlation of subjective/objective metrics may be considered for characterization of new codec - Objective metrics have merits in other tasks such as codec conformance testing Document TypeThis is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4. |
Total Summaries: 38 | PDFs Available: 38