S4-260144 - AI Summary

[FS_ULBC]TR 26.940 V 0.5.1

AI-Generated Summary AI

3GPP TR 26.940 - Study on Ultra Low Bit rate Speech Codecs (Release 20)

Document Overview

This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types.

1. Application Scenarios for Ultra-Low Bit Rate Communication Services

1.1 Scenario 1: IMS Voice Call over GEO (Primary Scenario)

Background:
- GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay
- TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s
- Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints

Scenario Descriptions:

Main Scenario (4.2.2.2): One UE connects via GEO satellite access
- UE1: Phone supporting IMS voice over GEO satellite
- UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation)

Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access
- Less common but relevant for disaster/cyberattack scenarios
- Even with transparent payload, voice packets transmit to ground before reaching other UE
- May enable transcoder-free operation

High-level Prerequisites:
- Very low bitrate support
- DTX support [TBC]
- Error concealment
- Real-time implementation on smartphones
- Good audio quality for reasonable QoE

1.2 Scenario 2: Multi-Party Voice Communication

Background:
- Addresses poor/unstable network conditions in WLAN access
- Network congestion during peak usage or in areas with limited infrastructure
- Codec selection critical for maintaining quality under bandwidth constraints

Scenario Description:
- One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding)
- Both participants on unstable networks using ULBC simultaneously (no transcoding needed)

High-level Prerequisites:
- Ultra-low bitrate capability
- Real-time operation on consumer devices (smartphones, laptops)
- Audio quality matching or exceeding existing voice services

1.3 Scenario 3: IMS Voice Call with ULBC over Other Access Types

Motivation:
- ULBC may provide enhanced robustness against poor network conditions
- Lower bit rates may benefit coverage/capacity
- Reduces transcoding needs when calls bridge GEO and other access types

Scenario Description:
- Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN)

2. Channel Characteristics and Service-Related Dependencies

2.1 Mouth-to-Ear Delay Estimation for GEO Scenarios

Delay Components:

UE Delay (Table 5.1.2-2):
- Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms)
- Performance objective range: 268-1435ms (excluding solution-specific delay)
- Maximum requirement range: 355-1435ms (excluding solution-specific delay)
- Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM

Core Network Delay:
- Ground station to core network: [5-20]ms minimum, [200]ms maximum
- eNodeB to core network: 5-20ms
- Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS)

GEO Transmission Delay:
- Minimum: 248ms
- Maximum: 280ms (per TS 22.261 KPI requirement)
- Variation of 32ms depending on UE location within beam

Mouth-to-Ear Delay Estimates (Table 5.1.3-1):

For Main Scenario (GEO-TN):
- 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X

For Sub-Scenario (GEO-GEO):
- 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X

2.2 NB-IoT NTN System Parameters

System Architecture:
- Service link: UE to NTN payload
- Feeder link: NTN payload to NTN Gateway

RAN Parameters:
- Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink)
- MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM
- Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1
- Resource unit (RU) duration varies with subcarrier spacing and number of tones

QoS Characteristics:
- Managed through QCI (QoS Class Identifier)
- Same PELR (Packet Error Loss Rate) required for UL and DL
- Suggests balanced UL/DL time-domain transmission resources

3. Design Constraints

3.1 Design Constraint Parameters (Table 6.2-1)

Key parameters identified:
- Bit rates: [TBD]
- Sample rate and audio bandwidth: [TBD]
- Frame length: [TBD]
- Complexity and memory demands: [TBD]
- Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing)
- Packet loss concealment (PLC): [TBD]
- Noise suppression: [TBD]
- Discontinuous transmission (DTX): Including VAD and comfort noise [TBD]
- Robustness to non-speech input: [TBD]

3.2 Complexity and Memory Considerations

Current Evaluation Analysis:
- Codec must support real-time thread and concurrent processing
- ML codecs with [5-10M] parameters considered for efficient operation within latency bounds
- Must operate within compute constraints of devices for real-time voice communication

Memory and Power Considerations:
- Larger models require more DRAM access → higher power consumption
- Memory footprint critical for device performance and usability

Complexity Metrics for AI-Based Codecs:

TOPS (Tera Operations Per Second):
- TOPS = 2 × MAC unit count × Frequency / 1 trillion
- Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16)
- TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs
- Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics

Alternative Metrics:
- MACs (Multiply-Accumulate operations): Practical for complexity assessment
- RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure
- Model Size: Number of parameters and precision; directly impacts memory and power
- Tools available: ptflops, torchinfo, fvcore for MAC counting

Observations:
- NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x)
- Actual NPU performance depends on computational graph structure
- Irregular/sequential/unsupported operations may require CPU fallback
- ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs
- Million MACs + model size provide first indication of complexity
- RTF useful but requires standardized test benches
- WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial

Complexity Target Estimation:
- Target devices: Modern smartphones with NPU components
- Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS)
- Actual power consumption on smartphone NPUs: TBD
- Model size and architecture significantly impact DRAM operations and overall power consumption

3.3 Design Constraint Verification

Editor's note: Algorithmic delay verification method for AI-based codecs required.

3.4 Additional Design Considerations

Codec Parameters and Configuration:
- Static parameters: Rarely changed, exchanged via SDP or predefined
- Dynamic parameters: May change frequently, included in each packet/frame
- Common static/dynamic parameters to be identified

4. Existing Technologies and Feasibility Evidence

4.1 Overview of Existing Codec Technologies (Table 7.1.1-1)

Categories:
1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS)
2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2)
3. AI-based postprocessor: Enhancement of conventional codec output
4. AI-based encoder/decoder:
- Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2)
- Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec)

Key Codec Properties:

3GPP IMS Codecs:
- AMR: NB, 5ms delay, 20ms frame, 4.75 kbps
- AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps
- EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps

Conventional Ultra Low Bitrate:
- MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps
- Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps

AI-based (Causal):
- LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps
- LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps
- Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps

AI-based (Non-causal):
- DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps
- DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps
- SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps

4.2 Observations on Codec Parameters

Audio Bandwidth:
- Conventional codecs: NB only
- Modern AI codecs: WB or higher

Algorithmic Codec Delay:
- IMS codecs: 25-32ms
- Conventional ultra-low: 60-126ms
- Causal AI: 20-80ms
- Non-causal AI: 500ms+ or full signal

Frame Duration:
- Conventional ultra-low: Increased vs. standard 20ms VoIP
- Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms)

Bitrate:
- All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps

Complexity:
- AI codecs generally higher than IMS/conventional codecs
- Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+)
- RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB)
- ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB)

4.3 Performance Evaluation

P.808 ACR Test Results (Figure 7.1.4-1):

Test setup:
- English clean speech (4 talkers × 6 samples)
- 32kHz, SWB, normalized to -26 dBoV
- 24 subjects

Key Findings:
- Codec2 (all rates) significantly worse than AMR 4.75 kbps
- SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps
- Three conditions on par or slightly better than EVS 9.6 kbps:
- Mimi-Codec 1.1 kbps (causal)
- DAC-ibm 1.5 kbps (non-causal)
- SNAC 0.98 kbps (non-causal)
- AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs

4.4 Packet Loss Concealment (PLC) Experiments

4.4.1 PLC Experiment with DAC

Test Configuration (Table 7.1.5.1-1):
- Bitrates: 1, 2.5, 4.5, 6 kbps
- Loss percentages: 1%, 6%, 10%, 20%
- Frame size: 80ms
- Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz)

Loss Simulation Methods:
1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss
2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency)

MUSHRA Test Results (8 listeners):
- Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps
- 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss
- Interleaving benefit increases with error rate
- Potential for improvement if model trained with random loss patterns

4.4.2 PLC Experiment with DAC and DAC-IBM

Comparison:
- DAC (default): 16kHz, general audio training, scalable bitrate
- DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps

MUSHRA Test Results (8 listeners, resampled to 16kHz):
- DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions
- DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR
- Specific training for target bitrate crucial for optimal performance
- Error resilience improvable through appropriate training/design choices

Conclusions:
- More design freedom needed in bitrate and BLER selection for optimal quality at given SNR
- Optimal coding performance (even under errors) achieved with appropriate training strategy
- Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates
- Dedicated training (e.g., DAC-IBM) much more efficient

4.5 Very Low Bitrate Listening Test Results

Test Setup (Nokia):
- Clean Finnish speech (3 males, 3 females, 4 sample pairs each)
- Diotic presentation via Sennheiser HD650 headphones
- Experienced listeners
- Extended ACR5 scale (0.5-5.5) and DCR methodologies
- Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz)

Codecs Tested:
- DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps)
- 3GPP: AMR, AMR-WB, EVS at various rates
- ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps)

Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2):
- Increased bandwidth improves quality up to ~12kHz (saturation region)
- 4kHz bandwidth significantly limits perceived quality
- MELP 2.4k and MPEG4 HVXC perform better than Codec2
- 3GPP codecs perform as expected at lowest bitrates
- TSAC and DAC show very good performance in clean speech
- TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references
- Both poor quality <1 kbps

DCR Results (Figure 7.2.4-1):
- Results align with ACR test
- Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz)
- Listeners more likely to notice degradations with reference available

4.6 Test Results on Clean Speech and Music/Mixed Content

4.6.1 DCR Test on Clean Speech (Figure 7.3.2-1)

Test Setup:
- French, 30 listeners (5 panels × 6)
- 8sec double sentences, 3 male + 3 female
- 20-20,000Hz bandpass, -26dB LKFS normalized

Codecs:
- Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps)
- AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8)

Key Findings:
- DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps
- EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate
- Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps)

4.6.2 ACR Test on Clean Speech (Figure 7.3.3-1)

Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test.

4.6.3 DCR Test on Music and Mixed Content (Figure 7.3.4-1)

Test Setup:
- 30 listeners (5 panels × 6)
- 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background)
- 20-20,000Hz bandpass, -26dB LKFS

Codecs:
- Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4)
- AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5)
- Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft)

Key Findings:
- Best quality: EVS and xHE-AAC @ ~24 kbps
- Neural codec advantage visible at low bitrates
- No tested neural codec achieves quality close to "Direct"
- FlowDec 7.5 kbps: 4.08 DMOS (best neural codec)
- No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps

4.7 Impact of Noise Suppression on AI-Based Codecs

4.7.1 Background on Existing Systems

Classical Speech Coding:
- Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality
- Especially beneficial in noisy conditions and low SNRs
- Improves intelligibility and perceptual quality
- Integrated in 3GPP2 EVRC and VMR-WB standards

Neural Speech Coding:
- Known to be sensitive to noisy environments
- Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization
- Data-driven approaches make failure modes difficult to anticipate
- Noise suppression can minimize issues and allow codec to focus on useful signal

4.7.2 Test Design

Two Listening Tests (ITU-T P.808 ACR):

Test 1 - High SNR:
- Assumptions from 3GPP EVS characterization
- SNRs: +15 to +20 dB (WB)
- Noises: car, street, office (from ITU-T P.501 Annex B)
- 24 pairs of sentences (8 pairs × 3 noises)
- 20 listeners

Test 2 - Low SNR:
- More adversarial environments
- SNRs: -5 to +15 dB
- Noises: street, construction, metro, car, office, restaurant
- 24 pairs of sentences (4 pairs × 6 noises)
- 21 listeners

Noise Suppression:
- DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz
- Applied as preprocessor before coding

Mixing Procedure:
- Loudness normalization using BS1770demo (ITU-T STL)
- RMS long-term option for background noise level

4.7.3 Conditions Under Test

Classical Codecs:
- MELPe, AMR, AMR-WB, EVS

Neural Codecs:
- SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps)
- LyraV2 3.2 kbps (likely trained on diverse data including noisy speech)
- DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only

All tested with and without noise suppression ("_nr" suffix).

4.7.4 High SNR Test Results (Figures 7.4.2.4-1, 7.4.2.4-2)

Key Observations:
- Listeners prefer uncoded denoised speech over uncoded noisy speech
- Denoised speech as good as clean speech at high SNRs (minimal artifacts)
- Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs)
- Classical codecs: Benefit increases with bitrate/quality
- Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps)
- DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate
- Plain DAC @ 24kHz not competitive at 1.5 kbps
- LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par)

4.7.5 Low SNR Test Results (Figures 7.4.2.5-1, 7.4.2.5-2)

Key Observations:
- Listeners strongly prefer uncoded denoised speech (~1 MOS difference)
- All classical codecs benefit from denoising (<1 MOS improvement)
- Neural codecs benefit even more (>1 MOS improvement possible)
- Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression
- Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech

4.7.6 Conclusions

Speech coder performance in noisy conditions significantly enhanced by ensuring high SNR (e.g., via noise suppression)
Neural speech coders more sensitive to noisy environments; benefit more from noise suppression than traditional coders
High SNR enables improved performance at very low bitrates under both high and low SNR conditions
Noise suppression impact on delay/complexity requires further study
Note: Removing all background audio may not always be desirable (e.g., emergency calls where background contains relevant information)

4.8 Analysis of Existing AI Codec: Lyra V2

Key Characteristics:
- Publicly reported: "38x faster than real-time" on high-end 2021 smartphone
- Entirely CPU execution (no NPU/TPU)
- Open-source under Apache 2.0 license (permissive for commercial/standardization)

Code-Level Analysis:
- Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite)
- Flag use_xnn=true directs to CPU execution
- No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU)
- Single-threaded execution (threads explicitly set to 1)
- Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time

Conclusion:
- Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU
- Significant margin towards max RTF
- CPU-only approach viable for ULBC

4.9 Complexity Analysis of Existing AI Codec: DAC

Methodology:
- ONNX Runtime library for execution
- Tested on CPU backend and NNAPI backend (Android NPU interface)
- Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference)
- No quantization applied (original float model)
- Metrics: Real-Time Factor (RTF) for end-to-end and individual components

Theoretical Complexity Analysis (Figure 7.6.2-1):
- Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification)
- Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms)
- Model: 76.9M parameters, 293MB size
- Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes

Real-World Inference Performance:

Test Platforms:
1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed)
2. High-end mobile: Qualcomm Snapdragon 8 Gen 2

Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3):

Desktop CPU:
- Single-threaded: NOT real-time (RTF 1.6-1.9)
- Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86)
- Still very slow for high-end desktop CPU

Mobile SoC:
- NO configuration achieves real-time performance
- Best-case RTF: 2.125 (>2x slower than real-time)
- Worst-case RTF: 5.884 (~6x slower than real-time)
- NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU
- Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required

Critical Gap:
- Significant gap between theoretical NPU capacity and actual measured performance (RTF)
- Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone
- Real-world testing essential

Editor's note: NNAPI may fallback to CPU for float models; impact needs verification.

5. Test Methodologies

5.1 General Considerations

5.1.1 Typical Quality Impairments of Ultra-Low Bit Rate Speech Coding

Categories:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech)

Additional Considerations:
- Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC

5.1.2 Challenges of Quality Assessment

Traditional 3GPP Practice:
- AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR
- ACR: Generally for clean speech
- DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content
- Focus not on intelligibility, speaker identifiability, prosodic impairments

ULBC Challenges:
- May need to address additional aspects directly through dedicated tests
- Hallucination: Specific to ML-based systems
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic)

Alternative Test Methods:
- Automatic speech recognition
- Modified rhyme tests
- DCR tests (for prosodic differences)
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests (word/semantic equivalence)
- Phoneme recognition tests

Noise Suppression Evaluation:
- P.835: Multi-dimensional rating (speech quality and noise suppression capability separately)
- Typically used for systems with noise suppression

DCR Considerations:
- Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference

5.1.3 Subjective Testing Considerations

Document Information

TDoc:
S4-260144

Source:
China Mobile Com. Corporation

Type:
draft TR

For:
Agreement

Original Document:
View on 3GPP

Title: [FS_ULBC]TR 26.940 V 0.5.1

Agenda item: 7.8

Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)

Doc type: draft TR

For action: Agreement

Release: Rel-20

Specification: 26.94

Version: 0.5.1

Related WIs: FS_ULBC

Spec: 26.94

Contact: Jiayi Xu

Uploaded: 2026-02-03T13:12:23.693000

Contact ID: 89460

Revised to: S4-260424

TDoc Status: revised

Reservation date: 03/02/2026 12:12:59

Agenda item sort order: 20