[FS_ULBC]TR 26.940 V 0.5.1
This Technical Report documents the study on Ultra Low Bit rate Speech Codecs (ULBC) for 3GPP Release 20. The primary focus is on IMS voice services over Geostationary Orbit (GEO) satellite access, with additional consideration for multi-party voice communication and other access types.
Background:
- GEO satellites operate at 35,786 km altitude, resulting in ~285ms one-way propagation delay
- TR 22.887 and TS 22.261 assume total transmission data rates of [1-3] kbit/s
- Current 3GPP codecs (lowest: AMR at 4.75 kbit/s) cannot support these constraints
Scenario Descriptions:
Main Scenario (4.2.2.2): One UE connects via GEO satellite access
- UE1: Phone supporting IMS voice over GEO satellite
- UE2: Either "regular" phone (requiring transcoding in core network) or "upgraded" phone supporting ULBC over other access (enabling transcoder-free operation)
Sub-Scenario (4.2.2.3): Both UEs connect via GEO satellite access
- Less common but relevant for disaster/cyberattack scenarios
- Even with transparent payload, voice packets transmit to ground before reaching other UE
- May enable transcoder-free operation
High-level Prerequisites:
- Very low bitrate support
- DTX support [TBC]
- Error concealment
- Real-time implementation on smartphones
- Good audio quality for reasonable QoE
Background:
- Addresses poor/unstable network conditions in WLAN access
- Network congestion during peak usage or in areas with limited infrastructure
- Codec selection critical for maintaining quality under bandwidth constraints
Scenario Description:
- One participant (UE1) on unstable network using ULBC, other (UE2) on stable network with conventional codec (requires transcoding)
- Both participants on unstable networks using ULBC simultaneously (no transcoding needed)
High-level Prerequisites:
- Ultra-low bitrate capability
- Real-time operation on consumer devices (smartphones, laptops)
- Audio quality matching or exceeding existing voice services
Motivation:
- ULBC may provide enhanced robustness against poor network conditions
- Lower bit rates may benefit coverage/capacity
- Reduces transcoding needs when calls bridge GEO and other access types
Scenario Description:
- Both UEs support ULBC but connect via 3GPP access other than GEO (LTE, NR, WLAN)
Delay Components:
UE Delay (Table 5.1.2-2):
- Depends on voice bundling period (80ms, 160ms, 320ms) and codec frame size (20-320ms)
- Performance objective range: 268-1435ms (excluding solution-specific delay)
- Maximum requirement range: 355-1435ms (excluding solution-specific delay)
- Components: 2x voice bundling period + 2x vendor-specific encoder/decoder processing + vendor delay budget + JBM
Core Network Delay:
- Ground station to core network: [5-20]ms minimum, [200]ms maximum
- eNodeB to core network: 5-20ms
- Transcoding: 7ms (AMR/AMR-WB) to 14ms (EVS)
GEO Transmission Delay:
- Minimum: 248ms
- Maximum: 280ms (per TS 22.261 KPI requirement)
- Variation of 32ms depending on UE location within beam
Mouth-to-Ear Delay Estimates (Table 5.1.3-1):
For Main Scenario (GEO-TN):
- 80ms bundling, 20ms frame: 548ms (lower) to 872ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
For Sub-Scenario (GEO-GEO):
- 80ms bundling, 20ms frame: 804ms (lower) to 1315ms (upper) + solution-specific delay X
- 320ms bundling, 320ms frame: 1952ms (lower) to 2395ms (upper) + solution-specific delay X
System Architecture:
- Service link: UE to NTN payload
- Feeder link: NTN payload to NTN Gateway
RAN Parameters:
- Channel coding: Turbo code (NPUSCH Format 1 uplink), TBCC (NPDSCH downlink)
- MCS: pi/2 BPSK, pi/4 QPSK, QPSK, 16QAM
- Subcarrier spacings: 3.75kHz and 15kHz for NPUSCH Format 1
- Resource unit (RU) duration varies with subcarrier spacing and number of tones
QoS Characteristics:
- Managed through QCI (QoS Class Identifier)
- Same PELR (Packet Error Loss Rate) required for UL and DL
- Suggests balanced UL/DL time-domain transmission resources
Key parameters identified:
- Bit rates: [TBD]
- Sample rate and audio bandwidth: [TBD]
- Frame length: [TBD]
- Complexity and memory demands: [TBD]
- Algorithmic delay: Frame size buffering + inherent codec delays (look-ahead, sample-rate conversion, post-processing)
- Packet loss concealment (PLC): [TBD]
- Noise suppression: [TBD]
- Discontinuous transmission (DTX): Including VAD and comfort noise [TBD]
- Robustness to non-speech input: [TBD]
Current Evaluation Analysis:
- Codec must support real-time thread and concurrent processing
- ML codecs with [5-10M] parameters considered for efficient operation within latency bounds
- Must operate within compute constraints of devices for real-time voice communication
Memory and Power Considerations:
- Larger models require more DRAM access → higher power consumption
- Memory footprint critical for device performance and usability
Complexity Metrics for AI-Based Codecs:
TOPS (Tera Operations Per Second):
- TOPS = 2 × MAC unit count × Frequency / 1 trillion
- Smartphone NPUs: 8-59 TOPS reported (varying precision: INT8, INT16, FP16)
- TOPS/W (power efficiency): 2-15 TOPS/W for smartphone NPUs
- Note: TOPS/W typically benchmarked under full-load; lighter workloads like audio codecs may show different characteristics
Alternative Metrics:
- MACs (Multiply-Accumulate operations): Practical for complexity assessment
- RTF (Real-Time Factor): Ratio of frame length to encoding/decoding time; reliable but resource-intensive to measure
- Model Size: Number of parameters and precision; directly impacts memory and power
- Tools available: ptflops, torchinfo, fvcore for MAC counting
Observations:
- NPUs/TPUs significantly more power-efficient than CPUs for AI tasks (5-20x)
- Actual NPU performance depends on computational graph structure
- Irregular/sequential/unsupported operations may require CPU fallback
- ULBC complexity constraints should be based on desired power consumption/computational performance, not relative to existing 3GPP codecs
- Million MACs + model size provide first indication of complexity
- RTF useful but requires standardized test benches
- WMOPS not directly suitable for NPU-capable devices but mapping to TOPS/RTF beneficial
Complexity Target Estimation:
- Target devices: Modern smartphones with NPU components
- Example: DAC codec estimated at ~150 Giga MAC/sec (~0.3 TOPS)
- Actual power consumption on smartphone NPUs: TBD
- Model size and architecture significantly impact DRAM operations and overall power consumption
Editor's note: Algorithmic delay verification method for AI-based codecs required.
Codec Parameters and Configuration:
- Static parameters: Rarely changed, exchanged via SDP or predefined
- Dynamic parameters: May change frequently, included in each packet/frame
- Common static/dynamic parameters to be identified
Categories:
1. 3GPP IMS codecs: Reference conditions (AMR, AMR-WB, EVS)
2. Conventional Ultra Low Bitrate Codecs: DSP-based (MELP/MELPe, AMBE-LR, MPEG-HVXC, TWELP MR, Codec2)
3. AI-based postprocessor: Enhancement of conventional codec output
4. AI-based encoder/decoder:
- Causal systems: Real-time capable (LPCNet, LyraV2, EnCodec, Mimi-Codec, TS3, TAAE, LMCodec2)
- Non-causal systems: Non-real-time due to large look-ahead (DAC, DAC-IBM, SNAC, SpeechTokenizer, SemantiCodec, FunCodec, WavTokenizer, BigCodec, FocalCodec)
Key Codec Properties:
3GPP IMS Codecs:
- AMR: NB, 5ms delay, 20ms frame, 4.75 kbps
- AMR-WB: WB, 5.9375ms delay, 20ms frame, 6.6 kbps
- EVS: NB/WB/SWB, 12ms delay, 20ms frame, 7.2-9.6 kbps
Conventional Ultra Low Bitrate:
- MELP/MELPe: NB, 20-36ms delay, 22.5-90ms frame, 0.6-2.4 kbps
- Codec2: NB, 40ms delay, 20-40ms frame, 0.45-2.4 kbps
AI-based (Causal):
- LPCNet: WB, 25ms delay, 40ms frame, 1.6 kbps
- LyraV2: WB, [TBD] delay, 20ms frame, 3.2/6/9.2 kbps
- Mimi-Codec: 24kHz, 0ms delay, 80ms frame, 0.55/1.1 kbps
AI-based (Non-causal):
- DAC: WB/24kHz, 244-366ms delay, 13.3-20ms frame, 0.5-3+ kbps
- DAC-IBM: 24kHz, 366ms delay, 13.3ms frame, 0.75/1.5/3 kbps
- SNAC: 24kHz, 1000ms delay, 80ms frame, 0.98 kbps
Audio Bandwidth:
- Conventional codecs: NB only
- Modern AI codecs: WB or higher
Algorithmic Codec Delay:
- IMS codecs: 25-32ms
- Conventional ultra-low: 60-126ms
- Causal AI: 20-80ms
- Non-causal AI: 500ms+ or full signal
Frame Duration:
- Conventional ultra-low: Increased vs. standard 20ms VoIP
- Some AI codecs maintain 20ms, others increase (e.g., Mimi 80ms)
Bitrate:
- All listed codecs (except IMS and LyraV2) offer ≥1 mode <3 kbps
Complexity:
- AI codecs generally higher than IMS/conventional codecs
- Exception: LyraV2 requires only 35% of ARM A53 core (RaspberryPi 3+)
- RAM: AI codecs significantly higher (e.g., LyraV2: 54MB vs. EVS: 294KB)
- ROM: AI codecs much higher (e.g., TAAE: 950M parameters ≈ 900MB @ 8-bit; SNAC: 19M ≈ 18MB @ 8-bit; EVS: ~2MB)
P.808 ACR Test Results (Figure 7.1.4-1):
Test setup:
- English clean speech (4 talkers × 6 samples)
- 32kHz, SWB, normalized to -26 dBoV
- 24 subjects
Key Findings:
- Codec2 (all rates) significantly worse than AMR 4.75 kbps
- SemantiCodec, LyraV2, LPCNet, Mimi 0.55 kbps: comparable to AMR-WB 6.65 kbps
- Three conditions on par or slightly better than EVS 9.6 kbps:
- Mimi-Codec 1.1 kbps (causal)
- DAC-ibm 1.5 kbps (non-causal)
- SNAC 0.98 kbps (non-causal)
- AI-based solutions show 2+ MOS improvement over conventional ultra-low bitrate codecs
Test Configuration (Table 7.1.5.1-1):
- Bitrates: 1, 2.5, 4.5, 6 kbps
- Loss percentages: 1%, 6%, 10%, 20%
- Frame size: 80ms
- Based on NB-IoT NTN data at ~3dB CNR (SCS=15kHz) and 9dB (SCS=3.75kHz)
Loss Simulation Methods:
1. Consecutive 4 blocks drop and repeat: Simulates 80ms packet loss
2. Interleaved drop and repeat: Spreads loss over 2 packets (adds latency)
MUSHRA Test Results (8 listeners):
- Despite higher loss percentage, 4.5 kbps and 6 kbps significantly better than 1 kbps and 2.5 kbps
- 6 kbps @ 20% loss rated close to 4.5 kbps @ 10% loss
- Interleaving benefit increases with error rate
- Potential for improvement if model trained with random loss patterns
Comparison:
- DAC (default): 16kHz, general audio training, scalable bitrate
- DAC-IBM: 24kHz, speech-specific training, fixed 1.5 kbps
MUSHRA Test Results (8 listeners, resampled to 16kHz):
- DAC-IBM 1.5 kbps @ 3% PLR significantly outperforms all other DAC conditions
- DAC 4.5 kbps @ 10% PLR and 6 kbps @ 20% PLR show no significant improvement over DAC-IBM 1.5 kbps @ 3% PLR
- Specific training for target bitrate crucial for optimal performance
- Error resilience improvable through appropriate training/design choices
Conclusions:
- More design freedom needed in bitrate and BLER selection for optimal quality at given SNR
- Optimal coding performance (even under errors) achieved with appropriate training strategy
- Bitrate scalability (e.g., DAC) comes with significant performance cost, especially at lower bitrates
- Dedicated training (e.g., DAC-IBM) much more efficient
Test Setup (Nokia):
- Clean Finnish speech (3 males, 3 females, 4 sample pairs each)
- Diotic presentation via Sennheiser HD650 headphones
- Experienced listeners
- Extended ACR5 scale (0.5-5.5) and DCR methodologies
- Bandwidths tested: NB (4kHz), MB (6kHz), WB (8kHz), 10kHz, SSWB (12kHz), SWB (16kHz), FB (20kHz)
Codecs Tested:
- DSP: Codec2 (0.7, 1.3, 2.4, 3.2 kbps), MELP (2.4 kbps), MPEG4 HVXC (2.0, 4.0 kbps)
- 3GPP: AMR, AMR-WB, EVS at various rates
- ML: DAC 44k (0.9, 1.7, 2.6, 3.4, 6.9 kbps), TSAC 44k (0.6, 1.2, 2.5, 3.2, 5.9 kbps)
Extended ACR5 Results (Figures 7.2.3-1, 7.2.3-2):
- Increased bandwidth improves quality up to ~12kHz (saturation region)
- 4kHz bandwidth significantly limits perceived quality
- MELP 2.4k and MPEG4 HVXC perform better than Codec2
- 3GPP codecs perform as expected at lowest bitrates
- TSAC and DAC show very good performance in clean speech
- TSAC ≥1.2 kbps and DAC ≥1.7 kbps suitable as ML-based references
- Both poor quality <1 kbps
DCR Results (Figure 7.2.4-1):
- Results align with ACR test
- Exception: MELP preferred over HVXC 2.0 in DCR (full 4kHz bandwidth vs. ~3.7kHz)
- Listeners more likely to notice degradations with reference available
Test Setup:
- French, 30 listeners (5 panels × 6)
- 8sec double sentences, 3 male + 3 female
- 20-20,000Hz bandpass, -26dB LKFS normalized
Codecs:
- Conventional: Opus (12, 16, 24 kbps), EVS-WB (7.2, 8 kbps), EVS-SWB (9.6, 13.2, 24.4 kbps)
- AI: LPCNet (1.6), Lyra V2 (3.2, 6, 9.2), EnCodec (1.5, 3, 6, 12, 24), AudioCraft (1.5, 3, 6), AudioDec, DAC (1.7, 2.6, 5.2, 7.8)
Key Findings:
- DAC best DMOS among ~1.5 kbps codecs; approaches "Direct" quality <8 kbps
- EnCodec doesn't achieve "Direct" quality even @ 24 kbps; below EVS/Opus at this rate
- Lyra V2 (6, 9.2 kbps) on par with EVS-WB (7.2, 8 kbps)
Same setup as DCR test, ACR methodology for better objective metric comparison. Same observations as DCR test.
Test Setup:
- 30 listeners (5 panels × 6)
- 6 categories: instrumental/vocal classical, instrumental/vocal modern, captured mixed, artificial mixed (speech + music background)
- 20-20,000Hz bandpass, -26dB LKFS
Codecs:
- Conventional: xHE-AAC (8, 12, 16, 24), Opus audio (16, 24), Opus voip (12, 16, 24), EVS-SWB (9.6, 13.2, 24.4)
- AI: EnCodec (12, 24), DAC (4.3, 6, 7.8), HILCodec (4.5, 6, 9), SNAC (2.6), FlowDec (4.5, 6, 7.5)
- Note: Many neural codecs pretested but excluded due to low quality (LPCNet, Lyra V2, AudioDec, FreqCodec, HifiCodec, Spectral Codecs, Vocos, DisCodec, Mimi, AudioCraft)
Key Findings:
- Best quality: EVS and xHE-AAC @ ~24 kbps
- Neural codec advantage visible at low bitrates
- No tested neural codec achieves quality close to "Direct"
- FlowDec 7.5 kbps: 4.08 DMOS (best neural codec)
- No tested AI codec provided reasonable quality for music/mixed content <2.6 kbps
Classical Speech Coding:
- Studies on MELPe and AMR show noise reduction preprocessing improves parameter extraction and decoded speech quality
- Especially beneficial in noisy conditions and low SNRs
- Improves intelligibility and perceptual quality
- Integrated in 3GPP2 EVRC and VMR-WB standards
Neural Speech Coding:
- Known to be sensitive to noisy environments
- Robustness influenced by training data diversity, low bitrates, capacity/complexity, quantization
- Data-driven approaches make failure modes difficult to anticipate
- Noise suppression can minimize issues and allow codec to focus on useful signal
Two Listening Tests (ITU-T P.808 ACR):
Test 1 - High SNR:
- Assumptions from 3GPP EVS characterization
- SNRs: +15 to +20 dB (WB)
- Noises: car, street, office (from ITU-T P.501 Annex B)
- 24 pairs of sentences (8 pairs × 3 noises)
- 20 listeners
Test 2 - Low SNR:
- More adversarial environments
- SNRs: -5 to +15 dB
- Noises: street, construction, metro, car, office, restaurant
- 24 pairs of sentences (4 pairs × 6 noises)
- 21 listeners
Noise Suppression:
- DeepFilterNet2: State-of-the-art DNN-based, operates at 48kHz
- Applied as preprocessor before coding
Mixing Procedure:
- Loudness normalization using BS1770demo (ITU-T STL)
- RMS long-term option for background noise level
Classical Codecs:
- MELPe, AMR, AMR-WB, EVS
Neural Codecs:
- SNAC, MIMI, DAC_IBM (speech-trained, <2 kbps)
- LyraV2 3.2 kbps (likely trained on diverse data including noisy speech)
- DAC (original, 24kHz, 1.5/3/6 kbps) - Test 1 only
All tested with and without noise suppression ("_nr" suffix).
Key Observations:
- Listeners prefer uncoded denoised speech over uncoded noisy speech
- Denoised speech as good as clean speech at high SNRs (minimal artifacts)
- Noise suppression beneficial for all codecs except MELPe (already has noise reduction; benefit minimized at high SNRs)
- Classical codecs: Benefit increases with bitrate/quality
- Neural codecs: Greater benefit, >0.5 MOS improvement for several (SNAC, DAC_ibm, DAC @ 3 kbps)
- DAC_ibm vs. DAC: Same architecture/complexity, very different behavior due to training data/target bitrate
- Plain DAC @ 24kHz not competitive at 1.5 kbps
- LyraV2: ~70x less complex than other neural codecs; @ 3.2 kbps performs worse except vs. DAC @ 3 kbps (on par)
Key Observations:
- Listeners strongly prefer uncoded denoised speech (~1 MOS difference)
- All classical codecs benefit from denoising (<1 MOS improvement)
- Neural codecs benefit even more (>1 MOS improvement possible)
- Neural codecs at vastly lower bitrates can compete with conventional codecs under adverse conditions when combined with noise suppression
- Generative-AI based codecs (e.g., DAC IBM) can improve absolute quality of input signal when coding denoised speech
Key Characteristics:
- Publicly reported: "38x faster than real-time" on high-end 2021 smartphone
- Entirely CPU execution (no NPU/TPU)
- Open-source under Apache 2.0 license (permissive for commercial/standardization)
Code-Level Analysis:
- Core components (LyraGanModel, SoundStreamEncoder) explicitly use CPU backend (XNNPACK via TensorFlow Lite)
- Flag use_xnn=true directs to CPU execution
- No hardware accelerator delegates (NNAPI, Hexagon, CoreML, TPU)
- Single-threaded execution (threads explicitly set to 1)
- Benchmark: Mean 0.525ms processing time for 20ms frame = ~38x real-time
Conclusion:
- Proves state-of-the-art low-bitrate AI speech codec can achieve/exceed real-time requirements on high-end 2021 smartphone CPU
- Significant margin towards max RTF
- CPU-only approach viable for ULBC
Methodology:
- ONNX Runtime library for execution
- Tested on CPU backend and NNAPI backend (Android NPU interface)
- Model: Unmodified pretrained DAC @ 44.1kHz, 8 kbps (from reference)
- No quantization applied (original float model)
- Metrics: Real-Time Factor (RTF) for end-to-end and individual components
Theoretical Complexity Analysis (Figure 7.6.2-1):
- Tools: ptflops v0.7.5, thop v2.0.17 (cross-verification)
- Complexity scales with frame size: 1.4 GFLOP (20ms) to 31.6 GFLOP (320ms)
- Model: 76.9M parameters, 293MB size
- Note: Different library versions produce different results due to ConvTranspose1d calculation methodology changes
Real-World Inference Performance:
Test Platforms:
1. High-end desktop: AMD Ryzen 9 7950X (5.7GHz fixed)
2. High-end mobile: Qualcomm Snapdragon 8 Gen 2
Key Findings (Figures 7.6.4-1, 7.6.4-2, 7.6.4-3):
Desktop CPU:
- Single-threaded: NOT real-time (RTF 1.6-1.9)
- Multi-threaded (4 threads): Real-time capable (RTF 0.67-0.86)
- Still very slow for high-end desktop CPU
Mobile SoC:
- NO configuration achieves real-time performance
- Best-case RTF: 2.125 (>2x slower than real-time)
- Worst-case RTF: 5.884 (~6x slower than real-time)
- NNAPI backend (NPU): Inconsistent results; sometimes helped slightly, sometimes significantly worse than CPU
- Cannot assume NPU automatically improves performance; NPU-specific optimizations may be required
Critical Gap:
- Significant gap between theoretical NPU capacity and actual measured performance (RTF)
- Model appearing suitable on paper (~2-5 GFLOP/frame) unable to run real-time on top-tier mobile phone
- Real-world testing essential
Editor's note: NNAPI may fallback to CPU for float models; impact needs verification.
Categories:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talker, reverberant speech)
Additional Considerations:
- Speech enhancement algorithms (noise suppression, gain normalization) may be part of ULBC
Traditional 3GPP Practice:
- AMR, AMR-WB, EVS: Listening-only evaluations using P.800 ACR and modified DCR
- ACR: Generally for clean speech
- DCR: For SWB clean speech, mixed-bandwidth, speech + background noise, music/mixed content
- Focus not on intelligibility, speaker identifiability, prosodic impairments
ULBC Challenges:
- May need to address additional aspects directly through dedicated tests
- Hallucination: Specific to ML-based systems
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic)
Alternative Test Methods:
- Automatic speech recognition
- Modified rhyme tests
- DCR tests (for prosodic differences)
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests (word/semantic equivalence)
- Phoneme recognition tests
Noise Suppression Evaluation:
- P.835: Multi-dimensional rating (speech quality and noise suppression capability separately)
- Typically used for systems with noise suppression
DCR Considerations:
- Subjects may consider noise suppression as degradation when comparing to uncoded noisy reference