[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals
This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics.
The study evaluates:
- Descript Audio Codec (DAC) - AI-based codec
- Enhanced Voice Services (EVS) codec - 3GPP standard reference
The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.
The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges:
- Wideband audio (16 kHz) offers naturalness and perceptual quality
- Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits
For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include:
- Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification
- System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas
- Need to balance modern expectations with legacy requirements and emergency scenarios
EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling:
- Practical quality floor definition for ULBC
- Comparison against established carrier-grade standards
- Isolation of bandwidth choice impact independent of codec architecture
DAC model: Evaluated at three sampling rates
- 16 kHz
- 24 kHz
- 44 kHz
EVS codec: Evaluated in standard modes
- Narrowband (NB)
- Wideband (WB)
Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)
WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.
Key Findings:
- DAC achieves high efficiency at low bitrates (~2 kbps)
- WER drops rapidly as bitrate increases, stabilizing around 3-4%
- At 1.5 kbps: WER approximately 5.5%
- Significant improvement observed in 1.5-3.0 kbps range
Bandwidth Impact at Low Bitrates:
- At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model
- With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band
- Results in better semantic preservation vs. 24 kHz model suffering from bit starvation
A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound.
Model Configuration:
- Sample rate: 8000 Hz
- Encoder rates: [2, 4, 4, 8], dimension: 64
- Decoder rates: [8, 4, 4, 2], dimension: 1536
- Quantization: 6 codebooks, size 1024, dimension 36
- Training: 200,000 steps on VCTK corpus
Critical Findings at Sub-1.5 kbps:
- At ~1 kbps:
- 8 kHz model (938 bps): WER 5.86%
- 16 kHz model (1000 bps): WER 11.23%
- Semantic penalty > 5 percentage points when forcing WB at 1 kbps
Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.
Competitive Advantage:
- DAC achieves comparable WER scores at significantly lower bitrates than EVS
- DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs
ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.
Performance at Different Bitrates:
Nearly identical, indicating saturation point in semantic quality
At 5.9-8.0 kbps range:
Both modes provide sufficient basic audio quality
At 9.6 kbps:
Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.
Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.
Key Findings:
- Semantic loss introduced by EVS in both NB and WB modes is minimal
- Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance
- Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum
Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.
Strategic Conclusions for ULBC Design:
Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation
AI-based codec sampling rate optimization:
Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing:
- Methodology
- Experimental setup
- Analysis of results concerning audio bandwidth impact on semantic intelligibility
Complete experimental data provided in appendix tables covering:
- Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps
- Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps