S4-260152 - AI Summary

[FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals

Back to Agenda Download Summary
AI-Generated Summary AI

Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals

1. Introduction and Objectives

This contribution presents experimental evaluation results focusing on semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify semantic preservation (listener's ability to accurately understand spoken content) using Automatic Speech Recognition (ASR) Word Error Rate (WER) as a proxy metric, rather than traditional perceptual quality (MOS) metrics.

The study evaluates:
- Descript Audio Codec (DAC) - AI-based codec
- Enhanced Voice Services (EVS) codec - 3GPP standard reference

The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.

2. Background and Motivation

2.1 ULBC Context

The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges:
- Wideband audio (16 kHz) offers naturalness and perceptual quality
- Bit allocation challenge: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits

2.2 Critical Communication Requirements

For emergency rescue operations, semantic intelligibility is the highest priority. Key considerations include:
- Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification
- System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas
- Need to balance modern expectations with legacy requirements and emergency scenarios

2.3 EVS as Reference Anchor

EVS serves as a quality anchor and concrete standardized baseline for semantic preservation, enabling:
- Practical quality floor definition for ULBC
- Comparison against established carrier-grade standards
- Isolation of bandwidth choice impact independent of codec architecture

3. Methodology

3.1 Evaluation Pipeline

  • Dataset: LibriSpeech train-clean-100 subset (standard benchmark for high-quality read English speech)
  • Sample size: 500 audio files randomly sampled across three seeds (101, 102, 103)
  • Consistency: Same audio files used for all codec and bitrate configurations

3.2 Processing Chain

  1. Process input audio through target codecs (DAC and EVS) at various bitrates
  2. Transcribe processed audio using OpenAI Whisper model (large-v3) - selected for state-of-the-art performance and noise robustness
  3. Compare transcripts against LibriSpeech ground truth
  4. Calculate WER using jiwer library

4. Experimental Setup

4.1 Codec Configurations

DAC model: Evaluated at three sampling rates
- 16 kHz
- 24 kHz
- 44 kHz

EVS codec: Evaluated in standard modes
- Narrowband (NB)
- Wideband (WB)

Baseline: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)

4.2 Observations on Baseline Variance

  • NB PCM occasionally scored ~0.1% better than WB PCM
  • Attributed to inherent ASR model variance rather than signal quality differences
  • Explains why high-bitrate DAC models occasionally score slightly lower than WB baseline

4.3 Primary Metric

WER (Word Error Rate): Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.

5. Results and Analysis

5.1 DAC Performance vs Bitrate

Key Findings:
- DAC achieves high efficiency at low bitrates (~2 kbps)
- WER drops rapidly as bitrate increases, stabilizing around 3-4%
- At 1.5 kbps: WER approximately 5.5%
- Significant improvement observed in 1.5-3.0 kbps range

Bandwidth Impact at Low Bitrates:
- At low bitrates (1.5 kbps and 3 kbps), 16 kHz model outperforms 24 kHz model
- With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band
- Results in better semantic preservation vs. 24 kHz model suffering from bit starvation

5.2 DAC 8 kHz Narrowband Model Analysis

A dedicated 8 kHz sampling rate model was trained to investigate bandwidth impact at the lower bitrate bound.

Model Configuration:
- Sample rate: 8000 Hz
- Encoder rates: [2, 4, 4, 8], dimension: 64
- Decoder rates: [8, 4, 4, 2], dimension: 1536
- Quantization: 6 codebooks, size 1024, dimension 36
- Training: 200,000 steps on VCTK corpus

Critical Findings at Sub-1.5 kbps:
- At ~1 kbps:
- 8 kHz model (938 bps): WER 5.86%
- 16 kHz model (1000 bps): WER 11.23%
- Semantic penalty > 5 percentage points when forcing WB at 1 kbps

  • At 1.5 kbps:
  • 8 kHz model (1563 bps): WER 3.86%
  • 16 kHz model (1500 bps): WER 5.46%

Conclusion: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.

5.3 DAC vs EVS Comparison

Competitive Advantage:
- DAC achieves comparable WER scores at significantly lower bitrates than EVS
- DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs

ULBC Application: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.

5.4 EVS Narrowband vs Wideband Analysis

Performance at Different Bitrates:

  • At 13.2 kbps (highest tested):
  • EVS-NB: 3.16%
  • EVS-WB: 3.14%
  • Nearly identical, indicating saturation point in semantic quality

  • At 5.9-8.0 kbps range:

  • EVS-WB maintains marginal advantage (e.g., at 8.0 kbps: WB 3.15% vs. NB 3.41%)
  • Both modes provide sufficient basic audio quality

  • At 9.6 kbps:

  • EVS-NB: 3.19%
  • EVS-WB: 3.23%
  • NB performance very close to WB, difference within ASR model error margin

Conclusion: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.

5.5 EVS Degradation Analysis

Methodology: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.

Key Findings:
- Semantic loss introduced by EVS in both NB and WB modes is minimal
- Degradation metric confirms that pure coding loss of NB and WB is statistically indistinguishable when subtracting baseline PCM variance
- Additional frequency content in wideband contributes negligible semantic information for machine understanding compared to core NB spectrum

Strategic Implication: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.

5.6 Summary of Findings

Strategic Conclusions for ULBC Design:

  1. For ~1 kbps emergency/GEO scenarios:
  2. Semantic intelligibility is paramount
  3. NB and WB offer comparable semantic preservation
  4. Enforcing wider bandwidth at extremely low bitrates is risky due to limited bit budget
  5. Narrowband is superior design choice at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation

  6. AI-based codec sampling rate optimization:

  7. DAC 16 kHz model provides distinct advantage over 24 kHz model at lower bitrates
  8. 8 kHz model (trained only 200k steps) defeats official 16 kHz model at low bitrates
  9. Optimizing sampling rate to match available bit budget is critical for system design
  10. Performance of intermediate rates (e.g., 12 kHz) remains open question

6. Proposals

Proposal: Include relevant content from Sections 3, 4, and 5 into TR 26.940, capturing:
- Methodology
- Experimental setup
- Analysis of results concerning audio bandwidth impact on semantic intelligibility

7. Detailed Results Tables

Complete experimental data provided in appendix tables covering:
- Table 1.a: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- Table 1.b: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps
- Table 2: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps

Document Information
Source:
vivo Mobile Communication Co., Nokia, Xiaomi Technology, Samsung, Spreadtrum, Bytedance
Type:
pCR
For:
Agreement
Original Document:
View on 3GPP
Title: [FS_ULBC] Analyzing semantic intelligibility in lossy coded audio signals
Agenda item: 7.8
Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)
Doc type: pCR
For action: Agreement
Abstract: This document presents the results of an experimental evaluation focusing on the semantic intelligibility of audio codecs under Ultra-Low Speech Bitrate (ULBC) [1] constraints. While traditional quality metrics often focus on perceptual quality (MOS), the primary objective of this paper is to quantify semantic preservation (the ability of the listener to accurately understand the spoken content) using automatic speech recognition (ASR) word error rate (WER) as a proxy.
Release: Rel-20
Specification: 26.94
Version: 0.4.0
Related WIs: FS_ULBC
Spec: 26.94
Contact: Wang Dong
Uploaded: 2026-02-03T13:43:09.937000
Contact ID: 107237
TDoc Status: noted
Reservation date: 03/02/2026 12:33:20
Agenda item sort order: 20