# Comprehensive Summary: Analyzing Semantic Intelligibility in Lossy Coded Audio Signals

## 1. Introduction and Objectives

This contribution presents experimental evaluation results focusing on **semantic intelligibility** of audio codecs under Ultra-Low Speech Bitrate (ULBC) constraints for GEO satellite communications. The primary objective is to quantify **semantic preservation** (listener's ability to accurately understand spoken content) using **Automatic Speech Recognition (ASR) Word Error Rate (WER)** as a proxy metric, rather than traditional perceptual quality (MOS) metrics.

The study evaluates:
- **Descript Audio Codec (DAC)** - AI-based codec
- **Enhanced Voice Services (EVS)** codec - 3GPP standard reference

The analysis specifically investigates whether higher audio bandwidths (wideband vs. narrowband) improve or reduce intelligibility at very low bitrates, providing data-driven guidance for audio bandwidth design constraints and quality floor determination.

## 2. Background and Motivation

### 2.1 ULBC Context
The ULBC study item targets voice over GEO satellite communications where balancing audio quality, robustness, and bit-efficiency is critical. At extremely low bitrates (< 3 kbps or ~1 kbps), a fundamental trade-off emerges:
- **Wideband audio (16 kHz)** offers naturalness and perceptual quality
- **Bit allocation challenge**: Allocating scarce bits to higher frequencies reduces the budget for core speech spectrum, potentially introducing artifacts that outweigh bandwidth benefits

### 2.2 Critical Communication Requirements
For emergency rescue operations, **semantic intelligibility** is the highest priority. Key considerations include:
- Wideband generally improves comfort and speaker identification, but its impact on speech understanding in "last resort" scenarios requires verification
- System interoperability with legacy endpoints (PSTN, GSM fallback) remains important in remote areas
- Need to balance modern expectations with legacy requirements and emergency scenarios

### 2.3 EVS as Reference Anchor
EVS serves as a **quality anchor** and concrete standardized baseline for semantic preservation, enabling:
- Practical quality floor definition for ULBC
- Comparison against established carrier-grade standards
- Isolation of bandwidth choice impact independent of codec architecture

## 3. Methodology

### 3.1 Evaluation Pipeline
- **Dataset**: LibriSpeech train-clean-100 subset (standard benchmark for high-quality read English speech)
- **Sample size**: 500 audio files randomly sampled across three seeds (101, 102, 103)
- **Consistency**: Same audio files used for all codec and bitrate configurations

### 3.2 Processing Chain
1. Process input audio through target codecs (DAC and EVS) at various bitrates
2. Transcribe processed audio using **OpenAI Whisper model (large-v3)** - selected for state-of-the-art performance and noise robustness
3. Compare transcripts against LibriSpeech ground truth
4. Calculate WER using jiwer library

## 4. Experimental Setup

### 4.1 Codec Configurations
**DAC model**: Evaluated at three sampling rates
- 16 kHz
- 24 kHz  
- 44 kHz

**EVS codec**: Evaluated in standard modes
- Narrowband (NB)
- Wideband (WB)

**Baseline**: Uncompressed PCM audio (resampled from 48 kHz to NB and WB)

### 4.2 Observations on Baseline Variance
- NB PCM occasionally scored ~0.1% better than WB PCM
- Attributed to inherent ASR model variance rather than signal quality differences
- Explains why high-bitrate DAC models occasionally score slightly lower than WB baseline

### 4.3 Primary Metric
**WER (Word Error Rate)**: Lower percentage indicates better performance. Log-scale visualization employed to distinguish performance differences in the 3-5% WER range.

## 5. Results and Analysis

### 5.1 DAC Performance vs Bitrate

**Key Findings**:
- DAC achieves high efficiency at low bitrates (~2 kbps)
- WER drops rapidly as bitrate increases, stabilizing around 3-4%
- At **1.5 kbps**: WER approximately **5.5%**
- Significant improvement observed in 1.5-3.0 kbps range

**Bandwidth Impact at Low Bitrates**:
- At low bitrates (1.5 kbps and 3 kbps), **16 kHz model outperforms 24 kHz model**
- With constant model size, 16 kHz model allocates more bits per spectral unit within narrower band
- Results in better semantic preservation vs. 24 kHz model suffering from **bit starvation**

### 5.2 DAC 8 kHz Narrowband Model Analysis

A dedicated **8 kHz sampling rate model** was trained to investigate bandwidth impact at the lower bitrate bound.

**Model Configuration**:
- Sample rate: 8000 Hz
- Encoder rates: [2, 4, 4, 8], dimension: 64
- Decoder rates: [8, 4, 4, 2], dimension: 1536
- Quantization: 6 codebooks, size 1024, dimension 36
- Training: 200,000 steps on VCTK corpus

**Critical Findings at Sub-1.5 kbps**:
- **At ~1 kbps**:
  - 8 kHz model (938 bps): **WER 5.86%**
  - 16 kHz model (1000 bps): **WER 11.23%**
  - **Semantic penalty > 5 percentage points** when forcing WB at 1 kbps

- **At 1.5 kbps**:
  - 8 kHz model (1563 bps): **WER 3.86%**
  - 16 kHz model (1500 bps): **WER 5.46%**

**Conclusion**: At sub-2 kbps bitrates, available bit budget is insufficient to support wider bandwidth without degrading core spectral content required for intelligibility. Native Narrowband mode allows high-precision bit allocation to fundamental frequencies (0-4 kHz), preserving semantic content more effectively.

### 5.3 DAC vs EVS Comparison

**Competitive Advantage**:
- DAC achieves comparable WER scores at **significantly lower bitrates** than EVS
- DAC 16 kHz performance curves converge towards high-quality PCM baselines faster than traditional codecs

**ULBC Application**: For GEO scenarios in [1-3] kbps range, semantic preservation is critical for defining quality floor.

### 5.4 EVS Narrowband vs Wideband Analysis

**Performance at Different Bitrates**:

- **At 13.2 kbps** (highest tested):
  - EVS-NB: **3.16%**
  - EVS-WB: **3.14%**
  - Nearly identical, indicating saturation point in semantic quality

- **At 5.9-8.0 kbps range**:
  - EVS-WB maintains marginal advantage (e.g., at 8.0 kbps: WB 3.15% vs. NB 3.41%)
  - Both modes provide sufficient basic audio quality

- **At 9.6 kbps**:
  - EVS-NB: **3.19%**
  - EVS-WB: **3.23%**
  - NB performance very close to WB, difference within ASR model error margin

**Conclusion**: For semantic understanding, NB bandwidth limitation is less critical than codec's bit allocation efficiency.

### 5.5 EVS Degradation Analysis

**Methodology**: Calculated WER Degradation = (WER_coded - WER_baseline) / (100 - WER_baseline) to isolate codec processing impact from ASR model variance.

**Key Findings**:
- Semantic loss introduced by EVS in both NB and WB modes is **minimal**
- Degradation metric confirms that pure coding loss of NB and WB is **statistically indistinguishable** when subtracting baseline PCM variance
- Additional frequency content in wideband contributes **negligible semantic information** for machine understanding compared to core NB spectrum

**Strategic Implication**: Robust NB mode is sufficient for intelligibility requirements of critical last resort communications, without bit starvation risk associated with wider bandwidths at low bitrates.

### 5.6 Summary of Findings

**Strategic Conclusions for ULBC Design**:

1. **For ~1 kbps emergency/GEO scenarios**: 
   - Semantic intelligibility is paramount
   - NB and WB offer comparable semantic preservation
   - Enforcing wider bandwidth at extremely low bitrates is risky due to limited bit budget
   - **Narrowband is superior design choice** at lowest bitrates, allowing encoder to focus bits on basic voice quality foundation

2. **AI-based codec sampling rate optimization**:
   - DAC 16 kHz model provides distinct advantage over 24 kHz model at lower bitrates
   - 8 kHz model (trained only 200k steps) defeats official 16 kHz model at low bitrates
   - **Optimizing sampling rate to match available bit budget is critical** for system design
   - Performance of intermediate rates (e.g., 12 kHz) remains open question

## 6. Proposals

**Proposal**: Include relevant content from Sections 3, 4, and 5 into **TR 26.940**, capturing:
- Methodology
- Experimental setup
- Analysis of results concerning audio bandwidth impact on semantic intelligibility

## 7. Detailed Results Tables

Complete experimental data provided in appendix tables covering:
- **Table 1.a**: DAC Model Results (16/24/44 kHz) across bitrates 500-7751 bps
- **Table 1.b**: DAC NB Model Results (8 kHz) across bitrates 312-1875 bps  
- **Table 2**: EVS & PCM Baseline Results for NB/WB modes at 5900-13200 bps