# Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

## Introduction and Scope

This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

## Main Technical Contributions

### Test Methodologies (Clause 9)

#### Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1)

The document identifies specific impairment categories relevant to ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).

#### Challenges of Quality Assessment (9.1.2)

The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):

**Traditional 3GPP Approach:**
- Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech
- P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content
- Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments

**ULBC-Specific Considerations:**
- ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues)
- DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition

**Alternative Test Methodologies Listed:**
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation

#### Subjective Testing Considerations (9.1.3)

**Robustness Related to Source Material (9.1.3.1):**
- Multiple languages with diverse intonations
- Non-speech signals
- Various linguistic features and accents
- Wide range of speakers (different voice pitches, speaking styles)
- Overlapping talkers

**Simulation of Real-world Acoustic Conditions (9.1.3.2):**
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Various reverberation levels (RT60 ranging from 0.3s to 1.0s)

**Tandeming and Compatibility Testing (9.1.3.3):**
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov

**Conclusion (9.1.3.4):**
- ITU-T P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

#### Objective Testing Considerations (9.1.4)

**Correlation Analysis Results (9.1.4.1):**

The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:

*Speech-oriented metrics:* PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ

*General audio metrics:* PEAQ, ViSQOL-A

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

**Key Observations for Clean Speech:**
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior
- 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

**Correlation Analysis for Music/Mixed Content:**

Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

**Key Observations for Music/Mixed Content:**
- POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f-model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology

**Discussion (9.1.4.2):**
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

**Conclusion (9.1.4.3):**
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective and objective metrics may be considered for codec characterization
- Objective metrics have merits in other tasks such as codec conformance testing

## Proposed Changes to TR 26.940

The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.