# Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

## Introduction and Scope

This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.

## Main Technical Contributions

### Test Methodologies - General Considerations (Clause 9.1.1-9.1.2)

**Quality Impairment Categories for ULBC:**
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

**Testing Challenges:**
- ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS)
- Traditional P.800 ACR methodology may not optimally quantify all potential impairments
- DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences
- Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations

**Alternative Test Methods Listed:**
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation

### Subjective Testing Considerations (Clause 9.1.3)

**Source Material Robustness (9.1.3.1):**
- Multiple languages with diverse intonations
- Various phonetic and linguistic environments
- Different voice pitches and speaking styles
- Overlapping talkers

**Real-world Acoustic Conditions (9.1.3.2):**
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)

**Tandeming and Compatibility Testing (9.1.3.3):**
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov

**Conclusion:**
- P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

### Objective Testing Considerations (Clause 9.1.4)

**Correlation Analysis on Clean Speech (9.1.4.1):**

Evaluated objective models from references [7-11]:
- **Speech-oriented metrics:** PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS
- **General audio metrics:** PEAQ, ViSQOL-A
- **Additional metric:** SCOREQ

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

**Key Observations (Clean Speech):**
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs
- Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

**Correlation Analysis on Music/Mixed Content:**

Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

**Key Observations (Music/Mixed Content):**
- POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading

**Discussion (9.1.4.2):**
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

**Conclusion (9.1.4.3):**
- **Subjective testing remains "golden reference" for codec selection**
- **Objective metrics NOT recommended for codec selection criteria or codec tuning**
- Correlation of subjective/objective metrics may be considered for characterization of new codec
- Objective metrics have merits in other tasks such as codec conformance testing

## Document Type

This is a **proposed Change Request (pCR) to TR 26.940**, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.