S4-260235 - AI Summary

On the use of objective metrics in ULBC standardization

AI-Generated Summary AI

Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

Introduction and Scope

This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

Main Technical Contributions

Test Methodologies (Clause 9)

Quality Impairments of Ultra-Low Bit Rate Speech Coding (9.1.1)

The document identifies specific impairment categories relevant to ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).

Challenges of Quality Assessment (9.1.2)

The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):

Traditional 3GPP Approach:
- Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech
- P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content
- Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments

ULBC-Specific Considerations:
- ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues)
- DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition

Alternative Test Methodologies Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation

Subjective Testing Considerations (9.1.3)

Robustness Related to Source Material (9.1.3.1):
- Multiple languages with diverse intonations
- Non-speech signals
- Various linguistic features and accents
- Wide range of speakers (different voice pitches, speaking styles)
- Overlapping talkers

Simulation of Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Various reverberation levels (RT60 ranging from 0.3s to 1.0s)

Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov

Conclusion (9.1.3.4):
- ITU-T P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

Objective Testing Considerations (9.1.4)

Correlation Analysis Results (9.1.4.1):

The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:

Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ

General audio metrics: PEAQ, ViSQOL-A

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

Key Observations for Clean Speech:
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior
- 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

Correlation Analysis for Music/Mixed Content:

Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

Key Observations for Music/Mixed Content:
- POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f-model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology

Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective and objective metrics may be considered for codec characterization
- Objective metrics have merits in other tasks such as codec conformance testing

Proposed Changes to TR 26.940

The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.

Document Information

TDoc:
S4-260235

Source:
Orange

Type:
pCR

For:
Agreement

Original Document:
View on 3GPP

Title: On the use of objective metrics in ULBC standardization

Agenda item: 7.8

Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)

Doc type: pCR

For action: Agreement

Release: Rel-20

Specification: 26.94

Version: 0.5.1

Related WIs: FS_ULBC

Spec: 26.94

Contact: Stephane Ragot

Uploaded: 2026-02-03T22:43:44.333000

Contact ID: 32055

Revised to: S4-260279

TDoc Status: revised

Reservation date: 03/02/2026 20:49:26

Agenda item sort order: 20