S4-260279 - AI Summary

On the use of objective metrics in ULBC standardization

Back to Agenda Download Summary
AI-Generated Summary AI

Summary of 3GPP Technical Document on Objective Metrics in ULBC Standardization

Introduction and Scope

This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).

The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.

Main Technical Contributions

Test Methodologies - General Considerations (Clause 9.1.1-9.1.2)

Quality Impairment Categories for ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)

Testing Challenges:
- ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS)
- Traditional P.800 ACR methodology may not optimally quantify all potential impairments
- DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences
- Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations

Alternative Test Methods Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation

Subjective Testing Considerations (Clause 9.1.3)

Source Material Robustness (9.1.3.1):
- Multiple languages with diverse intonations
- Various phonetic and linguistic environments
- Different voice pitches and speaking styles
- Overlapping talkers

Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)

Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov

Conclusion:
- P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming

Objective Testing Considerations (Clause 9.1.4)

Correlation Analysis on Clean Speech (9.1.4.1):

Evaluated objective models from references [7-11]:
- Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS
- General audio metrics: PEAQ, ViSQOL-A
- Additional metric: SCOREQ

Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient

Key Observations (Clean Speech):
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs
- Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)

Correlation Analysis on Music/Mixed Content:

Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model

Key Observations (Music/Mixed Content):
- POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading

Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided

Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective/objective metrics may be considered for characterization of new codec
- Objective metrics have merits in other tasks such as codec conformance testing

Document Type

This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.

Document Information
Source:
Orange, Dolby Laboratories Inc.
Type:
pCR
For:
Agreement
Original Document:
View on 3GPP
Title: On the use of objective metrics in ULBC standardization
Agenda item: 7.8
Agenda item description: FS_ULBC (Study on Ultra Low Bitrate Speech Codec)
Doc type: pCR
For action: Agreement
Release: Rel-20
Specification: 26.94
Version: 0.5.1
Related WIs: FS_ULBC
Spec: 26.94
Contact: Stephane Ragot
Uploaded: 2026-02-03T22:54:57.073000
Contact ID: 32055
Revised to: S4-260305
TDoc Status: revised
Is revision of: S4-260235
Reservation date: 03/02/2026 22:47:28
Agenda item sort order: 20