On the use of objective metrics in ULBC standardization
This document addresses the "Study on Ultra Low Bitrate Speech Codec" (FS_ULBC) approved at SA#107, specifically focusing on study objective 5 from the WID regarding performance requirements and test methodologies for speech quality, intelligibility, and conversational quality across various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
The contribution provides correlation analysis results of objective quality models as a complement to subjective test results on clean speech and music/mixed content in TR 26.940, building upon previous discussions in S4-251814.
Quality Impairment Categories for ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Testing Challenges:
- ML-based ULBC codecs introduce new impairment categories (e.g., hallucination) not present in signal-processing based codecs (AMR, AMR-WB, EVS)
- Traditional P.800 ACR methodology may not optimally quantify all potential impairments
- DCR methodology focuses on differences to reference, suitable for small impairments and prosodic differences
- Previous 3GPP codec standardization (AMR, AMR-WB, EVS) used ACR for clean speech and DCR for SWB, mixed-bandwidth, noisy speech, and music evaluations
Alternative Test Methods Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation
Source Material Robustness (9.1.3.1):
- Multiple languages with diverse intonations
- Various phonetic and linguistic environments
- Different voice pitches and speaking styles
- Overlapping talkers
Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Varying reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion:
- P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material covering multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Correlation Analysis on Clean Speech (9.1.4.1):
Evaluated objective models from references [7-11]:
- Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS
- General audio metrics: PEAQ, ViSQOL-A
- Additional metric: SCOREQ
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations (Clean Speech):
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior of multirate codecs
- Models operating at 16 kHz (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis on Music/Mixed Content:
Evaluated models from references [7-12]: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations (Music/Mixed Content):
- POLQA (despite not being recommended for non-speech signals) gave best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance despite being adapted to music/mixed content
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting quality for general audio and mismatch with DCR test methodology grading
Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective/objective metrics may be considered for characterization of new codec
- Objective metrics have merits in other tasks such as codec conformance testing
This is a proposed Change Request (pCR) to TR 26.940, specifically targeting Clause 9 (Test methodologies) with additions to subclauses 9.1.1 through 9.1.4.