On the use of objective metrics in ULBC standardization
This document addresses the Study on Ultra Low Bitrate Speech Codec (FS_ULBC), specifically focusing on performance requirements and test methodologies as defined in the WID. The contribution targets study objective 5 regarding speech quality, intelligibility, and conversational quality testing under various conditions (clean/noisy speech, tandeming with IMS codecs, clean/GEO channel conditions).
The document identifies specific impairment categories relevant to ULBC:
- Loss of listening-only audio quality
- Audio bandwidth loss
- Impaired intelligibility
- Impaired speaker identifiability
- Prosodic impairments
- Hallucination (word and phone confusions)
- Sensitivity to non-speech input (background noise, music, noisy speech, interfering talkers, reverberant speech)
Additionally notes that ULBC may incorporate speech enhancement algorithms (noise suppression, gain normalization).
The document highlights that ULBC testing introduces new challenges compared to signal processing-based codecs (AMR, AMR-WB, EVS):
Traditional 3GPP Approach:
- Historical reliance on ITU-T P.800 ACR (Absolute Category Rating) for clean speech
- P.800 DCR (Degradation Category Rating) for SWB clean speech, mixed-bandwidth, speech + background noise, and music/mixed content
- Previous codec standardizations did not focus on intelligibility, speaker identifiability, or prosodic impairments
ULBC-Specific Considerations:
- ML-based coding systems introduce new impairment types (e.g., hallucination) not present in signal-processing codecs
- ACR may not optimally quantify all impairments (hallucination, intelligibility, prosodic issues)
- DCR focuses on differences to reference, which may not directly impact conversational capability but affects aspects like identity recognition
Alternative Test Methodologies Listed:
- Diagnostic Rhyme Tests (DRT)
- Modified Rhyme Tests (MRT)
- MOS testing for speaker similarity
- Speaker verification/identification tests
- Prosodic naturalness MOS tests
- Intonation recognition tests
- Transcription tests for word and semantic equivalence
- Phoneme recognition tests
- Automatic speech recognition tests
- P.835 multi-dimensional rating scales for speech enhancement evaluation
Robustness Related to Source Material (9.1.3.1):
- Multiple languages with diverse intonations
- Non-speech signals
- Various linguistic features and accents
- Wide range of speakers (different voice pitches, speaking styles)
- Overlapping talkers
Simulation of Real-world Acoustic Conditions (9.1.3.2):
- Clean environments (minimal background noise)
- Noisy environments (traffic, human chatter, vehicle)
- Various reverberation levels (RT60 ranging from 0.3s to 1.0s)
Tandeming and Compatibility Testing (9.1.3.3):
- Testing with speech previously encoded by ITU-T G.711, AMR, AMR-WB, and EVS
- Various input levels: -16dBov, -26dBov, and -36dBov
Conclusion (9.1.3.4):
- ITU-T P.800 ACR/DCR serves as backbone for most subjective testing
- Other methodologies may be considered
- Emphasis on diverse test material: multilingual/multi-speaker testing, real-world acoustic conditions, and tandeming
Correlation Analysis Results (9.1.4.1):
The document presents correlation analysis based on ACR experiments (clause 7.3.3) evaluating objective models:
Speech-oriented metrics: PESQ, POLQA, ViSQOL-S, WARP-Q, DNSMOS, NISQA, NORESQA, UTMOS, SCOREQ
General audio metrics: PEAQ, ViSQOL-A
Evaluation metrics used: Pearson correlation coefficient, RMSE, Kendall's Tau rank correlation coefficient
Key Observations for Clean Speech:
- Best performing models (POLQA, UTMOS, PESQ, WARP-Q, SCOREQ) accurately predicted monotonic bitrate/quality behavior
- 16 kHz models (PESQ without mapping, UTMOS and WARP-Q with mapping) showed relatively good performance even for fullband codecs
- Mapping generally improves accuracy (RMSE) except for few models (PESQ, POLQA)
Correlation Analysis for Music/Mixed Content:
Based on DCR experiments (clause 7.3.4), evaluating: POLQA, PEMO-Q, ViSQOL-A, and 2f-model
Key Observations for Music/Mixed Content:
- POLQA (despite not being recommended for non-speech) showed best correlation results (Pearson, Kendall, RMSE after 3rd order mapping)
- 2f-model was second-best performing
- ViSQOL Audio, PEAQ, and PEMO-Q showed fair performance
- Correlation scores lower than clean speech, possibly due to more difficult task of predicting general audio quality and mismatch with DCR grading methodology
Discussion (9.1.4.2):
- P.862 (PESQ) officially "withdrawn" by ITU-T, cannot be considered valid standard
- P.863 remains main ITU-T standard, P.SAMD emerging as potential alternative
- Testing and parameter adjustment based on objective tools not recommended
- 3GPP TR 26.921 documented that tuning noise reduction based on PESQ should be avoided
Conclusion (9.1.4.3):
- Subjective testing remains "golden reference" for codec selection
- Objective metrics NOT recommended for codec selection criteria or codec tuning
- Correlation of subjective and objective metrics may be considered for codec characterization
- Objective metrics have merits in other tasks such as codec conformance testing
The document proposes comprehensive revisions to TR 26.940 v0.5.1, specifically to Clause 9 (Test methodologies), incorporating all the analysis and recommendations detailed above regarding both subjective and objective testing approaches for ULBC standardization.