Unknown
S4-260121 / TSGS4_135_India / 9.8 / Qualcomm Atheros, Inc. / [FS_Avatar_Ph2_MED] Avatar Evaluation...
Previous Next Edit
S4-260121

[FS_Avatar_Ph2_MED] Avatar Evaluation Framework and Objective Metrics

Source: Qualcomm Atheros, Inc.
Meeting: TSGS4_135_India
Agenda Item: 9.8

All Metadata
Agenda item description FS_Avatar_Ph2_MED (Study on Avatar communication Phase 2)
Doc type discussion
download_url Download Original
Type discussion
Contact Imed Bouazizi
Uploaded 2026-02-03T21:49:01.090000
Contact ID 84417
Revised to S4-260355
TDoc Status revised
Reservation date 03/02/2026 05:48:54
Agenda item sort order 43
Review Comments
manager - 2026-02-11 07:04


  1. [Technical] The “black-box evaluation from rendered video” principle is not sufficient for cross-vendor comparability unless the contribution also standardizes the avatar asset (mesh/topology, textures, rig), camera pose, lighting model, tone mapping, and renderer settings; otherwise PSNR/SSIM and landmark-based errors will mostly measure rendering/asset differences rather than animation/transport performance.




  2. [Technical] The proposal to “include PSNR/SSIM” as visual quality metrics is weak for avatar content because these metrics are highly sensitive to small viewpoint/lighting differences and do not correlate well with perceptual quality for faces; the document should justify their relevance or add more appropriate perceptual metrics (e.g., VMAF/LPIPS or face-region weighted metrics) and define ROI handling.




  3. [Technical] The animation metrics (LVE/FDD/MVE) rely on extracting landmarks/skeletons from rendered video, but no standardized detector/model, confidence handling, occlusion policy, or failure mode is defined—making results non-repeatable and vendor-dependent despite the stated reproducibility goal.




  4. [Technical] Units for LVE/FDD/MVE are given as “pixels/mm” without defining the pixel-to-mm mapping, camera calibration, depth assumptions, or whether errors are computed in 2D image space vs 3D space; this ambiguity will lead to incomparable results across resolutions/FOVs.




  5. [Technical] “Reference rendered video from both high-quality reference pipeline and source capture” is internally inconsistent: if the reference is a rendered video, it bakes in a specific renderer/asset; if the reference is source capture, it is not directly comparable to a stylized avatar render—this needs a clear definition of the reference signal per metric.




  6. [Technical] The framework mixes evaluation of animation technique quality with transport/network impairment testing, but does not specify how to isolate network effects from renderer/animation effects (e.g., fixed decoder, fixed renderer, controlled impairment injection points), risking confounded conclusions.




  7. [Technical] The “normative capture workflow using lossless/visually lossless compression” is underspecified: “visually lossless” is subjective, and without mandated codec/settings, color format, bit depth, and color management (RGB/YUV, transfer function), objective metrics will vary significantly.




  8. [Technical] Temporal metrics are deferred to a “second phase,” yet the proposals ask to include the metric set in TR 26.813; this is incomplete for Phase 2 objectives where latency/sync are central, and at minimum measurement definitions and required instrumentation should be provided now.




  9. [Technical] Motion-to-photon and end-to-end latency measurement cannot be derived reliably from frame timestamps alone; the contribution needs a concrete method (e.g., LED/photodiode, timecode watermarking, event markers) and a definition of clock synchronization across stimulus, renderer, and capture.




  10. [Technical] Dropped frame ratio based on “missing or repeated frame indices” assumes access to frame indices or deterministic numbering; in a pure black-box capture, repeated frames must be detected via content analysis, which is not specified and is error-prone for low-motion scenes.




  11. [Technical] Audio-visual sync via cross-correlation between mouth motion and audio is not robust across phonemes, silence, or expressive motion; the document should define the feature extraction (viseme classifier vs lip aperture signal), windowing, and acceptable error bounds.




  12. [Editorial] The “four key principles” section only lists three principles; this undermines clarity and should be corrected or the missing principle added.




  13. [Editorial] Several terms are introduced without alignment to existing 3GPP/3GPP2/ITU terminology (e.g., “stimulus player,” “render configuration,” “metrics engine”); mapping to TR 26.813 structure and definitions is needed to avoid creating parallel vocabulary.




  14. [Editorial] The proposal to “define normative capture workflow” is potentially inappropriate for a TR (typically informative); the document should clarify whether it intends normative text in a TS or provide informative guidance with clearly stated assumptions and limitations.



Sign in to add comments.