Meeting: TSGS4_135_India | Agenda Item: 9.8
[FS_Avatar_Ph2_MED] Avatar Evaluation Framework and Objective Metrics
Qualcomm Atheros, Inc.
discussion
| TDoc | S4-260121 |
| Title | [FS_Avatar_Ph2_MED] Avatar Evaluation Framework and Objective Metrics |
| Source | Qualcomm Atheros, Inc. |
| Agenda item | 9.8 |
| Agenda item description | FS_Avatar_Ph2_MED (Study on Avatar communication Phase 2) |
| Doc type | discussion |
| download_url | https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260121.zip |
| Type | discussion |
| Contact | Imed Bouazizi |
| Uploaded | 2026-02-03T21:49:01.090000 |
| Contact ID | 84417 |
| Revised to | S4-260355 |
| TDoc Status | revised |
| Reservation date | 03/02/2026 05:48:54 |
| Agenda item sort order | 43 |
[Technical] The “black-box evaluation from rendered video” principle is not sufficient for cross-vendor comparability unless the contribution also standardizes the avatar asset (mesh/topology, textures, rig), camera pose, lighting model, tone mapping, and renderer settings; otherwise PSNR/SSIM and landmark-based errors will mostly measure rendering/asset differences rather than animation/transport performance.
[Technical] The proposal to “include PSNR/SSIM” as visual quality metrics is weak for avatar content because these metrics are highly sensitive to small viewpoint/lighting differences and do not correlate well with perceptual quality for faces; the document should justify their relevance or add more appropriate perceptual metrics (e.g., VMAF/LPIPS or face-region weighted metrics) and define ROI handling.
[Technical] The animation metrics (LVE/FDD/MVE) rely on extracting landmarks/skeletons from rendered video, but no standardized detector/model, confidence handling, occlusion policy, or failure mode is defined—making results non-repeatable and vendor-dependent despite the stated reproducibility goal.
[Technical] Units for LVE/FDD/MVE are given as “pixels/mm” without defining the pixel-to-mm mapping, camera calibration, depth assumptions, or whether errors are computed in 2D image space vs 3D space; this ambiguity will lead to incomparable results across resolutions/FOVs.
[Technical] “Reference rendered video from both high-quality reference pipeline and source capture” is internally inconsistent: if the reference is a rendered video, it bakes in a specific renderer/asset; if the reference is source capture, it is not directly comparable to a stylized avatar render—this needs a clear definition of the reference signal per metric.
[Technical] The framework mixes evaluation of animation technique quality with transport/network impairment testing, but does not specify how to isolate network effects from renderer/animation effects (e.g., fixed decoder, fixed renderer, controlled impairment injection points), risking confounded conclusions.
[Technical] The “normative capture workflow using lossless/visually lossless compression” is underspecified: “visually lossless” is subjective, and without mandated codec/settings, color format, bit depth, and color management (RGB/YUV, transfer function), objective metrics will vary significantly.
[Technical] Temporal metrics are deferred to a “second phase,” yet the proposals ask to include the metric set in TR 26.813; this is incomplete for Phase 2 objectives where latency/sync are central, and at minimum measurement definitions and required instrumentation should be provided now.
[Technical] Motion-to-photon and end-to-end latency measurement cannot be derived reliably from frame timestamps alone; the contribution needs a concrete method (e.g., LED/photodiode, timecode watermarking, event markers) and a definition of clock synchronization across stimulus, renderer, and capture.
[Technical] Dropped frame ratio based on “missing or repeated frame indices” assumes access to frame indices or deterministic numbering; in a pure black-box capture, repeated frames must be detected via content analysis, which is not specified and is error-prone for low-motion scenes.
[Technical] Audio-visual sync via cross-correlation between mouth motion and audio is not robust across phonemes, silence, or expressive motion; the document should define the feature extraction (viseme classifier vs lip aperture signal), windowing, and acceptable error bounds.
[Editorial] The “four key principles” section only lists three principles; this undermines clarity and should be corrected or the missing principle added.
[Editorial] Several terms are introduced without alignment to existing 3GPP/3GPP2/ITU terminology (e.g., “stimulus player,” “render configuration,” “metrics engine”); mapping to TR 26.813 structure and definitions is needed to avoid creating parallel vocabulary.
[Editorial] The proposal to “define normative capture workflow” is potentially inappropriate for a TR (typically informative); the document should clarify whether it intends normative text in a TS or provide informative guidance with clearly stated assumptions and limitations.