S4-260121

[FS_Avatar_Ph2_MED] Avatar Evaluation Framework and Objective Metrics

Source: Qualcomm Atheros, Inc.
Meeting: TSGS4_135_India
Agenda Item: 9.8

All Metadata

Agenda item description	FS_Avatar_Ph2_MED (Study on Avatar communication Phase 2)
Doc type	discussion
download_url	Download Original
Type	discussion
Contact	Imed Bouazizi
Uploaded	2026-02-03T21:49:01.090000
Contact ID	84417
Revised to	S4-260355
TDoc Status	revised
Reservation date	03/02/2026 05:48:54
Agenda item sort order	43

Review Comments

manager - 2026-02-11 07:04

[Technical] The “black-box evaluation from rendered video” principle is not sufficient for cross-vendor comparability unless the contribution also standardizes the avatar asset (mesh/topology, textures, rig), camera pose, lighting model, tone mapping, and renderer settings; otherwise PSNR/SSIM and landmark-based errors will mostly measure rendering/asset differences rather than animation/transport performance.

[Technical] The proposal to “include PSNR/SSIM” as visual quality metrics is weak for avatar content because these metrics are highly sensitive to small viewpoint/lighting differences and do not correlate well with perceptual quality for faces; the document should justify their relevance or add more appropriate perceptual metrics (e.g., VMAF/LPIPS or face-region weighted metrics) and define ROI handling.

[Technical] The animation metrics (LVE/FDD/MVE) rely on extracting landmarks/skeletons from rendered video, but no standardized detector/model, confidence handling, occlusion policy, or failure mode is defined—making results non-repeatable and vendor-dependent despite the stated reproducibility goal.

[Technical] Units for LVE/FDD/MVE are given as “pixels/mm” without defining the pixel-to-mm mapping, camera calibration, depth assumptions, or whether errors are computed in 2D image space vs 3D space; this ambiguity will lead to incomparable results across resolutions/FOVs.

[Technical] “Reference rendered video from both high-quality reference pipeline and source capture” is internally inconsistent: if the reference is a rendered video, it bakes in a specific renderer/asset; if the reference is source capture, it is not directly comparable to a stylized avatar render—this needs a clear definition of the reference signal per metric.

[Technical] The framework mixes evaluation of animation technique quality with transport/network impairment testing, but does not specify how to isolate network effects from renderer/animation effects (e.g., fixed decoder, fixed renderer, controlled impairment injection points), risking confounded conclusions.

[Technical] The “normative capture workflow using lossless/visually lossless compression” is underspecified: “visually lossless” is subjective, and without mandated codec/settings, color format, bit depth, and color management (RGB/YUV, transfer function), objective metrics will vary significantly.

[Technical] Temporal metrics are deferred to a “second phase,” yet the proposals ask to include the metric set in TR 26.813; this is incomplete for Phase 2 objectives where latency/sync are central, and at minimum measurement definitions and required instrumentation should be provided now.

[Technical] Motion-to-photon and end-to-end latency measurement cannot be derived reliably from frame timestamps alone; the contribution needs a concrete method (e.g., LED/photodiode, timecode watermarking, event markers) and a definition of clock synchronization across stimulus, renderer, and capture.

[Technical] Dropped frame ratio based on “missing or repeated frame indices” assumes access to frame indices or deterministic numbering; in a pure black-box capture, repeated frames must be detected via content analysis, which is not specified and is error-prone for low-motion scenes.

[Technical] Audio-visual sync via cross-correlation between mouth motion and audio is not robust across phonemes, silence, or expressive motion; the document should define the feature extraction (viseme classifier vs lip aperture signal), windowing, and acceptable error bounds.

[Editorial] The “four key principles” section only lists three principles; this undermines clarity and should be corrected or the missing principle added.

[Editorial] Several terms are introduced without alignment to existing 3GPP/3GPP2/ITU terminology (e.g., “stimulus player,” “render configuration,” “metrics engine”); mapping to TR 26.813 structure and definitions is needed to avoid creating parallel vocabulary.

[Editorial] The proposal to “define normative capture workflow” is potentially inappropriate for a TR (typically informative); the document should clarify whether it intends normative text in a TS or provide informative guidance with clearly stated assumptions and limitations.

<ol>
<li>
[Technical] The “black-box evaluation from rendered video” principle is not sufficient for cross-vendor comparability unless the contribution also standardizes the avatar asset (mesh/topology, textures, rig), camera pose, lighting model, tone mapping, and renderer settings; otherwise PSNR/SSIM and landmark-based errors will mostly measure rendering/asset differences rather than animation/transport performance.
</li>
<li>
[Technical] The proposal to “include PSNR/SSIM” as visual quality metrics is weak for avatar content because these metrics are highly sensitive to small viewpoint/lighting differences and do not correlate well with perceptual quality for faces; the document should justify their relevance or add more appropriate perceptual metrics (e.g., VMAF/LPIPS or face-region weighted metrics) and define ROI handling.
</li>
<li>
[Technical] The animation metrics (LVE/FDD/MVE) rely on extracting landmarks/skeletons from rendered video, but no standardized detector/model, confidence handling, occlusion policy, or failure mode is defined—making results non-repeatable and vendor-dependent despite the stated reproducibility goal.
</li>
<li>
[Technical] Units for LVE/FDD/MVE are given as “pixels/mm” without defining the pixel-to-mm mapping, camera calibration, depth assumptions, or whether errors are computed in 2D image space vs 3D space; this ambiguity will lead to incomparable results across resolutions/FOVs.
</li>
<li>
[Technical] “Reference rendered video from both high-quality reference pipeline and source capture” is internally inconsistent: if the reference is a rendered video, it bakes in a specific renderer/asset; if the reference is source capture, it is not directly comparable to a stylized avatar render—this needs a clear definition of the reference signal per metric.
</li>
<li>
[Technical] The framework mixes evaluation of animation technique quality with transport/network impairment testing, but does not specify how to isolate network effects from renderer/animation effects (e.g., fixed decoder, fixed renderer, controlled impairment injection points), risking confounded conclusions.
</li>
<li>
[Technical] The “normative capture workflow using lossless/visually lossless compression” is underspecified: “visually lossless” is subjective, and without mandated codec/settings, color format, bit depth, and color management (RGB/YUV, transfer function), objective metrics will vary significantly.
</li>
<li>
[Technical] Temporal metrics are deferred to a “second phase,” yet the proposals ask to include the metric set in TR 26.813; this is incomplete for Phase 2 objectives where latency/sync are central, and at minimum measurement definitions and required instrumentation should be provided now.
</li>
<li>
[Technical] Motion-to-photon and end-to-end latency measurement cannot be derived reliably from frame timestamps alone; the contribution needs a concrete method (e.g., LED/photodiode, timecode watermarking, event markers) and a definition of clock synchronization across stimulus, renderer, and capture.
</li>
<li>
[Technical] Dropped frame ratio based on “missing or repeated frame indices” assumes access to frame indices or deterministic numbering; in a pure black-box capture, repeated frames must be detected via content analysis, which is not specified and is error-prone for low-motion scenes.
</li>
<li>
[Technical] Audio-visual sync via cross-correlation between mouth motion and audio is not robust across phonemes, silence, or expressive motion; the document should define the feature extraction (viseme classifier vs lip aperture signal), windowing, and acceptable error bounds.
</li>
<li>
[Editorial] The “four key principles” section only lists three principles; this undermines clarity and should be corrected or the missing principle added.
</li>
<li>
[Editorial] Several terms are introduced without alignment to existing 3GPP/3GPP2/ITU terminology (e.g., “stimulus player,” “render configuration,” “metrics engine”); mapping to TR 26.813 structure and definitions is needed to avoid creating parallel vocabulary.
</li>
<li>
[Editorial] The proposal to “define normative capture workflow” is potentially inappropriate for a TR (typically informative); the document should clarify whether it intends normative text in a TS or provide informative guidance with clearly stated assumptions and limitations.
</li>
</ol>