Summary of S4-260121: Avatar Evaluation Framework and Objective Metrics
Introduction
This contribution addresses Objectives 2 and 3 of the Avatar Communication Phase 2 SID (SP-251663), which concern QoE metrics, evaluation frameworks, and evaluation criteria for animation techniques. The document proposes a practical evaluation methodology designed to deliver repeatable, automated, and vendor-neutral results based on a core principle: evaluate what the user actually sees by measuring quality from rendered video output rather than internal system parameters.
Evaluation Framework
Design Principles
The framework is built on four key principles:
- Black-box evaluation: Metrics computed from rendered output video, not internal system states, ensuring cross-vendor comparability
- Reproducibility: Fixed test content, deterministic rendering conditions, and standardized capture workflows for consistent results
- Automation: All metrics computable without human intervention for large-scale testing
- (Note: Only three principles explicitly detailed despite mentioning four)
Testbed Architecture
The proposed testbed comprises five key components:
- Stimulus player: Feeds avatar system with animation streams (blendshape weights, landmarks, joint poses)
- Render configuration: Locks camera intrinsics, lighting, background, and resolution to eliminate variability
- Capture module: Records rendered frames using lossless/visually lossless compression with frame-accurate timestamps
- Network emulator: Applies controlled latency, jitter, bandwidth limits, and packet loss for transport testing
- Metrics engine: Computes frame-level and clip-level objective metrics from captured assets
Objective Metrics for Avatar Evaluation
The contribution proposes metrics across three quality dimensions:
Visual Quality Metrics
- PSNR (dB): Peak signal-to-noise ratio between reference and test frames
- SSIM (0-1): Structural similarity index
Animation Quality Metrics
Video-based computation extracting landmarks and skeletons from rendered output:
- Lip Vertex Error (LVE) (pixels/mm): RMS error of mouth landmarks; critical for lip sync evaluation
- Facial Distance Deviation (FDD) (pixels/mm): Deviation of expression-related landmark distances; measures facial expression accuracy
- Motion Vertex Error (MVE) (pixels/mm): RMS error of body joint positions; evaluates full-body animation fidelity
Temporal and Synchronization Metrics
Proposed for second phase evaluation due to complexity:
- Rendering Frame Rate (FPS): Computed from frame timestamp deltas
- Dropped Frame Ratio (%): Percentage of missing or repeated frame indices
- Motion-to-Photon Latency (ms): Time from input motion event to visible response
- End-to-End Latency (ms): Total delay from sender capture to receiver presentation
- Audio-Visual Sync Offset (ms): Offset between mouth motion and corresponding audio via cross-correlation
Test Content
Standardized animation streams should cover:
- Neutral speech: Clear visemes and steady head motion for baseline lip sync
- Expressive speech: Emotions (happiness, surprise, concern) for facial expression testing
- Conversational turn-taking: Gaze shifts, nods, backchannel gestures
- Non-verbal body motion: Pointing, waving, posture changes, idle animation
Each test set should contain reference audio, reference animation streams, and reference rendered video from both high-quality reference pipeline and source capture.
Proposals
The contribution proposes to:
- Adopt objective evaluation approach based on rendered video output as primary evaluation method for reproducibility
- Include the proposed metric set (visual quality, animation fidelity, temporal performance) in TR 26.813
- Define normative capture workflow using lossless recording, timecode embedding, and reference alignment for consistent metric computation across implementations