# Summary of S4-260121: Avatar Evaluation Framework and Objective Metrics

## Introduction

This contribution addresses Objectives 2 and 3 of the Avatar Communication Phase 2 SID (SP-251663), which concern QoE metrics, evaluation frameworks, and evaluation criteria for animation techniques. The document proposes a practical evaluation methodology designed to deliver repeatable, automated, and vendor-neutral results based on a core principle: **evaluate what the user actually sees** by measuring quality from rendered video output rather than internal system parameters.

## Evaluation Framework

### Design Principles

The framework is built on four key principles:

1. **Black-box evaluation**: Metrics computed from rendered output video, not internal system states, ensuring cross-vendor comparability
2. **Reproducibility**: Fixed test content, deterministic rendering conditions, and standardized capture workflows for consistent results
3. **Automation**: All metrics computable without human intervention for large-scale testing
4. *(Note: Only three principles explicitly detailed despite mentioning four)*

### Testbed Architecture

The proposed testbed comprises five key components:

- **Stimulus player**: Feeds avatar system with animation streams (blendshape weights, landmarks, joint poses)
- **Render configuration**: Locks camera intrinsics, lighting, background, and resolution to eliminate variability
- **Capture module**: Records rendered frames using lossless/visually lossless compression with frame-accurate timestamps
- **Network emulator**: Applies controlled latency, jitter, bandwidth limits, and packet loss for transport testing
- **Metrics engine**: Computes frame-level and clip-level objective metrics from captured assets

## Objective Metrics for Avatar Evaluation

The contribution proposes metrics across three quality dimensions:

### Visual Quality Metrics

- **PSNR** (dB): Peak signal-to-noise ratio between reference and test frames
- **SSIM** (0-1): Structural similarity index

### Animation Quality Metrics

Video-based computation extracting landmarks and skeletons from rendered output:

- **Lip Vertex Error (LVE)** (pixels/mm): RMS error of mouth landmarks; critical for lip sync evaluation
- **Facial Distance Deviation (FDD)** (pixels/mm): Deviation of expression-related landmark distances; measures facial expression accuracy
- **Motion Vertex Error (MVE)** (pixels/mm): RMS error of body joint positions; evaluates full-body animation fidelity

### Temporal and Synchronization Metrics

Proposed for **second phase** evaluation due to complexity:

- **Rendering Frame Rate** (FPS): Computed from frame timestamp deltas
- **Dropped Frame Ratio** (%): Percentage of missing or repeated frame indices
- **Motion-to-Photon Latency** (ms): Time from input motion event to visible response
- **End-to-End Latency** (ms): Total delay from sender capture to receiver presentation
- **Audio-Visual Sync Offset** (ms): Offset between mouth motion and corresponding audio via cross-correlation

## Test Content

Standardized animation streams should cover:

- **Neutral speech**: Clear visemes and steady head motion for baseline lip sync
- **Expressive speech**: Emotions (happiness, surprise, concern) for facial expression testing
- **Conversational turn-taking**: Gaze shifts, nods, backchannel gestures
- **Non-verbal body motion**: Pointing, waving, posture changes, idle animation

Each test set should contain reference audio, reference animation streams, and reference rendered video from both high-quality reference pipeline and source capture.

## Proposals

The contribution proposes to:

1. **Adopt** objective evaluation approach based on rendered video output as primary evaluation method for reproducibility
2. **Include** the proposed metric set (visual quality, animation fidelity, temporal performance) in TR 26.813
3. **Define** normative capture workflow using lossless recording, timecode embedding, and reference alignment for consistent metric computation across implementations