# Technical Summary: AVATAR - Update to Section 6.3.4

## Overview

This document provides a comprehensive update to section 6.3.4 concerning the MPEG Avatar Representation Format (ARF), now standardized as ISO/IEC 23090-39. The document reflects the progression of the standard from its initial development phase to reaching Committee Draft International Standard (CDIS) stage.

## MPEG Avatar Representation Format Development

### Scope and Objectives

The MPEG WG03 (Systems) workgroup is developing a new standard for avatar representation format with the following scope:

- Develop an interchange representation format for computer-generated avatars and associated containers
- Define an animation stream format to represent avatar dynamics and time-based information
- Include geometrical models and all associated data (blendshapes, skeleton, normals, textures, maps, metadata)
- Provide a streamable format for dynamics (animation parameters, tracking information, contextual data)
- Ensure interoperability between existing models and formats

### Requirements and Priorities

The Phase 1 requirements are categorized with three priority levels (High, Medium, Low) across multiple categories:

**High Priority Requirements:**
- Suitable exchange format for conversion between avatar representation formats
- Mesh-based format for representation and animation
- Signal coding format
- Semantic and signal representation
- Multiple levels of detail for geometry
- Facial and body animation
- Delay-sensitive animation streams
- Partial transport of base avatar
- Various storage and transport capabilities

**Medium Priority Requirements:**
- DRM protection support
- Integration into scene description
- Avatar authenticity and user association protection

**Low Priority Requirements:**
- Avatar-avatar, user-avatar, avatar-scene interactions
- Storage and replay of animation streams

## ARF Data Model and Structure

### Core Components

The ARF data model (Figure 12) includes the following components:

**Preamble Section:**
- Signature string for unique document identification
- Version string tied to specific ARF revision
- Optional authenticationFeatures (encrypted facial and voice feature vectors with public key URI)
- supportedAnimations object specifying compatible animation frameworks (facial, body, hand, landmark, texture)
- Optional proprietaryAnimations for vendor-specific schemes

**Metadata Object:**
- Avatar-level descriptive information (name, unique identifier, age, gender)
- Used for experience adaptation and policy/access control

**Components Section:**
- **Skeleton**: Defines joints with inverse bind matrices, optional animationInfo
- **Node**: Scene graph objects with names, IDs, parent/child relations, semantic mappings, TRS or 4×4 matrix transformations
- **Skin**: Links mesh to skeleton, optional blendshape/landmark/texture sets, per-vertex joint weights
- **Mesh**: Geometric primitives with name, ID, optional path, geometry data items
- **BlendshapeSets**: Shape targets for base mesh with optional animationInfo
- **LandmarkSets**: Vertex/face indices with barycentric weights for tracked landmarks
- **TextureSets**: Material resources with texture targets and animation links

### Container Formats

Two container formats are supported:

1. **ISOBMFF containers** (ISO/IEC 14496-12):
   - ARF document in ISOBMFF item in top-level MetaBox
   - Additional items for each component
   - May include animation track with time-based samples

2. **Zip-based containers** (ISO/IEC 21320-1):
   - Top-level ARF document
   - Component files referenced relative to document location

Both formats support partial access to avatar components.

## Integration with MPEG Scene Description

### Scene Description Integration

- ARF designed to work with MPEG Scene Description (ISO/IEC 23090-14) based on glTF
- Not limited to MPEG Scene Description; can integrate with any scene description solution
- ISO/IEC 23090-14 defines MPEG_node_avatar extension
- ISO/IEC 23090-39 extends MPEG_node_avatar for better ARF integration

### Reference Client Architecture

The reference architecture (Figure 13) includes:

- **Avatar Pipeline**: Part of Media Access Function (MAF)
  - Retrieves avatar model and associated information
  - Fetches ARF container and animation streams
  - Animates and reconstructs avatar
  - Provides reconstructed avatar to Presentation Engine through buffers containing 3D mesh components

## Animation Bitstream Format

### Avatar Animation Units (AAUs)

The animation stream format uses AAUs as the fundamental structure (Figure 14):

**AAU Structure:**
- Header:
  - AAU type (7-bit code)
  - AAU payload length (bytes)
- Payload:
  - 32-bit timestamp in "ticks"
  - Type-specific data
- Optional padding for byte alignment

**AAU Types:**
- AAU_CONFIG: Configuration unit
- AAU_BLENDSHAPE: Facial animation sample
- AAU_JOINT: Body/hand joint animation sample
- AAU_LANDMARK: Landmark animation sample
- AAU_TEXTURE: Texture animation sample

### Configuration Units

Configuration AAUs communicate stream-level parameters:
- Animation profile string (UTF-8 encoded)
- Timescale value (32-bit float, ticks per second)

### Facial Animation Samples (AAU_BLENDSHAPE)

Structure includes:
- Target blendshape set identifier
- Per-blendshape confidence flag
- Number of blendshape entries
- For each entry: blendshape index, weight (32-bit float), optional confidence (32-bit float)

**Deformation Formula:**
```
v = v₀ + Σₖ wₖ · Δvₖ
```
Where:
- v₀: base vertex position
- Δvₖ: offset for blendshape k
- wₖ: transmitted blendshape weight

### Joint Animation Samples (AAU_JOINT)

Structure includes:
- Target joint set identifier
- Per-joint velocity flag
- Number of joint entries
- For each entry: joint index, 4×4 transformation matrix (16 floats), optional 4×4 velocity matrix

**Linear Blend Skinning (LBS) Formula:**
```
vᵢ = Σⱼ wᵢⱼ · Mⱼ · vᵢ⁰
```
Where:
- wᵢⱼ: weight of joint j on vertex i
- Mⱼ: global transformation matrix for joint j
- vᵢ⁰: rest position of vertex i

### Landmark Animation Samples (AAU_LANDMARK)

Structure includes:
- Landmark set ID
- Velocity and confidence flags
- Dimensionality flag (2D vs. 3D)
- Number of landmarks
- For each landmark: index, coordinates (2D or 3D), optional velocity and confidence

Use cases: facial tracking overlays, sensor-mesh registration, animation data calibration

### Texture Animation Samples (AAU_TEXTURE)

Structure analogous to blendshape samples but applied to texture targets:
- Controls parametric texture effects (micro-geometry patterns, makeup, dynamic material variations)

### Animation Stream Delivery

Dual delivery modes:
1. **Live transmission**: Sequences of AAUs for real-time avatar driving
2. **Stored format**: Avatar animation tracks in ISOBMFF-based ARF container with sample grouping for pre-recorded sequences ("smile," "wave," "dance")

## Ongoing Exploration Experiments

The group continues exploration on:

1. **Compression for Animation Streams**: Methods for compressing facial and body animation streams
2. **Integrating Geometry Data Components**: Specifying integration of avatar data into interoperable container format
3. **Animation Sample Formats**: Developing structures for various animation data types (blend shapes, facial landmarks, animation controllers, joint transforms)
4. **Content Discovery and Partial Access**: Solutions for content discovery and partial access
5. **Animation Controllers**: Study on combining blend shape and joint animation