Avatar-udpate to section 6.3.4
This document provides a comprehensive update to section 6.3.4 concerning the MPEG Avatar Representation Format (ARF), now standardized as ISO/IEC 23090-39. The document reflects the progression of the standard from its initial development phase to reaching Committee Draft International Standard (CDIS) stage.
The MPEG WG03 (Systems) workgroup is developing a new standard for avatar representation format with the following scope:
The Phase 1 requirements are categorized with three priority levels (High, Medium, Low) across multiple categories:
High Priority Requirements:
- Suitable exchange format for conversion between avatar representation formats
- Mesh-based format for representation and animation
- Signal coding format
- Semantic and signal representation
- Multiple levels of detail for geometry
- Facial and body animation
- Delay-sensitive animation streams
- Partial transport of base avatar
- Various storage and transport capabilities
Medium Priority Requirements:
- DRM protection support
- Integration into scene description
- Avatar authenticity and user association protection
Low Priority Requirements:
- Avatar-avatar, user-avatar, avatar-scene interactions
- Storage and replay of animation streams
The ARF data model (Figure 12) includes the following components:
Preamble Section:
- Signature string for unique document identification
- Version string tied to specific ARF revision
- Optional authenticationFeatures (encrypted facial and voice feature vectors with public key URI)
- supportedAnimations object specifying compatible animation frameworks (facial, body, hand, landmark, texture)
- Optional proprietaryAnimations for vendor-specific schemes
Metadata Object:
- Avatar-level descriptive information (name, unique identifier, age, gender)
- Used for experience adaptation and policy/access control
Components Section:
- Skeleton: Defines joints with inverse bind matrices, optional animationInfo
- Node: Scene graph objects with names, IDs, parent/child relations, semantic mappings, TRS or 4×4 matrix transformations
- Skin: Links mesh to skeleton, optional blendshape/landmark/texture sets, per-vertex joint weights
- Mesh: Geometric primitives with name, ID, optional path, geometry data items
- BlendshapeSets: Shape targets for base mesh with optional animationInfo
- LandmarkSets: Vertex/face indices with barycentric weights for tracked landmarks
- TextureSets: Material resources with texture targets and animation links
Two container formats are supported:
May include animation track with time-based samples
Zip-based containers (ISO/IEC 21320-1):
Both formats support partial access to avatar components.
The reference architecture (Figure 13) includes:
The animation stream format uses AAUs as the fundamental structure (Figure 14):
AAU Structure:
- Header:
- AAU type (7-bit code)
- AAU payload length (bytes)
- Payload:
- 32-bit timestamp in "ticks"
- Type-specific data
- Optional padding for byte alignment
AAU Types:
- AAU_CONFIG: Configuration unit
- AAU_BLENDSHAPE: Facial animation sample
- AAU_JOINT: Body/hand joint animation sample
- AAU_LANDMARK: Landmark animation sample
- AAU_TEXTURE: Texture animation sample
Configuration AAUs communicate stream-level parameters:
- Animation profile string (UTF-8 encoded)
- Timescale value (32-bit float, ticks per second)
Structure includes:
- Target blendshape set identifier
- Per-blendshape confidence flag
- Number of blendshape entries
- For each entry: blendshape index, weight (32-bit float), optional confidence (32-bit float)
Deformation Formula:
v = v₀ + Σₖ wₖ · Δvₖ
Where:
- v₀: base vertex position
- Δvₖ: offset for blendshape k
- wₖ: transmitted blendshape weight
Structure includes:
- Target joint set identifier
- Per-joint velocity flag
- Number of joint entries
- For each entry: joint index, 4×4 transformation matrix (16 floats), optional 4×4 velocity matrix
Linear Blend Skinning (LBS) Formula:
vᵢ = Σⱼ wᵢⱼ · Mⱼ · vᵢ⁰
Where:
- wᵢⱼ: weight of joint j on vertex i
- Mⱼ: global transformation matrix for joint j
- vᵢ⁰: rest position of vertex i
Structure includes:
- Landmark set ID
- Velocity and confidence flags
- Dimensionality flag (2D vs. 3D)
- Number of landmarks
- For each landmark: index, coordinates (2D or 3D), optional velocity and confidence
Use cases: facial tracking overlays, sensor-mesh registration, animation data calibration
Structure analogous to blendshape samples but applied to texture targets:
- Controls parametric texture effects (micro-geometry patterns, makeup, dynamic material variations)
Dual delivery modes:
1. Live transmission: Sequences of AAUs for real-time avatar driving
2. Stored format: Avatar animation tracks in ISOBMFF-based ARF container with sample grouping for pre-recorded sequences ("smile," "wave," "dance")
The group continues exploration on: