S4-260120 - AI Summary

[FS_Avatar_Ph2_MED] 3D Gaussian Splatting Avatar Methods for Real-Time Communication

AI-Generated Summary AI

3D Gaussian Splatting Avatar Methods for Real-Time Communication

Introduction

This contribution surveys 3D Gaussian Splatting (3DGS) methods for avatar representation in the context of the Avatar Communication Phase 2 study (FS_Avatar_Ph2_MED, SP-251663), specifically addressing Objective 3 on animation techniques for avatar reconstruction and rendering. The document evaluates 3DGS methods for real-time communication scenarios and their compatibility with MPEG Avatar Representation Format (ARF, ISO/IEC 23090-39).

Key Technical Background:
- 3DGS represents objects as sets of anisotropic 3D Gaussians (splats)
- Each Gaussian stores: 3D mean position, oriented covariance (ellipsoidal footprint), opacity, and appearance parameters (RGB or spherical harmonics coefficients)
- Rendering projects 3D Gaussians into screen space as 2D Gaussians with depth-ordered alpha compositing
- Achieves real-time rendering at 100-370 FPS on desktop GPUs with quality comparable to neural radiance fields
- Maps well to GPU compute and graphics pipelines

Critical Question for Avatar Communication: How Gaussians deform under animation - either by binding to parametric meshes (FLAME for faces, SMPL/SMPL-X for bodies) or using small neural networks for residual motion prediction.

Survey of 3DGS Avatar Methods

Head and Face Avatar Methods

Key Differentiation Axes

Head/face methods differ along three practical dimensions:

Binding domain: mesh surface anchors, UV space anchors, or volumetric anchors
Runtime neural inference: whether MLPs are required at runtime
Non-FLAME component handling: hair, teeth, tongue, and eye occlusion

Most interoperable approaches: Fully explicit runtime with Gaussians driven from same blendshape and skeletal parameters as mesh renderers.

Method Comparison

| Method | FPS | Gaussians | Parametric Model | Runtime MLP | Key Feature |
|--------|-----|-----------|------------------|-------------|-------------|
| GaussianBlendshape | 370 | 70K | Custom blendshapes | No | Linear blending identical to mesh blendshapes, 32-39 dB PSNR |
| SplattingAvatar | 300+ | ~100K | FLAME mesh | No | Mesh-embedded via barycentric coords, 30 FPS on iPhone 13 |
| FlashAvatar | 300 | 10-50K | FLAME | Small MLP | UV-based init on FLAME, small MLPs for expression offsets |
| GaussianAvatars | 90-100 | ~100K | FLAME | No | FLAME-rigged, multi-view training, explicit binding |
| HHAvatar | ~100 | ~150K | FLAME | Temporal modules | First method for dynamic hair physics modeling |
| MeGA | ~90 | ~200K | FLAME (face) + 3DGS (hair) | No | Hybrid mesh+Gaussian, occlusion-aware blending, editable |

Standout Methods for Real-Time Communication: GaussianBlendshape and SplattingAvatar use purely explicit representations with no runtime neural networks, enabling deterministic rendering and direct ARF compatibility.

Mesh-Embedded Gaussian Splatting

Technical Approach:
- Each Gaussian anchored to animatable mesh supporting standard blendshapes and skeletal skinning
- Parameterization: triangle index + barycentric coordinates + optional offset vector in local tangent-normal frame
- Runtime: receiver deforms mesh using joint transforms and blendshape weights, then reconstructs Gaussian center from animated triangle vertices using barycentric weights
- No per-frame neural inference required - purely algebraic reconstruction ensures deterministic motion

Orientation and Footprint Handling:
- Gaussians stored in local frame aligned to triangle (per-axis scales + local rotation)
- Local-to-world transform from animated triangle frame transports covariance
- Keeps projected splat stable under motion, avoids jitter
- Appearance parameters (opacity, color coefficients) remain static unless dynamic effects explicitly modeled

Standardization Advantages:
- Reuses same animation signals as mesh avatar
- Enables graceful fallback: mesh-only renderers can ignore Gaussian extension and still animate
- 3DGS-capable renderers can render Gaussians alone or hybrid mesh+Gaussians composition

Limitation: Coarse driving mesh can restrict fine-scale effects (lip roll, eyelid thickness, hair motion). Addressed by higher resolution parametric meshes, local offsets, or dedicated Gaussian subsets for non-mesh components.

Gaussian Blendshapes

Technical Approach:
- Mirrors classical mesh blendshape animation
- Each Gaussian has neutral parameters + per-expression deltas for center position, scale, opacity
- Runtime computes linear combination identical to mesh blendshape pipeline
- Key advantage: Determinism and ARF-friendly control - same blendshape weight stream drives both mesh vertices and Gaussian deltas

Hybrid Methods with Small MLPs

Technical Approach:
- Parametric model for global control + small neural modules outputting residual offsets conditioned on expression, pose, or time
- Improves fine detail and handles effects difficult to capture with purely linear blendshapes

Tradeoff: Runtime inference and model distribution become part of interoperability (model versioning, determinism, platform-specific performance)

Full-Body Avatar Methods

Full-body methods have converged on SMPL/SMPL-X parametric body models, enabling compatibility with standard skeletal animation systems.

Method Comparison

| Method | FPS | Gaussians | Body Model | Training | Key Feature |
|--------|-----|-----------|------------|----------|-------------|
| GauHuman | 189 | ~13K | SMPL | 1-2 min | Fastest training, ~3.5 MB storage, KL divergence split/clone |
| HUGS | 60 | ~200K | SMPL | 30 min | Disentangles human/scene |
| ASH | ~60 | ~100K | SMPL | ~1 hour | 2D texture-space parameterization, Dual Quaternion skinning, motion retargeting |
| GART | >150 | ~50K | SMPL | sec-min | Latent bones for non-rigid deformations (dresses, loose clothing) |
| ExAvatar | ~60 | ~150K | SMPL-X | ~2 hours | Only SMPL-X method with unified body/face/hand animation |

Standout Methods:
- GauHuman: Best combination of minimal storage (~3.5 MB) and fast training (1-2 min)
- ExAvatar: Only method providing unified body/face/hand animation through SMPL-X - critical for immersive communication

Animation Architecture:
- Body model provides compact, standardized animation interface
- Base avatar: static set of Gaussians + binding metadata
- Runtime: joint transforms from SMPL/SMPL-X pose parameters deform body via skinning
- Gaussian propagation: surface anchoring (barycentric/UV coordinates) or direct skinning weights per Gaussian
- Enables motion retargeting by sending only pose stream while keeping high-fidelity Gaussian appearance fixed

Non-Rigid Effects Challenge:
- Clothing, long hair, accessories don't follow body surface with rigid skinning
- Solutions: latent bones or local deformation modules (additional control points beyond SMPL skeleton)
- ARF integration consideration: Distinguish between body-locked Gaussians (fully driven by standardized skeleton) and secondary Gaussians (may require optional control signals or local simulation)

Distribution Size Considerations:
- Full-body avatars require tens to hundreds of thousands of Gaussians
- Each Gaussian includes geometry and appearance attributes
- Compression and level-of-detail essential for real deployments
- Practical ARF profile should specify default Gaussian count budget and allow progressive refinement layers for high-end devices

Animation Compatibility Classification

Methods classified into three categories based on runtime architecture:

1. Purely Explicit (no MLPs)

Methods: SplattingAvatar, GaussianBlendshape, GaussianAvatars
- Performance: 300-370 FPS
- ARF Compatibility: Direct mapping
- Animation: Driven entirely by standard skeletal joints and blendshape weights
- Fully compatible with ARF Animation Stream Format

2. Hybrid (small MLPs)

Methods: 3DGS-Avatar, FlashAvatar, HUGS
- Performance: 50-100 FPS (near-real-time)
- Architecture: Small MLPs add expression-dependent offsets without fundamentally changing animation interface
- ARF Integration: Can still be driven by blendshape parameters with MLP weights distributed as part of base avatar

3. Fully Neural

Methods: Gaussian Head Avatar, GaussianHead
- Training: 1-2 days
- Latency: Higher
- ARF Integration: May be integrated into ARF containers as proprietary customized models

Interoperability Key Question: Not whether MLP exists, but whether animation interface remains the same. If driven solely by joints and blendshape weights, ARF Animation Stream Format remains sufficient and decoder only needs renderer choice.

Determinism Considerations:
- Explicit methods: Naturally deterministic given fixed floating-point rules, no platform-specific neural inference dependency
- Hybrid methods: Viable if MLP is small and shipped as part of base avatar, but conformance should define fixed operator sets and numerical tolerances
- Fully neural pipelines: Better treated as optional proprietary components inside ARF container rather than baseline interoperable tool

Proposed Architecture for ARF Integration

Four-Step Integration Approach

Step 1: Storage
- Store mesh-embedded Gaussians as auxiliary data within glTF/ARF containers
- Parameterization: relative to mesh surface using barycentric coordinates (SplattingAvatar) or linear blendshape offsets (GaussianBlendshape)
- Preserves backward compatibility with mesh-only renderers

Step 2: Animation
- Animate via standard skeletal and blendshape parameters already defined in ARF Animation Stream Format
- No changes to animation stream required - Gaussian positions derived from same joint transforms and blendshape weights used for mesh animation

Step 3: Compression
- Apply GS compression for Gaussian attributes within base avatar to minimize distribution size

Step 4: Streaming
- Stream only AAUs at approximately 40 KB/s for real-time animation
- Base avatar (including compressed Gaussian data) distributed once at session establishment
- Enables high-quality Gaussian splatting rendering on capable devices while maintaining mesh-based rendering compatibility on constrained devices

Deployment Requirements

Capability Exchange:
- Endpoints signal support for 3DGS rendering
- Supported attribute sets
- Supported Gaussian count budgets
- Fallback to mesh rendering if 3DGS not supported or resources constrained
- Avoids ecosystem fragmentation and maintains backward compatibility

Proposals

The document proposes that SA4 considers the following for FS_Avatar_Ph2_MED study:

Acknowledge 3D Gaussian Splatting as a viable rendering primitive for avatar communication
Coordinate with MPEG on integration of Gaussian splatting data within ARF Base Avatar Format (ISO/IEC 23090-39)
Evaluate compression techniques (SPZ, L-GSC, HAC++, Compact3D) for inclusion in study of static and animation data compression (Objective 7)
Define capability signaling and conformance points for 3DGS avatar rendering:
Supported Gaussian count budgets
Supported attribute sets
Required numerical tolerances for determinism
Study hybrid approaches with small MLPs - whether they warrant optional ARF profile, and if so, constrain operator sets and model sizes to preserve portability

Document Information

TDoc:
S4-260120

Source:
Qualcomm Atheros, Inc.

Type:
discussion

Original Document:
View on 3GPP

Title: [FS_Avatar_Ph2_MED] 3D Gaussian Splatting Avatar Methods for Real-Time Communication

Agenda item: 9.8

Agenda item description: FS_Avatar_Ph2_MED (Study on Avatar communication Phase 2)

Doc type: discussion

Contact: Imed Bouazizi

Uploaded: 2026-02-03T21:49:01.057000

Contact ID: 84417

Revised to: S4-260353

TDoc Status: revised

Reservation date: 03/02/2026 05:29:47

Agenda item sort order: 43