S4-260120

[FS_Avatar_Ph2_MED] 3D Gaussian Splatting Avatar Methods for Real-Time Communication

Source: Qualcomm Atheros, Inc.
Meeting: TSGS4_135_India
Agenda Item: 9.8

All Metadata

Agenda item description	FS_Avatar_Ph2_MED (Study on Avatar communication Phase 2)
Doc type	discussion
download_url	Download Original
Type	discussion
Contact	Imed Bouazizi
Uploaded	2026-02-03T21:49:01.057000
Contact ID	84417
Revised to	S4-260353
TDoc Status	revised
Reservation date	03/02/2026 05:29:47
Agenda item sort order	43

Review Comments

manager - 2026-02-09 04:55

[Technical] The claim “No changes to animation stream required” (Proposed Architecture Step 2) is not substantiated for all cited methods: mesh-embedded Gaussians may need additional per-Gaussian binding metadata (triangle ID, barycentric coords, local frame/covariance transport rules) and potentially additional animation parameters for non-mesh components (hair/teeth/tongue/eyes), which are not clearly covered by the existing ARF Animation Stream Format.

[Technical] Backward compatibility via “store mesh-embedded Gaussians as auxiliary data within glTF/ARF containers” (Step 1) is underspecified: ARF/glTF needs a normative extension mechanism (schema, MIME/box, or glTF extension) defining attribute semantics, coordinate frames, units, and default behaviors; otherwise different decoders will interpret the same auxiliary data differently.

[Technical] Determinism is overstated: “Explicit methods naturally deterministic given fixed floating-point rules” ignores that GPU raster/compute pipelines, floating-point contraction, sorting ties in depth-ordered alpha compositing, and parallel reduction order can yield non-bit-exact results across vendors; conformance would need explicit ordering rules and error tolerances, not just “fixed floating-point rules.”

[Technical] The proposed “depth-ordered alpha compositing” rendering model is central to 3DGS but no interoperability-critical details are given (sorting key definition, handling of equal depths, tile-based sorting, prefiltering, blending equation, color space), making it hard to assess whether ARF can standardize a decoder-independent rendering outcome.

[Technical] The document asserts “direct ARF compatibility” for GaussianBlendshape/SplattingAvatar, but does not map their control parameters to specific ARF constructs (e.g., which blendshape set, naming/ID mapping, ranges, neutral definition, coordinate conventions), risking a mismatch between research model parameters (FLAME/SMPL-X) and ARF-defined animation semantics.

[Technical] The “40 KB/s for real-time animation” streaming figure (Step 4) is presented without assumptions (number of joints, blendshape count, sampling rate, quantization, overhead, RTP/transport framing), and may be misleading given typical face blendshape streams can exceed this depending on rate and precision.

[Technical] Compression proposals (SPZ, L-GSC, HAC++, Compact3D) are listed without clarifying whether they are geometry-only, attribute-aware (SH coefficients, opacity), support random access/partial decode, or preserve required precision for stable splat rendering; Objective 7 evaluation needs criteria tied to ARF use cases (latency, progressive LOD, error metrics).

[Technical] The “graceful fallback: mesh-only renderers can ignore Gaussian extension and still animate” is only valid if the base avatar always includes a complete mesh representation; several 3DGS approaches are not mesh-complete (e.g., hair volumes), so the fallback behavior and minimum mesh requirements should be stated.

[Technical] Non-rigid elements (hair/clothing/accessories) are acknowledged as a challenge, but the proposed ARF integration does not define how “secondary Gaussians” are driven (extra bones, physics, per-frame deltas, or optional streams), which is likely the dominant interoperability gap for full-body avatars.

[Technical] The classification “Hybrid methods… can still be driven by blendshape parameters with MLP weights distributed as part of base avatar” glosses over runtime dependencies: even small MLPs require a standardized inference graph, activation functions, quantization, and tensor layouts; without aligning to an existing standardized neural model format/profile, “portable” decoding is not ensured.

[Editorial] Several performance/quality numbers (FPS, PSNR, training time, storage like “~3.5 MB”) are presented as facts but lack citations, test conditions, and hardware baselines; SA4 contributions typically need references or at least a consistent evaluation setup to avoid cherry-picked comparisons.

[Editorial] Terminology is inconsistent and sometimes ambiguous (e.g., “Gaussian Head Avatar” vs “GaussianHead”; “3DGS-Avatar” appears once without definition; “AAUs” is used without expansion in this document), which will hinder readers trying to relate items to known papers/spec terms.

[Editorial] The document repeatedly states “ARF compatibility” but does not reference specific clauses of ISO/IEC 23090-39 or the corresponding 3GPP study text (FS_Avatar_Ph2_MED Objective 3/7) where gaps exist; adding explicit clause-level mapping would make the contribution actionable for SA4.

<ol>
<li>
[Technical] The claim “No changes to animation stream required” (Proposed Architecture Step 2) is not substantiated for all cited methods: mesh-embedded Gaussians may need additional per-Gaussian binding metadata (triangle ID, barycentric coords, local frame/covariance transport rules) and potentially additional animation parameters for non-mesh components (hair/teeth/tongue/eyes), which are not clearly covered by the existing ARF Animation Stream Format.
</li>
<li>
[Technical] Backward compatibility via “store mesh-embedded Gaussians as auxiliary data within glTF/ARF containers” (Step 1) is underspecified: ARF/glTF needs a normative extension mechanism (schema, MIME/box, or glTF extension) defining attribute semantics, coordinate frames, units, and default behaviors; otherwise different decoders will interpret the same auxiliary data differently.
</li>
<li>
[Technical] Determinism is overstated: “Explicit methods naturally deterministic given fixed floating-point rules” ignores that GPU raster/compute pipelines, floating-point contraction, sorting ties in depth-ordered alpha compositing, and parallel reduction order can yield non-bit-exact results across vendors; conformance would need explicit ordering rules and error tolerances, not just “fixed floating-point rules.”
</li>
<li>
[Technical] The proposed “depth-ordered alpha compositing” rendering model is central to 3DGS but no interoperability-critical details are given (sorting key definition, handling of equal depths, tile-based sorting, prefiltering, blending equation, color space), making it hard to assess whether ARF can standardize a decoder-independent rendering outcome.
</li>
<li>
[Technical] The document asserts “direct ARF compatibility” for GaussianBlendshape/SplattingAvatar, but does not map their control parameters to specific ARF constructs (e.g., which blendshape set, naming/ID mapping, ranges, neutral definition, coordinate conventions), risking a mismatch between research model parameters (FLAME/SMPL-X) and ARF-defined animation semantics.
</li>
<li>
[Technical] The “40 KB/s for real-time animation” streaming figure (Step 4) is presented without assumptions (number of joints, blendshape count, sampling rate, quantization, overhead, RTP/transport framing), and may be misleading given typical face blendshape streams can exceed this depending on rate and precision.
</li>
<li>
[Technical] Compression proposals (SPZ, L-GSC, HAC++, Compact3D) are listed without clarifying whether they are geometry-only, attribute-aware (SH coefficients, opacity), support random access/partial decode, or preserve required precision for stable splat rendering; Objective 7 evaluation needs criteria tied to ARF use cases (latency, progressive LOD, error metrics).
</li>
<li>
[Technical] The “graceful fallback: mesh-only renderers can ignore Gaussian extension and still animate” is only valid if the base avatar always includes a complete mesh representation; several 3DGS approaches are not mesh-complete (e.g., hair volumes), so the fallback behavior and minimum mesh requirements should be stated.
</li>
<li>
[Technical] Non-rigid elements (hair/clothing/accessories) are acknowledged as a challenge, but the proposed ARF integration does not define how “secondary Gaussians” are driven (extra bones, physics, per-frame deltas, or optional streams), which is likely the dominant interoperability gap for full-body avatars.
</li>
<li>
[Technical] The classification “Hybrid methods… can still be driven by blendshape parameters with MLP weights distributed as part of base avatar” glosses over runtime dependencies: even small MLPs require a standardized inference graph, activation functions, quantization, and tensor layouts; without aligning to an existing standardized neural model format/profile, “portable” decoding is not ensured.
</li>
<li>
[Editorial] Several performance/quality numbers (FPS, PSNR, training time, storage like “~3.5 MB”) are presented as facts but lack citations, test conditions, and hardware baselines; SA4 contributions typically need references or at least a consistent evaluation setup to avoid cherry-picked comparisons.
</li>
<li>
[Editorial] Terminology is inconsistent and sometimes ambiguous (e.g., “Gaussian Head Avatar” vs “GaussianHead”; “3DGS-Avatar” appears once without definition; “AAUs” is used without expansion in this document), which will hinder readers trying to relate items to known papers/spec terms.
</li>
<li>
[Editorial] The document repeatedly states “ARF compatibility” but does not reference specific clauses of ISO/IEC 23090-39 or the corresponding 3GPP study text (FS_Avatar_Ph2_MED Objective 3/7) where gaps exist; adding explicit clause-level mapping would make the contribution actionable for SA4.
</li>
</ol>