S4-260096

Survey of Native AI formats for multi-modal AI

Source: Huawei Tech.(UK) Co.. Ltd
Meeting: TSGS4_135_India
Agenda Item: 11.1

All Metadata

Agenda item description	FS_6G_MED (Study on Media aspects for 6G System)
Doc type	pCR
For action	Agreement
Release	Rel-20
Specification	26.87
Version	0.0.1
Related WIs	FS_6G_MED
download_url	Download Original
For	Agreement
Spec	26.87
Type	pCR
Contact	Rufail Mekuria
Uploaded	2026-02-03T08:50:05.997000
Contact ID	104180
Revised to	S4aP260009
TDoc Status	noted
Reservation date	02/02/2026 13:45:39
Agenda item sort order	60

Review Comments

manager - 2026-02-09 04:25

[Technical] The contribution proposes adding a broad “Native AI Formats” clause but provides no 3GPP-relevant characterization (e.g., bitrate ranges, token rates, latency/jitter sensitivity, burstiness, uplink/downlink asymmetry), so it is unclear how it concretely supports “AI traffic characteristics” work in FS_6G_MED.

[Technical] The “General AI Processing Architecture” (Input→Encoder→Latent z→Quantization→Decoder→Output) is presented as generic, but many cited formats used for comprehension/IR/recommendation do not include a decoder or reconstruction objective; the clause should distinguish generative tokenizers/codecs vs embedding-only representations to avoid misleading conclusions.

[Technical] The stated “Alternative split inference approach” (“AI native format generation and AI pre-training instead of model-splitting”) is not technically substantiated: pre-training is offline and not a split-inference partitioning method, and the document does not define where the split occurs (UE, edge, network) nor the standardized interface implications.

[Technical] Several items in Table 1 are not “native AI formats” in the sense of a transferable discrete representation (e.g., CLIP is primarily continuous embeddings; many “N/A quantization” entries), so the proposed clause risks conflating embeddings, tokenizers, and codecs without defining the format properties that matter for transport and interoperability.

[Technical] The “Reasons for AI split processing” list mixes privacy, compute offload, and “LLM compatibility” but does not address key constraints for 3GPP (security of intermediate representations, reversibility/leakage, integrity, model/version coupling), which are central if SA4 is to consider such formats for media services.

[Technical] Quantization technique descriptions are inaccurate/unclear: “Level Wise Quantization (RQ)” is described as value-magnitude dependent error, whereas residual quantization is typically multi-stage codebooks on residuals; “FSQ projects vector to few dimensions” is not generally correct (FSQ is scalar quantization with finite levels per dimension).

[Technical] The proposal to add JPEG AI as an “AI-based codec” example is plausible, but the document does not explain how JPEG AI bitstreams relate to “native AI formats” for LLMs (tokens/latents) versus conventional decoded pixels, which affects whether it belongs in the same clause and what traffic characteristics apply.

[Technical] The contribution does not specify whether the “native AI format” is intended to be standardized as a bitstream, a feature tensor, or an application-layer payload; without a clear abstraction boundary, it is hard to align with SA4 scope and existing 3GPP media frameworks.

[Technical] The table includes many proprietary or poorly specified items (“Cosmos [NVIDIA, 2025]”, “Deep Render codec”, “Ming-univision”) without stable normative references, which undermines the feasibility of adding them as TR references and risks rapid obsolescence.

[Editorial] The document claims a “comprehensive survey table (Table 1)” but the actual table is not included (only bullet lists), making it impossible to review completeness, columns, definitions, or consistency with the proposed TR insertion.

[Editorial] Reference placeholders “[x1] through [x10]” are not provided with full bibliographic details, and several in-text citations are incomplete or inconsistent (e.g., “O’Shea 2015” for CNNs, “ISO/IEC 6048-1” for JPEG AI), which would block TR integration.

[Editorial] Terminology is inconsistent and sometimes incorrect for the target audience: “MLM compatibility” is used where “multimodal LLM” is intended, and “native AI format” vs “tokenizer” vs “codec” are used interchangeably without definitions.

[Technical] The “Supervision” subsection implies reconstruction loss (L2) as a general mechanism, but many modern tokenizers use perceptual/adversarial losses and many comprehension/IR tokenizers are trained with contrastive or task losses; the oversimplification could mislead SA4 conclusions about QoS/traffic (e.g., token stability vs task accuracy).

[Editorial] Several modality descriptions are overly generic or contain questionable statements (e.g., “Transformers handle large parameter sizes efficiently”), which reads like tutorial material rather than TR-ready text tied to 3GPP study objectives.

<ol>
<li>
[Technical] The contribution proposes adding a broad “Native AI Formats” clause but provides no 3GPP-relevant characterization (e.g., bitrate ranges, token rates, latency/jitter sensitivity, burstiness, uplink/downlink asymmetry), so it is unclear how it concretely supports “AI traffic characteristics” work in FS_6G_MED. 
</li>
<li>
[Technical] The “General AI Processing Architecture” (Input→Encoder→Latent z→Quantization→Decoder→Output) is presented as generic, but many cited formats used for comprehension/IR/recommendation do not include a decoder or reconstruction objective; the clause should distinguish generative tokenizers/codecs vs embedding-only representations to avoid misleading conclusions. 
</li>
<li>
[Technical] The stated “Alternative split inference approach” (“AI native format generation and AI pre-training instead of model-splitting”) is not technically substantiated: pre-training is offline and not a split-inference partitioning method, and the document does not define where the split occurs (UE, edge, network) nor the standardized interface implications. 
</li>
<li>
[Technical] Several items in Table 1 are not “native AI formats” in the sense of a transferable discrete representation (e.g., CLIP is primarily continuous embeddings; many “N/A quantization” entries), so the proposed clause risks conflating embeddings, tokenizers, and codecs without defining the format properties that matter for transport and interoperability. 
</li>
<li>
[Technical] The “Reasons for AI split processing” list mixes privacy, compute offload, and “LLM compatibility” but does not address key constraints for 3GPP (security of intermediate representations, reversibility/leakage, integrity, model/version coupling), which are central if SA4 is to consider such formats for media services. 
</li>
<li>
[Technical] Quantization technique descriptions are inaccurate/unclear: “Level Wise Quantization (RQ)” is described as value-magnitude dependent error, whereas residual quantization is typically multi-stage codebooks on residuals; “FSQ projects vector to few dimensions” is not generally correct (FSQ is scalar quantization with finite levels per dimension). 
</li>
<li>
[Technical] The proposal to add JPEG AI as an “AI-based codec” example is plausible, but the document does not explain how JPEG AI bitstreams relate to “native AI formats” for LLMs (tokens/latents) versus conventional decoded pixels, which affects whether it belongs in the same clause and what traffic characteristics apply. 
</li>
<li>
[Technical] The contribution does not specify whether the “native AI format” is intended to be standardized as a bitstream, a feature tensor, or an application-layer payload; without a clear abstraction boundary, it is hard to align with SA4 scope and existing 3GPP media frameworks. 
</li>
<li>
[Technical] The table includes many proprietary or poorly specified items (“Cosmos [NVIDIA, 2025]”, “Deep Render codec”, “Ming-univision”) without stable normative references, which undermines the feasibility of adding them as TR references and risks rapid obsolescence. 
</li>
<li>
[Editorial] The document claims a “comprehensive survey table (Table 1)” but the actual table is not included (only bullet lists), making it impossible to review completeness, columns, definitions, or consistency with the proposed TR insertion. 
</li>
<li>
[Editorial] Reference placeholders “[x1] through [x10]” are not provided with full bibliographic details, and several in-text citations are incomplete or inconsistent (e.g., “O’Shea 2015” for CNNs, “ISO/IEC 6048-1” for JPEG AI), which would block TR integration. 
</li>
<li>
[Editorial] Terminology is inconsistent and sometimes incorrect for the target audience: “MLM compatibility” is used where “multimodal LLM” is intended, and “native AI format” vs “tokenizer” vs “codec” are used interchangeably without definitions. 
</li>
<li>
[Technical] The “Supervision” subsection implies reconstruction loss (L2) as a general mechanism, but many modern tokenizers use perceptual/adversarial losses and many comprehension/IR tokenizers are trained with contrastive or task losses; the oversimplification could mislead SA4 conclusions about QoS/traffic (e.g., token stability vs task accuracy). 
</li>
<li>
[Editorial] Several modality descriptions are overly generic or contain questionable statements (e.g., “Transformers handle large parameter sizes efficiently”), which reads like tutorial material rather than TR-ready text tied to 3GPP study objectives.
</li>
</ol>