Meeting: TSGS4_135_India | Agenda Item: 11.1
Survey of Native AI formats for multi-modal AI
Huawei Tech.(UK) Co.. Ltd
pCR
Agreement
| TDoc | S4-260096 |
| Title | Survey of Native AI formats for multi-modal AI |
| Source | Huawei Tech.(UK) Co.. Ltd |
| Agenda item | 11.1 |
| Agenda item description | FS_6G_MED (Study on Media aspects for 6G System) |
| Doc type | pCR |
| For action | Agreement |
| Release | Rel-20 |
| Specification | 26.87 |
| Version | 0.0.1 |
| Related WIs | FS_6G_MED |
| download_url | https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260096.zip |
| For | Agreement |
| Spec | 26.87 |
| Type | pCR |
| Contact | Rufail Mekuria |
| Uploaded | 2026-02-03T08:50:05.997000 |
| Contact ID | 104180 |
| Revised to | S4aP260009 |
| TDoc Status | noted |
| Reservation date | 02/02/2026 13:45:39 |
| Agenda item sort order | 60 |
[Technical] The contribution proposes adding a broad “Native AI Formats” clause but provides no 3GPP-relevant characterization (e.g., bitrate ranges, token rates, latency/jitter sensitivity, burstiness, uplink/downlink asymmetry), so it is unclear how it concretely supports “AI traffic characteristics” work in FS_6G_MED.
[Technical] The “General AI Processing Architecture” (Input→Encoder→Latent z→Quantization→Decoder→Output) is presented as generic, but many cited formats used for comprehension/IR/recommendation do not include a decoder or reconstruction objective; the clause should distinguish generative tokenizers/codecs vs embedding-only representations to avoid misleading conclusions.
[Technical] The stated “Alternative split inference approach” (“AI native format generation and AI pre-training instead of model-splitting”) is not technically substantiated: pre-training is offline and not a split-inference partitioning method, and the document does not define where the split occurs (UE, edge, network) nor the standardized interface implications.
[Technical] Several items in Table 1 are not “native AI formats” in the sense of a transferable discrete representation (e.g., CLIP is primarily continuous embeddings; many “N/A quantization” entries), so the proposed clause risks conflating embeddings, tokenizers, and codecs without defining the format properties that matter for transport and interoperability.
[Technical] The “Reasons for AI split processing” list mixes privacy, compute offload, and “LLM compatibility” but does not address key constraints for 3GPP (security of intermediate representations, reversibility/leakage, integrity, model/version coupling), which are central if SA4 is to consider such formats for media services.
[Technical] Quantization technique descriptions are inaccurate/unclear: “Level Wise Quantization (RQ)” is described as value-magnitude dependent error, whereas residual quantization is typically multi-stage codebooks on residuals; “FSQ projects vector to few dimensions” is not generally correct (FSQ is scalar quantization with finite levels per dimension).
[Technical] The proposal to add JPEG AI as an “AI-based codec” example is plausible, but the document does not explain how JPEG AI bitstreams relate to “native AI formats” for LLMs (tokens/latents) versus conventional decoded pixels, which affects whether it belongs in the same clause and what traffic characteristics apply.
[Technical] The contribution does not specify whether the “native AI format” is intended to be standardized as a bitstream, a feature tensor, or an application-layer payload; without a clear abstraction boundary, it is hard to align with SA4 scope and existing 3GPP media frameworks.
[Technical] The table includes many proprietary or poorly specified items (“Cosmos [NVIDIA, 2025]”, “Deep Render codec”, “Ming-univision”) without stable normative references, which undermines the feasibility of adding them as TR references and risks rapid obsolescence.
[Editorial] The document claims a “comprehensive survey table (Table 1)” but the actual table is not included (only bullet lists), making it impossible to review completeness, columns, definitions, or consistency with the proposed TR insertion.
[Editorial] Reference placeholders “[x1] through [x10]” are not provided with full bibliographic details, and several in-text citations are incomplete or inconsistent (e.g., “O’Shea 2015” for CNNs, “ISO/IEC 6048-1” for JPEG AI), which would block TR integration.
[Editorial] Terminology is inconsistent and sometimes incorrect for the target audience: “MLM compatibility” is used where “multimodal LLM” is intended, and “native AI format” vs “tokenizer” vs “codec” are used interchangeably without definitions.
[Technical] The “Supervision” subsection implies reconstruction loss (L2) as a general mechanism, but many modern tokenizers use perceptual/adversarial losses and many comprehension/IR tokenizers are trained with contrastive or task losses; the oversimplification could mislead SA4 conclusions about QoS/traffic (e.g., token stability vs task accuracy).
[Editorial] Several modality descriptions are overly generic or contain questionable statements (e.g., “Transformers handle large parameter sizes efficiently”), which reads like tutorial material rather than TR-ready text tied to 3GPP study objectives.