Unknown
S4-260096 / TSGS4_135_India / 11.1 / Huawei Tech.(UK) Co.. Ltd / Survey of Native AI formats for multi-modal AI
Previous Next Edit
S4-260096

Survey of Native AI formats for multi-modal AI

Source: Huawei Tech.(UK) Co.. Ltd
Meeting: TSGS4_135_India
Agenda Item: 11.1

All Metadata
Agenda item description FS_6G_MED (Study on Media aspects for 6G System)
Doc type pCR
For action Agreement
Release Rel-20
Specification 26.87
Version 0.0.1
Related WIs FS_6G_MED
download_url Download Original
For Agreement
Spec 26.87
Type pCR
Contact Rufail Mekuria
Uploaded 2026-02-03T08:50:05.997000
Contact ID 104180
Revised to S4aP260009
TDoc Status noted
Reservation date 02/02/2026 13:45:39
Agenda item sort order 60
Review Comments
manager - 2026-02-09 04:25


  1. [Technical] The contribution proposes adding a broad “Native AI Formats” clause but provides no 3GPP-relevant characterization (e.g., bitrate ranges, token rates, latency/jitter sensitivity, burstiness, uplink/downlink asymmetry), so it is unclear how it concretely supports “AI traffic characteristics” work in FS_6G_MED.




  2. [Technical] The “General AI Processing Architecture” (Input→Encoder→Latent z→Quantization→Decoder→Output) is presented as generic, but many cited formats used for comprehension/IR/recommendation do not include a decoder or reconstruction objective; the clause should distinguish generative tokenizers/codecs vs embedding-only representations to avoid misleading conclusions.




  3. [Technical] The stated “Alternative split inference approach” (“AI native format generation and AI pre-training instead of model-splitting”) is not technically substantiated: pre-training is offline and not a split-inference partitioning method, and the document does not define where the split occurs (UE, edge, network) nor the standardized interface implications.




  4. [Technical] Several items in Table 1 are not “native AI formats” in the sense of a transferable discrete representation (e.g., CLIP is primarily continuous embeddings; many “N/A quantization” entries), so the proposed clause risks conflating embeddings, tokenizers, and codecs without defining the format properties that matter for transport and interoperability.




  5. [Technical] The “Reasons for AI split processing” list mixes privacy, compute offload, and “LLM compatibility” but does not address key constraints for 3GPP (security of intermediate representations, reversibility/leakage, integrity, model/version coupling), which are central if SA4 is to consider such formats for media services.




  6. [Technical] Quantization technique descriptions are inaccurate/unclear: “Level Wise Quantization (RQ)” is described as value-magnitude dependent error, whereas residual quantization is typically multi-stage codebooks on residuals; “FSQ projects vector to few dimensions” is not generally correct (FSQ is scalar quantization with finite levels per dimension).




  7. [Technical] The proposal to add JPEG AI as an “AI-based codec” example is plausible, but the document does not explain how JPEG AI bitstreams relate to “native AI formats” for LLMs (tokens/latents) versus conventional decoded pixels, which affects whether it belongs in the same clause and what traffic characteristics apply.




  8. [Technical] The contribution does not specify whether the “native AI format” is intended to be standardized as a bitstream, a feature tensor, or an application-layer payload; without a clear abstraction boundary, it is hard to align with SA4 scope and existing 3GPP media frameworks.




  9. [Technical] The table includes many proprietary or poorly specified items (“Cosmos [NVIDIA, 2025]”, “Deep Render codec”, “Ming-univision”) without stable normative references, which undermines the feasibility of adding them as TR references and risks rapid obsolescence.




  10. [Editorial] The document claims a “comprehensive survey table (Table 1)” but the actual table is not included (only bullet lists), making it impossible to review completeness, columns, definitions, or consistency with the proposed TR insertion.




  11. [Editorial] Reference placeholders “[x1] through [x10]” are not provided with full bibliographic details, and several in-text citations are incomplete or inconsistent (e.g., “O’Shea 2015” for CNNs, “ISO/IEC 6048-1” for JPEG AI), which would block TR integration.




  12. [Editorial] Terminology is inconsistent and sometimes incorrect for the target audience: “MLM compatibility” is used where “multimodal LLM” is intended, and “native AI format” vs “tokenizer” vs “codec” are used interchangeably without definitions.




  13. [Technical] The “Supervision” subsection implies reconstruction loss (L2) as a general mechanism, but many modern tokenizers use perceptual/adversarial losses and many comprehension/IR tokenizers are trained with contrastive or task losses; the oversimplification could mislead SA4 conclusions about QoS/traffic (e.g., token stability vs task accuracy).




  14. [Editorial] Several modality descriptions are overly generic or contain questionable statements (e.g., “Transformers handle large parameter sizes efficiently”), which reads like tutorial material rather than TR-ready text tied to 3GPP study objectives.



Sign in to add comments.