Survey of Native AI Formats for Multi-modal AI
1. Introduction
This document surveys AI native formats for addressing generic AI-related tasks including generation, comprehension, information retrieval, and recommendation in advanced multimedia use cases (multi-modal AI).
Key Differences from 5G Work
- Broader scope: Covers generic applications beyond detection/segmentation/tracking, focusing on reconstruction, comprehension, recommendation, and information retrieval
- Multi-Modal LLM support: Enables use of Multi-Modal Large Language Models
- Generic multi-modal formats: Combines text, image, video, and audio modalities
- Alternative split inference approach: Uses AI native format generation and AI pre-training instead of model-splitting
Standardization Considerations
The document acknowledges that standardization of such formats is challenging due to:
- Constantly evolving field
- Task-specific nature of AI native formats
However, SA4 should:
- Document and study these formats in FS_6G_MED
- Track progress in this area
- Understand characteristics and QoS requirements relevant to 3GPP networks
- Consider these formats when analyzing AI traffic characteristics
2. AI Processing and Related Native AI Formats
Overview
Recent advances in AI, particularly Large Language Models (LLMs) and Multi-Modal LLMs, enable new applications in generation, comprehension, information retrieval, and recommendation. Multi-modal LLMs require AI-related pre-processing to create AI native formats.
Reasons for AI Split Processing and Native AI Formatting
- Workload distribution: Offload privacy-sensitive and computationally intensive parts (similar to 5G split inferencing)
- LLM/MLM compatibility: Enable input suitable for auto-regressive models (discrete information in vectors)
- Modality combination: Merge text, image, video into relevant features
- Data reduction: Reduce data size, latency, and bandwidth requirements
- Task optimization: Optimize for different tasks (reconstruction vs. comprehension)
General AI Processing Architecture
The document presents a comprehensive survey based on [Jian Jia et al. 2025], extended with 2025 techniques, showing:
Input → Encoder → Latent Vector (z) → Quantization → Decoder → Output
With supervision feedback loop for training.
Encoder Techniques
Transformer [Vaswani et al 2017]
- Attention-based model with significant performance enhancement
- Text: Processed by tokenization then fed to transformers
- 2D input: Segmented into patches, treated as sequences
- 3D input: Sliced temporally, 2D patches represented as 3D tubes [Wang et al 2024b]
- Handles large parameter sizes efficiently
- Increasingly popular
Convolutional Neural Network (CNN) [O'Shea 2015]
- Popular for image feature extraction (e.g., UNet [Ronneberger et al 2015])
- Used for audio intermediate formats
- Extends to video via 3D-CNN incorporating temporal dimension
Multi-Layer Perceptron (MLP)
- Earlier architecture for embeddings in recommender systems
- Used for latent space mapping [Rajput et al. 2023; Singh et al. 2024]
Decoder Processing
- Applies related transforms for reconstruction
- May include different models for specific tasks (generation, recommendation, information retrieval)
- Can be jointly optimized with encoder
Supervision
- Minimizes error between decoder output and encoder input
- For reconstruction: Uses metrics like L2 norm distance
- For comprehension/information retrieval: Requires additional ground truth information
Application Types
- Generation: Reconstruction or generating related content
- Comprehension: Understanding input (textual description, labeling)
- Information Retrieval: Retrieve related documents using semantic features
- Recommendation: Provide recommendations (mainly based on historical behavior)
Different applications use different decoder models and encoder processing, making native AI formats often task-specific.
Quantization Techniques
Following [Jian Jia et al. 2025], the survey identifies:
- Vector Quantization (VQ): Vanilla vector quantization [Juang and Gray, 1982] using minimum distance codebook entry
- Level Wise Quantization (RQ): Quantization error based on current level (smaller error for smaller values)
- Group-wise Quantization (GRVQ): Splits vector into sub-components, quantizes separately
- Lookup Free Quantization (LFQ): Quantization without specific lookup table
- Finite Scalar Quantization (FSQ): Projects vector to few dimensions for rounded representation
Note: Some tokenizers may not use quantization and rely on floating-point arithmetic.
AI-Based Codecs
Native AI formats have been used to develop codecs:
- JPEG AI [ISO/IEC 6048-1]
- Deep Render codec (from InterDigital, available on FFMPEG and VLC platforms)
3. Survey of AI Processing for Native AI Formats
Comprehensive Survey Table
The document provides an extensive table (Table 1) surveying AI pre-processing techniques with the following characteristics:
Image Modality Techniques
- VQVAE [Van Den Oord et al., 2017]: CNN encoder, VQ, Generation
- VQGAN [Esser et al., 2021]: CNN encoder, VQ, Generation
- ViT-VQGAN [Yu et al., 2022]: Transformer encoder, VQ, Generation
- RQVAE [Lee et al., 2022]: CNN encoder, RQ, Generation
- LQAE [Liu et al., 2023]: CNN encoder, VQ, Generation
- SEED [Ge et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- TiTok [Yu et al., 2024b]: Transformer encoder, VQ, Generation
- Spectral image tokenizer [Esteves et al 2025]: Transformer encoder, VQ, Generation
- Subobject-level [Chen et al 2024]: Transformer/CNN encoder, VQ, Generation
- One-d-piece [Miwa et al 2025]: Transformer encoder, VQ, Generation & Comprehension
- Semhitok [Chen et al 2025]: Transformer encoder, LFQ, Generation & Comprehension
- Ming-univision [Z Huang et al]: Transformer encoder, N/A, Generation & Comprehension
- SetTok [Geng et al 2025]: Transformer encoder, LFQ, Generation
- UniTok [Chuofan Ma et al 2020]: CNN encoder, LFQ, Generation & Understanding
- GloTok [Zhao et al 2025]: Transformer encoder, VQ, Generation
- GaussianToken [Jiajun et al 2025]: CNN encoder, VQ, Generation
- CAT [Shen et al 2025]: Transformer/CNN encoder, VQ, Generation
- OpenAI CLIP: CNN encoder, N/A, Generation & Comprehension
- JPEG AI [ISO/IEC 6048-1]: CNN/Transformer encoder, VQ, Reconstruction & Comprehension
Image & Video Modality Techniques
- MAGVIT [Yu et al., 2023]: CNN encoder, VQ, Generation
- MAGVIT-v2 [Yu et al., 2024a]: CNN encoder, LFQ, Generation
- OmniTokenizer [Wang et al., 2024b]: Transformer encoder, VQ, Generation
- SweetTokenizer [Tan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- Atoken [J Lu et al 2025]: Transformer encoder, FSQ, Generation & Comprehension
- HieraTok [Chen et al 2025]: Transformer encoder, VQ, Generation
Video Modality Techniques
- Cosmos [NVIDIA, 2025]: CNN/Transformer encoder, FSQ, Generation
- VidTok [Tang et al., 2024]: CNN encoder, FSQ, Generation
- Video-LaViT [Jin et al., 2024b]: Transformer encoder, VQ, Generation & Comprehension
- Grace [Cheng et. al 2024]: CNN encoder, VQ, Reconstruction
- MPEG FCM [Eimond et al 2025]: CNN encoder, VQ, Comprehension
Multi-Modal (Image/Audio/Text) Techniques
- TEAL [Yang et al., 2023b]: Transformer encoder, VQ, Comprehension
- AnyGPT [Zhan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- LaViT [Jin et al., 2024c]: Transformer encoder, VQ, Generation & Comprehension
- ElasticTok [Yan et al., 2024]: Transformer encoder, VQ/FSQ, Generation & Comprehension
- Chameleon [Team, 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
- ShowO [Xie et al., 2024]: CNN/Transformer encoder, LFQ, Generation & Comprehension
Audio Modality Techniques
- SoundStream [Zeghidour et al., 2021]: CNN encoder, RQ, Generation
- HiFiCodec [Yang et al., 2023a]: CNN encoder, GRVQ, Generation
- RepCodec [Huang et al., 2024]: CNN/Transformer encoder, RQ, Comprehension
- SpeechTokenizer [Zhang et al., 2024]: CNN/Transformer encoder, RQ, Generation & Comprehension
- NeuralSpeech-3 [Ju et al., 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
- iRVQGAN [Kumar et al., 2024]: CNN encoder, RQ, Generation
Text-Based Recommendation Systems
- TIGER [Rajput et al., 2023]: MLP encoder, RQ, Recommendation
- SPM-SID [Singh et al., 2024]: MLP encoder, RQ, Recommendation
- TokenRec [Qu et al., 2024]: MLP encoder, VQ, Recommendation
- VQ-Rec [Hou et al., 2023]: MLP encoder, RQ, Recommendation
- LC-Rec [Zheng et al., 2024]: MLP encoder, RQ, Recommendation
- LETTER [Wang et al., 2024c]: MLP encoder, RQ, Recommendation
- CoST [Zhu et al., 2024]: MLP encoder, RQ, Recommendation
- ColaRec [Wang et al., 2024d]: MLP encoder, VQ, Recommendation
- SEATER [Si et al., 2024]: MLP encoder, VQ, Recommendation
- QARM [Luo et al., 2024]: MLP encoder, VQ, Recommendation
Text-Based Information Retrieval
- DSI [Tay et al., 2022]: Transformer encoder, VQ, Information Retrieval
- Ultron [Zhou et al., 2022]: Transformer encoder, RQ, Information Retrieval
- GenRet [Sun et al., 2024]: Transformer encoder, VQ, Information Retrieval
- LMINDEXER [Jin et al., 2024a]: Transformer encoder, VQ, Information Retrieval
- RIPOR [Zeng et al., 2024]: Transformer encoder, RQ, Information Retrieval
4. Conclusions and Proposals
Proposals
a) AI Traffic Characteristics: Take this information into account when developing an overview of AI traffic characteristics with native AI format or codec besides options for traditional codec.
b) 6G Split Inferencing: Consider that split operation may include AI processing/formatting in addition to traditional model splitting considered in 5G.
c) TR Update: Add text and diagram based on clause 2 to TR for FS_6G_MED.
Proposed Change Requests
Change 1: References Addition
Add references [x1] through [x10] to the TR, including key papers on:
- Discrete tokenizers survey
- JPEG AI standard
- Quantization techniques
- Transformer architectures
- CNN architectures
- Specific implementations
Change 2: New Clause 6.2.4.X - Native AI Formats
Add comprehensive new clause under "AI Traffic Characteristics" covering:
- Overview of multi-modal AI and native formats
- Reasons for AI split processing
- Encoder techniques (Transformer, CNN, MLP)
- Decoder processing
- Supervision methods
- Application types
- Quantization techniques
- AI-based codecs
This clause provides the technical foundation for understanding native AI formats in the context of 6G media services.