# Survey of Native AI Formats for Multi-modal AI

## 1. Introduction

This document surveys AI native formats for addressing generic AI-related tasks including generation, comprehension, information retrieval, and recommendation in advanced multimedia use cases (multi-modal AI).

### Key Differences from 5G Work

- **Broader scope**: Covers generic applications beyond detection/segmentation/tracking, focusing on reconstruction, comprehension, recommendation, and information retrieval
- **Multi-Modal LLM support**: Enables use of Multi-Modal Large Language Models
- **Generic multi-modal formats**: Combines text, image, video, and audio modalities
- **Alternative split inference approach**: Uses AI native format generation and AI pre-training instead of model-splitting

### Standardization Considerations

The document acknowledges that standardization of such formats is challenging due to:
- Constantly evolving field
- Task-specific nature of AI native formats

However, SA4 should:
- Document and study these formats in FS_6G_MED
- Track progress in this area
- Understand characteristics and QoS requirements relevant to 3GPP networks
- Consider these formats when analyzing AI traffic characteristics

## 2. AI Processing and Related Native AI Formats

### Overview

Recent advances in AI, particularly Large Language Models (LLMs) and Multi-Modal LLMs, enable new applications in generation, comprehension, information retrieval, and recommendation. Multi-modal LLMs require AI-related pre-processing to create AI native formats.

### Reasons for AI Split Processing and Native AI Formatting

1. **Workload distribution**: Offload privacy-sensitive and computationally intensive parts (similar to 5G split inferencing)
2. **LLM/MLM compatibility**: Enable input suitable for auto-regressive models (discrete information in vectors)
3. **Modality combination**: Merge text, image, video into relevant features
4. **Data reduction**: Reduce data size, latency, and bandwidth requirements
5. **Task optimization**: Optimize for different tasks (reconstruction vs. comprehension)

### General AI Processing Architecture

The document presents a comprehensive survey based on [Jian Jia et al. 2025], extended with 2025 techniques, showing:

**Input → Encoder → Latent Vector (z) → Quantization → Decoder → Output**

With supervision feedback loop for training.

### Encoder Techniques

#### Transformer [Vaswani et al 2017]
- Attention-based model with significant performance enhancement
- **Text**: Processed by tokenization then fed to transformers
- **2D input**: Segmented into patches, treated as sequences
- **3D input**: Sliced temporally, 2D patches represented as 3D tubes [Wang et al 2024b]
- Handles large parameter sizes efficiently
- Increasingly popular

#### Convolutional Neural Network (CNN) [O'Shea 2015]
- Popular for image feature extraction (e.g., UNet [Ronneberger et al 2015])
- Used for audio intermediate formats
- Extends to video via 3D-CNN incorporating temporal dimension

#### Multi-Layer Perceptron (MLP)
- Earlier architecture for embeddings in recommender systems
- Used for latent space mapping [Rajput et al. 2023; Singh et al. 2024]

### Decoder Processing

- Applies related transforms for reconstruction
- May include different models for specific tasks (generation, recommendation, information retrieval)
- Can be jointly optimized with encoder

### Supervision

- Minimizes error between decoder output and encoder input
- For reconstruction: Uses metrics like L2 norm distance
- For comprehension/information retrieval: Requires additional ground truth information

### Application Types

1. **Generation**: Reconstruction or generating related content
2. **Comprehension**: Understanding input (textual description, labeling)
3. **Information Retrieval**: Retrieve related documents using semantic features
4. **Recommendation**: Provide recommendations (mainly based on historical behavior)

Different applications use different decoder models and encoder processing, making native AI formats often task-specific.

### Quantization Techniques

Following [Jian Jia et al. 2025], the survey identifies:

- **Vector Quantization (VQ)**: Vanilla vector quantization [Juang and Gray, 1982] using minimum distance codebook entry
- **Level Wise Quantization (RQ)**: Quantization error based on current level (smaller error for smaller values)
- **Group-wise Quantization (GRVQ)**: Splits vector into sub-components, quantizes separately
- **Lookup Free Quantization (LFQ)**: Quantization without specific lookup table
- **Finite Scalar Quantization (FSQ)**: Projects vector to few dimensions for rounded representation

**Note**: Some tokenizers may not use quantization and rely on floating-point arithmetic.

### AI-Based Codecs

Native AI formats have been used to develop codecs:
- **JPEG AI** [ISO/IEC 6048-1]
- **Deep Render codec** (from InterDigital, available on FFMPEG and VLC platforms)

## 3. Survey of AI Processing for Native AI Formats

### Comprehensive Survey Table

The document provides an extensive table (Table 1) surveying AI pre-processing techniques with the following characteristics:

#### Image Modality Techniques
- **VQVAE** [Van Den Oord et al., 2017]: CNN encoder, VQ, Generation
- **VQGAN** [Esser et al., 2021]: CNN encoder, VQ, Generation
- **ViT-VQGAN** [Yu et al., 2022]: Transformer encoder, VQ, Generation
- **RQVAE** [Lee et al., 2022]: CNN encoder, RQ, Generation
- **LQAE** [Liu et al., 2023]: CNN encoder, VQ, Generation
- **SEED** [Ge et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- **TiTok** [Yu et al., 2024b]: Transformer encoder, VQ, Generation
- **Spectral image tokenizer** [Esteves et al 2025]: Transformer encoder, VQ, Generation
- **Subobject-level** [Chen et al 2024]: Transformer/CNN encoder, VQ, Generation
- **One-d-piece** [Miwa et al 2025]: Transformer encoder, VQ, Generation & Comprehension
- **Semhitok** [Chen et al 2025]: Transformer encoder, LFQ, Generation & Comprehension
- **Ming-univision** [Z Huang et al]: Transformer encoder, N/A, Generation & Comprehension
- **SetTok** [Geng et al 2025]: Transformer encoder, LFQ, Generation
- **UniTok** [Chuofan Ma et al 2020]: CNN encoder, LFQ, Generation & Understanding
- **GloTok** [Zhao et al 2025]: Transformer encoder, VQ, Generation
- **GaussianToken** [Jiajun et al 2025]: CNN encoder, VQ, Generation
- **CAT** [Shen et al 2025]: Transformer/CNN encoder, VQ, Generation
- **OpenAI CLIP**: CNN encoder, N/A, Generation & Comprehension
- **JPEG AI** [ISO/IEC 6048-1]: CNN/Transformer encoder, VQ, Reconstruction & Comprehension

#### Image & Video Modality Techniques
- **MAGVIT** [Yu et al., 2023]: CNN encoder, VQ, Generation
- **MAGVIT-v2** [Yu et al., 2024a]: CNN encoder, LFQ, Generation
- **OmniTokenizer** [Wang et al., 2024b]: Transformer encoder, VQ, Generation
- **SweetTokenizer** [Tan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- **Atoken** [J Lu et al 2025]: Transformer encoder, FSQ, Generation & Comprehension
- **HieraTok** [Chen et al 2025]: Transformer encoder, VQ, Generation

#### Video Modality Techniques
- **Cosmos** [NVIDIA, 2025]: CNN/Transformer encoder, FSQ, Generation
- **VidTok** [Tang et al., 2024]: CNN encoder, FSQ, Generation
- **Video-LaViT** [Jin et al., 2024b]: Transformer encoder, VQ, Generation & Comprehension
- **Grace** [Cheng et. al 2024]: CNN encoder, VQ, Reconstruction
- **MPEG FCM** [Eimond et al 2025]: CNN encoder, VQ, Comprehension

#### Multi-Modal (Image/Audio/Text) Techniques
- **TEAL** [Yang et al., 2023b]: Transformer encoder, VQ, Comprehension
- **AnyGPT** [Zhan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
- **LaViT** [Jin et al., 2024c]: Transformer encoder, VQ, Generation & Comprehension
- **ElasticTok** [Yan et al., 2024]: Transformer encoder, VQ/FSQ, Generation & Comprehension
- **Chameleon** [Team, 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
- **ShowO** [Xie et al., 2024]: CNN/Transformer encoder, LFQ, Generation & Comprehension

#### Audio Modality Techniques
- **SoundStream** [Zeghidour et al., 2021]: CNN encoder, RQ, Generation
- **HiFiCodec** [Yang et al., 2023a]: CNN encoder, GRVQ, Generation
- **RepCodec** [Huang et al., 2024]: CNN/Transformer encoder, RQ, Comprehension
- **SpeechTokenizer** [Zhang et al., 2024]: CNN/Transformer encoder, RQ, Generation & Comprehension
- **NeuralSpeech-3** [Ju et al., 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
- **iRVQGAN** [Kumar et al., 2024]: CNN encoder, RQ, Generation

#### Text-Based Recommendation Systems
- **TIGER** [Rajput et al., 2023]: MLP encoder, RQ, Recommendation
- **SPM-SID** [Singh et al., 2024]: MLP encoder, RQ, Recommendation
- **TokenRec** [Qu et al., 2024]: MLP encoder, VQ, Recommendation
- **VQ-Rec** [Hou et al., 2023]: MLP encoder, RQ, Recommendation
- **LC-Rec** [Zheng et al., 2024]: MLP encoder, RQ, Recommendation
- **LETTER** [Wang et al., 2024c]: MLP encoder, RQ, Recommendation
- **CoST** [Zhu et al., 2024]: MLP encoder, RQ, Recommendation
- **ColaRec** [Wang et al., 2024d]: MLP encoder, VQ, Recommendation
- **SEATER** [Si et al., 2024]: MLP encoder, VQ, Recommendation
- **QARM** [Luo et al., 2024]: MLP encoder, VQ, Recommendation

#### Text-Based Information Retrieval
- **DSI** [Tay et al., 2022]: Transformer encoder, VQ, Information Retrieval
- **Ultron** [Zhou et al., 2022]: Transformer encoder, RQ, Information Retrieval
- **GenRet** [Sun et al., 2024]: Transformer encoder, VQ, Information Retrieval
- **LMINDEXER** [Jin et al., 2024a]: Transformer encoder, VQ, Information Retrieval
- **RIPOR** [Zeng et al., 2024]: Transformer encoder, RQ, Information Retrieval

## 4. Conclusions and Proposals

### Proposals

**a) AI Traffic Characteristics**: Take this information into account when developing an overview of AI traffic characteristics with native AI format or codec besides options for traditional codec.

**b) 6G Split Inferencing**: Consider that split operation may include AI processing/formatting in addition to traditional model splitting considered in 5G.

**c) TR Update**: Add text and diagram based on clause 2 to TR for FS_6G_MED.

### Proposed Change Requests

#### Change 1: References Addition
Add references [x1] through [x10] to the TR, including key papers on:
- Discrete tokenizers survey
- JPEG AI standard
- Quantization techniques
- Transformer architectures
- CNN architectures
- Specific implementations

#### Change 2: New Clause 6.2.4.X - Native AI Formats
Add comprehensive new clause under "AI Traffic Characteristics" covering:
- Overview of multi-modal AI and native formats
- Reasons for AI split processing
- Encoder techniques (Transformer, CNN, MLP)
- Decoder processing
- Supervision methods
- Application types
- Quantization techniques
- AI-based codecs

This clause provides the technical foundation for understanding native AI formats in the context of 6G media services.