S4-260096 - AI Summary

Survey of Native AI formats for multi-modal AI

AI-Generated Summary AI

Survey of Native AI Formats for Multi-modal AI

1. Introduction

This document surveys AI native formats for addressing generic AI-related tasks including generation, comprehension, information retrieval, and recommendation in advanced multimedia use cases (multi-modal AI).

Key Differences from 5G Work

Broader scope: Covers generic applications beyond detection/segmentation/tracking, focusing on reconstruction, comprehension, recommendation, and information retrieval
Multi-Modal LLM support: Enables use of Multi-Modal Large Language Models
Generic multi-modal formats: Combines text, image, video, and audio modalities
Alternative split inference approach: Uses AI native format generation and AI pre-training instead of model-splitting

Standardization Considerations

The document acknowledges that standardization of such formats is challenging due to:
- Constantly evolving field
- Task-specific nature of AI native formats

However, SA4 should:
- Document and study these formats in FS_6G_MED
- Track progress in this area
- Understand characteristics and QoS requirements relevant to 3GPP networks
- Consider these formats when analyzing AI traffic characteristics

2. AI Processing and Related Native AI Formats

Overview

Recent advances in AI, particularly Large Language Models (LLMs) and Multi-Modal LLMs, enable new applications in generation, comprehension, information retrieval, and recommendation. Multi-modal LLMs require AI-related pre-processing to create AI native formats.

Reasons for AI Split Processing and Native AI Formatting

Workload distribution: Offload privacy-sensitive and computationally intensive parts (similar to 5G split inferencing)
LLM/MLM compatibility: Enable input suitable for auto-regressive models (discrete information in vectors)
Modality combination: Merge text, image, video into relevant features
Data reduction: Reduce data size, latency, and bandwidth requirements
Task optimization: Optimize for different tasks (reconstruction vs. comprehension)

General AI Processing Architecture

The document presents a comprehensive survey based on [Jian Jia et al. 2025], extended with 2025 techniques, showing:

Input → Encoder → Latent Vector (z) → Quantization → Decoder → Output

With supervision feedback loop for training.

Encoder Techniques

Transformer [Vaswani et al 2017]

Attention-based model with significant performance enhancement
Text: Processed by tokenization then fed to transformers
2D input: Segmented into patches, treated as sequences
3D input: Sliced temporally, 2D patches represented as 3D tubes [Wang et al 2024b]
Handles large parameter sizes efficiently
Increasingly popular

Convolutional Neural Network (CNN) [O'Shea 2015]

Popular for image feature extraction (e.g., UNet [Ronneberger et al 2015])
Used for audio intermediate formats
Extends to video via 3D-CNN incorporating temporal dimension

Multi-Layer Perceptron (MLP)

Earlier architecture for embeddings in recommender systems
Used for latent space mapping [Rajput et al. 2023; Singh et al. 2024]

Decoder Processing

Applies related transforms for reconstruction
May include different models for specific tasks (generation, recommendation, information retrieval)
Can be jointly optimized with encoder

Supervision

Minimizes error between decoder output and encoder input
For reconstruction: Uses metrics like L2 norm distance
For comprehension/information retrieval: Requires additional ground truth information

Application Types

Generation: Reconstruction or generating related content
Comprehension: Understanding input (textual description, labeling)
Information Retrieval: Retrieve related documents using semantic features
Recommendation: Provide recommendations (mainly based on historical behavior)

Different applications use different decoder models and encoder processing, making native AI formats often task-specific.

Quantization Techniques

Following [Jian Jia et al. 2025], the survey identifies:

Vector Quantization (VQ): Vanilla vector quantization [Juang and Gray, 1982] using minimum distance codebook entry
Level Wise Quantization (RQ): Quantization error based on current level (smaller error for smaller values)
Group-wise Quantization (GRVQ): Splits vector into sub-components, quantizes separately
Lookup Free Quantization (LFQ): Quantization without specific lookup table
Finite Scalar Quantization (FSQ): Projects vector to few dimensions for rounded representation

Note: Some tokenizers may not use quantization and rely on floating-point arithmetic.

AI-Based Codecs

Native AI formats have been used to develop codecs:
- JPEG AI [ISO/IEC 6048-1]
- Deep Render codec (from InterDigital, available on FFMPEG and VLC platforms)

3. Survey of AI Processing for Native AI Formats

Comprehensive Survey Table

The document provides an extensive table (Table 1) surveying AI pre-processing techniques with the following characteristics:

Image Modality Techniques

VQVAE [Van Den Oord et al., 2017]: CNN encoder, VQ, Generation
VQGAN [Esser et al., 2021]: CNN encoder, VQ, Generation
ViT-VQGAN [Yu et al., 2022]: Transformer encoder, VQ, Generation
RQVAE [Lee et al., 2022]: CNN encoder, RQ, Generation
LQAE [Liu et al., 2023]: CNN encoder, VQ, Generation
SEED [Ge et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
TiTok [Yu et al., 2024b]: Transformer encoder, VQ, Generation
Spectral image tokenizer [Esteves et al 2025]: Transformer encoder, VQ, Generation
Subobject-level [Chen et al 2024]: Transformer/CNN encoder, VQ, Generation
One-d-piece [Miwa et al 2025]: Transformer encoder, VQ, Generation & Comprehension
Semhitok [Chen et al 2025]: Transformer encoder, LFQ, Generation & Comprehension
Ming-univision [Z Huang et al]: Transformer encoder, N/A, Generation & Comprehension
SetTok [Geng et al 2025]: Transformer encoder, LFQ, Generation
UniTok [Chuofan Ma et al 2020]: CNN encoder, LFQ, Generation & Understanding
GloTok [Zhao et al 2025]: Transformer encoder, VQ, Generation
GaussianToken [Jiajun et al 2025]: CNN encoder, VQ, Generation
CAT [Shen et al 2025]: Transformer/CNN encoder, VQ, Generation
OpenAI CLIP: CNN encoder, N/A, Generation & Comprehension
JPEG AI [ISO/IEC 6048-1]: CNN/Transformer encoder, VQ, Reconstruction & Comprehension

Image & Video Modality Techniques

MAGVIT [Yu et al., 2023]: CNN encoder, VQ, Generation
MAGVIT-v2 [Yu et al., 2024a]: CNN encoder, LFQ, Generation
OmniTokenizer [Wang et al., 2024b]: Transformer encoder, VQ, Generation
SweetTokenizer [Tan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
Atoken [J Lu et al 2025]: Transformer encoder, FSQ, Generation & Comprehension
HieraTok [Chen et al 2025]: Transformer encoder, VQ, Generation

Video Modality Techniques

Cosmos [NVIDIA, 2025]: CNN/Transformer encoder, FSQ, Generation
VidTok [Tang et al., 2024]: CNN encoder, FSQ, Generation
Video-LaViT [Jin et al., 2024b]: Transformer encoder, VQ, Generation & Comprehension
Grace [Cheng et. al 2024]: CNN encoder, VQ, Reconstruction
MPEG FCM [Eimond et al 2025]: CNN encoder, VQ, Comprehension

Multi-Modal (Image/Audio/Text) Techniques

TEAL [Yang et al., 2023b]: Transformer encoder, VQ, Comprehension
AnyGPT [Zhan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
LaViT [Jin et al., 2024c]: Transformer encoder, VQ, Generation & Comprehension
ElasticTok [Yan et al., 2024]: Transformer encoder, VQ/FSQ, Generation & Comprehension
Chameleon [Team, 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
ShowO [Xie et al., 2024]: CNN/Transformer encoder, LFQ, Generation & Comprehension

Audio Modality Techniques

SoundStream [Zeghidour et al., 2021]: CNN encoder, RQ, Generation
HiFiCodec [Yang et al., 2023a]: CNN encoder, GRVQ, Generation
RepCodec [Huang et al., 2024]: CNN/Transformer encoder, RQ, Comprehension
SpeechTokenizer [Zhang et al., 2024]: CNN/Transformer encoder, RQ, Generation & Comprehension
NeuralSpeech-3 [Ju et al., 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
iRVQGAN [Kumar et al., 2024]: CNN encoder, RQ, Generation

Text-Based Recommendation Systems

TIGER [Rajput et al., 2023]: MLP encoder, RQ, Recommendation
SPM-SID [Singh et al., 2024]: MLP encoder, RQ, Recommendation
TokenRec [Qu et al., 2024]: MLP encoder, VQ, Recommendation
VQ-Rec [Hou et al., 2023]: MLP encoder, RQ, Recommendation
LC-Rec [Zheng et al., 2024]: MLP encoder, RQ, Recommendation
LETTER [Wang et al., 2024c]: MLP encoder, RQ, Recommendation
CoST [Zhu et al., 2024]: MLP encoder, RQ, Recommendation
ColaRec [Wang et al., 2024d]: MLP encoder, VQ, Recommendation
SEATER [Si et al., 2024]: MLP encoder, VQ, Recommendation
QARM [Luo et al., 2024]: MLP encoder, VQ, Recommendation

Text-Based Information Retrieval

DSI [Tay et al., 2022]: Transformer encoder, VQ, Information Retrieval
Ultron [Zhou et al., 2022]: Transformer encoder, RQ, Information Retrieval
GenRet [Sun et al., 2024]: Transformer encoder, VQ, Information Retrieval
LMINDEXER [Jin et al., 2024a]: Transformer encoder, VQ, Information Retrieval
RIPOR [Zeng et al., 2024]: Transformer encoder, RQ, Information Retrieval

4. Conclusions and Proposals

Proposals

a) AI Traffic Characteristics: Take this information into account when developing an overview of AI traffic characteristics with native AI format or codec besides options for traditional codec.

b) 6G Split Inferencing: Consider that split operation may include AI processing/formatting in addition to traditional model splitting considered in 5G.

c) TR Update: Add text and diagram based on clause 2 to TR for FS_6G_MED.

Proposed Change Requests

Change 1: References Addition

Add references [x1] through [x10] to the TR, including key papers on:
- Discrete tokenizers survey
- JPEG AI standard
- Quantization techniques
- Transformer architectures
- CNN architectures
- Specific implementations

Change 2: New Clause 6.2.4.X - Native AI Formats

Add comprehensive new clause under "AI Traffic Characteristics" covering:
- Overview of multi-modal AI and native formats
- Reasons for AI split processing
- Encoder techniques (Transformer, CNN, MLP)
- Decoder processing
- Supervision methods
- Application types
- Quantization techniques
- AI-based codecs

This clause provides the technical foundation for understanding native AI formats in the context of 6G media services.

Document Information

TDoc:
S4-260096

Source:
Huawei Tech.(UK) Co.. Ltd

Type:
pCR

For:
Agreement

Original Document:
View on 3GPP

Title: Survey of Native AI formats for multi-modal AI

Agenda item: 11.1

Agenda item description: FS_6G_MED (Study on Media aspects for 6G System)

Doc type: pCR

For action: Agreement

Release: Rel-20

Specification: 26.87

Version: 0.0.1

Related WIs: FS_6G_MED

Spec: 26.87

Contact: Rufail Mekuria

Uploaded: 2026-02-03T08:50:05.997000

Contact ID: 104180

Revised to: S4aP260009

TDoc Status: noted

Reservation date: 02/02/2026 13:45:39

Agenda item sort order: 60