S4-260096 - AI Summary

Survey of Native AI formats for multi-modal AI

Back to Agenda Download Summary
AI-Generated Summary AI

Survey of Native AI Formats for Multi-modal AI

1. Introduction

This document surveys AI native formats for addressing generic AI-related tasks including generation, comprehension, information retrieval, and recommendation in advanced multimedia use cases (multi-modal AI).

Key Differences from 5G Work

  • Broader scope: Covers generic applications beyond detection/segmentation/tracking, focusing on reconstruction, comprehension, recommendation, and information retrieval
  • Multi-Modal LLM support: Enables use of Multi-Modal Large Language Models
  • Generic multi-modal formats: Combines text, image, video, and audio modalities
  • Alternative split inference approach: Uses AI native format generation and AI pre-training instead of model-splitting

Standardization Considerations

The document acknowledges that standardization of such formats is challenging due to:
- Constantly evolving field
- Task-specific nature of AI native formats

However, SA4 should:
- Document and study these formats in FS_6G_MED
- Track progress in this area
- Understand characteristics and QoS requirements relevant to 3GPP networks
- Consider these formats when analyzing AI traffic characteristics

2. AI Processing and Related Native AI Formats

Overview

Recent advances in AI, particularly Large Language Models (LLMs) and Multi-Modal LLMs, enable new applications in generation, comprehension, information retrieval, and recommendation. Multi-modal LLMs require AI-related pre-processing to create AI native formats.

Reasons for AI Split Processing and Native AI Formatting

  1. Workload distribution: Offload privacy-sensitive and computationally intensive parts (similar to 5G split inferencing)
  2. LLM/MLM compatibility: Enable input suitable for auto-regressive models (discrete information in vectors)
  3. Modality combination: Merge text, image, video into relevant features
  4. Data reduction: Reduce data size, latency, and bandwidth requirements
  5. Task optimization: Optimize for different tasks (reconstruction vs. comprehension)

General AI Processing Architecture

The document presents a comprehensive survey based on [Jian Jia et al. 2025], extended with 2025 techniques, showing:

Input → Encoder → Latent Vector (z) → Quantization → Decoder → Output

With supervision feedback loop for training.

Encoder Techniques

Transformer [Vaswani et al 2017]

  • Attention-based model with significant performance enhancement
  • Text: Processed by tokenization then fed to transformers
  • 2D input: Segmented into patches, treated as sequences
  • 3D input: Sliced temporally, 2D patches represented as 3D tubes [Wang et al 2024b]
  • Handles large parameter sizes efficiently
  • Increasingly popular

Convolutional Neural Network (CNN) [O'Shea 2015]

  • Popular for image feature extraction (e.g., UNet [Ronneberger et al 2015])
  • Used for audio intermediate formats
  • Extends to video via 3D-CNN incorporating temporal dimension

Multi-Layer Perceptron (MLP)

  • Earlier architecture for embeddings in recommender systems
  • Used for latent space mapping [Rajput et al. 2023; Singh et al. 2024]

Decoder Processing

  • Applies related transforms for reconstruction
  • May include different models for specific tasks (generation, recommendation, information retrieval)
  • Can be jointly optimized with encoder

Supervision

  • Minimizes error between decoder output and encoder input
  • For reconstruction: Uses metrics like L2 norm distance
  • For comprehension/information retrieval: Requires additional ground truth information

Application Types

  1. Generation: Reconstruction or generating related content
  2. Comprehension: Understanding input (textual description, labeling)
  3. Information Retrieval: Retrieve related documents using semantic features
  4. Recommendation: Provide recommendations (mainly based on historical behavior)

Different applications use different decoder models and encoder processing, making native AI formats often task-specific.

Quantization Techniques

Following [Jian Jia et al. 2025], the survey identifies:

  • Vector Quantization (VQ): Vanilla vector quantization [Juang and Gray, 1982] using minimum distance codebook entry
  • Level Wise Quantization (RQ): Quantization error based on current level (smaller error for smaller values)
  • Group-wise Quantization (GRVQ): Splits vector into sub-components, quantizes separately
  • Lookup Free Quantization (LFQ): Quantization without specific lookup table
  • Finite Scalar Quantization (FSQ): Projects vector to few dimensions for rounded representation

Note: Some tokenizers may not use quantization and rely on floating-point arithmetic.

AI-Based Codecs

Native AI formats have been used to develop codecs:
- JPEG AI [ISO/IEC 6048-1]
- Deep Render codec (from InterDigital, available on FFMPEG and VLC platforms)

3. Survey of AI Processing for Native AI Formats

Comprehensive Survey Table

The document provides an extensive table (Table 1) surveying AI pre-processing techniques with the following characteristics:

Image Modality Techniques

  • VQVAE [Van Den Oord et al., 2017]: CNN encoder, VQ, Generation
  • VQGAN [Esser et al., 2021]: CNN encoder, VQ, Generation
  • ViT-VQGAN [Yu et al., 2022]: Transformer encoder, VQ, Generation
  • RQVAE [Lee et al., 2022]: CNN encoder, RQ, Generation
  • LQAE [Liu et al., 2023]: CNN encoder, VQ, Generation
  • SEED [Ge et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
  • TiTok [Yu et al., 2024b]: Transformer encoder, VQ, Generation
  • Spectral image tokenizer [Esteves et al 2025]: Transformer encoder, VQ, Generation
  • Subobject-level [Chen et al 2024]: Transformer/CNN encoder, VQ, Generation
  • One-d-piece [Miwa et al 2025]: Transformer encoder, VQ, Generation & Comprehension
  • Semhitok [Chen et al 2025]: Transformer encoder, LFQ, Generation & Comprehension
  • Ming-univision [Z Huang et al]: Transformer encoder, N/A, Generation & Comprehension
  • SetTok [Geng et al 2025]: Transformer encoder, LFQ, Generation
  • UniTok [Chuofan Ma et al 2020]: CNN encoder, LFQ, Generation & Understanding
  • GloTok [Zhao et al 2025]: Transformer encoder, VQ, Generation
  • GaussianToken [Jiajun et al 2025]: CNN encoder, VQ, Generation
  • CAT [Shen et al 2025]: Transformer/CNN encoder, VQ, Generation
  • OpenAI CLIP: CNN encoder, N/A, Generation & Comprehension
  • JPEG AI [ISO/IEC 6048-1]: CNN/Transformer encoder, VQ, Reconstruction & Comprehension

Image & Video Modality Techniques

  • MAGVIT [Yu et al., 2023]: CNN encoder, VQ, Generation
  • MAGVIT-v2 [Yu et al., 2024a]: CNN encoder, LFQ, Generation
  • OmniTokenizer [Wang et al., 2024b]: Transformer encoder, VQ, Generation
  • SweetTokenizer [Tan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
  • Atoken [J Lu et al 2025]: Transformer encoder, FSQ, Generation & Comprehension
  • HieraTok [Chen et al 2025]: Transformer encoder, VQ, Generation

Video Modality Techniques

  • Cosmos [NVIDIA, 2025]: CNN/Transformer encoder, FSQ, Generation
  • VidTok [Tang et al., 2024]: CNN encoder, FSQ, Generation
  • Video-LaViT [Jin et al., 2024b]: Transformer encoder, VQ, Generation & Comprehension
  • Grace [Cheng et. al 2024]: CNN encoder, VQ, Reconstruction
  • MPEG FCM [Eimond et al 2025]: CNN encoder, VQ, Comprehension

Multi-Modal (Image/Audio/Text) Techniques

  • TEAL [Yang et al., 2023b]: Transformer encoder, VQ, Comprehension
  • AnyGPT [Zhan et al., 2024]: Transformer encoder, VQ, Generation & Comprehension
  • LaViT [Jin et al., 2024c]: Transformer encoder, VQ, Generation & Comprehension
  • ElasticTok [Yan et al., 2024]: Transformer encoder, VQ/FSQ, Generation & Comprehension
  • Chameleon [Team, 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
  • ShowO [Xie et al., 2024]: CNN/Transformer encoder, LFQ, Generation & Comprehension

Audio Modality Techniques

  • SoundStream [Zeghidour et al., 2021]: CNN encoder, RQ, Generation
  • HiFiCodec [Yang et al., 2023a]: CNN encoder, GRVQ, Generation
  • RepCodec [Huang et al., 2024]: CNN/Transformer encoder, RQ, Comprehension
  • SpeechTokenizer [Zhang et al., 2024]: CNN/Transformer encoder, RQ, Generation & Comprehension
  • NeuralSpeech-3 [Ju et al., 2024]: CNN/Transformer encoder, VQ, Generation & Comprehension
  • iRVQGAN [Kumar et al., 2024]: CNN encoder, RQ, Generation

Text-Based Recommendation Systems

  • TIGER [Rajput et al., 2023]: MLP encoder, RQ, Recommendation
  • SPM-SID [Singh et al., 2024]: MLP encoder, RQ, Recommendation
  • TokenRec [Qu et al., 2024]: MLP encoder, VQ, Recommendation
  • VQ-Rec [Hou et al., 2023]: MLP encoder, RQ, Recommendation
  • LC-Rec [Zheng et al., 2024]: MLP encoder, RQ, Recommendation
  • LETTER [Wang et al., 2024c]: MLP encoder, RQ, Recommendation
  • CoST [Zhu et al., 2024]: MLP encoder, RQ, Recommendation
  • ColaRec [Wang et al., 2024d]: MLP encoder, VQ, Recommendation
  • SEATER [Si et al., 2024]: MLP encoder, VQ, Recommendation
  • QARM [Luo et al., 2024]: MLP encoder, VQ, Recommendation

Text-Based Information Retrieval

  • DSI [Tay et al., 2022]: Transformer encoder, VQ, Information Retrieval
  • Ultron [Zhou et al., 2022]: Transformer encoder, RQ, Information Retrieval
  • GenRet [Sun et al., 2024]: Transformer encoder, VQ, Information Retrieval
  • LMINDEXER [Jin et al., 2024a]: Transformer encoder, VQ, Information Retrieval
  • RIPOR [Zeng et al., 2024]: Transformer encoder, RQ, Information Retrieval

4. Conclusions and Proposals

Proposals

a) AI Traffic Characteristics: Take this information into account when developing an overview of AI traffic characteristics with native AI format or codec besides options for traditional codec.

b) 6G Split Inferencing: Consider that split operation may include AI processing/formatting in addition to traditional model splitting considered in 5G.

c) TR Update: Add text and diagram based on clause 2 to TR for FS_6G_MED.

Proposed Change Requests

Change 1: References Addition

Add references [x1] through [x10] to the TR, including key papers on:
- Discrete tokenizers survey
- JPEG AI standard
- Quantization techniques
- Transformer architectures
- CNN architectures
- Specific implementations

Change 2: New Clause 6.2.4.X - Native AI Formats

Add comprehensive new clause under "AI Traffic Characteristics" covering:
- Overview of multi-modal AI and native formats
- Reasons for AI split processing
- Encoder techniques (Transformer, CNN, MLP)
- Decoder processing
- Supervision methods
- Application types
- Quantization techniques
- AI-based codecs

This clause provides the technical foundation for understanding native AI formats in the context of 6G media services.

Document Information
Source:
Huawei Tech.(UK) Co.. Ltd
Type:
pCR
For:
Agreement
Original Document:
View on 3GPP
Title: Survey of Native AI formats for multi-modal AI
Agenda item: 11.1
Agenda item description: FS_6G_MED (Study on Media aspects for 6G System)
Doc type: pCR
For action: Agreement
Release: Rel-20
Specification: 26.87
Version: 0.0.1
Related WIs: FS_6G_MED
Spec: 26.87
Contact: Rufail Mekuria
Uploaded: 2026-02-03T08:50:05.997000
Contact ID: 104180
Revised to: S4aP260009
TDoc Status: noted
Reservation date: 02/02/2026 13:45:39
Agenda item sort order: 60