# LLM-based AI Services for 6G Media Study

## Introduction

This contribution addresses Work Task 2 objective (d) of FS_6G_MED, which focuses on media communication for emerging AI services. The objective aims to:

- Collect and study AI representation formats and traffic characteristics for AI-related media services
- Examine use cases including agents, multi-modal large language models, and diffusion models
- Identify gaps in 3GPP specifications (e.g., QoS requirements, dynamic traffic characteristics, AI-representation formats)

The contribution notes that SA1 TR 22.870 contains over 60 AI-related use cases, many referencing "tokens" as basic units for Gen-AI models. While tokenized traffic over networks is not yet widely deployed, the fast-paced research warrants SA4's attention to elaborate these terms and establish a framework.

## Generic Workflow and Architecture for LLMs

### Background on LLMs and MLLMs

The document proposes a more generic architecture than the voice translation-specific model in TR 26.847. Key definitions:

- **Large Language Models (LLM)**: AI systems capable of processing and generating natural language, based on transformer architecture with self-attention
- **Generative Pre-trained Transformers (GPT)**: Type of LLM forming the basis of modern AI systems (ChatGPT, Gemini, Deepseek, Claude)
- **Multimodal Large Language Models (MLLM)**: Models processing multiple input/output modalities (text, images, audio, video) with learned cross-modal alignment

### Proposed Generic Architecture

The contribution presents a generic (M)LLM architecture (Figure X.1) with the following components:

**Input Processing:**
- **Tokenizer**: Function that converts data of a particular modality into tokens (e.g., words, image patches)
- **Modality Encoder**: AI/ML model that encodes tokens into token embeddings (e.g., OpenAI's CLIP for images and text)

**Processing:**
- **Combination Layer**: Combines input token embeddings with contextual token embeddings, potentially using techniques like RAG for context window management

**Output Processing:**
- **Media Decoder/Generator**: Processes LLM output token embeddings into desired format (e.g., natural language)

### Key Definitions

**Tokens**: Discrete units of information in a given modality (words in text, audio frames, image patches) representing meaningful components of AI/ML data with clearly defined boundaries.

**Token embeddings (or embeddings)**: Dense numerical tensors encoding semantic properties, relationships, and contextual meaning of tokens. Transform discrete tokens into continuous mathematical spaces where semantic relationships can be computed through vector operations.

### Important Notes

**NOTE 1**: Current popular AI applications do not generate network traffic composed of token embeddings. Feasibility of such transport using existing protocols is FFS.

**NOTE 2**: Modern LLM services charge based on number of tokens processed (outcome of modality encoding and combination layers), but user input consists of traditional media (text, images, audio) in the form of prompts.

**NOTE 3**: In current AI applications, all components on the server side (right of dashed line in architecture) run on the server.

## Abbreviations

- **LLM**: Large Language Model
- **MLLM**: Multimodal Large Language Model
- **RAG**: Retrieval-Augmented Generation

## Proposal

The document proposes to discuss and agree on the generic architecture and definitions for LLM-based AI applications in Clause 3 as a basis for further work in the study.