[FS_6G_MED] LLM-based AI services
This contribution addresses Work Task 2 objective (d) of FS_6G_MED, which focuses on media communication for emerging AI services. The objective aims to:
The contribution notes that SA1 TR 22.870 contains over 60 AI-related use cases, many referencing "tokens" as basic units for Gen-AI models. While tokenized traffic over networks is not yet widely deployed, the fast-paced research warrants SA4's attention to elaborate these terms and establish a framework.
The document proposes a more generic architecture than the voice translation-specific model in TR 26.847. Key definitions:
The contribution presents a generic (M)LLM architecture (Figure X.1) with the following components:
Input Processing:
- Tokenizer: Function that converts data of a particular modality into tokens (e.g., words, image patches)
- Modality Encoder: AI/ML model that encodes tokens into token embeddings (e.g., OpenAI's CLIP for images and text)
Processing:
- Combination Layer: Combines input token embeddings with contextual token embeddings, potentially using techniques like RAG for context window management
Output Processing:
- Media Decoder/Generator: Processes LLM output token embeddings into desired format (e.g., natural language)
Tokens: Discrete units of information in a given modality (words in text, audio frames, image patches) representing meaningful components of AI/ML data with clearly defined boundaries.
Token embeddings (or embeddings): Dense numerical tensors encoding semantic properties, relationships, and contextual meaning of tokens. Transform discrete tokens into continuous mathematical spaces where semantic relationships can be computed through vector operations.
NOTE 1: Current popular AI applications do not generate network traffic composed of token embeddings. Feasibility of such transport using existing protocols is FFS.
NOTE 2: Modern LLM services charge based on number of tokens processed (outcome of modality encoding and combination layers), but user input consists of traditional media (text, images, audio) in the form of prompts.
NOTE 3: In current AI applications, all components on the server side (right of dashed line in architecture) run on the server.
The document proposes to discuss and agree on the generic architecture and definitions for LLM-based AI applications in Clause 3 as a basis for further work in the study.