S4-260273 - AI Summary

6GMedia - AI terminology

AI-Generated Summary AI

6GMedia – AI Terminology for Media Communication Services

Overview

This contribution addresses the need for standardized AI terminology in the context of 6GMedia work. The document recognizes that terminology such as tokens and embeddings lacks clarity across delegates and 3GPP working groups, and proposes definitions for AI representation formats to enable assessment of traffic characteristics and impact on SA4 specifications.

Main Technical Contributions

1. Core AI Representation Terminology

The document proposes comprehensive definitions for fundamental AI representation concepts:

Token Definitions

Token (general): Unit of representation processed sequentially or symbolically by a model, which can be discrete (hard token) or continuous (soft token)
Hard token: Discrete symbolic representation selected from a predefined vocabulary or codebook, processed as an atomic unit within a model's input or output sequence. Has explicit identifier without partial or weighted selection. Examples include text symbols, discrete audio or visual codewords
Soft token: Continuous vector representation that functionally replaces or augments a hard token, not corresponding to a single discrete vocabulary element. Typically learned and processed similarly to hard tokens

Other Core Representations

Embedding: Continuous vector representation encoding semantic, perceptual, or structural characteristics of content for comparison, retrieval, alignment, or conditioning. Not inherently part of a token sequence
Latent representation (latent): Continuous representation capturing compressed, abstract, or generative state during processing or synthesis
Hierarchical latent representation: Latent representation structured across multiple levels, with higher levels capturing coarse/global/semantic information and lower levels capturing fine-grained/local/residual information

2. Media-Specific and Exchange Format Terminology

Compression and Exchange Formats

Learned based media compression (representation): Discrete, syntax-defined coded form derived from latent representation after quantization and entropy coding (e.g., JPEG AI, MPEG AI-PCC)
Model exchange representation: Defines network topology, operators, parameters, and data types for interoperable deployment across distributed systems (e.g., ONNX, NNEF, GGUF)

Inference-Related Representations

Feature/descriptor: Structured representation of content characteristics, extracted for description, comparison, indexing, or retrieval (e.g., MPEG-7 descriptors)
Inference results: Output representation produced by AI system as result of applying model to input data, conveying decision, prediction, estimation, or detection outcome. May include labels, scores, confidence values, coordinates, masks, or structured outputs (e.g., W3C Media Annotations)
Intermediate data: Representation of partially processed information exchanged between inference components, enabling continuation of inference by downstream entity (includes intermediate coded representation, feature representation or descriptors). References TR 26.927 definition

3. Applicability Matrix to Media Types/Modalities

The document provides a comprehensive mapping of representation types across different media modalities:

Text

Prevalent method: Hard tokens
Hard tokens: Words, subwords, characters, punctuation symbols
Soft tokens: Learned prompt or prefix vectors controlling task or style
Embeddings: Vector encoding semantic meaning of sentence or document
Latent representation: Internal hidden states during text generation or transformation
Learned compression: Text compression remains predominantly symbolic with limited use of learned latent

Audio

Prevalent method: Latents + embeddings
Hard tokens: Discrete speech units, phoneme identifiers, quantized audio codes
Soft tokens: Continuous conditioning vectors controlling speaker, emotion, or prosody
Embeddings: Vector encoding speaker identity, acoustic similarity, or musical features
Latent representation: Audio latents capturing timbre, rhythm, or spectral structure
Learned compression: Applicable

Image

Prevalent method: Latents
Hard tokens: Discrete visual codewords or quantized image patches
Soft tokens: Learned visual token vectors participating in image models
Embeddings: Vector encoding image content, objects, or style
Latent representation: Image latents used in generative or transformational models
Learned compression: Applicable (prevalent)

Video

Prevalent method: Latents (hierarchical)
Hard tokens: Discrete frame-level or spatiotemporal codewords
Soft tokens: Continuous vectors controlling temporal or contextual behavior
Embeddings: Vector encoding semantic or perceptual characteristics of video segment
Latent representation: Spatiotemporal latents capturing motion and scene dynamics
Learned compression: Applicable (prevalent)

Multimodal

Prevalent method: Mixed (tokens + embeddings + latents)
Hard tokens: Aligned discrete tokens from multiple media types
Soft tokens: Continuous token-equivalent vectors enabling cross-modal conditioning
Embeddings: Shared vector representation aligning content across modalities
Latent representation: Individual or multimodal latents supporting joint generation or transformation
Learned compression: Usually applied per modality but could be applied on shared latents

4. Internal vs. External Representation Framework

The document proposes distinguishing between:
- Internal representation: Representation used by the model or agent for its internal process
- Exchangeable/external representation: Format exchanged between two entities (e.g., UL or DL)

A matrix is provided mapping representation formats to internal/external usage, with most entries marked as FFS (For Further Study), except:
- Learned based compressed representation: Not internal, external examples include JPEG AI, MPEG AI-PCC
- Model exchange representation: Not internal, external examples include ONNX, NNEF, GGUF, NNC

Proposal

The contribution proposes to include sections 1 to 3 in a relevant section of TR 26.870.

Document Information

TDoc:
S4-260273

Source:
InterDigital New York

Type:
discussion

For:
Agreement

Original Document:
View on 3GPP

Title: 6GMedia - AI terminology

Agenda item: 11.1

Agenda item description: FS_6G_MED (Study on Media aspects for 6G System)

Doc type: discussion

For action: Agreement

Release: Rel-20

Specification: 26.87

Spec: 26.87

Contact: Gaelle Martin-Cocher

Uploaded: 2026-02-03T22:05:34.897000

Contact ID: 91571

TDoc Status: noted

Reservation date: 03/02/2026 21:55:32

Agenda item sort order: 60