S4-260273 - AI Summary

6GMedia - AI terminology

Back to Agenda Download Summary
AI-Generated Summary AI

6GMedia – AI Terminology for Media Communication Services

Overview

This contribution addresses the need for standardized AI terminology in the context of 6GMedia work. The document recognizes that terminology such as tokens and embeddings lacks clarity across delegates and 3GPP working groups, and proposes definitions for AI representation formats to enable assessment of traffic characteristics and impact on SA4 specifications.

Main Technical Contributions

1. Core AI Representation Terminology

The document proposes comprehensive definitions for fundamental AI representation concepts:

Token Definitions

  • Token (general): Unit of representation processed sequentially or symbolically by a model, which can be discrete (hard token) or continuous (soft token)
  • Hard token: Discrete symbolic representation selected from a predefined vocabulary or codebook, processed as an atomic unit within a model's input or output sequence. Has explicit identifier without partial or weighted selection. Examples include text symbols, discrete audio or visual codewords
  • Soft token: Continuous vector representation that functionally replaces or augments a hard token, not corresponding to a single discrete vocabulary element. Typically learned and processed similarly to hard tokens

Other Core Representations

  • Embedding: Continuous vector representation encoding semantic, perceptual, or structural characteristics of content for comparison, retrieval, alignment, or conditioning. Not inherently part of a token sequence
  • Latent representation (latent): Continuous representation capturing compressed, abstract, or generative state during processing or synthesis
  • Hierarchical latent representation: Latent representation structured across multiple levels, with higher levels capturing coarse/global/semantic information and lower levels capturing fine-grained/local/residual information

2. Media-Specific and Exchange Format Terminology

Compression and Exchange Formats

  • Learned based media compression (representation): Discrete, syntax-defined coded form derived from latent representation after quantization and entropy coding (e.g., JPEG AI, MPEG AI-PCC)
  • Model exchange representation: Defines network topology, operators, parameters, and data types for interoperable deployment across distributed systems (e.g., ONNX, NNEF, GGUF)

Inference-Related Representations

  • Feature/descriptor: Structured representation of content characteristics, extracted for description, comparison, indexing, or retrieval (e.g., MPEG-7 descriptors)
  • Inference results: Output representation produced by AI system as result of applying model to input data, conveying decision, prediction, estimation, or detection outcome. May include labels, scores, confidence values, coordinates, masks, or structured outputs (e.g., W3C Media Annotations)
  • Intermediate data: Representation of partially processed information exchanged between inference components, enabling continuation of inference by downstream entity (includes intermediate coded representation, feature representation or descriptors). References TR 26.927 definition

3. Applicability Matrix to Media Types/Modalities

The document provides a comprehensive mapping of representation types across different media modalities:

Text

  • Prevalent method: Hard tokens
  • Hard tokens: Words, subwords, characters, punctuation symbols
  • Soft tokens: Learned prompt or prefix vectors controlling task or style
  • Embeddings: Vector encoding semantic meaning of sentence or document
  • Latent representation: Internal hidden states during text generation or transformation
  • Learned compression: Text compression remains predominantly symbolic with limited use of learned latent

Audio

  • Prevalent method: Latents + embeddings
  • Hard tokens: Discrete speech units, phoneme identifiers, quantized audio codes
  • Soft tokens: Continuous conditioning vectors controlling speaker, emotion, or prosody
  • Embeddings: Vector encoding speaker identity, acoustic similarity, or musical features
  • Latent representation: Audio latents capturing timbre, rhythm, or spectral structure
  • Learned compression: Applicable

Image

  • Prevalent method: Latents
  • Hard tokens: Discrete visual codewords or quantized image patches
  • Soft tokens: Learned visual token vectors participating in image models
  • Embeddings: Vector encoding image content, objects, or style
  • Latent representation: Image latents used in generative or transformational models
  • Learned compression: Applicable (prevalent)

Video

  • Prevalent method: Latents (hierarchical)
  • Hard tokens: Discrete frame-level or spatiotemporal codewords
  • Soft tokens: Continuous vectors controlling temporal or contextual behavior
  • Embeddings: Vector encoding semantic or perceptual characteristics of video segment
  • Latent representation: Spatiotemporal latents capturing motion and scene dynamics
  • Learned compression: Applicable (prevalent)

Multimodal

  • Prevalent method: Mixed (tokens + embeddings + latents)
  • Hard tokens: Aligned discrete tokens from multiple media types
  • Soft tokens: Continuous token-equivalent vectors enabling cross-modal conditioning
  • Embeddings: Shared vector representation aligning content across modalities
  • Latent representation: Individual or multimodal latents supporting joint generation or transformation
  • Learned compression: Usually applied per modality but could be applied on shared latents

4. Internal vs. External Representation Framework

The document proposes distinguishing between:
- Internal representation: Representation used by the model or agent for its internal process
- Exchangeable/external representation: Format exchanged between two entities (e.g., UL or DL)

A matrix is provided mapping representation formats to internal/external usage, with most entries marked as FFS (For Further Study), except:
- Learned based compressed representation: Not internal, external examples include JPEG AI, MPEG AI-PCC
- Model exchange representation: Not internal, external examples include ONNX, NNEF, GGUF, NNC

Proposal

The contribution proposes to include sections 1 to 3 in a relevant section of TR 26.870.

Document Information
Source:
InterDigital New York
Type:
discussion
For:
Agreement
Original Document:
View on 3GPP
Title: 6GMedia - AI terminology
Agenda item: 11.1
Agenda item description: FS_6G_MED (Study on Media aspects for 6G System)
Doc type: discussion
For action: Agreement
Release: Rel-20
Specification: 26.87
Spec: 26.87
Contact: Gaelle Martin-Cocher
Uploaded: 2026-02-03T22:05:34.897000
Contact ID: 91571
TDoc Status: noted
Reservation date: 03/02/2026 21:55:32
Agenda item sort order: 60