S4-260273 Metadata - 3GPP Contribution Reviewer

Document Information

Title

6GMedia - AI terminology

Source

InterDigital New York

Type

discussion

For

Agreement

Release

Rel-20

Specification

26.87

3GPP Document

View on 3GPP

TDoc	S4-260273
Title	6GMedia - AI terminology
Source	InterDigital New York
Agenda item	11.1
Agenda item description	FS_6G_MED (Study on Media aspects for 6G System)
Doc type	discussion
For action	Agreement
Release	Rel-20
Specification	26.87
download_url	https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260273.zip
For	Agreement
Spec	26.87
Type	discussion
Contact	Gaelle Martin-Cocher
Uploaded	2026-02-03T22:05:34.897000
Contact ID	91571
TDoc Status	noted
Reservation date	03/02/2026 21:55:32
Agenda item sort order	60

Comments

Previous Comments:

manager

2026-02-09 04:29:05

[Technical] The proposal is not framed as normative 3GPP terminology (no alignment to TR 21.905 style, no indication of scope/authority), so inserting these definitions into TR 26.870 risks creating conflicting “official” definitions versus existing SA4/SA2 terms (e.g., “feature”, “descriptor”, “intermediate data”) without a clear governance statement.

[Technical] “Soft token” is defined as a continuous vector that “replaces or augments a hard token” and is “processed similarly,” but in many architectures tokens remain discrete indices while the embedding is continuous; the current text blurs token vs embedding and will confuse traffic characterization discussions (bits on the wire are typically hard-token indices or coded latents, not “soft tokens”).

[Technical] The definition “Embedding… Not inherently part of a token sequence” is incorrect/incomplete for common transformer pipelines where embeddings are exactly the per-token continuous representations forming the input sequence; this contradiction undermines the intended clarity between “token”, “embedding”, and “latent”.

[Technical] “Learned based media compression (representation)” is described as “syntax-defined coded form derived from latent representation after quantization and entropy coding,” which excludes important learned codecs that do not use explicit entropy coding or use arithmetic coding over discrete tokens; the definition should be generalized or explicitly scoped to avoid being wrong for major classes of neural codecs.

[Technical] The “Model exchange representation” examples include GGUF, which is primarily a model file/container format for specific inference stacks rather than an interoperable operator-graph exchange format like ONNX/NNEF; mixing these may mislead SA4 on what is realistically exchangeable across vendors.

[Technical] The “Internal vs external representation” matrix is largely FFS and asserts “Model exchange representation: Not internal,” but model formats can be internal to a system component boundary (e.g., between orchestrator and accelerator runtime); the internal/external dichotomy needs a 3GPP-entity/interface context (UE/AF/AS/NEF, etc.) to be meaningful.

[Technical] “Intermediate data” is said to “include intermediate coded representation, feature representation or descriptors” and references TR 26.927, but the contribution does not quote or ensure consistency with the exact TR 26.927 definition; this risks redefining an existing SA4 term and should be cross-checked verbatim.

[Technical] The applicability matrix makes strong modality claims (e.g., “Text: prevalent method hard tokens”, “Audio: prevalent method latents + embeddings”) that are architecture-dependent and not stable enough for a TR unless clearly labeled as informative examples; otherwise it may bias later requirements/traffic models incorrectly.

[Technical] “Inference results” includes “W3C Media Annotations” as an example, but that is a metadata framework rather than an AI inference output format; the example set should be constrained to representations relevant to 3GPP media workflows (e.g., bounding boxes, masks, captions) and to what is exchanged over 3GPP interfaces.

[Editorial] Several terms are introduced without consistent naming/grammar (“Learned based…” vs “Learned-based…”, “latent representation (latent)”, “exchangeable/external representation”), which will read poorly in TR 26.870 and complicate cross-referencing.

[Editorial] The contribution proposes “include sections 1 to 3” but does not specify the exact target clause/subclause in TR 26.870, nor provide proposed text with numbering and definitions formatting; this makes it hard to assess integration impact and creates editorial ambiguity for rapporteurs.

[Editorial] Examples mix standards and non-standards inconsistently (JPEG AI, MPEG AI-PCC, ONNX, NNEF, GGUF, NNC) without citations; TR text should either cite stable references or avoid listing volatile ecosystem artifacts that may date quickly.

<ol>
<li>
[Technical] The proposal is not framed as normative 3GPP terminology (no alignment to TR 21.905 style, no indication of scope/authority), so inserting these definitions into TR 26.870 risks creating conflicting “official” definitions versus existing SA4/SA2 terms (e.g., “feature”, “descriptor”, “intermediate data”) without a clear governance statement.
</li>
<li>
[Technical] “Soft token” is defined as a continuous vector that “replaces or augments a hard token” and is “processed similarly,” but in many architectures tokens remain discrete indices while the embedding is continuous; the current text blurs token vs embedding and will confuse traffic characterization discussions (bits on the wire are typically hard-token indices or coded latents, not “soft tokens”).
</li>
<li>
[Technical] The definition “Embedding… Not inherently part of a token sequence” is incorrect/incomplete for common transformer pipelines where embeddings are exactly the per-token continuous representations forming the input sequence; this contradiction undermines the intended clarity between “token”, “embedding”, and “latent”.
</li>
<li>
[Technical] “Learned based media compression (representation)” is described as “syntax-defined coded form derived from latent representation after quantization and entropy coding,” which excludes important learned codecs that do not use explicit entropy coding or use arithmetic coding over discrete tokens; the definition should be generalized or explicitly scoped to avoid being wrong for major classes of neural codecs.
</li>
<li>
[Technical] The “Model exchange representation” examples include GGUF, which is primarily a model file/container format for specific inference stacks rather than an interoperable operator-graph exchange format like ONNX/NNEF; mixing these may mislead SA4 on what is realistically exchangeable across vendors.
</li>
<li>
[Technical] The “Internal vs external representation” matrix is largely FFS and asserts “Model exchange representation: Not internal,” but model formats can be internal to a system component boundary (e.g., between orchestrator and accelerator runtime); the internal/external dichotomy needs a 3GPP-entity/interface context (UE/AF/AS/NEF, etc.) to be meaningful.
</li>
<li>
[Technical] “Intermediate data” is said to “include intermediate coded representation, feature representation or descriptors” and references TR 26.927, but the contribution does not quote or ensure consistency with the exact TR 26.927 definition; this risks redefining an existing SA4 term and should be cross-checked verbatim.
</li>
<li>
[Technical] The applicability matrix makes strong modality claims (e.g., “Text: prevalent method hard tokens”, “Audio: prevalent method latents + embeddings”) that are architecture-dependent and not stable enough for a TR unless clearly labeled as informative examples; otherwise it may bias later requirements/traffic models incorrectly.
</li>
<li>
[Technical] “Inference results” includes “W3C Media Annotations” as an example, but that is a metadata framework rather than an AI inference output format; the example set should be constrained to representations relevant to 3GPP media workflows (e.g., bounding boxes, masks, captions) and to what is exchanged over 3GPP interfaces.
</li>
<li>
[Editorial] Several terms are introduced without consistent naming/grammar (“Learned based…” vs “Learned-based…”, “latent representation (latent)”, “exchangeable/external representation”), which will read poorly in TR 26.870 and complicate cross-referencing.
</li>
<li>
[Editorial] The contribution proposes “include sections 1 to 3” but does not specify the exact target clause/subclause in TR 26.870, nor provide proposed text with numbering and definitions formatting; this makes it hard to assess integration impact and creates editorial ambiguity for rapporteurs.
</li>
<li>
[Editorial] Examples mix standards and non-standards inconsistently (JPEG AI, MPEG AI-PCC, ONNX, NNEF, GGUF, NNC) without citations; TR text should either cite stable references or avoid listing volatile ecosystem artifacts that may date quickly.
</li>
</ol>

You must log in to post comment

Log In

TDoc: S4-260273