S4-260108 Metadata - 3GPP Contribution Reviewer

Document Information

Title

[FS_6G_MED] LLM-based AI services

Source

Nokia

Type

discussion

3GPP Document

View on 3GPP

TDoc	S4-260108
Title	[FS_6G_MED] LLM-based AI services
Source	Nokia
Agenda item	11.1
Agenda item description	FS_6G_MED (Study on Media aspects for 6G System)
Doc type	discussion
download_url	https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260108.zip
Type	discussion
Contact	Saba Ahsan
Uploaded	2026-02-03T19:37:43.990000
Contact ID	81411
Revised to	S4aP260018
TDoc Status	noted
Reservation date	02/02/2026 17:47:14
Agenda item sort order	60

Comments

Previous Comments:

manager

2026-02-09 04:31:02

[Technical] The proposed “Tokenizer” definition as converting any modality into tokens (including “audio frames” and “image patches”) conflates model-internal tokenization with generic media segmentation; in practice audio/image tokenization is highly model- and codec-dependent and not a stable unit suitable for SA4 normative terminology without tighter scoping.

[Technical] “Tokens … with clearly defined boundaries” is not generally true for common LLM tokenizers (e.g., BPE/WordPiece) where boundaries are algorithmic subword units and vary by vocabulary/version; the definition risks being misleading for traffic characterization and charging discussions.

[Technical] The architecture mixes functional blocks in a way that is inconsistent with typical MLLM pipelines: CLIP is cited as a “Modality Encoder” producing “token embeddings,” but CLIP commonly produces fixed-length embeddings (or patch embeddings internally) and is not representative of many current MLLMs; the document should avoid naming specific models or clarify the abstraction level.

[Technical] The “Combination Layer” description (“combines input token embeddings with contextual token embeddings, potentially using RAG for context window management”) is conceptually muddled: RAG is an external retrieval + prompt construction mechanism, not a layer combining embeddings inside the model; this should be separated into “context retrieval/augmentation” vs “model inference.”

[Technical] NOTE 2 claims token charging is based on “outcome of modality encoding and combination layers,” which is inaccurate for most services (charging is typically based on input/output token counts at the text tokenization interface, not internal embeddings); this undermines the traffic/charging motivation.

[Technical] The proposal does not connect the architecture/definitions to SA4-relevant study outputs (e.g., concrete traffic models, latency/jitter constraints, conversational turn-taking, uplink/downlink asymmetry, streaming token generation), so it’s unclear how Clause 3 definitions will enable gap analysis in QoS or media formats.

[Technical] NOTE 1 introduces “transport of token embeddings” as FFS, but the contribution does not justify why embedding transport is in scope for 3GPP media work versus transporting conventional media + text; without a clear use case, this risks steering the study toward non-deployable assumptions.

[Technical] The statement in NOTE 3 that “all components on the server side run on the server” is already not universally true (on-device encoders, hybrid edge inference, split computing); if kept, it should be framed as “typical today” and include split/edge variants relevant to 6G.

[Editorial] The document references “Figure X.1” and a “dashed line” but provides no figure; key architectural claims cannot be reviewed or agreed without the actual diagram and its boundary conditions.

[Editorial] The contribution proposes adding content to “Clause 3” but does not specify which 3GPP document/TS/TR and which clause numbering (FS_6G_MED deliverable vs TR 26.847 vs a new SA4 TR), making the proposal non-actionable.

[Editorial] Several terms are introduced without alignment to existing 3GPP terminology (e.g., “Media Decoder/Generator” vs codec/renderer concepts, “token embeddings” vs feature vectors), and no mapping is provided to existing SA4 definitions, risking inconsistent vocabulary across the study.

[Editorial] The introduction cites TR 22.870 and “over 60 AI-related use cases” but does not cite specific use cases relevant to media communication nor extract requirements; adding a small set of representative use cases and their media/traffic implications would strengthen the contribution.

<ol>
<li>
[Technical] The proposed “Tokenizer” definition as converting any modality into tokens (including “audio frames” and “image patches”) conflates model-internal tokenization with generic media segmentation; in practice audio/image tokenization is highly model- and codec-dependent and not a stable unit suitable for SA4 normative terminology without tighter scoping.
</li>
<li>
[Technical] “Tokens … with clearly defined boundaries” is not generally true for common LLM tokenizers (e.g., BPE/WordPiece) where boundaries are algorithmic subword units and vary by vocabulary/version; the definition risks being misleading for traffic characterization and charging discussions.
</li>
<li>
[Technical] The architecture mixes functional blocks in a way that is inconsistent with typical MLLM pipelines: CLIP is cited as a “Modality Encoder” producing “token embeddings,” but CLIP commonly produces fixed-length embeddings (or patch embeddings internally) and is not representative of many current MLLMs; the document should avoid naming specific models or clarify the abstraction level.
</li>
<li>
[Technical] The “Combination Layer” description (“combines input token embeddings with contextual token embeddings, potentially using RAG for context window management”) is conceptually muddled: RAG is an external retrieval + prompt construction mechanism, not a layer combining embeddings inside the model; this should be separated into “context retrieval/augmentation” vs “model inference.”
</li>
<li>
[Technical] NOTE 2 claims token charging is based on “outcome of modality encoding and combination layers,” which is inaccurate for most services (charging is typically based on input/output token counts at the text tokenization interface, not internal embeddings); this undermines the traffic/charging motivation.
</li>
<li>
[Technical] The proposal does not connect the architecture/definitions to SA4-relevant study outputs (e.g., concrete traffic models, latency/jitter constraints, conversational turn-taking, uplink/downlink asymmetry, streaming token generation), so it’s unclear how Clause 3 definitions will enable gap analysis in QoS or media formats.
</li>
<li>
[Technical] NOTE 1 introduces “transport of token embeddings” as FFS, but the contribution does not justify why embedding transport is in scope for 3GPP media work versus transporting conventional media + text; without a clear use case, this risks steering the study toward non-deployable assumptions.
</li>
<li>
[Technical] The statement in NOTE 3 that “all components on the server side run on the server” is already not universally true (on-device encoders, hybrid edge inference, split computing); if kept, it should be framed as “typical today” and include split/edge variants relevant to 6G.
</li>
<li>
[Editorial] The document references “Figure X.1” and a “dashed line” but provides no figure; key architectural claims cannot be reviewed or agreed without the actual diagram and its boundary conditions.
</li>
<li>
[Editorial] The contribution proposes adding content to “Clause 3” but does not specify which 3GPP document/TS/TR and which clause numbering (FS_6G_MED deliverable vs TR 26.847 vs a new SA4 TR), making the proposal non-actionable.
</li>
<li>
[Editorial] Several terms are introduced without alignment to existing 3GPP terminology (e.g., “Media Decoder/Generator” vs codec/renderer concepts, “token embeddings” vs feature vectors), and no mapping is provided to existing SA4 definitions, risking inconsistent vocabulary across the study.
</li>
<li>
[Editorial] The introduction cites TR 22.870 and “over 60 AI-related use cases” but does not cite specific use cases relevant to media communication nor extract requirements; adding a small set of representative use cases and their media/traffic implications would strengthen the contribution.
</li>
</ol>

You must log in to post comment

Log In

TDoc: S4-260108