Meeting: TSGS4_135_India | Agenda Item: 11.1
[FS_6G_MED] LLM-based AI services
Nokia
discussion
| TDoc | S4-260108 |
| Title | [FS_6G_MED] LLM-based AI services |
| Source | Nokia |
| Agenda item | 11.1 |
| Agenda item description | FS_6G_MED (Study on Media aspects for 6G System) |
| Doc type | discussion |
| download_url | https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260108.zip |
| Type | discussion |
| Contact | Saba Ahsan |
| Uploaded | 2026-02-03T19:37:43.990000 |
| Contact ID | 81411 |
| Revised to | S4aP260018 |
| TDoc Status | noted |
| Reservation date | 02/02/2026 17:47:14 |
| Agenda item sort order | 60 |
[Technical] The proposed “Tokenizer” definition as converting any modality into tokens (including “audio frames” and “image patches”) conflates model-internal tokenization with generic media segmentation; in practice audio/image tokenization is highly model- and codec-dependent and not a stable unit suitable for SA4 normative terminology without tighter scoping.
[Technical] “Tokens … with clearly defined boundaries” is not generally true for common LLM tokenizers (e.g., BPE/WordPiece) where boundaries are algorithmic subword units and vary by vocabulary/version; the definition risks being misleading for traffic characterization and charging discussions.
[Technical] The architecture mixes functional blocks in a way that is inconsistent with typical MLLM pipelines: CLIP is cited as a “Modality Encoder” producing “token embeddings,” but CLIP commonly produces fixed-length embeddings (or patch embeddings internally) and is not representative of many current MLLMs; the document should avoid naming specific models or clarify the abstraction level.
[Technical] The “Combination Layer” description (“combines input token embeddings with contextual token embeddings, potentially using RAG for context window management”) is conceptually muddled: RAG is an external retrieval + prompt construction mechanism, not a layer combining embeddings inside the model; this should be separated into “context retrieval/augmentation” vs “model inference.”
[Technical] NOTE 2 claims token charging is based on “outcome of modality encoding and combination layers,” which is inaccurate for most services (charging is typically based on input/output token counts at the text tokenization interface, not internal embeddings); this undermines the traffic/charging motivation.
[Technical] The proposal does not connect the architecture/definitions to SA4-relevant study outputs (e.g., concrete traffic models, latency/jitter constraints, conversational turn-taking, uplink/downlink asymmetry, streaming token generation), so it’s unclear how Clause 3 definitions will enable gap analysis in QoS or media formats.
[Technical] NOTE 1 introduces “transport of token embeddings” as FFS, but the contribution does not justify why embedding transport is in scope for 3GPP media work versus transporting conventional media + text; without a clear use case, this risks steering the study toward non-deployable assumptions.
[Technical] The statement in NOTE 3 that “all components on the server side run on the server” is already not universally true (on-device encoders, hybrid edge inference, split computing); if kept, it should be framed as “typical today” and include split/edge variants relevant to 6G.
[Editorial] The document references “Figure X.1” and a “dashed line” but provides no figure; key architectural claims cannot be reviewed or agreed without the actual diagram and its boundary conditions.
[Editorial] The contribution proposes adding content to “Clause 3” but does not specify which 3GPP document/TS/TR and which clause numbering (FS_6G_MED deliverable vs TR 26.847 vs a new SA4 TR), making the proposal non-actionable.
[Editorial] Several terms are introduced without alignment to existing 3GPP terminology (e.g., “Media Decoder/Generator” vs codec/renderer concepts, “token embeddings” vs feature vectors), and no mapping is provided to existing SA4 definitions, risking inconsistent vocabulary across the study.
[Editorial] The introduction cites TR 22.870 and “over 60 AI-related use cases” but does not cite specific use cases relevant to media communication nor extract requirements; adding a small set of representative use cases and their media/traffic implications would strengthen the contribution.