TDoc: S4-260108

Meeting: TSGS4_135_India | Agenda Item: 11.1

Back to Agenda
Document Information
Title

[FS_6G_MED] LLM-based AI services

Source

Nokia

Type

discussion

3GPP Document
View on 3GPP
TDoc S4-260108
Title [FS_6G_MED] LLM-based AI services
Source Nokia
Agenda item 11.1
Agenda item description FS_6G_MED (Study on Media aspects for 6G System)
Doc type discussion
download_url https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_135_India/Docs/S4-260108.zip
Type discussion
Contact Saba Ahsan
Uploaded 2026-02-03T19:37:43.990000
Contact ID 81411
Revised to S4aP260018
TDoc Status noted
Reservation date 02/02/2026 17:47:14
Agenda item sort order 60
Comments
Previous Comments:
manager
2026-02-09 04:31:02


  1. [Technical] The proposed “Tokenizer” definition as converting any modality into tokens (including “audio frames” and “image patches”) conflates model-internal tokenization with generic media segmentation; in practice audio/image tokenization is highly model- and codec-dependent and not a stable unit suitable for SA4 normative terminology without tighter scoping.




  2. [Technical] “Tokens … with clearly defined boundaries” is not generally true for common LLM tokenizers (e.g., BPE/WordPiece) where boundaries are algorithmic subword units and vary by vocabulary/version; the definition risks being misleading for traffic characterization and charging discussions.




  3. [Technical] The architecture mixes functional blocks in a way that is inconsistent with typical MLLM pipelines: CLIP is cited as a “Modality Encoder” producing “token embeddings,” but CLIP commonly produces fixed-length embeddings (or patch embeddings internally) and is not representative of many current MLLMs; the document should avoid naming specific models or clarify the abstraction level.




  4. [Technical] The “Combination Layer” description (“combines input token embeddings with contextual token embeddings, potentially using RAG for context window management”) is conceptually muddled: RAG is an external retrieval + prompt construction mechanism, not a layer combining embeddings inside the model; this should be separated into “context retrieval/augmentation” vs “model inference.”




  5. [Technical] NOTE 2 claims token charging is based on “outcome of modality encoding and combination layers,” which is inaccurate for most services (charging is typically based on input/output token counts at the text tokenization interface, not internal embeddings); this undermines the traffic/charging motivation.




  6. [Technical] The proposal does not connect the architecture/definitions to SA4-relevant study outputs (e.g., concrete traffic models, latency/jitter constraints, conversational turn-taking, uplink/downlink asymmetry, streaming token generation), so it’s unclear how Clause 3 definitions will enable gap analysis in QoS or media formats.




  7. [Technical] NOTE 1 introduces “transport of token embeddings” as FFS, but the contribution does not justify why embedding transport is in scope for 3GPP media work versus transporting conventional media + text; without a clear use case, this risks steering the study toward non-deployable assumptions.




  8. [Technical] The statement in NOTE 3 that “all components on the server side run on the server” is already not universally true (on-device encoders, hybrid edge inference, split computing); if kept, it should be framed as “typical today” and include split/edge variants relevant to 6G.




  9. [Editorial] The document references “Figure X.1” and a “dashed line” but provides no figure; key architectural claims cannot be reviewed or agreed without the actual diagram and its boundary conditions.




  10. [Editorial] The contribution proposes adding content to “Clause 3” but does not specify which 3GPP document/TS/TR and which clause numbering (FS_6G_MED deliverable vs TR 26.847 vs a new SA4 TR), making the proposal non-actionable.




  11. [Editorial] Several terms are introduced without alignment to existing 3GPP terminology (e.g., “Media Decoder/Generator” vs codec/renderer concepts, “token embeddings” vs feature vectors), and no mapping is provided to existing SA4 definitions, risking inconsistent vocabulary across the study.




  12. [Editorial] The introduction cites TR 22.870 and “over 60 AI-related use cases” but does not cite specific use cases relevant to media communication nor extract requirements; adding a small set of representative use cases and their media/traffic implications would strengthen the contribution.



You must log in to post comment