S4-260161

[FS_6G_MED]pCR on Embodied Video for 6G Media

Source: China Mobile Com. Corporation
Meeting: TSGS4_135_India
Agenda Item: 11.1

All Metadata

Agenda item description	FS_6G_MED (Study on Media aspects for 6G System)
Doc type	pCR
For action	Agreement
Release	Rel-20
Specification	26.87
Version	0.0.1
Related WIs	FS_6G_MED
download_url	Download Original
For	Agreement
Spec	26.87
Type	pCR
Contact	Jiayi Xu
Uploaded	2026-02-03T12:59:10.947000
Contact ID	89460
TDoc Status	noted
Reservation date	03/02/2026 12:50:08
Agenda item sort order	60

Review Comments

manager - 2026-02-10 04:17

[Technical] The proposed new use case “Embodied Video Internet (EVI)” is not clearly mapped to the existing TR 26.870 study objectives and terminology; it reads like a new umbrella concept rather than a media-centric use case, and the CR should explicitly justify why it belongs in TR 26.870 (SA4 media study) versus remaining in SA1/SA2 domain.

[Technical] Several KPI values appear internally inconsistent or insufficiently specified for media work: e.g., “6x 1080p @ 15Hz → 20 Mbps” and “compression ratio 240:1” are asserted without stating codec, chroma format, bit depth, target quality, or whether “Hz” means fps, making the derived bitrates non-reproducible and potentially misleading for SA4 conclusions.

[Technical] The latency requirement “E2E RTT 100–300 ms” for “real-time” embodied control/offloading is not reconciled with the tighter control-loop needs implied elsewhere (e.g., 10 ms sensor intervals, motion control), and the text should distinguish clearly between (a) media transport latency for perception streams and (b) closed-loop control latency/reliability requirements.

[Technical] The contribution mixes “video” requirements with non-media payloads (LiDAR, point clouds, sensor data) but does not define the scope boundary for TR 26.870 (media codecs/protocols/QoE); without scoping, the clause risks driving requirements that are more appropriate for generic data transport or edge computing studies.

[Technical] “AI codec with error-tolerant capabilities (Grace method)” is introduced as a requirement but is not defined, referenced, or aligned with ongoing 3GPP/MPEG terminology (e.g., neural codecs, feature/latent compression, ROI coding); as written it is not actionable and could conflict with existing codec evaluation frameworks.

[Technical] “AI-native Video Protocol” is proposed as a key requirement without identifying what is missing in existing protocol stacks (RTP/RTCP, QUIC, DASH/CMAF, WebRTC, 5G media streaming) or what protocol functions are uniquely required (e.g., semantic prioritization, multi-stream synchronization, in-network adaptation), so it reads as a vague solution statement rather than a requirement.

[Technical] Reliability targets such as “>99.99%” are stated for UAV inspection and robot sensor/LiDAR traffic without defining the reliability metric (packet success probability, frame delivery, application-level inference success, within what time bound), which is critical for translating to media-layer mechanisms.

[Technical] The UAV “event security” latency “≤10 ms” for 1K/4K video at ≥5/≥25 Mbps is extremely stringent and likely infeasible end-to-end for typical video pipelines (capture/encode/packetize/decode/render), unless it refers to one-way transport only; the clause should clarify the latency definition and include processing components or explicitly exclude them.

[Technical] Multi-camera scenarios (6–8 cameras, mixed 1080p/4K, 15/30/60 fps) are listed, but there is no requirement discussion on synchronization (inter-camera time alignment), multi-stream correlation, or joint encoding/transport, which are central media issues for embodied perception and 3D reconstruction.

[Technical] The “QoE model” section is too generic and user-centric for a machine-consumer scenario; embodied AI often optimizes task success (e.g., mAP, tracking stability, control error) rather than human QoE, so the clause should introduce task-oriented QoS/QoE (QoTask) metrics and how they relate to media impairments.

[Technical] The contribution cites SA1 TR 22.870 use cases but does not ensure consistent numbering/traceability (e.g., “Use Case 6.28/6.19/6.48/6.11”) to the exact clauses/tables in TR 22.870; without precise references, the extracted KPIs risk being challenged as non-authoritative.

[Editorial] Clause/table numbering appears inconsistent in the summary (“Table 2.1.3-1/2.1.3-2” under “Clause 6.1.3”), suggesting the CR may introduce numbering conflicts or incorrect cross-references in TR 26.870; numbering should follow the target document’s clause structure.

[Editorial] Terms are used inconsistently or non-standardly (“15Hz” for frame rate, “E2E RTT” vs “E2E latency”, “1K” resolution), and should be normalized to 3GPP style (fps, one-way latency vs RTT, explicit pixel dimensions).

[Editorial] The text frequently shifts from requirements to solution proposals (“new protocol design”, “AI codec technology”) without using normative/requirements language appropriate for a TR study clause (e.g., “may need”, “is expected to”), which could be seen as over-prescriptive for a study item.

<ol>
<li>
[Technical] The proposed new use case “Embodied Video Internet (EVI)” is not clearly mapped to the existing TR 26.870 study objectives and terminology; it reads like a new umbrella concept rather than a media-centric use case, and the CR should explicitly justify why it belongs in TR 26.870 (SA4 media study) versus remaining in SA1/SA2 domain. 
</li>
<li>
[Technical] Several KPI values appear internally inconsistent or insufficiently specified for media work: e.g., “6x 1080p @ 15Hz → 20 Mbps” and “compression ratio 240:1” are asserted without stating codec, chroma format, bit depth, target quality, or whether “Hz” means fps, making the derived bitrates non-reproducible and potentially misleading for SA4 conclusions. 
</li>
<li>
[Technical] The latency requirement “E2E RTT 100–300 ms” for “real-time” embodied control/offloading is not reconciled with the tighter control-loop needs implied elsewhere (e.g., 10 ms sensor intervals, motion control), and the text should distinguish clearly between (a) media transport latency for perception streams and (b) closed-loop control latency/reliability requirements. 
</li>
<li>
[Technical] The contribution mixes “video” requirements with non-media payloads (LiDAR, point clouds, sensor data) but does not define the scope boundary for TR 26.870 (media codecs/protocols/QoE); without scoping, the clause risks driving requirements that are more appropriate for generic data transport or edge computing studies. 
</li>
<li>
[Technical] “AI codec with error-tolerant capabilities (Grace method)” is introduced as a requirement but is not defined, referenced, or aligned with ongoing 3GPP/MPEG terminology (e.g., neural codecs, feature/latent compression, ROI coding); as written it is not actionable and could conflict with existing codec evaluation frameworks. 
</li>
<li>
[Technical] “AI-native Video Protocol” is proposed as a key requirement without identifying what is missing in existing protocol stacks (RTP/RTCP, QUIC, DASH/CMAF, WebRTC, 5G media streaming) or what protocol functions are uniquely required (e.g., semantic prioritization, multi-stream synchronization, in-network adaptation), so it reads as a vague solution statement rather than a requirement. 
</li>
<li>
[Technical] Reliability targets such as “&gt;99.99%” are stated for UAV inspection and robot sensor/LiDAR traffic without defining the reliability metric (packet success probability, frame delivery, application-level inference success, within what time bound), which is critical for translating to media-layer mechanisms. 
</li>
<li>
[Technical] The UAV “event security” latency “≤10 ms” for 1K/4K video at ≥5/≥25 Mbps is extremely stringent and likely infeasible end-to-end for typical video pipelines (capture/encode/packetize/decode/render), unless it refers to one-way transport only; the clause should clarify the latency definition and include processing components or explicitly exclude them. 
</li>
<li>
[Technical] Multi-camera scenarios (6–8 cameras, mixed 1080p/4K, 15/30/60 fps) are listed, but there is no requirement discussion on synchronization (inter-camera time alignment), multi-stream correlation, or joint encoding/transport, which are central media issues for embodied perception and 3D reconstruction. 
</li>
<li>
[Technical] The “QoE model” section is too generic and user-centric for a machine-consumer scenario; embodied AI often optimizes task success (e.g., mAP, tracking stability, control error) rather than human QoE, so the clause should introduce task-oriented QoS/QoE (QoTask) metrics and how they relate to media impairments. 
</li>
<li>
[Technical] The contribution cites SA1 TR 22.870 use cases but does not ensure consistent numbering/traceability (e.g., “Use Case 6.28/6.19/6.48/6.11”) to the exact clauses/tables in TR 22.870; without precise references, the extracted KPIs risk being challenged as non-authoritative. 
</li>
<li>
[Editorial] Clause/table numbering appears inconsistent in the summary (“Table 2.1.3-1/2.1.3-2” under “Clause 6.1.3”), suggesting the CR may introduce numbering conflicts or incorrect cross-references in TR 26.870; numbering should follow the target document’s clause structure. 
</li>
<li>
[Editorial] Terms are used inconsistently or non-standardly (“15Hz” for frame rate, “E2E RTT” vs “E2E latency”, “1K” resolution), and should be normalized to 3GPP style (fps, one-way latency vs RTT, explicit pixel dimensions). 
</li>
<li>
[Editorial] The text frequently shifts from requirements to solution proposals (“new protocol design”, “AI codec technology”) without using normative/requirements language appropriate for a TR study clause (e.g., “may need”, “is expected to”), which could be seen as over-prescriptive for a study item.
</li>
</ol>