S4-260097 - AI Summary

Embodied AI use case and related requirements

Back to Agenda Download Summary
AI-Generated Summary AI

Comprehensive Summary of S4-260097: Embodied AI Use Case and Requirements

1. Introduction and Background

This document builds upon previous work from SA4#134 (S4-251826) and TR 22.870 clause 6.28, which established the importance of embodied AI for the FS_6G_MED study. The paper represents a paradigm shift from static observation sensors (fixed cameras with limited fields of view) to mobile embodied sensors (robots, UAVs) that actively interact with and explore physical environments. This shift is aligned with recent industry developments including NVIDIA's Isaac GR00T project and ITU-T SG21 workshop discussions.

The core use case involves devices equipped with multiple cameras capturing and uploading multi-modal concurrent data streams (video, point clouds) for network-based AI inference supporting tasks like multi-modal perception, 3D digital twin modeling, trajectory planning, and task orchestration across educational, home, industrial, and hazardous environments.

2. Example Embodied AI Tasks

The document provides detailed descriptions of four state-of-the-art embodied AI tasks based on current research:

2.1 Task Example 1: Explore and Explain

An autonomous agent explores previously unknown environments while providing natural language descriptions at key moments. The approach uses:
- Curiosity-based exploration using forward/inverse dynamics models with neural network embeddings
- Surprisal value (L2 norm between predicted and actual embeddings) as reward function
- Speaker policy triggered by depth or curiosity thresholds
- Transformer-based captioning model with self-attention

Evaluation metrics: Average surprisal score, coverage measure (intersection with ground-truth semantic classes), diversity score for consecutive captions.

2.2 Task Example 2: Spot the Difference

Agent identifies differences between an outdated map and current environment state, combining exploration with spatial reasoning.

Evaluation metrics:
- Percentage of navigable area seen (Seen%)
- Detection accuracy (Acc%)
- Intersection over Union (IoU) for changed elements
- Separate IoU+ (added objects) and IoU- (removed objects)
- mAcc and mIoU (computed only on visited space)

2.3 Task Example 3: Indoor Exploration

Fundamental task for acquiring spatial information using deep reinforcement learning with intrinsic rewards (curiosity, novelty, coverage). Architecture comprises:
- CNN-based mapper
- Pose estimator
- Hierarchical navigation policy

Evaluation metrics: IoU between reconstructed and ground-truth maps, map accuracy (m²), area seen (AS), free/occupied space metrics (FIoU, OIoU, FAS, OAS), mean positioning error.

2.4 Task Example 4: Vision and Language Navigation (VLN)

Agent navigates to target destination guided only by natural language instructions, using:
- 360° panorama encoding in 12×3 grid with 2048-dimensional feature maps
- Attention mechanisms for instruction interpretation
- Low-level actions (rotate, tilt, step ahead)

Evaluation metrics: Navigation error (NE), oracle success rate (OSR), success rate (SR), success rate weighted by path length (SPL).

3. Key Observations from Task Analysis

Observation 1: AI processing may occur at cloud/server, requiring transmission of either raw visual data (with standard compression) or pre-processed data (embeddings).

Observation 2: Cloud-based implementation requires low latency connectivity and error resilience for real-time navigation and environmental interaction.

Observation 3: Evaluation methods are highly task-dependent with different metrics for different tasks.

4. Deployment Scenarios and Rationale for Cloud Offloading

The document identifies specific scenarios where cloud-based AI processing is preferable:

  • Hazardous environments: Keep robots simple/light to reduce vulnerability
  • Industrial environments: Centralize AI processing for multiple robots using similar processing
  • Home settings: Centralize processing at cloud or local gateway for multiple coupled devices
  • Educational settings: Centralized AI models serving many students/learners

5. Transmission Formats and Network Requirements

5.1 Current Requirements from TR 22.870

For 6-8 cameras using 3GPP codecs (e.g., HEVC):
- Peak data rates: 20-100 Mbit
- Direction: Uplink
- Characteristics: Bursty, ultra-low latency

Observation 4: Offloaded embodied AI may demand uplink bit-rates of 20-100 Mbit.

5.2 Alternative Transmission Options

The document presents three transmission format categories:

| Transmission Format | UE Requirements | Network Requirements |
|---------------------|-----------------|----------------------|
| 3GPP codec (HEVC) | Support HEVC encoding and transmission | ~20-100 Mbit peak, bursty, uplink, ultra-low latency |
| Standardized Feature map/codec (MPEG VCM/FCM, JPEG AI) | Support standard-based feature/image codec | Unknown peak bit-rate, bursty, ultra-low latency uplink |
| Proprietary/open source (embeddings, tokenizers) | Compute representation in software and transmit | Unknown, bursty, ultra-low latency uplink, efficient transmission support needed |

Observation 5: More investigation of proprietary and standardized feature map codecs is needed to support this use case.

6. Proposed Text for TR (CHANGE 1)

The document proposes new clause 4.2.2.X "Requirements for embodied AI" incorporating:
- Summary of example tasks (explore and explain, spot the difference, indoor exploration, vision and language navigation)
- All five observations regarding cloud processing scenarios, latency requirements, task-dependent evaluation, uplink bit-rate demands, and codec investigation needs
- Complete transmission format comparison table
- Rationale for cloud offloading in different scenarios

7. Proposals

  1. Take embodied AI requirements into account in FS_6G_MED, particularly real-time AI inference requirements
  2. Document the simplified embodied AI use case and related requirements based on the proposed text in clause 8

Technical Contributions Summary

This document makes significant contributions by:
- Providing concrete, research-backed examples of embodied AI tasks with detailed technical descriptions
- Establishing task-specific evaluation methodologies and metrics
- Identifying network requirements for cloud-offloaded embodied AI (20-100 Mbit uplink, ultra-low latency, error resilience)
- Analyzing alternative transmission formats beyond traditional video codecs (feature maps, embeddings)
- Justifying cloud-based processing for specific deployment scenarios
- Proposing specific text additions to the TR for FS_6G_MED study

Document Information
Source:
Huawei Tech.(UK) Co.. Ltd
Type:
pCR
For:
Agreement
Original Document:
View on 3GPP
Title: Embodied AI use case and related requirements
Agenda item: 11.1
Agenda item description: FS_6G_MED (Study on Media aspects for 6G System)
Doc type: pCR
For action: Agreement
Release: Rel-20
Specification: 26.87
Version: 0.0.1
Related WIs: FS_6G_MED
Spec: 26.87
Contact: Rufail Mekuria
Uploaded: 2026-02-03T08:50:06.013000
Contact ID: 104180
TDoc Status: noted
Reservation date: 02/02/2026 13:48:07
Agenda item sort order: 60