# Comprehensive Summary of S4-260097: Embodied AI Use Case and Requirements

## 1. Introduction and Background

This document builds upon previous work from SA4#134 (S4-251826) and TR 22.870 clause 6.28, which established the importance of embodied AI for the FS_6G_MED study. The paper represents a paradigm shift from **static observation sensors** (fixed cameras with limited fields of view) to **mobile embodied sensors** (robots, UAVs) that actively interact with and explore physical environments. This shift is aligned with recent industry developments including NVIDIA's Isaac GR00T project and ITU-T SG21 workshop discussions.

The core use case involves devices equipped with multiple cameras capturing and uploading multi-modal concurrent data streams (video, point clouds) for network-based AI inference supporting tasks like multi-modal perception, 3D digital twin modeling, trajectory planning, and task orchestration across educational, home, industrial, and hazardous environments.

## 2. Example Embodied AI Tasks

The document provides detailed descriptions of four state-of-the-art embodied AI tasks based on current research:

### 2.1 Task Example 1: Explore and Explain

An autonomous agent explores previously unknown environments while providing natural language descriptions at key moments. The approach uses:
- **Curiosity-based exploration** using forward/inverse dynamics models with neural network embeddings
- **Surprisal value** (L2 norm between predicted and actual embeddings) as reward function
- **Speaker policy** triggered by depth or curiosity thresholds
- **Transformer-based captioning** model with self-attention

**Evaluation metrics**: Average surprisal score, coverage measure (intersection with ground-truth semantic classes), diversity score for consecutive captions.

### 2.2 Task Example 2: Spot the Difference

Agent identifies differences between an outdated map and current environment state, combining exploration with spatial reasoning.

**Evaluation metrics**: 
- Percentage of navigable area seen (Seen%)
- Detection accuracy (Acc%)
- Intersection over Union (IoU) for changed elements
- Separate IoU+ (added objects) and IoU- (removed objects)
- mAcc and mIoU (computed only on visited space)

### 2.3 Task Example 3: Indoor Exploration

Fundamental task for acquiring spatial information using deep reinforcement learning with intrinsic rewards (curiosity, novelty, coverage). Architecture comprises:
- CNN-based mapper
- Pose estimator
- Hierarchical navigation policy

**Evaluation metrics**: IoU between reconstructed and ground-truth maps, map accuracy (m²), area seen (AS), free/occupied space metrics (FIoU, OIoU, FAS, OAS), mean positioning error.

### 2.4 Task Example 4: Vision and Language Navigation (VLN)

Agent navigates to target destination guided only by natural language instructions, using:
- 360° panorama encoding in 12×3 grid with 2048-dimensional feature maps
- Attention mechanisms for instruction interpretation
- Low-level actions (rotate, tilt, step ahead)

**Evaluation metrics**: Navigation error (NE), oracle success rate (OSR), success rate (SR), success rate weighted by path length (SPL).

## 3. Key Observations from Task Analysis

**Observation 1**: AI processing may occur at cloud/server, requiring transmission of either raw visual data (with standard compression) or pre-processed data (embeddings).

**Observation 2**: Cloud-based implementation requires low latency connectivity and error resilience for real-time navigation and environmental interaction.

**Observation 3**: Evaluation methods are highly task-dependent with different metrics for different tasks.

## 4. Deployment Scenarios and Rationale for Cloud Offloading

The document identifies specific scenarios where cloud-based AI processing is preferable:

- **Hazardous environments**: Keep robots simple/light to reduce vulnerability
- **Industrial environments**: Centralize AI processing for multiple robots using similar processing
- **Home settings**: Centralize processing at cloud or local gateway for multiple coupled devices
- **Educational settings**: Centralized AI models serving many students/learners

## 5. Transmission Formats and Network Requirements

### 5.1 Current Requirements from TR 22.870

For 6-8 cameras using 3GPP codecs (e.g., HEVC):
- **Peak data rates**: 20-100 Mbit
- **Direction**: Uplink
- **Characteristics**: Bursty, ultra-low latency

**Observation 4**: Offloaded embodied AI may demand uplink bit-rates of 20-100 Mbit.

### 5.2 Alternative Transmission Options

The document presents three transmission format categories:

| Transmission Format | UE Requirements | Network Requirements |
|---------------------|-----------------|----------------------|
| **3GPP codec (HEVC)** | Support HEVC encoding and transmission | ~20-100 Mbit peak, bursty, uplink, ultra-low latency |
| **Standardized Feature map/codec** (MPEG VCM/FCM, JPEG AI) | Support standard-based feature/image codec | Unknown peak bit-rate, bursty, ultra-low latency uplink |
| **Proprietary/open source** (embeddings, tokenizers) | Compute representation in software and transmit | Unknown, bursty, ultra-low latency uplink, efficient transmission support needed |

**Observation 5**: More investigation of proprietary and standardized feature map codecs is needed to support this use case.

## 6. Proposed Text for TR (CHANGE 1)

The document proposes new clause 4.2.2.X "Requirements for embodied AI" incorporating:
- Summary of example tasks (explore and explain, spot the difference, indoor exploration, vision and language navigation)
- All five observations regarding cloud processing scenarios, latency requirements, task-dependent evaluation, uplink bit-rate demands, and codec investigation needs
- Complete transmission format comparison table
- Rationale for cloud offloading in different scenarios

## 7. Proposals

1. **Take embodied AI requirements into account in FS_6G_MED**, particularly real-time AI inference requirements
2. **Document the simplified embodied AI use case and related requirements** based on the proposed text in clause 8

## Technical Contributions Summary

This document makes significant contributions by:
- Providing concrete, research-backed examples of embodied AI tasks with detailed technical descriptions
- Establishing task-specific evaluation methodologies and metrics
- Identifying network requirements for cloud-offloaded embodied AI (20-100 Mbit uplink, ultra-low latency, error resilience)
- Analyzing alternative transmission formats beyond traditional video codecs (feature maps, embeddings)
- Justifying cloud-based processing for specific deployment scenarios
- Proposing specific text additions to the TR for FS_6G_MED study