# Network, QoS and UE Considerations for Client Side Inferencing AIML/IMS

## 1. Introduction

This contribution addresses network-related issues in the previously discussed call flow for client/UE side inferencing (S4aR260004a). The main concerns relate to steps 12-16 of the draft call flow, which involve model download and deployment for UE-based AI inferencing.

## 2. Network Related Issues

### 2.1 Model Size

**Problem Identification:**
- TR 26.927 indicates models are approximately 40 MB (Table 6.6.2-1)
- Current publicly available models for practical use cases are significantly larger (100+ GB)
- Example: Hunyuan Image generation model set is 169 GB (available on Hugging Face)
- Simple language models (e.g., single language translation) are approximately 100 MB

**Required Action:**
Details on supported model sizes and required response times need to be defined.

### 2.2 Network QoS Support

**Problem Identification:**
- For real-time request-response (500 ms or even 1000 ms), current mobile networks cannot support required bit-rates
- Example calculation: 100 GB model with 1000 ms response time requires ~800 Gbps
- Such bit-rates are not realistic in current mobile networks

**Required Actions:**
- Define supported model size and transfer time requirements
- Identify appropriate QoS profile (5QI)
- If no suitable 5QI exists, request SA2 to update 5QI specifications for this use case

### 2.3 Compression and UE Support

**Problem Identification:**
- TR 26.927 details NN compression with 2-20% compression ratios
- Even with compression, resulting bit-rates remain infeasible for mobile networks
- No UE capabilities for NN codec support have been defined
- Cannot assume UE support for such capabilities

**Required Action:**
Clarify whether NNC is required for client-side inferencing and document related requirements.

### 2.4 Protocol Support Issue

**Problem Identification:**
- S4aR260004a mentions HTTP for download
- HTTP/TCP is suboptimal for large, quick data downloads due to:
  - TCP slow start
  - Congestion control introducing additional latency
  - Tail latency from head-of-line blocking

**Proposed Solutions:**
- Consider alternative protocols:
  - RTP protocol with 3GPP burst QoS
  - QUIC (has bindings to 5G XRM framework for improved QoS support)
- Leverage 3GPP XRM QoS support for bursty data transfer (HTTP/3 with QUIC or RTP)

### 2.5 Caching and Bandwidth Wastage

**Problem Identification:**
- Current call flow indicates model download for every request
- No explicit caching or model update mechanism
- Results in:
  - Huge bandwidth wastage
  - Impossible network bit-rate requirements in current mobile networks

**Required Action:**
Include model updates and caching mechanisms in call flow rather than requesting new model from network each time.

## 3. Suggested Way Forward

The contribution emphasizes that the intention is not to exclude UE inferencing (as agreed for the work item), but to clarify limitations and requirements before agreeing to a CR detailing such call flows.

### Proposed Actions:

1. **Scope Limitation:** Add note that client-side inferencing only works for simple cases:
   - Explicitly exclude complex VLM/LLM
   - Define maximum model size limits
   - Specify applicable use cases for smaller models

2. **Latency Requirements:** Clarify end-to-end latency requirements and derive required bit-rate/latency and loss profiles

3. **Protocol Clarification:** Clarify correct protocol usage (typically not HTTP/TCP) to support the use case with required latency

4. **SA2 Coordination:** Ask SA2:
   - How such bursts can be supported
   - Whether new QoS profile is needed or if existing profiles suffice

5. **Codec Support:** Clarify required neural network codec support (if any) for the UE

6. **Caching Mechanism:** Add caching and model update mechanisms in call flow to avoid downloading model for each task