Network, QoS and UE Considerations for client side inferencing AIML/IMS
This contribution addresses network-related issues in the previously discussed call flow for client/UE side inferencing (S4aR260004a). The main concerns relate to steps 12-16 of the draft call flow, which involve model download and deployment for UE-based AI inferencing.
Problem Identification:
- TR 26.927 indicates models are approximately 40 MB (Table 6.6.2-1)
- Current publicly available models for practical use cases are significantly larger (100+ GB)
- Example: Hunyuan Image generation model set is 169 GB (available on Hugging Face)
- Simple language models (e.g., single language translation) are approximately 100 MB
Required Action:
Details on supported model sizes and required response times need to be defined.
Problem Identification:
- For real-time request-response (500 ms or even 1000 ms), current mobile networks cannot support required bit-rates
- Example calculation: 100 GB model with 1000 ms response time requires ~800 Gbps
- Such bit-rates are not realistic in current mobile networks
Required Actions:
- Define supported model size and transfer time requirements
- Identify appropriate QoS profile (5QI)
- If no suitable 5QI exists, request SA2 to update 5QI specifications for this use case
Problem Identification:
- TR 26.927 details NN compression with 2-20% compression ratios
- Even with compression, resulting bit-rates remain infeasible for mobile networks
- No UE capabilities for NN codec support have been defined
- Cannot assume UE support for such capabilities
Required Action:
Clarify whether NNC is required for client-side inferencing and document related requirements.
Problem Identification:
- S4aR260004a mentions HTTP for download
- HTTP/TCP is suboptimal for large, quick data downloads due to:
- TCP slow start
- Congestion control introducing additional latency
- Tail latency from head-of-line blocking
Proposed Solutions:
- Consider alternative protocols:
- RTP protocol with 3GPP burst QoS
- QUIC (has bindings to 5G XRM framework for improved QoS support)
- Leverage 3GPP XRM QoS support for bursty data transfer (HTTP/3 with QUIC or RTP)
Problem Identification:
- Current call flow indicates model download for every request
- No explicit caching or model update mechanism
- Results in:
- Huge bandwidth wastage
- Impossible network bit-rate requirements in current mobile networks
Required Action:
Include model updates and caching mechanisms in call flow rather than requesting new model from network each time.
The contribution emphasizes that the intention is not to exclude UE inferencing (as agreed for the work item), but to clarify limitations and requirements before agreeing to a CR detailing such call flows.
Specify applicable use cases for smaller models
Latency Requirements: Clarify end-to-end latency requirements and derive required bit-rate/latency and loss profiles
Protocol Clarification: Clarify correct protocol usage (typically not HTTP/TCP) to support the use case with required latency
SA2 Coordination: Ask SA2:
Whether new QoS profile is needed or if existing profiles suffice
Codec Support: Clarify required neural network codec support (if any) for the UE
Caching Mechanism: Add caching and model update mechanisms in call flow to avoid downloading model for each task