[AIML_IMS-MED] Call flow for split inferencing
This change request proposes updates to the AIML call flow for split inferencing in IMS-based media services. It revises the previously agreed device inferencing call flow (S4aR260014) to accommodate split inferencing scenarios where AI model execution is partitioned between the UE and network-based DC AS (Data Channel Application Server).
Key Addition:
- The UE now indicates split inferencing availability in the application request message sent to the MF (Media Function) when requesting the application list via the Bootstrap Data Channel (BDC)
- This allows the network to understand the UE's capability to participate in distributed AI inference
Application Metadata Enhancements:
- Application-related metadata now includes:
- Generic app information (description, app ID, URL)
- AI-specific information including AI feature tags indicating AI requirements
- AI task-related descriptions for user-informed selection
Task Metadata:
- AI task metadata is delivered with the application, potentially expressed as a task manifest
- Task list presented to users includes annotations from AI task metadata
- Execution endpoints supported by each task and subtask are now exposed to enable split inference decisions
Partitioning List Introduction:
The CR introduces a comprehensive partitioning framework:
Request Phase (Step 10):
- UE requests both a model list and a partitioning list from DCAS
- UE provides its capability metadata to enable appropriate partitioning options
Partitioning Metadata Definition:
The partitioning list/submodel partitioning metadata specifies:
- Submodel identifiers - unique identification of model partitions
- Execution endpoints - where each submodel executes (UE vs. network)
- Input/output tensor characteristics - data interfaces between submodels
- Operational characteristics - performance and resource requirements
Download Phase (Step 12):
- UE downloads both the model list and partitioning list corresponding to its capabilities
Selection Criteria (Step 13):
- User is presented with lists of both models and partitions supported by the UE
- User selects desired AI model(s) and partition
- Partition selection may be based on:
- Load distribution preferences
- Battery impact considerations
- Other task execution preferences
Configuration Phase (Step 14):
- UE configures split inference with DCAS by selecting:
- A specific model
- A specific partition
- From these selections, the corresponding submodel(s) to be executed are derived
Server-Side Preparation (Step 15):
- DCAS prepares the server-side execution context
- DCAS registers the sub-model(s) and associated metadata with the selected partitioning
Configuration Confirmation (Step 16):
- DCAS indicates whether the requested configuration is accepted
- DCAS confirms readiness to execute the server-side sub-model(s)
Submodel Deployment (Steps 17-18):
- Selected tasks/models and corresponding AI submodels are communicated to DCAS
- UE downloads the AI submodel(s) corresponding to subtasks to be executed on the device side
Execution (Step 19):
- Tasks identified for split inference between UE and DCAS are executed in a distributed manner
The main distinctions from pure device inferencing include:
The document notes one FFS (For Further Study) item:
- How device capabilities are sent to obtain an accurate list of models (noted after Step 6)