[FS_AVFOPS_MED] New scenario: Video with semantic segmentation map
This CR proposes adding a new scenario to TS 26.966 addressing video with semantic segmentation maps, progressing objective 1 on identifying relevant new representation formats not yet documented in TS 26.265.
Semantic Segmentation Fundamentals
- Technique where every pixel in an image is classified into one or more semantic classes
- Example classes from Android ARCore Scene Semantics API: sky, building, tree, road, vehicle, sidewalk, terrain, structure, water, object, person
- Enables AR applications with advanced video processing (sky replacement, realistic lighting effects)
Mobile Implementation Context
- Real-time capture of segmentation maps alongside camera view is commonly available on recent mobile devices
- Leverages high-capacity camera/video pipelines and AI frameworks with hardware optimizations
- Specialized models exist for specific content types (e.g., multi-class selfie segmentation)
Multi-class Selfie Segmentation Model
- Provides 7 classes for selfie shots:
- Background
- Hair
- Body-skin
- Face-skin
- Clothes
- Others (accessories)
Use Cases
- Video effects (hair replacement, face filtering)
- Video indexing
- AI search
Processing Pipeline
Three main steps identified:
1. Frame acquisition
2. AI inference
3. Generation of segmentation map
Implementation Details
- Uses Google Media Pipe framework API for image segmentation
- AI model performs inference on camera frames
- Output format: 2D array of unsigned 8-bit integers
- Each value represents estimated category for each input pixel
Class Identifier Mapping
For multi-class selfie segmentation:
- 0: background
- 1: hair
- 2: body-skin
- 3: face-skin
- 4: clothes
- 5: others (accessories)
Efficiency Considerations
- Direct class identifier representation is inefficient (only 6 values out of 255 used)
- Mapping class identifiers to sample value ranges improves:
- Transport efficiency
- Robustness to encoding artifacts
Example Mapping Table
| Class ID | Assigned Value | Sample Range |
|----------|---------------|--------------|
| 0 | 21 | 0-42 |
| 1 | 64 | 43-85 |
| 2 | 107 | 86-128 |
| 3 | 150 | 129-171 |
| 4 | 193 | 172-214 |
| 5 | 235 | 215-255 |
Current Status in JVET
- Encoding of semantic maps not currently enabled by MV-HEVC standard
- JVET developing possible MV-HEVC extension with:
- New auxiliary layer type called "segmentation plane"
- Picture segmentation information SEI message for interpreting decoded samples as class identifiers
- Reference: JVET-AN2032 (40th Meeting, Geneva, October 2025)
Assessment Framework
Two aspects for functional analysis:
Option b: No existing hardware support (provide justification/description of expected implementation impact)
Codec Capabilities