Evaluation Metrics
Limitations of COCO Metrics for Limb Deficiency
Standard pose estimation metrics like those used in COCO have several limitations when evaluating models on individuals with limb deficiencies:
- COCO metrics assume all keypoints are present, which is not true for individuals with limb differences
- They don't account for the varying degrees of visibility in residual limbs
- Standard OKS (Object Keypoint Similarity) doesn't properly handle cases where keypoints are physically absent
- Existing metrics may penalize models that correctly identify missing limbs as "not present"
Limb-specific Confidence Consistency (LiCC)
We introduce the Limb-specific Confidence Consistency (LiCC) to measure how well the confidence scores predicted by a model respect the expected exclusivity among anatomically linked keypoints. LiCC is defined as:
- Let V be the set of all ground-truth keypoints with visibility v ≥ 1
- For each keypoint i ∈ V, denote by M(i) the set of mutually exclusive keypoints
- If the residual wrist keypoint is visible, then both the residual elbow keypoint and the normal wrist landmark cannot be present
- si is the predicted confidence for keypoint i
- maxj∈M(i) sj is the highest confidence among its mutually exclusive partners
where 1(·) is the indicator function. A higher LiCC indicates stronger consistency: visible keypoints are assigned higher confidence than any impossible alternatives.
LiCC ranges from 0 to 1, where 1 indicates perfect confidence calibration for both present and absent limbs.
Average Precision (AP)
Measures the precision-recall curve's area under the curve (AUC). We report AP at different OKS thresholds (0.5:0.05:0.95).
LiCC Score
Our proposed metric that evaluates how well a model's confidence scores align with actual limb presence/absence.
Benchmark Results
The following table compares the performance of state-of-the-art pose estimation models on our dataset:

Figure 1: Performance comparison of different models on InclusiveVidPose
Method | Backbone | Input Size | InclusiveVidPose | InclusiveVidPose + COCO | COCO | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
→ InclusiveVidPose | → InclusiveVidPose | → COCO | ||||||||||||||||||
AP | AP50 | AP75 | AR | AR50 | AR75 | LiCC | AP | AP50 | AP75 | AR | AR50 | AR75 | LiCC | AP | AR | AP | AR | |||
ResNet1 | ResNet50 | 256×192 | 66.0 | 80.8 | 72.4 | 72.1 | 84.5 | 77.5 | 54.4 | 67.1 | 81.8 | 73.6 | 72.7 | 85.0 | 78.3 | 67.7 | 70.0 | 73.4 | 71.8 | 77.4 |
ResNet101 | 256×192 | 66.3 | 80.1 | 71.3 | 71.3 | 83.5 | 75.8 | 59.1 | 66.7 | 81.7 | 72.4 | 72.0 | 84.2 | 76.8 | 36.7 | 68.8 | 90.5 | 72.8 | 78.3 | |
ResNet152 | 256×192 | 66.4 | 79.9 | 71.8 | 71.6 | 83.9 | 76.3 | 55.7 | 66.5 | 81.5 | 70.8 | 71.9 | 84.3 | 75.8 | 48.4 | 69.5 | 72.8 | 73.6 | 79.1 | |
HRNet2 | HRNet-W32 | 256×192 | 69.0 | 81.2 | 71.8 | 74.3 | 84.2 | 77.3 | 59.9 | 68.9 | 80.2 | 72.5 | 74.2 | 83.7 | 77.3 | 41.9 | 73.6 | 76.4 | 74.9 | 80.4 |
HRNet-W32 | 384×288 | 71.0 | 82.0 | 74.2 | 76.2 | 85.3 | 79.3 | 52.4 | 72.3 | 82.5 | 75.1 | 76.8 | 85.5 | 79.3 | 33.1 | 73.6 | 76.4 | 76.1 | 81.1 | |
HRNet-W48 | 256×192 | 61.0 | 77.2 | 63.4 | 64.9 | 80.1 | 69.8 | 49.7 | 69.8 | 81.2 | 73.8 | 74.8 | 84.0 | 78.7 | 37.4 | 73.0 | 75.9 | 75.6 | 80.9 | |
HRNet-W48 | 384×288 | 61.9 | 78.9 | 65.7 | 67.5 | 82.7 | 71.7 | 48.5 | 68.5 | 80.4 | 72.7 | 73.5 | 83.7 | 77.3 | 53.0 | 63.5 | 67.1 | 76.7 | 81.7 | |
Swin Transformer3 | Swin-T | 256×192 | 69.5 | 82.5 | 74.8 | 74.6 | 85.7 | 79.0 | 42.2 | 68.6 | 82.5 | 73.0 | 74.2 | 85.8 | 78.2 | 58.8 | 68.9 | 72.4 | 72.4 | 78.2 |
Swin-B | 256×192 | 67.2 | 80.9 | 72.2 | 73.0 | 84.2 | 77.5 | 45.4 | 67.5 | 81.2 | 72.4 | 72.9 | 84.2 | 77.7 | 47.5 | 64.8 | 68.4 | 73.7 | 79.4 | |
Swin-L | 256×192 | 69.1 | 81.4 | 74.7 | 74.8 | 84.8 | 79.0 | 43.2 | 67.4 | 80.6 | 71.8 | 72.9 | 83.8 | 76.5 | 41.0 | 65.8 | 69.4 | 74.3 | 79.8 | |
RTMPose4 | RTMPose-T | 256×192 | 61.2 | 76.3 | 64.5 | 68.2 | 74.0 | 73.5 | 44.2 | 51.9 | 74.1 | 54.6 | 57.9 | 79.1 | 58.9 | 62.7 | 60.4 | 63.1 | 68.2 | 73.6 |
RTMPose-S | 256×192 | 64.2 | 78.3 | 68.1 | 70.8 | 77.4 | 73.9 | 51.3 | 56.1 | 76.4 | 61.0 | 61.4 | 80.2 | 66.7 | 64.0 | 64.2 | 67.5 | 71.6 | 76.8 | |
RTMPose-M | 256×192 | 69.7 | 81.5 | 73.0 | 74.1 | 84.3 | 77.3 | 43.5 | 69.1 | 82.0 | 73.8 | 74.2 | 85.3 | 78.0 | 53.6 | 73.3 | 76.2 | 74.6 | 79.5 | |
RTMPose-L | 256×192 | 70.7 | 81.6 | 74.4 | 75.0 | 84.3 | 78.0 | 58.1 | 70.9 | 82.2 | 74.0 | 75.6 | 85.3 | 78.0 | 64.5 | 75.1 | 77.8 | 75.8 | 80.6 | |
ViTPose5 | ViT-S | 256×192 | 67.6 | 81.0 | 71.6 | 72.6 | 84.0 | 76.2 | 57.2 | 67.5 | 81.8 | 72.8 | 73.0 | 85.2 | 77.5 | 42.4 | 67.7 | 71.3 | 73.9 | 79.2 |
ViT-B | 256×192 | 69.4 | 81.7 | 73.4 | 75.0 | 85.2 | 78.3 | 56.2 | 68.3 | 81.9 | 72.5 | 74.2 | 85.5 | 77.5 | 57.4 | 73.4 | 76.5 | 75.7 | 81.0 | |
ViT-L | 256×192 | 71.1 | 82.9 | 75.0 | 77.2 | 86.2 | 80.8 | 55.1 | 73.1 | 83.7 | 76.2 | 78.0 | 86.8 | 80.5 | 49.2 | 78.5 | 81.4 | 78.2 | 83.4 | |
ViT-H | 256×192 | 71.2 | 83.1 | 75.2 | 77.1 | 86.8 | 80.3 | 45.9 | 73.4 | 83.2 | 76.9 | 78.4 | 86.8 | 81.8 | 65.1 | 78.4 | 81.2 | 78.8 | 83.9 | |
1 He et al. CVPR 2016. 2 Sun et al. CVPR 2019. 3 Liu et al. ICCV 2021. 4 He et al. CBM 2024. 5 Xu et al. NeurIPS 2022. |
Analysis
Our evaluation shows that while existing models perform reasonably well, there's significant room for improvement, particularly in handling diverse limb deficiencies. Key findings include:
- Models perform better on upper body keypoints compared to lower body keypoints
- Performance degrades for individuals with multiple limb deficiencies
- Challenging poses and occlusions remain difficult for all models
- Our proposed architecture shows consistent improvements across all metrics

Figure 2: Qualitative results showing model predictions (green) vs ground truth (red)
Submission
To submit your model's results to our benchmark, please follow these steps:
- Train your model on our training set
- Generate predictions on our test set
- Format your results according to our submission guidelines
- Submit via our evaluation server (coming soon)