Evaluation Framework

Evaluation Metrics

Limitations of COCO Metrics for Limb Deficiency

Standard pose estimation metrics like those used in COCO have several limitations when evaluating models on individuals with limb deficiencies:

COCO metrics assume all keypoints are present, which is not true for individuals with limb differences
They don't account for the varying degrees of visibility in residual limbs
Standard OKS (Object Keypoint Similarity) doesn't properly handle cases where keypoints are physically absent
Existing metrics may penalize models that correctly identify missing limbs as "not present"

Limb-specific Confidence Consistency (LiCC)

We introduce the Limb-specific Confidence Consistency (LiCC) to measure how well the confidence scores predicted by a model respect the expected exclusivity among anatomically linked keypoints. LiCC is defined as:

LiCC =

1 |V|

∑_{i ∈ V} 1(s_i > max_j∈M(i) s_j)

Let V be the set of all ground-truth keypoints with visibility v ≥ 1
For each keypoint i ∈ V, denote by M(i) the set of mutually exclusive keypoints
If the residual wrist keypoint is visible, then both the residual elbow keypoint and the normal wrist landmark cannot be present
s_i is the predicted confidence for keypoint i
max_j∈M(i) s_j is the highest confidence among its mutually exclusive partners

where 1(·) is the indicator function. A higher LiCC indicates stronger consistency: visible keypoints are assigned higher confidence than any impossible alternatives.

LiCC ranges from 0 to 1, where 1 indicates perfect confidence calibration for both present and absent limbs.

Average Precision (AP)

Measures the precision-recall curve's area under the curve (AUC). We report AP at different OKS thresholds (0.5:0.05:0.95).

LiCC Score

Our proposed metric that evaluates how well a model's confidence scores align with actual limb presence/absence.

Benchmark Results

The following table compares the performance of state-of-the-art pose estimation models on our dataset:

Figure 1: Performance comparison of different models on InclusiveVidPose

**Main experimental results on pose estimation algorithms.** We evaluate top-down heatmap models with ResNet, HRNet, and Swin Transformer backbones, along with RTMPose and ViTPose. "InclusiveVidPose → InclusiveVidPose" reports training on our training set and evaluation on our validation and test splits. "InclusiveVidPose + COCO → InclusiveVidPose/COCO" reports training on both datasets, validating on our validation split and testing on our test split and COCO validation. "COCO → COCO" reports COCO training and COCO validation performance (from MMPose).
Method	Backbone	Input Size	InclusiveVidPose							InclusiveVidPose + COCO									COCO
			→ InclusiveVidPose							→ InclusiveVidPose							→ COCO		COCO
			AP	AP⁵⁰	AP⁷⁵	AR	AR⁵⁰	AR⁷⁵	LiCC	AP	AP⁵⁰	AP⁷⁵	AR	AR⁵⁰	AR⁷⁵	LiCC	AP	AR	AP	AR
ResNet¹	ResNet50	256×192	66.0	80.8	72.4	72.1	84.5	77.5	54.4	67.1	81.8	73.6	72.7	85.0	78.3	67.7	70.0	73.4	71.8	77.4
	ResNet101	256×192	66.3	80.1	71.3	71.3	83.5	75.8	59.1	66.7	81.7	72.4	72.0	84.2	76.8	36.7	68.8	90.5	72.8	78.3
	ResNet152	256×192	66.4	79.9	71.8	71.6	83.9	76.3	55.7	66.5	81.5	70.8	71.9	84.3	75.8	48.4	69.5	72.8	73.6	79.1
HRNet²	HRNet-W32	256×192	69.0	81.2	71.8	74.3	84.2	77.3	59.9	68.9	80.2	72.5	74.2	83.7	77.3	41.9	73.6	76.4	74.9	80.4
	HRNet-W32	384×288	71.0	82.0	74.2	76.2	85.3	79.3	52.4	72.3	82.5	75.1	76.8	85.5	79.3	33.1	73.6	76.4	76.1	81.1
	HRNet-W48	256×192	61.0	77.2	63.4	64.9	80.1	69.8	49.7	69.8	81.2	73.8	74.8	84.0	78.7	37.4	73.0	75.9	75.6	80.9
	HRNet-W48	384×288	61.9	78.9	65.7	67.5	82.7	71.7	48.5	68.5	80.4	72.7	73.5	83.7	77.3	53.0	63.5	67.1	76.7	81.7
Swin Transformer³	Swin-T	256×192	69.5	82.5	74.8	74.6	85.7	79.0	42.2	68.6	82.5	73.0	74.2	85.8	78.2	58.8	68.9	72.4	72.4	78.2
	Swin-B	256×192	67.2	80.9	72.2	73.0	84.2	77.5	45.4	67.5	81.2	72.4	72.9	84.2	77.7	47.5	64.8	68.4	73.7	79.4
	Swin-L	256×192	69.1	81.4	74.7	74.8	84.8	79.0	43.2	67.4	80.6	71.8	72.9	83.8	76.5	41.0	65.8	69.4	74.3	79.8
RTMPose⁴	RTMPose-T	256×192	61.2	76.3	64.5	68.2	74.0	73.5	44.2	51.9	74.1	54.6	57.9	79.1	58.9	62.7	60.4	63.1	68.2	73.6
	RTMPose-S	256×192	64.2	78.3	68.1	70.8	77.4	73.9	51.3	56.1	76.4	61.0	61.4	80.2	66.7	64.0	64.2	67.5	71.6	76.8
	RTMPose-M	256×192	69.7	81.5	73.0	74.1	84.3	77.3	43.5	69.1	82.0	73.8	74.2	85.3	78.0	53.6	73.3	76.2	74.6	79.5
	RTMPose-L	256×192	70.7	81.6	74.4	75.0	84.3	78.0	58.1	70.9	82.2	74.0	75.6	85.3	78.0	64.5	75.1	77.8	75.8	80.6
ViTPose⁵	ViT-S	256×192	67.6	81.0	71.6	72.6	84.0	76.2	57.2	67.5	81.8	72.8	73.0	85.2	77.5	42.4	67.7	71.3	73.9	79.2
	ViT-B	256×192	69.4	81.7	73.4	75.0	85.2	78.3	56.2	68.3	81.9	72.5	74.2	85.5	77.5	57.4	73.4	76.5	75.7	81.0
	ViT-L	256×192	71.1	82.9	75.0	77.2	86.2	80.8	55.1	73.1	83.7	76.2	78.0	86.8	80.5	49.2	78.5	81.4	78.2	83.4
	ViT-H	256×192	71.2	83.1	75.2	77.1	86.8	80.3	45.9	73.4	83.2	76.9	78.4	86.8	81.8	65.1	78.4	81.2	78.8	83.9
¹ He et al. CVPR 2016. ² Sun et al. CVPR 2019. ³ Liu et al. ICCV 2021. ⁴ He et al. CBM 2024. ⁵ Xu et al. NeurIPS 2022.

Analysis

Our evaluation shows that while existing models perform reasonably well, there's significant room for improvement, particularly in handling diverse limb deficiencies. Key findings include:

Models perform better on upper body keypoints compared to lower body keypoints
Performance degrades for individuals with multiple limb deficiencies
Challenging poses and occlusions remain difficult for all models
Our proposed architecture shows consistent improvements across all metrics

Figure 2: Qualitative results showing model predictions (green) vs ground truth (red)

Submission

To submit your model's results to our benchmark, please follow these steps:

Train your model on our training set
Generate predictions on our test set
Format your results according to our submission guidelines
Submit via our evaluation server (coming soon)