Evaluation Framework

Evaluation Metrics

Limitations of COCO Metrics for Limb Deficiency

Standard pose estimation metrics like those used in COCO have several limitations when evaluating models on individuals with limb deficiencies:

  • COCO metrics assume all keypoints are present, which is not true for individuals with limb differences
  • They don't account for the varying degrees of visibility in residual limbs
  • Standard OKS (Object Keypoint Similarity) doesn't properly handle cases where keypoints are physically absent
  • Existing metrics may penalize models that correctly identify missing limbs as "not present"

Limb-specific Confidence Consistency (LiCC)

We introduce the Limb-specific Confidence Consistency (LiCC) to measure how well the confidence scores predicted by a model respect the expected exclusivity among anatomically linked keypoints. LiCC is defined as:

LiCC =
1 |V|
iV 1(si > maxjM(i) sj)
  • Let V be the set of all ground-truth keypoints with visibility v ≥ 1
  • For each keypoint iV, denote by M(i) the set of mutually exclusive keypoints
  • If the residual wrist keypoint is visible, then both the residual elbow keypoint and the normal wrist landmark cannot be present
  • si is the predicted confidence for keypoint i
  • maxj∈M(i) sj is the highest confidence among its mutually exclusive partners

where 1(·) is the indicator function. A higher LiCC indicates stronger consistency: visible keypoints are assigned higher confidence than any impossible alternatives.

LiCC ranges from 0 to 1, where 1 indicates perfect confidence calibration for both present and absent limbs.

Average Precision (AP)

Measures the precision-recall curve's area under the curve (AUC). We report AP at different OKS thresholds (0.5:0.05:0.95).

LiCC Score

Our proposed metric that evaluates how well a model's confidence scores align with actual limb presence/absence.

Benchmark Results

The following table compares the performance of state-of-the-art pose estimation models on our dataset:

Evaluation Metrics Comparison

Figure 1: Performance comparison of different models on InclusiveVidPose

Main experimental results on pose estimation algorithms. We evaluate top-down heatmap models with ResNet, HRNet, and Swin Transformer backbones, along with RTMPose and ViTPose. "InclusiveVidPose → InclusiveVidPose" reports training on our training set and evaluation on our validation and test splits. "InclusiveVidPose + COCO → InclusiveVidPose/COCO" reports training on both datasets, validating on our validation split and testing on our test split and COCO validation. "COCO → COCO" reports COCO training and COCO validation performance (from MMPose).
Method Backbone Input Size InclusiveVidPose InclusiveVidPose + COCO COCO
→ InclusiveVidPose → InclusiveVidPose → COCO
AP AP50 AP75 AR AR50 AR75 LiCC AP AP50 AP75 AR AR50 AR75 LiCC AP AR AP AR
ResNet1 ResNet50 256×192 66.080.872.472.184.577.554.4 67.181.873.672.785.078.367.7 70.073.471.877.4
ResNet101 256×192 66.380.171.371.383.575.859.1 66.781.772.472.084.276.836.7 68.890.572.878.3
ResNet152 256×192 66.479.971.871.683.976.355.7 66.581.570.871.984.375.848.4 69.572.873.679.1
HRNet2 HRNet-W32 256×192 69.081.271.874.384.277.359.9 68.980.272.574.283.777.341.9 73.676.474.980.4
HRNet-W32 384×288 71.082.074.276.285.379.352.4 72.382.575.176.885.579.333.1 73.676.476.181.1
HRNet-W48 256×192 61.077.263.464.980.169.849.7 69.881.273.874.884.078.737.4 73.075.975.680.9
HRNet-W48 384×288 61.978.965.767.582.771.748.5 68.580.472.773.583.777.353.0 63.567.176.781.7
Swin Transformer3 Swin-T 256×192 69.582.574.874.685.779.042.2 68.682.573.074.285.878.258.8 68.972.472.478.2
Swin-B 256×192 67.280.972.273.084.277.545.4 67.581.272.472.984.277.747.5 64.868.473.779.4
Swin-L 256×192 69.181.474.774.884.879.043.2 67.480.671.872.983.876.541.0 65.869.474.379.8
RTMPose4 RTMPose-T 256×192 61.276.364.568.274.073.544.2 51.974.154.657.979.158.962.7 60.463.168.273.6
RTMPose-S 256×192 64.278.368.170.877.473.951.3 56.176.461.061.480.266.764.0 64.267.571.676.8
RTMPose-M 256×192 69.781.573.074.184.377.343.5 69.182.073.874.285.378.053.6 73.376.274.679.5
RTMPose-L 256×192 70.781.674.475.084.378.058.1 70.982.274.075.685.378.064.5 75.177.875.880.6
ViTPose5 ViT-S 256×192 67.681.071.672.684.076.257.2 67.581.872.873.085.277.542.4 67.771.373.979.2
ViT-B 256×192 69.481.773.475.085.278.356.2 68.381.972.574.285.577.557.4 73.476.575.781.0
ViT-L 256×192 71.182.975.077.286.280.855.1 73.183.776.278.086.880.549.2 78.581.478.283.4
ViT-H 256×192 71.283.175.277.186.880.345.9 73.483.276.978.486.881.865.1 78.481.278.883.9

Analysis

Our evaluation shows that while existing models perform reasonably well, there's significant room for improvement, particularly in handling diverse limb deficiencies. Key findings include:

  • Models perform better on upper body keypoints compared to lower body keypoints
  • Performance degrades for individuals with multiple limb deficiencies
  • Challenging poses and occlusions remain difficult for all models
  • Our proposed architecture shows consistent improvements across all metrics
Qualitative Results

Figure 2: Qualitative results showing model predictions (green) vs ground truth (red)

Submission

To submit your model's results to our benchmark, please follow these steps:

  1. Train your model on our training set
  2. Generate predictions on our test set
  3. Format your results according to our submission guidelines
  4. Submit via our evaluation server (coming soon)