HOnnotate: A method for 3D Annotation of Hand and Object Poses


We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. There is a current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we capture sequences with one or several RGB-D cameras, and jointly optimizes the 3D hand and object poses over all the frames simultaneously. This method allows us to automatically annotate each frame with accurate estimates of the poses, despite large mutual occlusions. With this method, we created HO-3D, the first markerless dataset of color images with 3D annotations of both hand and object. This dataset is currently made of 80,000 frames, 65 sequences, 10 persons, and 10 objects.. We also use it to train a deepnet to perform RGB-based single frame hand pose estimation and provide a baseline on our dataset.



HO-3D is a dataset with 3D pose annotations for hand and object under severe occlusions from each other. The 68 sequences in the dataset contain 10 different persons manipulating 10 different objects, which are taken from YCB objects dataset. The dataset currently contains annotations for 77,558 images which are split into 66,034 training images (from 55 sequences) and 11,524 evaluation images (from 13 sequences). The evaluation sequences are carefully selected to address the following scenarios:
  • Seen object and seen hand: Sequences SM1, SB11 and SB13 contain hands and objects which are also used in the training set.
  • Unseen object and seen hand: Sequences AP10, AP11, AP12, AP13 and AP14 contain 019_pitcher_base object which is not used in the training set.
  • Seen object and unseen hand: Sequences MPM10, MPM11, MPM12, MPM13 and MPM14
  • contain a subject with different hand shape and color andis not part of the training set.
In order to evaluate different methods for hand pose estimation from RGB/depth images on our dataset using a common protocol, we have launched a codalab competition. The online competition server evaluates hand pose estimation results from different methods using three different standard metrics (see our paper) and can be used to compare with other submissions. The hand pose annotations for the evaluation split are withheld, while the object pose annotations are made public. A set of additional information is provided for the evaluation split to aid the pose estimation task (see evaluation page in competition). Evaluation scripts used in the challenge are available in the github repo provided below.


 We provide baseline results for hand pose estimation from single RGB image on our HO3D dataset. Please refer to the paper for more details.

  Mesh error (procrustes alignment, cms) F@5mm F@15mm Joint error (Scale trans. alignment, cms)
baseline 1.07 0.51 0.94 3.04
[1] 1.10 0.46 0.93 3.18

[1] Hasson et al. Learning Joint Reconstruction of Hands and Manipulated Objects. CVPR'19

This work was supported by the Christian Doppler Laboratory for Semantic 3D Computer Vision, funded in part by Qualcomm Inc.

Team Lepetit
Hampali Shivakumar