Hand 3D Pose Estimation

Deep Prior

We introduce a prior model for predicting the 3D joint locations of a hand given a depth map using Convolutional Neural Networks (CNN). In particular, we show that 1) a prior model improves the accuracy by constraining the predicted hand pose to possible ones; 2) our non-linear, compressed model can be learnt in a data-driven manner and explicitly benefits from a holistic hand pose representation; 3) our prior model seamlessly integrates into a standard CNN architecture creating an unusual "bottleneck". We show that our contributions allow us to significantly outperform the state-of-the-art on several challenging benchmarks, both in terms of accuracy and computation times. This ICCV'15 paper (Depth-based hand pose estimation: data, methods, and challenges) independently shows that our approach outperforms the other ones :)

Material

Presentation: CVWW'15 presentation Our results: Each line is the estimated hand pose of a frame. The pose is parametrized by the locations of the joints in (u, v, d) coordinates, ie image coordinates and depth. The coordinates of each joint are stored in sequential order.

ICVL dataset of D. Tang: CVWW'15 Prior, CVWW'15 Refinement
NYU dataset of J. Tompson: CVWW'15 Prior, CVWW'15 Refinement

Code

Here you can find the code for our CVWW'15 paper "Hands Deep in Deep Learning for Hand Pose Estimation". It is distributed as a single package DeepPrior under GPLv3. It also includes two pretrained models for the NYU and ICVL dataset.
There is no proper documentation yet, but a basic readme file is included. If you have questions please do not hesitate to contact us.
If you use the code, please cite us (see below).

DeepPrior++

DeepPrior is a simple approach based on Deep Learning that predicts the joint 3D locations of a hand given a depth map. Since its publication early 2015, it has been outperformed by several impressive works. Here we show that with simple improvements: adding ResNet layers, data augmentation, and better initial hand localization, we achieve better or similar performance than more sophisticated recent methods on the three main benchmarks (NYU, ICVL, MSRA) while keeping the simplicity of the original method. Our new implementation is available at https://github.com/moberweger/deep-prior-pp.

Material

Poster: ICCVW'17 poster Our results: Each line is the estimated hand pose of a frame. The pose is parametrized by the locations of the joints in (u, v, d) coordinates, ie image coordinates and depth. The coordinates of each joint are stored in sequential order.

ICVL dataset of D. Tang: ICCVW'17 DeepPrior++
NYU dataset of J. Tompson: ICCVW'17 DeepPrior++
MSRA dataset of X. Sun: ICCVW'17 DeepPrior++

Code

Here you can find the code for our ICCVW'17 paper "DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation". It is distributed as a single package DeepPrior++ under GPLv3. It also includes pretrained models for the NYU, MSRA and ICVL dataset.
There is no proper documentation yet, but a basic readme file is included. If you have questions please do not hesitate to contact us.
If you use the code, please cite us (see below).

Trained Feedback Loop

We propose an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network trained to predict an estimate of the 3D pose by using a feedback loop. The components of this feedback loop are also Deep Networks, optimized using training data. They remove the need for fitting a 3D model to the input data, which requires both a carefully designed fitting function and algorithm. We show that our approach outperforms state-of-the-art methods, and is efficient as our implementation runs at over 400 fps on a single GPU.

Material

Presentation: ICCV'15 presentation, ICCV'15 poster Our results: Each line is the estimated hand pose of a frame. The pose is parametrized by the locations of the joints in (u, v, d) coordinates, ie image coordinates and depth. The coordinates of each joint are stored in sequential order.

NYU dataset of J. Tompson: ICCV'15 Init, ICCV'15 Feedback

Semi-Automatic 3D Annotation

While many recent hand pose estimation methods critically rely on a training set of labelled frames, the creation of such a dataset is a challenging task that has been overlooked so far. As a result, existing datasets are limited to a few sequences and individuals, with limited accuracy, and this prevents these methods from delivering their full potential. We propose a semi-automated method for efficiently and accurately labeling each frame of a hand depth video with the corresponding 3D locations of the joints: The user is asked to provide only an estimate of the 2D reprojections of the visible joints in some reference frames, which are automatically selected to minimize the labeling work by efficiently optimizing a sub-modular loss function. We then exploit spatial, temporal, and appearance constraints to retrieve the full 3D poses of the hand over the complete sequence. We show that this data can be used to train a recent state-of-the-art hand pose estimation method, leading to increased accuracy.

Material

Presentation: CVPR'16 poster Datasets:

Synthetic dataset with ground truth 3D joint locations used for the evaluation of our method.
Egocentric 3D hand pose dataset created with our method (used in our CVPR paper).

Multi-User Egocentric Datasets: Using our annotation tool, we created a large dataset with 3D hand pose annotations. This dataset targets hand pose estimation from an egocentric viewpoint. The dataset was captured by mounting an RGBD camera on a tripod at head height facing away from the subject. The subject is standing behind the camera, simulating a camera viewpoint equivalent to mounting the camera on an HMD.

The subjects were asked to perform common hand articulations, as well as typical articulations for AR/VR interaction. For each subject we recorded 19 sequences, 18 of which contain the same hand articulation performed by each subject, and 1 sequence with individual articulation. We collected data from 4 subjects (1 male, 3 female) and approximately 63k RGBD frames (each around 15k). Data and annotations can be downloaded below. More annotations will be released soon :)

Subject 1:	RGB+Depth Data 6.6GB	Hand 3D detections 96KB
Subject 2:	RGB+Depth Data 7.9GB	Hand 3D detections 96KB
Subject 3:	RGB+Depth Data 7.2GB	Hand 3D detections 92KB
Subject 4:	RGB+Depth Data 8.0GB	Hand 3D detections 120KB

Code

Here you can find the code for our CVPR'16 paper "Efficiently Creating 3D Training Data for Fine Hand Pose Estimation". It is distributed as a single package SemiAutoAnno under GPLv3. The code can be run out-of-the-box with our synthetic dataset.
There is no proper documentation yet, but a basic readme file and a short manual on how to use the GUI are included. If you have questions please do not hesitate to contact us.
If you use the code, please cite us (see below).

Publications

DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation
Markus Oberweger and Vincent Lepetit
In Proc. ICCV Workshops, 2017 Efficiently Creating 3D Training Data for Fine Hand Pose Estimation
Markus Oberweger, Gernot Riegler, Paul Wohlhart, and Vincent Lepetit
In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2016 Hands Deep in Deep Learning for Hand Pose Estimation
Markus Oberweger, Paul Wohlhart, and Vincent Lepetit
In Proc. Computer Vision Winter Workshop, 2015 Training a Feedback Loop for Hand Pose Estimation
Markus Oberweger, Paul Wohlhart, and Vincent Lepetit
In Proc. IEEE Int'l Conf. on Computer Vision, 2015