Detecting poorly textured objects and estimating their 3D pose reliably is still a very challenging problem. We introduce a simple but powerful approach to computing descriptors for object views that efficiently capture both the object identity and 3D pose.
By contrast with previous manifold-based approaches, we can rely on the Euclidean distance to evaluate the similarity between descriptors, and therefore use scalable Nearest Neighbor search methods to efficiently handle a large number of objects under a large range of poses. To achieve this, we train a Convolutional Neural Network to compute these descriptors by enforcing simple similarity and dissimilarity constraints between the descriptors.
We show that our constraints nicely untangle the images from different objects and different views into clusters that are not only well-separated but also structured as the corresponding sets of poses: The Euclidean distance between descriptors is large when the descriptors are from different objects, and directly related to the distance between the poses when the descriptors are from the same object. These important properties allow us to outperform state-of-the-art object views representations on challenging RGB and RGB-D data.
Additionally, you can download the Blender file we used to render the synthetic data.
Update (Oct 15th 2015):
As state above the data was taken from the LineMOD dataset. There the origin of the objects was defined as a central point the object stands on (or really the center of the marker-board on which they were captured). For cropping the training and test images, we defined the "center point" which the camera looks at to be the point (0,0,5) (in cm, so 5cm above ground).
Recently, however, for further work Wadim Kehl decided to go fo a more practical scheme and centered all objects. You can download the updated blender file and the corresponding ground-truth poses for the real-world data sequences here. The poses file contains a poseXXXX.txt file for each image containing a homogenous 4x4 transformation matrix that maps camera world coordinate to camera coordinates. The translation part of these matrices is in meters!
Texture-Less Object Tracking with Online Training using an RGB-D Camera Youngmin Park, Vincent Lepetit, and Woontack Woo In Proceedings of the International Symposium on Mixed and Augmented Reality, 2011.
Dominant Orientation Templates for Real-Time Detection of Texture-Less Objects Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Pascal Fua, and Nassir Navab In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.