Detecting poorly textured objects and estimating their 3D pose reliably is still a very challenging problem. We introduce a simple but powerful approach to computing descriptors for object views that efficiently capture both the object identity and 3D pose.
By contrast with previous manifold-based approaches, we can rely on the Euclidean distance to evaluate the similarity between descriptors, and therefore use scalable Nearest Neighbor search methods to efficiently handle a large number of objects under a large range of poses. To achieve this, we train a Convolutional Neural Network to compute these descriptors by enforcing simple similarity and dissimilarity constraints between the descriptors.
We show that our constraints nicely untangle the images from different objects and different views into clusters that are not only well-separated but also structured as the corresponding sets of poses: The Euclidean distance between descriptors is large when the descriptors are from different objects, and directly related to the distance between the poses when the descriptors are from the same object. These important properties allow us to outperform state-of-the-art object views representations on challenging RGB and RGB-D data.
There is no documentation yet, other than the readme file explaining some basics. Over time, when we receive feedback, we will provide more details. Meanwhile, if you have questions just contact us.
The data used in the paper is essentially the LineMOD dataset created by Stefan Hinterstoisser. However, we have our own way to render the synthetic images with Blender and a median-inpainting-filter for the real-world Kinect depth data. Thus, here we provide our version of the data for unmodified use with the code above: ape, benchviseblue, bowl, cam, can, cat, cup, driller, duck, eggbox, glue, holepuncher, iron, lamp, phone, camOrientations.txt, camPositionsElAz.txt.
Additionally, you can download the Blender file we used to render the synthetic data.
Update (Oct 15th 2015):
As state above the data was taken from the LineMOD dataset. There the origin of the objects was defined as a central point the object stands on (or really the center of the marker-board on which they were captured). For cropping the training and test images, we defined the "center point" which the camera looks at to be the point (0,0,5) (in cm, so 5cm above ground).
Recently, however, for further work Wadim Kehl decided to go fo a more practical scheme and centered all objects. You can download the updated blender file and the corresponding ground-truth poses for the real-world data sequences here. The poses file contains a poseXXXX.txt file for each image containing a homogenous 4x4 transformation matrix that maps camera world coordinate to camera coordinates. The translation part of these matrices is in meters!
Also, if you do not want to render the images yourself, but use the images exactly how we cropped them, you can download the set of images cropped from GT location and rescaled to 64x64.
TODO: New code to work with this data will follow. Again, the readme file contains information about how to use this data with the code.
Multimodal Templates for Real-Time Detection of Texture-Less Objects in Heavily Cluttered ScenesStefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit In Proceedings of the International Conference on Computer Vision, 2011.
Texture-Less Object Tracking with Online Training using an RGB-D Camera Youngmin Park, Vincent Lepetit, and Woontack Woo In Proceedings of the International Symposium on Mixed and Augmented Reality, 2011.
Dominant Orientation Templates for Real-Time Detection of Texture-Less Objects Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Pascal Fua, and Nassir Navab In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.