Institute of Computer Graphics and Vision
MenĂĽ

Multiple Target Detection and Tracking

Our previous approaches achieved high frame rates by only doing detection or tracking, but not both at the same time. The detector requires about 30-50ms per frame on a mobile phone, while the tracker uses only 6-10ms per tracked object and frame. It is therefore not a problem to track at 30Hz or more, but once a target (object) has been detected and tracked no more targets can be detected without suffering overall frame rate. We therefore developed new techniques that allow tracking and detection to run simultanously at high frame rates. The following figure shows a high-level overview of our new approach:

For every frame, the system first processes all currently tracked objects in the tracking sub-system. The tracked areas are entered into a mask that is then forwarded to the detection system. This mask is used to prevent keypoint detection on areas that are already tracked and do not need to be processed during detection. The following figure shows the tracking mask for three tracked targets.

The following figure show the detected keypoint for the same camera image. In the left image 486 keypoints were detected, whereas in the right image, due to masking out the tracked target, only 70 keypoints have been found.

However, the detector is still to slow to run at high frame rates (20fps or more) additionally to the tracker on a mobile phone. We therefore split up the detection task over multiple frames. Careful timing is performed to not exceed a predefined budget. E.g. if 20 frames per second are defined as target frame rate, then the overall budget is 50ms minus time for camera image aquisition, rendering, system overhead and tracking, leaving usually less than 20ms for the detection task. The following images show how CPU time is spent while tracking more and more targets. The timing graph uses the following color coding (from bottom to top): Tracking (blue), keypoint detection (red), keypoint description (green), matching (purple ) and removal and pose estimation (cyan). Black represents the time left for the rest of the system.

More details on our approach can be found in our publication: Multiple Target Detection and Tracking with Guaranteed Framerates on Mobile Phones, Daniel Wagner, Dieter Schmalstieg, Horst Bischof, ISMAR 2009
The following video shows detection and tracking of up to six targets at a frame rate of 23Hz on a mobile phone.
The following video shows a similar sequence as the video above, but also renders annotations that highlight our method of combined detection and tracking.