crest


 

Hand Tracking Using a Tree-Based Filter

 
 
Project Members

Björn Stenger
Arasanathan Thayananthan (Nanthan)
Paulo Mendonça
Philip Torr
Roberto Cipolla

 
 
Publications

see here

 
 
 
Overview
corresponding 3D hand model
3D hand tracking has great potential as a tool for better human-computer interaction. Tracking hands, in particular articulated finger motion, is a challenging problem because the motion exhibits many degrees of freedom. Typically hand motion can be characterized by 27 degrees of freedom, 21 for the joint angles and 6 for orientation and location. Estimation in this high dimensional state space given only an image (or video sequence) of a hand is rather difficult. Other obstacles which have limited the use of hand trackers in real applications are the handling of self-occlusion (very common in hand motion), tracking in cluttered backgrounds, and automatic tracker initialization. Note that 3D tracking is different from gesture recognition, where there is a limited set of hand poses which need to be recognized.

The presented algorithm uses a tree of templates, generated from a 3D geometric hand model. The hand model is built from truncated quadrics and its contours can be projected into the image plane while handling self-occlusion. Articulated hand motion is learned from training data collected with a data glove, leading to a lower dimensional representation of finger motion. The likelihood cost function is based on the chamfer distance between projected contours and edges in the image. Additionally, edge orientation and skin colour information is used, making the matching more robust in cluttered backgrounds. The problem of tracker initialisation is solved by searching the tree in the first frame without the use of any prior information.

At the heart of the tracker is the tree-based filter, which approximates the optimal Bayesian filtering equations. We propose a tree-based representation of the posterior distribution, where the leaves define a partition of the state space with piecewise constant density. The advantage of this representation  is that regions with low probability mass can be rapidly discarded in a  hierarchical search, and the distribution can be approximated to arbitrary precision.


 
 
The Hand Model

3D hand modelprojected contour
The hand model is built from a set of truncated quadrics, including ellipsoids, cones and cylinders. The advantages of this representation are that the geometry is represented with only few parameters and that the contours can be gernerated easily using projective geometry. The projection of  a quadric contour into an image is a conic. For example, the projection of an ellipsoid is an ellipse, and the projection of a cone is a pair of lines. Self-occlusion is also handled when projecting the contours, yielding usable templates. A default shape is first obtained by taking measurements from a real hand.  Given the image data, shape matching can be used to estimate a set of shape parameters,  including finger lengths and a width parameter.

The model has 27 degrees of freedom: 6 for the global pose, 4 for the pose of each finger, and 5 for the pose of the thumb. However, hand motion is constrained as each joint can only move within certain limits. Furthermore the motion of different joints is correlated, for example most people find it difficult to bend the middle finger and keep the ring finger extended at the same time. Go on, try it yourself. Alternatively, try bending your little finger while keeping the ring finger extended. Thus hand articulation can be expected to lie in a compact region within the high-dimensional angle space. Also, analyzing data captured with a data glove it could be seen that in most cases 95% of the variance is captured by the first 8 principal components. This can be exploited to reduce the dimensionality of the search space.


 
 
The Tree-Based Filter

tree illustrationapproximation to posterior pdf
Filtering is the problem of estimating a (hidden) state of a system given a history of observations. In our specific application the state describes the current pose of the hand (location, orientation, joint angle parameters) and the observation is the image at a particular time (or some set of features extracted from that image). A Bayesian approach to the filtering problem yields an estimate of the posterior distribution (the distribution of the state parameters given the observations) at each time step. The equations are fairly easy to derive, however, they are hard to evaluate in practice for all but simple cases (e.g. the Kalman filter holds for normal distributions).  Monte Carlo methods such as particle filters, in the vision community also known as Condensation algorithm, are one way to evaluate the filtering equations. Particle filters go beyond the uni-modal assumption by approximating arbitrary distributions with random samples.

We use an alternative method for evaluating the filtering equations, which is by partitioning the state space at multiple resolutions. We propose a tree-based representation of the distribution, where the leaves define a partition of the state space with piecewise constant density. Consider the schematic example above.:

left - Associated with the nodes at each level is a non-overlapping set in the state space, defining a partition of the state space (here rotation angle). The posterior
distribution for each node is evaluated using the center of each set, depicted by a hand rotated by a specific angle. Sub-trees of nodes with low posterior probability are not further evaluated. 

right - Corresponding posterior density (continuous) and the piecewise constant approximation using tree-based estimation. The modes of the distribution are approximated with higher precision at each level.


 
 
The Likelihood Function

cost functionimage showing global minimum
The likelihood function is at the heart of any estimation algorithm, as it relates the observations to the unknown state. Ideally the chosen observations should yield a likelihood with high discriminative power for detecting a hand with as few local minima as possible. Furthermore it should be possible to compute the likelihood (and features, if needed) with little computational overhead. For hand tracking, finding good features and a suitable likelihood function is challenging, since there are few good features which can be detected and tracked reliable (unlike faces, for example). Color values and edge contours seem to be suitable and have been used in other trackers in the past.  In our case we therefore assume that the data is taked from two sets of observations, from edge data and color data.

The term for the edge data is based on a chamfer distance function. The chamfer distance is the mean (or root mean squared average) of the distances between each point in the model point set and its closest point in the edge point set. The chamfer distance between two shapes can be efficiently computed using a distance transform (DT). This transformation takes a binary feature image as input, and assigns to each pixel in the image the  distance to its nearest feature. The distance between a template and an edge map can then be computed as the mean of the DT values at the template point coordinates. Edge orientation and color normal to the edge is also taken into account.

The color term is based on the skin color distribution. The RGB values are intensity normalized and skin color is modeled as a Gaussian distribution in this normalized space. For background pixels a uniform distribution is assumed.

left - Surface described by the negative log-likelihood function when searching the scale and angle space, matching a hand template with the input image on the right.

right - The superimposed template corresponds to the global minimum, but there are many local minima.


 
 
 
Results

Pointing hand sequenceturning hand sequenceopening and closing hand sequence
In two sequences we track the global 3D motion of the hand without finger articulation. The 3D rotations are limited to a hemisphere. At the leaf
level, the tree has the following resolutions: 15 degrees in two 3D rotations, 10 degrees in image rotation and 5 different scales. These 12,960 templates are then combined with a search at 2-pixel resolution in the image translation space. Figures (left) and (middle) show results from tracking a pointing and an open hand, respectively, through their global motions.

In the third sequence (right) tracking is demonstrated for global hand motion together with finger articulation. For this sequence the range of global hand motion is restricted to a smaller region, but int total 8 DOF are tracked. In total 35,000 templates are used at the leaf level.

In all three cases the hand model is automatically initialized  by searching the complete tree in the first frame of the sequence. The images are shown with projected model contours superimposed.