|Arasanathan Thayananthan (Nanthan)|
|3D hand tracking has great potential
as a tool for better human-computer interaction. Tracking hands, in particular
articulated finger motion, is a challenging problem because the motion
exhibits many degrees of freedom. Typically hand motion can be characterized
by 27 degrees of freedom, 21 for the joint angles and 6 for orientation
and location. Estimation in this high dimensional state space given only
an image (or video sequence) of a hand is rather difficult. Other obstacles
which have limited the use of hand trackers in real applications are the
handling of self-occlusion (very common in hand motion), tracking in cluttered
backgrounds, and automatic tracker initialization. Note that 3D tracking
is different from gesture recognition, where there is a limited set of
hand poses which need to be recognized.
The presented algorithm uses a tree of templates, generated from a 3D geometric hand model. The hand model is built from truncated quadrics and its contours can be projected into the image plane while handling self-occlusion. Articulated hand motion is learned from training data collected with a data glove, leading to a lower dimensional representation of finger motion. The likelihood cost function is based on the chamfer distance between projected contours and edges in the image. Additionally, edge orientation and skin colour information is used, making the matching more robust in cluttered backgrounds. The problem of tracker initialisation is solved by searching the tree in the first frame without the use of any prior information.
At the heart of the tracker is the tree-based filter, which approximates the optimal Bayesian filtering equations. We propose a tree-based representation of the posterior distribution, where the leaves define a partition of the state space with piecewise constant density. The advantage of this representation is that regions with low probability mass can be rapidly discarded in a hierarchical search, and the distribution can be approximated to arbitrary precision.
|The Hand Model
|The hand model is built from a set of
truncated quadrics, including ellipsoids, cones and cylinders. The advantages
of this representation are that the geometry is represented with only few
parameters and that the contours can be gernerated easily using projective
geometry. The projection of a quadric contour into an image is a
conic. For example, the projection of an ellipsoid is an ellipse, and the
projection of a cone is a pair of lines. Self-occlusion is also handled
when projecting the contours, yielding usable templates. A default shape
is first obtained by taking measurements from a real hand. Given
the image data, shape matching can be used to estimate a set of shape parameters,
including finger lengths and a width parameter.
The model has 27 degrees of freedom: 6 for the global pose, 4 for the pose of each finger, and 5 for the pose of the thumb. However, hand motion is constrained as each joint can only move within certain limits. Furthermore the motion of different joints is correlated, for example most people find it difficult to bend the middle finger and keep the ring finger extended at the same time. Go on, try it yourself. Alternatively, try bending your little finger while keeping the ring finger extended. Thus hand articulation can be expected to lie in a compact region within the high-dimensional angle space. Also, analyzing data captured with a data glove it could be seen that in most cases 95% of the variance is captured by the first 8 principal components. This can be exploited to reduce the dimensionality of the search space.
|The Tree-Based Filter
|Filtering is the problem of estimating
a (hidden) state of a system given a history of observations. In our specific
application the state describes the current pose of the hand (location,
orientation, joint angle parameters) and the observation is the image at
a particular time (or some set of features extracted from that image).
A Bayesian approach to the filtering problem yields an estimate of the
posterior distribution (the distribution of the state parameters given
the observations) at each time step. The equations are fairly easy to derive,
however, they are hard to evaluate in practice for all but simple cases
(e.g. the Kalman filter holds for normal distributions). Monte Carlo
methods such as particle filters, in the vision community also known as
Condensation algorithm, are one way to evaluate the filtering equations.
Particle filters go beyond the uni-modal assumption by approximating arbitrary
distributions with random samples.
We use an alternative method for evaluating the filtering equations, which is by partitioning the state space at multiple resolutions. We propose a tree-based representation of the distribution, where the leaves define a partition of the state space with piecewise constant density. Consider the schematic example above.:
left - Associated with the nodes
at each level is a non-overlapping set in the state space, defining a partition
of the state space (here rotation angle). The posterior
right - Corresponding posterior density (continuous) and the piecewise constant approximation using tree-based estimation. The modes of the distribution are approximated with higher precision at each level.
|The Likelihood Function
|The likelihood function is at the
heart of any estimation algorithm, as it relates the observations to the
unknown state. Ideally the chosen observations should yield a likelihood
with high discriminative power for detecting a hand with as few local minima
as possible. Furthermore it should be possible to compute the likelihood
(and features, if needed) with little computational overhead. For hand
tracking, finding good features and a suitable likelihood function is challenging,
since there are few good features which can be detected and tracked reliable
(unlike faces, for example). Color values and edge contours seem to be
suitable and have been used in other trackers in the past. In our
case we therefore assume that the data is taked from two sets of observations,
from edge data and color data.
The term for the edge data is based on a chamfer distance function. The chamfer distance is the mean (or root mean squared average) of the distances between each point in the model point set and its closest point in the edge point set. The chamfer distance between two shapes can be efficiently computed using a distance transform (DT). This transformation takes a binary feature image as input, and assigns to each pixel in the image the distance to its nearest feature. The distance between a template and an edge map can then be computed as the mean of the DT values at the template point coordinates. Edge orientation and color normal to the edge is also taken into account.
The color term is based on the skin color distribution. The RGB values are intensity normalized and skin color is modeled as a Gaussian distribution in this normalized space. For background pixels a uniform distribution is assumed.
left - Surface described by the negative log-likelihood function when searching the scale and angle space, matching a hand template with the input image on the right.
right - The superimposed template corresponds to the global minimum, but there are many local minima.
|In two sequences we track the global
3D motion of the hand without finger articulation. The 3D rotations are
limited to a hemisphere. At the leaf
level, the tree has the following resolutions: 15 degrees in two 3D rotations, 10 degrees in image rotation and 5 different scales. These 12,960 templates are then combined with a search at 2-pixel resolution in the image translation space. Figures (left) and (middle) show results from tracking a pointing and an open hand, respectively, through their global motions.
In the third sequence (right) tracking is demonstrated for global hand motion together with finger articulation. For this sequence the range of global hand motion is restricted to a smaller region, but int total 8 DOF are tracked. In total 35,000 templates are used at the leaf level.
In all three cases the hand model is automatically
initialized by searching the complete tree in the first frame of
the sequence. The images are shown with projected model contours superimposed.