|
|
|
|
| Project Members
|
| Björn Stenger |
| Arasanathan Thayananthan (Nanthan) |
| Paulo Mendonça |
| Philip Torr |
| Roberto Cipolla |
| Publications
|
| see here |
| Overview
|
![]() ![]() |
| 3D hand tracking has great potential
as a tool for better human-computer interaction. Tracking hands, in particular
articulated finger motion, is a challenging problem because the motion
exhibits many degrees of freedom. Typically hand motion can be characterized
by 27 degrees of freedom, 21 for the joint angles and 6 for orientation
and location. Estimation in this high dimensional state space given only
an image (or video sequence) of a hand is rather difficult. Other obstacles
which have limited the use of hand trackers in real applications are the
handling of self-occlusion (very common in hand motion), tracking in cluttered
backgrounds, and automatic tracker initialization. Note that 3D tracking
is different from gesture recognition, where there is a limited set of
hand poses which need to be recognized.
The presented algorithm uses a tree of templates, generated from a 3D geometric hand model. The hand model is built from truncated quadrics and its contours can be projected into the image plane while handling self-occlusion. Articulated hand motion is learned from training data collected with a data glove, leading to a lower dimensional representation of finger motion. The likelihood cost function is based on the chamfer distance between projected contours and edges in the image. Additionally, edge orientation and skin colour information is used, making the matching more robust in cluttered backgrounds. The problem of tracker initialisation is solved by searching the tree in the first frame without the use of any prior information. At the heart of the tracker is the tree-based filter, which approximates the optimal Bayesian filtering equations. We propose a tree-based representation of the posterior distribution, where the leaves define a partition of the state space with piecewise constant density. The advantage of this representation is that regions with low probability mass can be rapidly discarded in a hierarchical search, and the distribution can be approximated to arbitrary precision. |
| The Tree-Based Filter
|
![]() ![]() |
| Filtering is the problem of estimating
a (hidden) state of a system given a history of observations. In our specific
application the state describes the current pose of the hand (location,
orientation, joint angle parameters) and the observation is the image at
a particular time (or some set of features extracted from that image).
A Bayesian approach to the filtering problem yields an estimate of the
posterior distribution (the distribution of the state parameters given
the observations) at each time step. The equations are fairly easy to derive,
however, they are hard to evaluate in practice for all but simple cases
(e.g. the Kalman filter holds for normal distributions). Monte Carlo
methods such as particle filters, in the vision community also known as
Condensation algorithm, are one way to evaluate the filtering equations.
Particle filters go beyond the uni-modal assumption by approximating arbitrary
distributions with random samples.
We use an alternative method for evaluating the filtering equations, which is by partitioning the state space at multiple resolutions. We propose a tree-based representation of the distribution, where the leaves define a partition of the state space with piecewise constant density. Consider the schematic example above.: left - Associated with the nodes
at each level is a non-overlapping set in the state space, defining a partition
of the state space (here rotation angle). The posterior
right - Corresponding posterior density (continuous) and the piecewise constant approximation using tree-based estimation. The modes of the distribution are approximated with higher precision at each level. |
| The Likelihood Function
|
![]() ![]() |
| The likelihood function is at the
heart of any estimation algorithm, as it relates the observations to the
unknown state. Ideally the chosen observations should yield a likelihood
with high discriminative power for detecting a hand with as few local minima
as possible. Furthermore it should be possible to compute the likelihood
(and features, if needed) with little computational overhead. For hand
tracking, finding good features and a suitable likelihood function is challenging,
since there are few good features which can be detected and tracked reliable
(unlike faces, for example). Color values and edge contours seem to be
suitable and have been used in other trackers in the past. In our
case we therefore assume that the data is taked from two sets of observations,
from edge data and color data.
The term for the edge data is based on a chamfer distance function. The chamfer distance is the mean (or root mean squared average) of the distances between each point in the model point set and its closest point in the edge point set. The chamfer distance between two shapes can be efficiently computed using a distance transform (DT). This transformation takes a binary feature image as input, and assigns to each pixel in the image the distance to its nearest feature. The distance between a template and an edge map can then be computed as the mean of the DT values at the template point coordinates. Edge orientation and color normal to the edge is also taken into account. The color term is based on the skin color distribution. The RGB values are intensity normalized and skin color is modeled as a Gaussian distribution in this normalized space. For background pixels a uniform distribution is assumed. left - Surface described by the negative log-likelihood function when searching the scale and angle space, matching a hand template with the input image on the right. right - The superimposed template corresponds to the global minimum, but there are many local minima. |
|
|