As part of our ongoing effort to investigate computationally efficient models for real-time recognition of human activities at the QMW Machine Vision Research Group, we have developed an integrated system platform VIGOUR - Visual Interaction based on Gestures and behaviOUR.
VIGOUR provides a graphical-user interface for both real-time image
sequence input from either recorded offline or captured online video streams
and processed output on the X11 display. VIGOUR integrates modules from
several other research projects conducted at QMW. This configuration allows
for rapid proto-typing of software for research experiments and real-time
system demonstrations. The system is flexibly configurable, so it
is impossible to say exactly what VIGOUR does. Rather one configures
the software for a given experiment or task.
Three Methodologies that form the basis for VIGOUR
The system design is based on three core philosophies used widely at QMW:
The cues and perceptual modules integrated by VIGOUR are:
An example of this integration is shown in this data flow diagram that applies when VIGOUR is used for real-time camera control in visually mediated interaction.
(Click it to enlarge)
Image sequences are taken from a camera, or from a pre-recorded video
sequence on disk. Whilst frame differencing is used to detect motion, a
Gaussian mixture colour model is used to segment skin coloured regions
in each image frame. Initially the binary skin image is clustered
into regions that are searched for a near-frontal face using a support
vector machine. Detected faces are used to initialise head and body
trackers. The tracker is used for camera orientation and extraction
of object-centred features. The features are then used to form a
spatio-temporal trajectory of the individual's behaviour over time.
View-based gesture models, which have been previously trained using VIGOUR,
are matched to an accumulated temporal trajectory of the extracted features.
With an heuristic mechanism for detecting whether a gesture has occurred,
the probabilities can be plotted along with an indicator that a gesture
has occurred. The gesture events are interpreted to control the camera.
This processing is performed in real-time at approx. 4 frames per second
on a PII-330 MHz machine.
System Working Examples
Here we illustrate two different real-time applications of VIGOUR: face tracking and Visually Mediated Interaction.
Here are two different ways of performing face tracking. In both examples, colour and motion are fused at the pixel level with an AND rule. Clustering is performed on these regions of "moving skin".
In the first example, tracking is based only on skin colour, and is therefore tolerant to large pose changes. Occlusions are dealt with by temporal predication using a Kalman filter.
Click for MPEG example (1.2 MB)
In the following two MPEG examples, the clustered regions (shown in green) are searched for faces at multiple scales and positions using a Support Vector Machine face detector at every frame. Note that the detector is tolerant to some deviation from frontal pose.
Click for MPEG example (5.6 MB)
Click for MPEG example (3.6 MB)
Visually Mediated Interaction
In this example VIGOUR has been used for real-time intentional tracking, scene interpretation and camera control. The video conferencing scenario is illustrated below. In one room (left) there are two people communicating with a third in a different room. There is a video link from the two-person room to the single person, and an audio link via a speaker phone. A single pan-tilt-zoom camera provides all visual information in the first room, and zooms in on the current speaker. The cropped face of the speaker is transmitted to the third person.
(Click to enlarge)
The communicants can wave to get a close up, perform a dismissive gesture to get a zoomed-out head-and-shoulders shot, and point to cause the camera to pan to the other speaker. The system automatically detects people and zooms in on them initially.
Below is an AVI example of the system output during real-time operation, using object-centred features for gesture recognition. In the video window, the magenta box frames the tracked head, the green box is the subject's left hand, and the red box the right hand. The green binary image is frame differencing output, the amber binary image shows skin-coloured pixels. The central sub-window shows gesture detections and the small face view is the cropped close-up shown to the remote communicant. The lower part of the screen shows gesture model likelihoods plotted versus time. The frame rate is approx. 4.5 fps.
AVI example of real-time VMI (47.6 MB!!!)
Here is another example showing earlier results, using global features extracted from the motion frame for gesture recognition. The small white squares show the x-y centroid of the binary motion image, which are used as features for gesture classification.
MPEG example of real-time VMI, earlier results (1.5 MB)
Other examples may be found on the ISCANIT-related
Jamie Sherrah and Shaogang Gong, "VIGOUR: A System for Tracking and Recognition of Multiple People and their Activities", to appear in Proceedings of the International Conference on Pattern Recognition, September 2000, Barcelona Spain.
Sherrah and Shaogang Gong and Jonathan Howell and Hilary Buxton, "Interpretation
of Group Behaviour in Visually Mediated Interaction", to appear in
Proceedings of the International Conference on Pattern Recognition, September
2000, Barcelona Spain.