VIGOUR: An Integrated Vision System for Research in Human Recognition


As part of our ongoing effort to investigate computationally efficient models for real-time recognition of human activities at the QMW Machine Vision Research Group, we have developed an integrated system platform VIGOUR - Visual Interaction based on Gestures and behaviOUR.

VIGOUR provides a graphical-user interface for both real-time image sequence input from either recorded offline or captured online video streams and processed output on the X11 display. VIGOUR integrates modules from several other research projects conducted at QMW. This configuration allows for rapid proto-typing of software for research experiments and real-time system demonstrations.  The system is flexibly configurable, so it is impossible to say exactly what VIGOUR does.  Rather one configures the software for a given experiment or task.
 

Three Methodologies that form the basis for VIGOUR

The system design is based on three core philosophies used widely at QMW:

  1. view-based approach: all input comes from a single camera view, rather than 3D or stereo sensors.

  2.  
  3. minimalist modelling: only the bare minimum of computation or model complexity should be used to achieve real-time performance.  For example, tracking a face may require only that a skin-coloured cluster be followed rather than face detection at each frame.

  4.  
  5. perceptual integration: exploit the complementary qualities of different simple visual cues, such as colour and motion, rather than relying on a single modality with complicated processing.  The simplicity of these cues facilitates real-time processing, but the challenge is in integrating them effectively.


Perceptual Integration 

The cues and perceptual modules integrated by VIGOUR are:

  1. skin colour: calculated using Gaussian mixture models in hue-saturation space
  2. motion: estimated roughly using frame differencing
  3. face detection: for near frontal views, using a Support Vector Machine
  4. clustering: to identify regions of interest, such as skin-coloured body parts
  5. head pose estimation: using similarity measures and the CONDENSATION algorithm
  6. head tracker: tracks the head based on skin colour
  7. human body modelling: a Bayesian network-based tracker that can robustly cope with discontinuous motion and occlusions.
  8. feature extraction: either motion-based global features or object-centred features from the body model
  9. gesture recognition: models that are previously constructed from example sequences using VIGOUR are employed to recognise gestures performed by tracked subjects based on the extracted features.
  10. behaviour interpretation: simple rules to control the pan-tilt-zoom camera in response to gestures.


An example of this integration is shown in this data flow diagram that applies when VIGOUR is used for real-time camera control in visually mediated interaction.


(Click it to enlarge)

Image sequences are taken from a camera, or from a pre-recorded video sequence on disk. Whilst frame differencing is used to detect motion, a Gaussian mixture colour model is used to segment skin coloured regions in each image frame.  Initially the binary skin image is clustered into regions that are searched for a near-frontal face using a support vector machine.  Detected faces are used to initialise head and body trackers.  The tracker is used for camera orientation and extraction of object-centred features.  The features are then used to form a spatio-temporal trajectory of the individual's behaviour over time.  View-based gesture models, which have been previously trained using VIGOUR, are matched to an accumulated temporal trajectory of the extracted features. With an heuristic mechanism for detecting whether a gesture has occurred, the probabilities can be plotted along with an indicator that a gesture has occurred.  The gesture events are interpreted to control the camera.  This processing is performed in real-time at approx. 4 frames per second on a PII-330 MHz machine.
 

System Working Examples

Here we illustrate two different real-time applications of VIGOUR: face tracking and Visually Mediated Interaction.

Face Tracking

Here are two different ways of performing face tracking.  In both examples, colour and motion are fused at the pixel level with an AND rule.  Clustering is performed on these regions of "moving skin".  

In the first example, tracking is based only on skin colour, and is therefore tolerant to large pose changes.  Occlusions are dealt with by temporal predication using a Kalman filter.


Click for MPEG example (1.2 MB)

In the following two MPEG examples, the clustered regions (shown in green) are searched for faces at multiple scales and positions using a Support Vector Machine face detector at every frame.  Note that the detector is tolerant to some deviation from frontal pose.


Click for MPEG example (5.6 MB)


Click for MPEG example (3.6 MB)

Visually Mediated Interaction

In this example VIGOUR has been used for real-time intentional tracking, scene interpretation and camera control.  The video conferencing scenario is illustrated below.  In one room (left) there are two people communicating with a third in a different room.  There is a video link from the two-person room to the single person, and an audio link via a speaker phone.  A single pan-tilt-zoom camera provides all visual information in the first room, and zooms in on the current speaker.  The cropped face of the speaker is transmitted to the third person.  


(Click to enlarge)

The communicants can wave to get a close up, perform a dismissive gesture to get a zoomed-out head-and-shoulders shot, and point to cause the camera to pan to the other speaker.  The system automatically detects people and zooms in on them initially.

Below is an AVI example of the system output during real-time operation, using object-centred features for gesture recognition.  In the video window, the magenta box frames the tracked head, the green box is the subject's left hand, and the red box the right hand.  The green binary image is frame differencing output, the amber binary image shows skin-coloured pixels.  The central sub-window shows gesture detections and the small face view is the cropped close-up shown to the remote communicant.  The lower part of the screen shows gesture model likelihoods plotted versus time.  The frame rate is approx. 4.5 fps.


AVI example of real-time VMI  (47.6 MB!!!)

Here is another example showing earlier results, using global features extracted from the motion frame for gesture recognition.  The small white squares show the x-y centroid of the binary motion image, which are used as features for gesture classification.


MPEG example of real-time VMI, earlier results (1.5 MB)

Other examples may be found on the ISCANIT-related research pages.
 
 

Relevant Publications

Jamie Sherrah and Shaogang Gong, "VIGOUR: A System for Tracking and Recognition of Multiple People and their Activities", to appear in Proceedings of the International Conference on Pattern Recognition, September 2000, Barcelona Spain.

Jamie Sherrah and Shaogang Gong and Jonathan Howell and Hilary Buxton, "Interpretation of Group Behaviour in Visually Mediated Interaction", to appear in Proceedings of the International Conference on Pattern Recognition, September 2000, Barcelona Spain.