|
This page describes work done for a PhD thesis at Columbia University
We have built a working system where hand gestures replace many of the tasks performed by a mouse in a typical Graphical User Interface (GUI). The user can select on-screen objects by pointing, move or resize them with subsequent gestures, or bring up menus and select items from them. The system is built on a standard desktop workstation with a camera below the screen looking up. This gives an image where the hand looms large, making segmentation and vertical positioning easier. No special lighting or background are needed, but long sleeved shirts are required.
In the first step of the process the hand is segmented from the background on the basis of color. During initialization, the user watches the live camera image and places their hand within a boundary overlaid on the image. A table-based HSI color predicate is then generated based on the the fleshtone found in this representative image. During operation, this predicate quickly marks hand vs. non-hand pixels then the centroid of the largest resulting binary blob is used as the position of the hand.
In the next step, the image position of the hand is mapped to a target screen position for the pointer (mouse). To compensate for the viewing angle of the camera, a geometric correction function must be applied. A perspective transformation can be trained using only the four coordinate pairs obtained by asking the user to point to the four corners of the screen. Thus training only takes a few seconds. This approach works as long as the physical geometry stays relatively constant. Minor changes, such as the user sitting differently, or holding their hand differently to point, generally have only minor effects on the accuracy of tracking, which are easily compensated for by the user. More major changes, such as the user standing rather than sitting, require retraining.
Sometimes it is necessary to classify the pose of the user's hand. To do this a tight window is extracted from around the current hand position. Some additional morphology and enhancement steps are performed to reduce noise and compensate for variations in lighting. The image is then converted to gray scale and resample to a fixed resolution of 26 by 39 pixels. Several examples of this processing are shown above. During operation, classification of hand poses is performed by a neural net with 1014 input nodes, one for each pixel, and 20 hidden nodes. There is one output node for each possible pose. This network is trained off-line using example images of each pose at various orientations.
At the system's core is a loop which navigates a finite state machine (FSM) describing the user interaction language (details). There are essentially three types of nodes in the FSM, motion feature nodes, pose feature nodes, and menu nodes. If the current node is a motion node, the system snaps an image of the user's hand, determines where they are pointing, extracts any motion features from the path history of the cursor, and branches based on those features. If the current node is a pose node, the system examines the current or saved image of the user's hand, classifies the pose, and branches accordingly. If the current node is a menu node, the system brings up a menu with one item for each possible link out of the node, allows the user to select an item, and branches accordingly. Every cycle the system parses the current node of the FSM, determines which out-going link to follow by watching the user, and executes any actions associated with the destination node. Actions can include simple interface events such as displaying the cursor, complex interface events such as interactively moving a window, and system actions such as saving the current image for later use. The language encoded by this FSM was designed using studies of natural human gesticulation, which suggest, among other things, that there is a natural exclusivity between information conveyed by motion and pose. That is, when the hand is moving, little or no information is conveyed by the pose. This leads to a natural alternation between motion and pose in a gesture. |
| Contact: Rick Kjeldsen | Last updated: 6/12/02 | ||
|
|
|
|
|