Visual Interpretation for Hand Gestures as a Practical Interface Modality
Fredrick C. M. Kjeldsen
Ph.D. thesis, Department of Computer Science, Columbia University, 1997.
This dissertation describes a user interface in which many tasks traditionally performed by a mouse are instead performed using visual recognition of hand gestures. The goals are to explore both how a vision system should be designed to recognize hand gestures, and how they are best used in a general purpose interface. Observed by a camera below the screen, the user manipulates objects directly with gestures incorporating both motion and pose. Task and domain knowledge provide context, allowing real-time recognition on standard PC hardware. A color-based algorithm is trained to segment user's hands from complex backgrounds without visual aids. Training uses a novel combination of both positive and negative data to improve segmentation quality. The apparent path of the hand is smoothed with an algorithm which reduces the types of noise inherent in the domain but leaves a cursor motion on the screen that feels natural for the user. Salient features of the motion are extracted, including a newly discovered natural gesture (a "Comma"), which helps provide punctuation for each gestural sentence. Neural networks are trained to classify the pose of the user's hand from cropped and preprocessed images. The nets correctly classify 90-95% of the hand images in real time. A transition network encodes the interaction language. It controls the application of feature extraction operators and interprets their results to determine when to perform actions on the user's behalf. The style of interaction is based on studies of natural gesticulation and incorporates various features designed to make it natural and easy for the user to remember. The system demonstrates a 80-90% success rate on most tasks. Object selection time for large objects is demonstrated to be equal or superior to that of a mouse. Object selection performance is modeled accurately by augmenting Fitts' Law with terms for lag and random cursor noise. Finally, the suitability of gesture for this type of task is considered. Various interaction styles are examined, and problems specific to hand gesture are discussed.