On hand tracking & gesture recognition: January 2008

Wednesday, January 30, 2008

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models (Chen, El-Sawah, Joslin & Georganas)

Summary:

This paper describes a system based on HMMs to do continuous dynamic gesture recognition, motivated by natural interaction in virtual environments. They review the major points of an HMM. They collect data from a CyberGlove and use three different dynamic gestures to control a cube's rotation. They use a multi-dimensional HMM and use the standard deviation of the angle variation for each finger joint as an alternative to requiring pauses in gesturing to split the data into meaningful pieces. They collected 10 data sets for each of the three gestures they wanted to recognize in order to train the HMMs. They have a 3D hand bone structure model to give extra feedback and show what the data from the glove looks like.

Discussion:

The difference between this paper and others we've recently read is that it deals with continuous gestures rather than requiring a single brief gesture at a time with a pause before the next. I find it hard to tell exactly what the gestures they have chosen are from the image, or a way to make sense of any intuitive meaning the gestures have in relation to the idea of a 3D cube rotating, though it did seem they only used a rotating cube to have some visualization of how the commands are being recognized.

The idea of a repetitive, continuous gesture is something we haven't considered very much so far. Is it useful to be able to break up a graph and look for repetition, like we do with overtraced circles and spirals? Are there many natural gestures that are repetitive and continuous like this? Waving to instruct somebody to move or be quiet might fall under this pattern, but what other things are there?

Online, Interactive Learning of Gestures for Human/Robot Interfaces (Lee & Xu)

This paper presents a system that can recognize gestures and learn new ones with one or two examples online, using HMMs. They base their idea on the procedure: (1) the user makes a series of gestures, (2) the system segments the data into separate gestures, then either reacts to the gesture if it recognizes it, or asks for clarification from the user, and (3) the system adds the new example to a list of examples it has seen and retrains the HMM on the data so far seen using the Baum-Welch algorithm. They chose to represent gestures by reducing the data to a one-dimensional sequence of symbols, after resampling at even intervals, dividing into time windows, and undergoing vector quantization. They generate a codebook for this using the LBG algorithm, offline. Their segmentation process requires that the hand be still for a short time between gestures, though they believe an acceleration threshold would be useful if the hand does not stop. They have a simple function to give them a confidence measure for each gesture's classification, and they tested the system on 14 letters of the sign language alphabet which they chose for not being ambiguous without hand orientation data. They found 1%-2.4% error after 2 examples and close to none after 4 or 6 examples in their two tests. Their future goals include increasing vocabulary size by using 5 dimensions of symbols (one per finger).

Discussion:

I am curious how natural pausing between gestures will be in how many applications. As we've discussed in class, applications like sign language might use a very fluid series of gestures. But in the case of some kinds of commands, pauses are probably very natural, unless you want to do a fast sequence of commands and not have to wait for confirmation of comprehension between them. I can imagine "corner finding" based on direction and speed could be another useful tool to segment gestures into more manageable pieces.

I think resampling at even intervals as in this paper will be a very good thing to keep in mind, along with jitter reduction.

Monday, January 28, 2008

The HoloSketch VR Sketching System (Deering)

The creators of the HoloSketch system wanted to create a high-quality, high-accuracy VR display system to allow non-programmers to be able to create accurate virtual objects more quickly than with more common 3D drawing systems. Their 3D sketching system uses a CRT monitor display, a 3D mouse which works like a one-fingered data glove, and a set of glasses that enables head tracking. They implemented a "fade-up" menu that replaces the current view after a small delay (using a fade effect) on a wand button press with a pie menu, including sub-menu capability. Drawing is done by holding down a different wand button and using the wand tip as a sort of 3D pen tip. Prepackaged objects can be created using the menu to select one, pushing the wand button to create a new instance, and moving the wand to affect size or shape or form of the object. Other kinds of objects include making a material trail behind the wand, like wire-frame lines or virtual "toothpaste". HoloSketch supports editing operations once an object is selected by moving the wand tip inside an object and pressing the middle wand button, and to enable complex operations without accidental wand button presses doing accidental things, keyboard buttons must be held down at the same time. They have a "10X reduction mode (meta key)" to remove issues with jitter. The system allows animations to be added to objects.

They found that it is "surprisingly quick and easy to create complex forms in [3D] using HoloSketch." They find that the most common mistake is people not expecting head motion to make a difference in the display because people have been trained to expect otherwise by standard computer use. They had a single non-programmer artist use trials of the system over several months. They found that elbow support was necessary for extended work, and that as an expert user she could draw two-handed to rapidly create complex shapes with the toothpaste primitive.

Discussion:

I wonder what the cost of reproducing a system like this would be today. It seems like it would be easier to work with an LED monitor than a CRT because the screen is flat and there is no layer of thick glass and there shouldn't have to be corrections for that.

I imagine pie menus would be even more useful in 3D space than in 2D with a pen, because if the user is just pointing without physical contact to steady her hand it would be hard to point to a specific list item easily (I have noticed this with the Wii and doing tasks like "typing" a name by pointing and clicking at a keyboard image.) It would be nice to better understand what they are talking about in this paper when they discuss how they do jitter reduction (their "10X reduction mode" -- see page 5, or 58) -- I think this is very important in any hand-tracking-based application requiring precision.

An Architecture for Gesture-Based Control of Mobile Robots (Iba, Weghe, Paredis, Khosla)

This paper describes an approach for controlling mobile robots with hand gestures. The authors believe that capturing the intent of the user's 3D motions is a challenging but important goal for improving interaction between human and machine, and their gesture-based programming research is aimed at this long term goal. They have made a gesture spotting and recognition algorithm based on a HMM. Their system setup includes a mobile robot, a CyberGlove, a Polhemus 6DOF position sensor and a geolocation system to track the position and orientation of the robot. The robot can be directed to move by hand gestures: opening and closing are represented by going between flat hand and closed fist; pointing; and waving left or right. Closing slows and eventually stops the robot; opening maintains its current state; pointing makes the robot accelerate forwards (local control) or "go there" (global control); waving left or right makes the robot increase its rotational velocity in that direction (local) or go in the direction the hand is waving (global).

They pre-process the data to improve speed and performance by reducing the glove data from an 18-dimensional vector to a 10-dimensional feature vector, augment it with its derivatives, and maps the input vector to an integer codeword -- they chose to use 32 codewords. They partitioned a data set of 5000 measurements that cover the whole range of possible hand movements into 32 clusters, a centroid is calculated for each, and so at run time a feature vector can be mapped to a code word and a gesture can be treated as a sequence of codewords.

Discussion:

I think the most potentially useful part of this paper is the idea of reducing gestures to a sequence of codewords, since that would simplify the data we'd have to deal with by a lot. However, I wonder if they really got as thorough a sampling as they think they did, and if it allows for subtly different gestures. I don't buy that 32 codewords would be enough for all conceivable gestures, especially for something complex -- obviously ASL has 24 different static gestures that mean different things, and I'd bet we could find 9 more distinct gestures. Maybe increasing the number of codewords would help, but I'd still be wary.

I also think that while their system is a good step toward robot control using hand tracking, it isn't convincing that hand gestures are a good way to control robots doing important things that require much precision. I think I've played games where a character is controlled by using the up arrow to walk forward from the character's perspective and right and left keys to turn the character to his left or right, and that is hard enough to control without the input being interpreted as adding velocity rather than just turning or moving at a fixed speed. As for the global controls, pointing is imprecise, especially at a distance. Think about trying to have people point out a constellation to you: their arm is neither aligned with their line of sight nor yours. Waving seems even harder to be precise about. For circumstances that require accuracy in the robot's movement, I think a different set of controls would be necessary, though it might still be possible to do it using a glove or hand tracking.

Also of note, the 8th reference in this paper refers to public domain HMM C++ code by Myers and Whitson, which might be nice to look at if we need HMM code.

Wednesday, January 23, 2008

An Introduction to Hidden Markov Models (Rabiner, Juang)

(note: this paper is not haptics-specific)

Summary:

This paper is intended as a tutorial to help readers understand the theory of Markov models and how they have been applied to speech recognition problems. The authors describe how some data might be modeled according to a simple function like a sine wave, but noise might cause it to vary over time in some variable. Within a "short time" period, part of the data might be able to be modeled with a simpler function than the whole set. The whole can be modeled by stringing together a bunch of these pieces, but it can be more efficient to use a common short time model for each well-behaved part of the data signal along with a characterization of how one of these parts evolves to the next. This leads to hidden Markov models (HMM). Three problems to be addressed are given: (1) how to identify a well-behaved period, (2) how the sequentially evolving nature of these periods can be characterized, and
(3) what typical or common short time models should be chosen for each period. This paper focuses on what is a HMM, when it's appropriate, and how to use it.

They define an HMM as a doubly stochastic process (a sometimes random process - wikipedia) where one of the processes is hidden and can only be observed through another set of processes that produce the output sequence. They illustrate this with a coin toss example, where only the result of the toss is known for a series of tosses. They give examples with fair coins (50-50 odds) and biased coins (weighted towards heads or tails) to illustrate that without knowing anything besides the output sequence, important considerations are that it's hard to determine the number of states of the model; how do we choose model parameters; and how large is the training data set.

They define elements, mechanism, and notation for HMMs in their paper, so as to give sample problems and solutions using HMMs. The first problem is, given an observation sequence and a model, how to compute the probability of the observation sequence (or evaluate and give it a score). The second is how to choose a state sequence which is optimal in some meaningful sense, given an observation sequence (how to uncover the hidden model). The third is how to adjust model parameters to maximize the probability of the observation sequence (to describe how the training sequence came about via the parameters). They give explanations of how to solve these problems, with formulae. They give some issues related to HMMs that should be kept in mind when applying one.

They give an example of use of an HMM in isolated word recognition with a vocabulary of V words, a training set for each word, and an independent testing set. To enable word recognition, they build an HMM for each word, calculate a probability for each word model, and choose the one with the highest probability.

Discussion:

I like that the HMM is designed around the idea of a degree of randomness and the idea that you can't directly see the cause of the output that you can see -- this makes it seem like a good tool to apply to situations that match these criteria. I can see some parallels in human perception and recognition: we try to make sense of events that happen, but sometimes there are causes we can't see, and we're less likely to acquire an incorrect belief if we're open to the idea that we can't know everything.

I'm not yet convinced as to how quick and easy it would be to implement the math, but this seems like a fairly beginner-friendly reference for it.

American Sign Language Finger Spelling Recognition System (Allen, Pierre, Foulds)

Summary:

This short paper describes a system built to recognize finger-spelled letters in ASL, motivated by helping integrate deaf communities into mainstream society, especially for those who are better at ASL than reading and writing English. They used a cyberglove to capture hand position data -- it's not clear to me if they have a stream of data or if they just take a snapshot of the glove's positions at one point in time. They left out J and Z because those letters include motion in 3D space. They used Matlab with a Neural Networks toolbox for letter recognition. They found a 90% accuracy on the 24 letters they used, but only for the person whose data was used to train the neural network.

Discussion:

This is an interesting paper, but as the authors say, it is only a beginning step. Being able to recognize 24 letters of the alphabet is nice, but it isn't nearly sufficient for any kind of normal-speed conversation. Would you ask a hearing person to spell out their sentences instead of speaking in words? Setting aside the issues of natural language processing and assuming there is a distinct gesture for every word, and supposing our only goal is to recognize words so as to translate them directly, there remains a huge number of gestures to be learned in order to facilitate comfortable, natural conversation. The 90% accuracy rate given in the paper for the alphabet is obviously going to go down as the search space gets larger. Based on this paper alone, sign language is not a solved problem -- I'd expect it to be a pretty hard problem.

Additionally, it might be important to see if wearing a glove affects hand movements, in case anyone ever tries to apply glove data to training vision-based recognition systems.

Tuesday, January 22, 2008

Flexible Gesture Recognition for Immersive Virtual Environments (Deller, Ebert, Bender, Hagen)

Summary:

This paper presents use of all three dimensions in a computer interface as more immersive than a traditional desktop setup, but not yet common because adequate interfaces are not common. The authors believe that a good interface will allow manipulation using intuitive hand gestures, as this will be most natural and easy to learn. They review related work, and find that there is still a need for gesture recognition that can work with various conditions like different users, different hardware, and able to work in real-time for a variety of gestures.

The authors used a P5 Glove with their gesture recognition engine to get position and orientation data. Tracking is cut off if the back of the hand is angled away from the receptor and conceals too many reflectors, which is okay for the P5's intended purpose of being used sitting in front of a desktop computer. To reduce computation time, they define gestures as a sequence of postures with specific positions and orientations of the user's hand, rather than as motion over a period of time. Postures for the recognition engine are made up of flexion values of the fingers, orientation data of the hand, and a value representing relevance of the orientation for the posture. Postures are taught to the system by performing them, and the system can similarly be trained for a given user. The recognition engine has a data acquisition thread constantly checking to see if received data from the glove matches anything from its gesture manager component. Data is filtered to reduce noise, marked as a candidate for a gesture if the gesture manager finds a likely match, and marked recognized if held for a minimum time span (300-600 milliseconds by their tests). The gesture manager keeps a list of known postures and functions to manage it. A known gesture is stored with a "finger constellation", which is a set of 5D vectors representing each finger's bend value. If the current data is within some minimum recognition distance, the orientation is checked similarly. Likelihood of a match is given based on these comparisons.

They find that the system works well with applications like American Sign Language and letting the user make a "click" gesture by pointing and "tapping" at an object. Their sample implementation uses a virtual desktop where the user can select a document by moving his hand over it and making a fist to represent grabbing it. They have other gestures for opening and browsing through documents.

Discussion:

I wonder if there might be a good way to enable a wider range of hand motion to be recognized by using more than one tracker location. If there were views from the side, could they be incorporated into the big picture for a more complete view, or perhaps whichever view sees the largest number of reflectors? I don't know if this is feasible or worthwhile with the equipment we have, or if users are even really bothered by needing to learn how to hold their hands.

I also imagine that there may be a tradeoff between ease of use and allowing intuitive, natural gestures -- the gestures they describe for browsing are not entirely intuitive to me, and I would not likely guess how to open a document without being shown how to. However, without tactile feedback, making the same gesture to open a book that I would make in physical reality could be just as difficult to accomplish, as some sort of sweeping gesture could be interpreted as moving the book instead.

Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

Environmental Technology: Making the Real World Virtual (Myron W. Krueger)

Myron, W. K. (1993). “Environmental technology: making the real world virtual.” Commun. ACM 36(7): 36-37.

Summary:

This article describes research and applications that focus on letting humans act using natural human gestures in order to communicate with computers. A number of examples are given: pressure sensors in the floor track a human's movements around the room or trigger musical tones (1969). A video-based telecommunication space was found to work best by superimposing images of hands into a computer-graphic image (1970). A 2D VIDEOPLACE medium (1974-1985) serves as an interface to 2D and 3D applications, such as one that allows the user to fly around a virtual landscape by holding out their hands and leaning in the direction they want to go. Practical applications include therapeutic analysis of body motion, language instruction, and virtual exploration of other planets. 3D sculpture can be done using thumbs and forefingers. The primary use the author foresees of this technology is simply teleconferencing to discuss traditional documents.

Discussion:

This paper provides a nice introduction to some of the kinds of things that have been done in this field without giving many technical details, only describing interface design choices.

I can imagine some of these applications meshing well with the head-tracking technique using the Wii components, as in the "Head Tracking for Desktop VR Displays using the WiiRemote" video. Space exploration, for example, might work well with a screen representing a window out of a spaceship. Combining this with use of gloves to control movement might make for a fun, immersive experience, though I'm not certain it would be an improvement on a setting with VR goggles aside from lower cost and greater accessibility to average consumers. I like the idea of using a simple hand gesture to specify movement and change of direction. Maybe there would be some intuitive gesture to switch between movement and manipulation of the environment, like some sort of reaching forward to put on gloves, which could make images of gloves/hands appear in the scene.

On hand tracking & gesture recognition