On hand tracking & gesture recognition: 2008

Wednesday, April 2, 2008

Activity Recognition using Visual Tracking and RFID (Krahnstoever, Rittscher, Tu, Chean & Tomlinson)

Summary:

This paper discusses the combination of RFID and video tracking, so as to help determine what people are doing in their interactions with tagged objects. Their system uses camera data together with mathematical models of the human body to track a human's head and hands in 3D space. An RFID tracking unit detects the presence, location, and orientation of RFID tags in 3D space which are attached to objects. They believe that by combining these sources of information, it should be possible to recognize activity such as theft or tampering with retail items.

Discussion:

This paper has a decent, if fairly straightforward, idea. Yes, combining different types of data should allow for better recognition. A more thorough user study would be nice.

Reading about the way the system uses visual tracking made me think more about the situations where a glove would be more useful than cameras. For casual, everyday activities, where you might want to track the motion of one user in several locations, a glove would be a better choice if it could be wireless and send a signal to a recording device on a more stable part of a person's body. It would have to be voluntary, so it's probably not ever going to be very useful for security purposes (though it might be possible to identify when a person takes a glove off based on glove data.)

Wednesday, March 19, 2008

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman & Breazeal)

Summary:

This work is motivated by the idea of motor learning being improved by feedback over all joints, which is not available from a human instructor, especially in a large class setting. Their system includes optical tracking (based on reflecting markers on their suit), tactile actuators (vibrotactile feedback -- by pulsing a series of actuators around a joint, an order for rotation is communicated), feedback software (triggers feedback when error between the learned joint angles and current angles is high enough), and hardware for output control. Their user study compared people learning an action with only video instruction to learning it with both video and tactile feedback from the suit. They found that subjects' error in performing the motions was reduced by a statistically very significant amount (numbers around 20%).

Discussion:

I like the idea of using tactile feedback to help learn actions that involve the whole body, like dancing, but I wonder if there would be a problem with not emphasizing the important joints more than the unimportant ones. It seems to me that in dance, some amount of stylistic variation is acceptable (though I haven't spoken to any professional dancers about how important it is to have every joint bent at exactly the correct angle). Or a short person engaged in ballroom dancing with a tall person would have to position at least the arms differently than a pair of dancers of the same height. It might be possible to offset this by training the system with several different experts, and it might be slightly less of an issue with a solo activity with no props.

I also think it would be very interesting to see, as they ask when discussing future work, whether human attention can take in information from more joints of the body all at once and react effectively to it.

Wiizards: 3D Gesture Recognition for Game Play Input (Kratz, Smith & Lee)

This paper describes a game where players make gestures to cast spells against opponents with a dueling wizards theme. Different gestures can be combined in series to make different sorts of spells. They use the Wii controller and feed its accelerometer data into an HMM gesture recognition package. They had 7 users perform each gesture over 40 times, and they found that having 10 states for a gesture in the HMM gave over 90% accuracy. After 20 training gestures from one user, recognition is over 95% (or 80% for 10 gestures). Recognition without user-dependent training data was found to be around 50%. Evaluating gestures in their system works at near real-time for 250 gestures, but training HMMs takes more time.

Discussion:
This sounds like it would be an interesting game to try, though I don't know how well you could convince all players to input a lot of training data so as to get remotely palatable recognition rates. It could work for a single-player game, if the design of the training session was sufficiently clever and fun, but as the focus is on being a multiplayer game, you don't want to make players who have played before wait for an hour while the new player goes through tons of sample gestures. It might be worth trying to have important features of a given gesture that can be communicated to the user, e.g. the most important features of this spiral gesture are that it is circular, on a plane, and never crosses itself, and since the user is playing a game, they might be okay with learning these things. Then recognition could be based in significant part on these purposely chosen features as well.

Friday, February 29, 2008

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series (Kadous)

Summary (ONLY intro and 6.3.2 - Auslan):

Kadous' thesis deals with machine learning in domains where values vary over time, including gesture recognition. His method involves using metafeatures of the data, which capture properties including temporal data, so a local maximum in height might be noted. He considers synthetic events combined with metafeatures, or "interesting examples", where a local maximum occurs near the nose or near the chin, which can lead to two very different classifications.

One application he tests his classifier on is Auslan (Australian sign language). He acknowledges that fingerspelling is not sufficient for proper communication, so focuses on whole signs. A sign is made up of handshape, location, orientation, movement, and expression (such as raising eyebrows to make a question).

He tested his classifier, which tries many different machine learning methods, on data from a Nintendo Powerglove as well as on data from 5DT gloves with a Flock-of-Birds tracker for each hand. The Powerglove data was one-handed, included one user's data, and totalled 1900 signs. The Flock data was two-handed, gave more data per hand, and had significantly less noise, and it was collected from one native Auslan signer over a period of 9 weeks, totalling 27 samples for sign, making 2565 signs.

short notes:
-powerglove low accuracy, HMM best
-flock good accuracy, adaboost best, maybe HMM not so good because of too many more channels of data
-rules formed for classification

Discussion:

It was nice to see data from a native signer, although it sounded a bit more exciting before I realized it was only one signer. I think this paper includes some interesting machine learning techniques, though I didn't look too closely at the parts of the thesis focusing on that rather than the sign language study. His data set is available at http://archive.ics.uci.edu/ml/datasets/Australian+Sign+Language+signs+(High+Quality) which could be nice to have in order to compare different, new techniques to published results.

Wednesday, February 27, 2008

American Sign Language Recognition in Game Development for Deaf Children (Brashear, Henderson, Park, Hamilton, Lee, Starner)

Summary:

This work discusses CopyCat, a game intended to help teach ASL to young deaf children. It is aimed at ages 6-11, and encourages signing in complete phrases using gesture recognition technology. They used a Wizard of Oz method to collect relevant data, since no sufficient ASL recognizer existed for their use. They use a camera and colored gloves, with a different color on each fingertip, to find hand position. The game has a character, Iris, who must be woken up with a click and then instructed by ASL gestures, and then the interaction is ended with another click. Sample phrases used for the game include "go chase snake" and "orange kitten in flowers". They use HSV histograms and a Bayes classifier to make a binary mask from the video data. They have wearable, wireless accelerometers that provide extra data to be combined with video data. They collected data with five children and, after removing samples not signed correctly according to gameplay and samples with problems like fidgeting or poor form, they had 541 signed sentences and 1959 individual signs. They used GT^2k for gesture recognition, and found 93% accuracy for a user-dependent model and 86% accuracy for user-independent models.

Discussion:

I think this paper is well-motivated: it would be hard as a hearing parent to raise an unexpectedly deaf child well, and even if the parents try to learn sign language to speak to the child, most parents probably will have some difficulty approaching fluency in it. I remember from psychology classes that being exposed to language in a non-interactive format like television is not sufficient for a child to learn language (so neglected children may not learn language, even if often left alone in a room with a television playing.) An interactive game, then, seems more likely to assist in language acquisition (though I wouldn't trust it as the only source of teaching, no matter how good the recognition is.)

I also think the data samples with "problems" like false starts or fidgeting could be useful in the future as recognition becomes more refined, as it would be valuable for a computer to be able to deal with it, at least to identify it as noise, but maybe also to get extra information like how nervous the user is, or how fluent a speaker -- similarly it would be useful for a verbal speech recognition system to handle stuttering and self-corrections.

Monday, February 25, 2008

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition (Westeyn, Brashear, Atrash, Starner)

Summary:

This paper describes a toolkit, GT^2k, to aid in gesture recognition via HMM, which makes use of another existing toolkit, HTK, that supports speech recognition. GT^2k allows for training models with both real-time and off-line recognition. They discuss four sample applications for which their toolkit has been applied. A gesture panel in automobiles recognizes simple gestures with camera data and gets 99.2% accuracy. A security system is built to have recognition of patterned blinking, to go with face recognition, so that the system can't be fooled with a still photograph of an authorized person, and it gets an 89.6% accuracy. TeleSign is a system that does mobile sign language recognition, but real-time recognition was not yet implemented, but they got 90% accuracy by combining vision and accelerometer data. A workshop activity recognition system is built to recognize actions made while constructing an object in a workshop, such as hammering or sawing, based on accelerometer data, and they found 93% accuracy, though again not in real-time.

Discussion:

The blinking idea is interesting, but I have a mental image of someone with holes cut in a photograph, maybe with fake eyelids attached that blink when a string is pulled. Plus, from an interface standpoint, it seems like most people who have already been trained by today's hand-operated devices would be more comfortable using their hands to input a code. Maybe an input pad hanging on a cord that the user can pull close and hide from people standing behind them would be a simpler solution.

It does seem like a valuable idea to have a toolkit that can be used in multiple gesture recognition applications, and anything that could improve accuracy could benefit a lot of people.

Computer Vision-based Gesture Recognition for an Augmented Reality Interface (Storring, Moeslund, Liu, Granum)

This paper discusses wearable computing and augmented reality, which the authors believe should include gesture recognition as a way for the user to input commands to interact with the interface. Their focus here is on building towards a multi-user system that allows for a round-table meeting; since a table can be assumed and most of the expected gestures involve pointing, they believe that it is reasonable to restrict the gesture set to six gestures, these being a fist, and 1 to 5 fingers extended. Then all gestures are in a plane, and the recognition problem is reduced to 2D. Their system does low level pre-segmentation of the image of a hand, using pixel color (skin color) to find where the hand is, by changing the non-hand part of the image to black, and the hand shape to white. They look for the palm to be matched up to a circle and the fingers as rectangles centered at the circle, and they differentiate gestures by how many fingers are extended. They give no numerical results, but say that users adapted quickly to using the system and that the recognition rate was high enough that users found the gesture interface useful for the AR round-table architecture planning application.

Discussion:

I wonder how well their system does with fingers that are not held apart -- it would probably be possible to deal with recognizing multi-finger blobs when a human could infer that hey, that one "finger" is twice the width of two other fingers, or half as wide as the palm, so maybe it is really two fingers together. But it might be more of a pain to tweak the system to deal with those harder cases than to move to another recognition method -- and then of course there is the issue of moving to any more complicated gesture set that doesn't translate well to 2D.

Also, how did they find the thumb movement to be the most natural choice for a click gesture? Was that their design choice, or did they decide on it from user feedback?

Thursday, February 21, 2008

Television Control by Hand Gestures (Freeman, Weissman)

Summary:

This paper deals with studying how a person can control a TV set via hand gestures, focusing on two issues: (1) how can there be a large set of commands without requiring the users to learn difficult-to-remember gestures, and (2) how can the computer recognize commands in a complex vision-based setting? They chose gestures as the method of input because they anticipated problems with voice fatigue and awkwardness in incremental changes of parameters (like volume). Their approach is to have the user hold out an open hand to trigger a control mode, wherein a hand icon is displayed on the screen whose movements mirror the user's hand movements, and then there are command icons activated by mouseover and sliders which the hand icon can manipulate for things like volume. They use normalized correlation/template matching for recognition. They do not process objects that are stationary for some time. They created a prototype and found that users were excited about using gestures to control the TV, but weren't sure if this was just due to novelty.

Discussion:

The idea of giving the TV a camera with which to watch the viewer seems both interesting and unsettling. Maybe if combined with recognition of what the user is doing, this sort of technology could adjust the output of the television, adjusting volume down if the user is focused on reading material or a phone conversation, or up if the user is leaning forward with an intent expression as if trying to hear. I'm not sure I would want such technology in my home at all, because I think there is something to be said for predictable manual control and a sense of privacy. Maybe if the user can be convinced that the system only allows real-time analysis of motion and is not being recorded somewhere, they will be more likely to accept the technology, and perhaps having similar monitoring around the home could be useful for detecting injuries and illnesses that leave a person incapacitated and needing help. Someone falls to the floor, a voice prompt asks, "Are you all right?", and without a verbal response, the system calls for outside help.

I like the idea of having a hand icon appear on the display when the interactive mode is initiated by the user holding up a hand. This visual feedback, in sync with the user's motions, should help make the system easier and less frustrating to use. It might become annoying if people are watching TV together, making conversation about the show with hand gestures, and trigger an extensive display, so maybe the hand icon could be unobtrusive for a few seconds, maybe only a couple of sliders would be shown or other gestures could be allowed as "quick commands", and then instructions pop up for commands after a couple seconds of no significant change in input. I'm not sure if this would be better or not. Nor that a camera is a better solution in the first place for remote control loss than gluing a string to your remote and attaching it somewhere near where you watch tv, or such.

Monday, February 18, 2008

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola)

Summary (section 3):

This section of the paper summarizes a number of algorithmic techniques that have been applied to recognition of hand postures and gestures.

They cover feature extraction, statistics, and models, which includes template matching* (simple, accurate over a small set, small amount of calibration needed, recommended for postures not gestures), feature extraction* (a statistical technique to reduce data's dimensionality, handles gestures as well as postures, slow if many features), active shape models, principal components* (recognizes around 30 postures, requires lots of training by multiple users, requires normalization), linear fingertip models, and causal analysis.

They discuss three learning algorithms: neural networks* and HMMs* (both can recognize large posture/gesture sets, good accuracy given extensive training), and instance-based learning* (relatively simple to implement, moderately high accuracy for large set of postures, provides continuous training, memory/time intensive as data set grows, not well-researched for this application).

They also discuss three miscellaneous techniques: Linguistic approach* (uses formal grammar to represent posture & gesture set, simple approach with so-far low accuracy), appearance-based motion analysis, and spatio-temporal vector analysis.

*can be done using glove data rather than only applying to vision data.

Summary (section 4):

This section discusses applications that use hand postures and gestures: sign language, gesture-to-speech, presentations, virtual environments, 3D modeling, multimodal interaction, human/robot interaction, and television control.

Discussion:

This seems like it could be a very useful paper for an introduction to haptics -- the reader would get an idea of what kinds of tools have been used for recognition and their strengths and weaknesses. The linguistic approach sounds like what we were discussing in class regarding an evolution of LADDER, and while it's slightly worrying that this paper says the accuracy found in other implementations of it so far has been low, it could also be a good contribution to the field if we are in fact able to make it work well. It might be worth looking at the 1994 paper they cite ("A Linguistic Approach to the Recognition of Hand Gestures" -- Hand, Sexton & Mullan).

Wednesday, February 13, 2008

Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh, Watt)

This work revolves around a user study to determine how people make iconic hand gestures to convey spatial information, with the expectation that 3D gestures can be used to create/sculpt 3D objects in a virtual space. They questioned if people use iconic hand gestures while trying to communicate shapes and objects without speaking, and if so, how frequently do iconic gestures occur, what types are used, and do people prefer to make these gestures with one or two hands? They chose 15 objects for people to describe nonverbally, including primitives such as circle and pyramid, as well as complex and compound objects like chair, french baguette, and house. They found that users did use iconic hand gestures to describe all shapes. With primitives, subjects preferred to use 2-handed virtual depiction (tracing an outline in space). With complex shapes, subjects used iconic two-handed gestures as well as pantomimic, diectic, and body gestures when iconic gestures were not sufficient. Users had no trouble recalling gestures they'd made.

Discussion:

This is an interesting study, but I think there would be more information they could have gotten at the same time. I think it would be interesting to know to what extent users made similar gestures, or whether they could recognize each other's gestures without seeing the card with the object name. I suppose with just the motivation of allowing intuitive 3D sculpting, rather than recognition of gestures, their study makes pretty good sense.

A Dynamic Gesture Recognition System for the Korean Sign Language (KSL) (Kim, Jang & Bien)

Summary:

This work describes a method to recognize gestures with a fuzzy min-max neural network for online recognition. They focus on recognizing Korean Sign Language (KSL) gestures, which are generally two-handed gestures, and most of the 6000 gestures in the language are made up of combinations of basic gestures, so they chose 25 important gestures for their study. They use a VPL Data-Glove, which has 10 flex angles for the fingers of a hand, (x, y, z) position, and roll, pitch, yaw. They found 10 basic direction types of motion patterns, which mostly seem to be motion in a straight line, and one case each of an arc and a circle. They use a Fuzzy Min-Max Neural Network to recognize gestures, and get nearly 85% accuracy.

Discussion:

The idea of 10 basic types of direction is interesting from a 3D gesture-based LADDER equivalent idea. Does this hold true in ASL, that there tends to be only a certain set of general, simple motions, rather than complicated curves, that could be used to describe most gestures? Then there would seem to be hope for creating a system that lets a user describe gestures in terms of those simple motions.

Wednesday, February 6, 2008

A Similarity Measure for Motion Stream Segmentation and Recognition (Li & Prabhakaran)

This paper deals with recognizing motion streams, as generated by human body motions like sign language, by using a SVD-based similarity measure. They represent position with a matrix: columns represent positions of different joints, and rows represent different instants in time. Motions are considered to be similar if the matrices that represent them are similar in that they have the same number of attributes (columns), but they may have a different number of rows, since a fast gesture can have the same meaning as a slow one. They discuss how SVD shows the geometric structure of a matrix and can be used to compare two matrices. They performed a study with a CyberGlove, covering 18 motions (ASL for Goodbye, Idiom, 35, etc.), with 24 motion streams generated, with 5 to 10 motions in a stream. This data allowed consideration of segmentation issues. They also got motion capture data from 62 isolated motions from dance movements (each repeated 5 times). They find near 100% accuracy for isolated motions and around 94% accuracy in motion streams with their kWAS algorithm (looking at the first k eigenvectors with k=6). It is much more accurate and faster than EROS and slightly better than MAS, which are the other algorithms they compare theirs to.

Discussion:

I appreciate that they used words that would be used in regular ASL rather than just letters, since having a broad selection of recognized words will probably be more useful than finger spelling for native speakers of ASL. It's also to see an unusual application like dance, even if it does include only isolated motions -- I'm curious as to how complex these repeated short motions were, and if it was enough to count as a basic step that the dance as a whole might be made up of, and so someone might do repeatedly as practice to master that basic step, or if it is just a brief segment of the dance that is not common to be practiced by itself. It might also be interesting to see how well a practice step's motion data corresponds to the data for that same step integrated into a more complicated set of dance motions, and if there is a greater difference between these cases for expert versus novice dancers.

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip, Law, & Kwong)

Summary:

This work describes a system designed to allow both experienced musicians and novices to compose music by using hand gestures. The authors explain automated music generation in terms of musical theory, discussing how tonality, chord progression, closure of musical phrases (cadence), and generation of melody that follows the chords, and how all of these can be somewhat automated based on general rules for what makes a coherent piece of music. Then they describe their system architecture and implementation, which uses a pair of CyberGloves with Polhemus 3D position trackers; MIDI is used to synthesize musical sound; a music interface converts musical expressions to MIDI signals; background music is generated according to music theories and user-defined parameters of tempo and key; melody is generated according to hand signals, music theories and style template.

They describe the specific gesture mapping they chose for the system in depth, based on five guidelines: (1) Musical expressions should be intuitive; (2) those requiring fine control should be mapped to agile parts of the hands; (3) most important expressions should be easily triggered; (4) no two gestures should be too similar; (5) accidental triggering should be avoided. They map rhythym to the wrist flexion because it is very important but doesn't require fine movement. Pitch is important, so they map it to the relative height of the right hand, though it resets at a new bar of music. Pitch shifting of melody notes also occurs if the right hand is moved far enough relative to its position for the previous melody note. Dynamics (how strongly the note is played) and volume are controlled by right-hand finger flexion: extended fingers mean a stronger note. Lifting the left hand higher than the right hand adds a second instrument, which plays in unison or harmonizes two notes higher. Cadence occurs when the left-hand fingers completely bend, and keeping the hand closed stops the music.

The GUI lets the user choose an instrument, key, tonality, and tempo from drop-down menus (presumably with a mouse) before beginning composition with the CyberGloves.

Discussion:

Due to my lack of knowledge of composing music, I'm not sure I understood all of the automated music generation section, or if I did, it seems as though this could limit how much variety of music can be composed with this system. Then again, I could possibly be convinced that that working within the rules easily leaves more than enough flexibility to create interesting, original music, and it makes sense that it would be easier to create a system that could automate some things in order to allow the user to adjust the big picture with gestures. It does seem like their system might be difficult enough to learn to use without requiring the user to specify every detail of the music via hand movements, though maybe this could be alleviated with a sufficiently informative and usable GUI.

I think the part of the paper that is most likely to apply to other applications is their list of guidelines for determining gesture-to-meaning mapping, if we go to create our own gesture set. They seem kind of obvious, but in designing a system and writing a paper about it, it would be good to have a list of rules like that to compare our choices against.

Monday, February 4, 2008

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation (Hernandez-Rebollar, Linderman, Kyriakopoulos)

Summary:

This paper describes a system/method for recognizing the 26 letters in ASL, along with two other signs (space and enter), and also describes an attempt at an affordable alternative to other data gloves. They call their device The Accele Glove, which uses a micro controller and dual-axis accelerometers. They chose dual-axis sensor location to include the middle joints of the fingers and the distal joint of the thumb to eliminate ambiguity in the ASL alphabet. The accelerometer can track joint flexion and hand roll or yaw, or individual finger abduction, because it is suspended by springs.

They acquired data with five people each signing all 26 letters ten times, with J & Z only at their final position. For classification, they divide gestures into three subclasses: vertical, horizontal, or closed, defined by dividing the 3D space with planes whose locations are based on index finger position. They use a decision tree whose first division is based on these subclasses and further divisions are based on features of the gestures, like "flat" vs. "rolled" postures. They found that 21/26 letters reached 100% recognition, and that R, U & V could not be distinguished with their system.

Discussion:

This seems like a fairly reasonable method, and it seems like it should be able to be tweaked to recognize the letters they can't recognize. Of course, then there's always the question of whether it extends to other kinds of gestures without more and more tweaking.

At one point I misread their discussion of a voice synthesizer and thought they meant they were taking in voice data as well, which seemed like an interesting idea on its own. Having voice output while a person makes gestures during a user study could be useful later when we are trying to figure out what gestures they intended to make. On the other hand, it might have a negative effect if the person talks more or more slowly than she would ordinarily make the accompanying gestures, or if the act of talking distracts the user from making gestures, sort of how people may feel awkward describing what they are doing aloud in studies that test product usability.

Saturday, February 2, 2008

Hand Tension as a Gesture Segmentation Cue (Harling & Edwards)

Summary:

This paper focuses on "the segmentation problem" -- discriminating between two or more fluidly connected gestures. The authors emphasize that their approach is recognition-led: rather than looking at what gestures would be useful for a particular interface and creating a recognizer for just those, they are making recognizers that could possibly be incorporated into various interfaces. They divide gestures into postures and gestures (static & dynamic), and each of these groups is divided by whether or not hand motion and orientation are considered (giving categories like Static Posture Static Location: SPSL -- similarly, DPSL, SPDL, DPDL, in order of complexity). Segmenting gestures from a less complex class is easier than from a more complex class. They suggest that fingertip acceleration maxing away from the body may indicate an intention to produce another gesture. They also suggest considering the minima on the hand tension graph or other changes in the graph's shape. They give an equation to model finger tension based on finger-joint angles. Tension is considered as a sum of the tension in each finger.

They tested the hand tension model with two sets of data using a Mattel Power Glove, which scores finger bentness from 1-4 on 4 fingers. They tried two BSL sentence fragments: "MY NAME" and "MY NAME ME". The graphs indicate tension maximizes where the intentional postures occur and minima occurs between them. They admit that more data is a necessary next step before firm conclusions can be made.

Discussion:

This paper seems to focus on the equivalent of recognizing sketched geometric primitives so as to eventually be able to recognize complex shapes made up of them, which makes it an excellent followup to recent class discussion. Supposing that they didn't choose an overly easy pair of sample sentence fragments to segment, their approach seems pretty promising, and even if most gestures don't turn out to divide well based on tension, it seems likely that it will be worth including in some way for the cases where it is useful. I wonder if it would be worth having some sensor of tension in the glove, supposing the glove fits well, maybe using a elastic string along the inside/outside of the finger that stretches and puts pressure on a sensor or falls slack and doesn't, and if this would compare favorably to the kind of angle-based tension that they are talking about or if it is just redundant and unnecessary.

Wednesday, January 30, 2008

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models (Chen, El-Sawah, Joslin & Georganas)

Summary:

This paper describes a system based on HMMs to do continuous dynamic gesture recognition, motivated by natural interaction in virtual environments. They review the major points of an HMM. They collect data from a CyberGlove and use three different dynamic gestures to control a cube's rotation. They use a multi-dimensional HMM and use the standard deviation of the angle variation for each finger joint as an alternative to requiring pauses in gesturing to split the data into meaningful pieces. They collected 10 data sets for each of the three gestures they wanted to recognize in order to train the HMMs. They have a 3D hand bone structure model to give extra feedback and show what the data from the glove looks like.

Discussion:

The difference between this paper and others we've recently read is that it deals with continuous gestures rather than requiring a single brief gesture at a time with a pause before the next. I find it hard to tell exactly what the gestures they have chosen are from the image, or a way to make sense of any intuitive meaning the gestures have in relation to the idea of a 3D cube rotating, though it did seem they only used a rotating cube to have some visualization of how the commands are being recognized.

The idea of a repetitive, continuous gesture is something we haven't considered very much so far. Is it useful to be able to break up a graph and look for repetition, like we do with overtraced circles and spirals? Are there many natural gestures that are repetitive and continuous like this? Waving to instruct somebody to move or be quiet might fall under this pattern, but what other things are there?

Online, Interactive Learning of Gestures for Human/Robot Interfaces (Lee & Xu)

This paper presents a system that can recognize gestures and learn new ones with one or two examples online, using HMMs. They base their idea on the procedure: (1) the user makes a series of gestures, (2) the system segments the data into separate gestures, then either reacts to the gesture if it recognizes it, or asks for clarification from the user, and (3) the system adds the new example to a list of examples it has seen and retrains the HMM on the data so far seen using the Baum-Welch algorithm. They chose to represent gestures by reducing the data to a one-dimensional sequence of symbols, after resampling at even intervals, dividing into time windows, and undergoing vector quantization. They generate a codebook for this using the LBG algorithm, offline. Their segmentation process requires that the hand be still for a short time between gestures, though they believe an acceleration threshold would be useful if the hand does not stop. They have a simple function to give them a confidence measure for each gesture's classification, and they tested the system on 14 letters of the sign language alphabet which they chose for not being ambiguous without hand orientation data. They found 1%-2.4% error after 2 examples and close to none after 4 or 6 examples in their two tests. Their future goals include increasing vocabulary size by using 5 dimensions of symbols (one per finger).

Discussion:

I am curious how natural pausing between gestures will be in how many applications. As we've discussed in class, applications like sign language might use a very fluid series of gestures. But in the case of some kinds of commands, pauses are probably very natural, unless you want to do a fast sequence of commands and not have to wait for confirmation of comprehension between them. I can imagine "corner finding" based on direction and speed could be another useful tool to segment gestures into more manageable pieces.

I think resampling at even intervals as in this paper will be a very good thing to keep in mind, along with jitter reduction.

Monday, January 28, 2008

The HoloSketch VR Sketching System (Deering)

The creators of the HoloSketch system wanted to create a high-quality, high-accuracy VR display system to allow non-programmers to be able to create accurate virtual objects more quickly than with more common 3D drawing systems. Their 3D sketching system uses a CRT monitor display, a 3D mouse which works like a one-fingered data glove, and a set of glasses that enables head tracking. They implemented a "fade-up" menu that replaces the current view after a small delay (using a fade effect) on a wand button press with a pie menu, including sub-menu capability. Drawing is done by holding down a different wand button and using the wand tip as a sort of 3D pen tip. Prepackaged objects can be created using the menu to select one, pushing the wand button to create a new instance, and moving the wand to affect size or shape or form of the object. Other kinds of objects include making a material trail behind the wand, like wire-frame lines or virtual "toothpaste". HoloSketch supports editing operations once an object is selected by moving the wand tip inside an object and pressing the middle wand button, and to enable complex operations without accidental wand button presses doing accidental things, keyboard buttons must be held down at the same time. They have a "10X reduction mode (meta key)" to remove issues with jitter. The system allows animations to be added to objects.

They found that it is "surprisingly quick and easy to create complex forms in [3D] using HoloSketch." They find that the most common mistake is people not expecting head motion to make a difference in the display because people have been trained to expect otherwise by standard computer use. They had a single non-programmer artist use trials of the system over several months. They found that elbow support was necessary for extended work, and that as an expert user she could draw two-handed to rapidly create complex shapes with the toothpaste primitive.

Discussion:

I wonder what the cost of reproducing a system like this would be today. It seems like it would be easier to work with an LED monitor than a CRT because the screen is flat and there is no layer of thick glass and there shouldn't have to be corrections for that.

I imagine pie menus would be even more useful in 3D space than in 2D with a pen, because if the user is just pointing without physical contact to steady her hand it would be hard to point to a specific list item easily (I have noticed this with the Wii and doing tasks like "typing" a name by pointing and clicking at a keyboard image.) It would be nice to better understand what they are talking about in this paper when they discuss how they do jitter reduction (their "10X reduction mode" -- see page 5, or 58) -- I think this is very important in any hand-tracking-based application requiring precision.

An Architecture for Gesture-Based Control of Mobile Robots (Iba, Weghe, Paredis, Khosla)

This paper describes an approach for controlling mobile robots with hand gestures. The authors believe that capturing the intent of the user's 3D motions is a challenging but important goal for improving interaction between human and machine, and their gesture-based programming research is aimed at this long term goal. They have made a gesture spotting and recognition algorithm based on a HMM. Their system setup includes a mobile robot, a CyberGlove, a Polhemus 6DOF position sensor and a geolocation system to track the position and orientation of the robot. The robot can be directed to move by hand gestures: opening and closing are represented by going between flat hand and closed fist; pointing; and waving left or right. Closing slows and eventually stops the robot; opening maintains its current state; pointing makes the robot accelerate forwards (local control) or "go there" (global control); waving left or right makes the robot increase its rotational velocity in that direction (local) or go in the direction the hand is waving (global).

They pre-process the data to improve speed and performance by reducing the glove data from an 18-dimensional vector to a 10-dimensional feature vector, augment it with its derivatives, and maps the input vector to an integer codeword -- they chose to use 32 codewords. They partitioned a data set of 5000 measurements that cover the whole range of possible hand movements into 32 clusters, a centroid is calculated for each, and so at run time a feature vector can be mapped to a code word and a gesture can be treated as a sequence of codewords.

Discussion:

I think the most potentially useful part of this paper is the idea of reducing gestures to a sequence of codewords, since that would simplify the data we'd have to deal with by a lot. However, I wonder if they really got as thorough a sampling as they think they did, and if it allows for subtly different gestures. I don't buy that 32 codewords would be enough for all conceivable gestures, especially for something complex -- obviously ASL has 24 different static gestures that mean different things, and I'd bet we could find 9 more distinct gestures. Maybe increasing the number of codewords would help, but I'd still be wary.

I also think that while their system is a good step toward robot control using hand tracking, it isn't convincing that hand gestures are a good way to control robots doing important things that require much precision. I think I've played games where a character is controlled by using the up arrow to walk forward from the character's perspective and right and left keys to turn the character to his left or right, and that is hard enough to control without the input being interpreted as adding velocity rather than just turning or moving at a fixed speed. As for the global controls, pointing is imprecise, especially at a distance. Think about trying to have people point out a constellation to you: their arm is neither aligned with their line of sight nor yours. Waving seems even harder to be precise about. For circumstances that require accuracy in the robot's movement, I think a different set of controls would be necessary, though it might still be possible to do it using a glove or hand tracking.

Also of note, the 8th reference in this paper refers to public domain HMM C++ code by Myers and Whitson, which might be nice to look at if we need HMM code.

Wednesday, January 23, 2008

An Introduction to Hidden Markov Models (Rabiner, Juang)

(note: this paper is not haptics-specific)

Summary:

This paper is intended as a tutorial to help readers understand the theory of Markov models and how they have been applied to speech recognition problems. The authors describe how some data might be modeled according to a simple function like a sine wave, but noise might cause it to vary over time in some variable. Within a "short time" period, part of the data might be able to be modeled with a simpler function than the whole set. The whole can be modeled by stringing together a bunch of these pieces, but it can be more efficient to use a common short time model for each well-behaved part of the data signal along with a characterization of how one of these parts evolves to the next. This leads to hidden Markov models (HMM). Three problems to be addressed are given: (1) how to identify a well-behaved period, (2) how the sequentially evolving nature of these periods can be characterized, and
(3) what typical or common short time models should be chosen for each period. This paper focuses on what is a HMM, when it's appropriate, and how to use it.

They define an HMM as a doubly stochastic process (a sometimes random process - wikipedia) where one of the processes is hidden and can only be observed through another set of processes that produce the output sequence. They illustrate this with a coin toss example, where only the result of the toss is known for a series of tosses. They give examples with fair coins (50-50 odds) and biased coins (weighted towards heads or tails) to illustrate that without knowing anything besides the output sequence, important considerations are that it's hard to determine the number of states of the model; how do we choose model parameters; and how large is the training data set.

They define elements, mechanism, and notation for HMMs in their paper, so as to give sample problems and solutions using HMMs. The first problem is, given an observation sequence and a model, how to compute the probability of the observation sequence (or evaluate and give it a score). The second is how to choose a state sequence which is optimal in some meaningful sense, given an observation sequence (how to uncover the hidden model). The third is how to adjust model parameters to maximize the probability of the observation sequence (to describe how the training sequence came about via the parameters). They give explanations of how to solve these problems, with formulae. They give some issues related to HMMs that should be kept in mind when applying one.

They give an example of use of an HMM in isolated word recognition with a vocabulary of V words, a training set for each word, and an independent testing set. To enable word recognition, they build an HMM for each word, calculate a probability for each word model, and choose the one with the highest probability.

Discussion:

I like that the HMM is designed around the idea of a degree of randomness and the idea that you can't directly see the cause of the output that you can see -- this makes it seem like a good tool to apply to situations that match these criteria. I can see some parallels in human perception and recognition: we try to make sense of events that happen, but sometimes there are causes we can't see, and we're less likely to acquire an incorrect belief if we're open to the idea that we can't know everything.

I'm not yet convinced as to how quick and easy it would be to implement the math, but this seems like a fairly beginner-friendly reference for it.

American Sign Language Finger Spelling Recognition System (Allen, Pierre, Foulds)

Summary:

This short paper describes a system built to recognize finger-spelled letters in ASL, motivated by helping integrate deaf communities into mainstream society, especially for those who are better at ASL than reading and writing English. They used a cyberglove to capture hand position data -- it's not clear to me if they have a stream of data or if they just take a snapshot of the glove's positions at one point in time. They left out J and Z because those letters include motion in 3D space. They used Matlab with a Neural Networks toolbox for letter recognition. They found a 90% accuracy on the 24 letters they used, but only for the person whose data was used to train the neural network.

Discussion:

This is an interesting paper, but as the authors say, it is only a beginning step. Being able to recognize 24 letters of the alphabet is nice, but it isn't nearly sufficient for any kind of normal-speed conversation. Would you ask a hearing person to spell out their sentences instead of speaking in words? Setting aside the issues of natural language processing and assuming there is a distinct gesture for every word, and supposing our only goal is to recognize words so as to translate them directly, there remains a huge number of gestures to be learned in order to facilitate comfortable, natural conversation. The 90% accuracy rate given in the paper for the alphabet is obviously going to go down as the search space gets larger. Based on this paper alone, sign language is not a solved problem -- I'd expect it to be a pretty hard problem.

Additionally, it might be important to see if wearing a glove affects hand movements, in case anyone ever tries to apply glove data to training vision-based recognition systems.

Tuesday, January 22, 2008

Flexible Gesture Recognition for Immersive Virtual Environments (Deller, Ebert, Bender, Hagen)

Summary:

This paper presents use of all three dimensions in a computer interface as more immersive than a traditional desktop setup, but not yet common because adequate interfaces are not common. The authors believe that a good interface will allow manipulation using intuitive hand gestures, as this will be most natural and easy to learn. They review related work, and find that there is still a need for gesture recognition that can work with various conditions like different users, different hardware, and able to work in real-time for a variety of gestures.

The authors used a P5 Glove with their gesture recognition engine to get position and orientation data. Tracking is cut off if the back of the hand is angled away from the receptor and conceals too many reflectors, which is okay for the P5's intended purpose of being used sitting in front of a desktop computer. To reduce computation time, they define gestures as a sequence of postures with specific positions and orientations of the user's hand, rather than as motion over a period of time. Postures for the recognition engine are made up of flexion values of the fingers, orientation data of the hand, and a value representing relevance of the orientation for the posture. Postures are taught to the system by performing them, and the system can similarly be trained for a given user. The recognition engine has a data acquisition thread constantly checking to see if received data from the glove matches anything from its gesture manager component. Data is filtered to reduce noise, marked as a candidate for a gesture if the gesture manager finds a likely match, and marked recognized if held for a minimum time span (300-600 milliseconds by their tests). The gesture manager keeps a list of known postures and functions to manage it. A known gesture is stored with a "finger constellation", which is a set of 5D vectors representing each finger's bend value. If the current data is within some minimum recognition distance, the orientation is checked similarly. Likelihood of a match is given based on these comparisons.

They find that the system works well with applications like American Sign Language and letting the user make a "click" gesture by pointing and "tapping" at an object. Their sample implementation uses a virtual desktop where the user can select a document by moving his hand over it and making a fist to represent grabbing it. They have other gestures for opening and browsing through documents.

Discussion:

I wonder if there might be a good way to enable a wider range of hand motion to be recognized by using more than one tracker location. If there were views from the side, could they be incorporated into the big picture for a more complete view, or perhaps whichever view sees the largest number of reflectors? I don't know if this is feasible or worthwhile with the equipment we have, or if users are even really bothered by needing to learn how to hold their hands.

I also imagine that there may be a tradeoff between ease of use and allowing intuitive, natural gestures -- the gestures they describe for browsing are not entirely intuitive to me, and I would not likely guess how to open a document without being shown how to. However, without tactile feedback, making the same gesture to open a book that I would make in physical reality could be just as difficult to accomplish, as some sort of sweeping gesture could be interpreted as moving the book instead.

Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

Environmental Technology: Making the Real World Virtual (Myron W. Krueger)

Myron, W. K. (1993). “Environmental technology: making the real world virtual.” Commun. ACM 36(7): 36-37.

Summary:

This article describes research and applications that focus on letting humans act using natural human gestures in order to communicate with computers. A number of examples are given: pressure sensors in the floor track a human's movements around the room or trigger musical tones (1969). A video-based telecommunication space was found to work best by superimposing images of hands into a computer-graphic image (1970). A 2D VIDEOPLACE medium (1974-1985) serves as an interface to 2D and 3D applications, such as one that allows the user to fly around a virtual landscape by holding out their hands and leaning in the direction they want to go. Practical applications include therapeutic analysis of body motion, language instruction, and virtual exploration of other planets. 3D sculpture can be done using thumbs and forefingers. The primary use the author foresees of this technology is simply teleconferencing to discuss traditional documents.

Discussion:

This paper provides a nice introduction to some of the kinds of things that have been done in this field without giving many technical details, only describing interface design choices.

I can imagine some of these applications meshing well with the head-tracking technique using the Wii components, as in the "Head Tracking for Desktop VR Displays using the WiiRemote" video. Space exploration, for example, might work well with a screen representing a window out of a spaceship. Combining this with use of gloves to control movement might make for a fun, immersive experience, though I'm not certain it would be an improvement on a setting with VR goggles aside from lower cost and greater accessibility to average consumers. I like the idea of using a simple hand gesture to specify movement and change of direction. Maybe there would be some intuitive gesture to switch between movement and manipulation of the environment, like some sort of reaching forward to put on gloves, which could make images of gloves/hands appear in the scene.