On hand tracking & gesture recognition: February 2008

Friday, February 29, 2008

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series (Kadous)

Summary (ONLY intro and 6.3.2 - Auslan):

Kadous' thesis deals with machine learning in domains where values vary over time, including gesture recognition. His method involves using metafeatures of the data, which capture properties including temporal data, so a local maximum in height might be noted. He considers synthetic events combined with metafeatures, or "interesting examples", where a local maximum occurs near the nose or near the chin, which can lead to two very different classifications.

One application he tests his classifier on is Auslan (Australian sign language). He acknowledges that fingerspelling is not sufficient for proper communication, so focuses on whole signs. A sign is made up of handshape, location, orientation, movement, and expression (such as raising eyebrows to make a question).

He tested his classifier, which tries many different machine learning methods, on data from a Nintendo Powerglove as well as on data from 5DT gloves with a Flock-of-Birds tracker for each hand. The Powerglove data was one-handed, included one user's data, and totalled 1900 signs. The Flock data was two-handed, gave more data per hand, and had significantly less noise, and it was collected from one native Auslan signer over a period of 9 weeks, totalling 27 samples for sign, making 2565 signs.

short notes:
-powerglove low accuracy, HMM best
-flock good accuracy, adaboost best, maybe HMM not so good because of too many more channels of data
-rules formed for classification

Discussion:

It was nice to see data from a native signer, although it sounded a bit more exciting before I realized it was only one signer. I think this paper includes some interesting machine learning techniques, though I didn't look too closely at the parts of the thesis focusing on that rather than the sign language study. His data set is available at http://archive.ics.uci.edu/ml/datasets/Australian+Sign+Language+signs+(High+Quality) which could be nice to have in order to compare different, new techniques to published results.

Wednesday, February 27, 2008

American Sign Language Recognition in Game Development for Deaf Children (Brashear, Henderson, Park, Hamilton, Lee, Starner)

Summary:

This work discusses CopyCat, a game intended to help teach ASL to young deaf children. It is aimed at ages 6-11, and encourages signing in complete phrases using gesture recognition technology. They used a Wizard of Oz method to collect relevant data, since no sufficient ASL recognizer existed for their use. They use a camera and colored gloves, with a different color on each fingertip, to find hand position. The game has a character, Iris, who must be woken up with a click and then instructed by ASL gestures, and then the interaction is ended with another click. Sample phrases used for the game include "go chase snake" and "orange kitten in flowers". They use HSV histograms and a Bayes classifier to make a binary mask from the video data. They have wearable, wireless accelerometers that provide extra data to be combined with video data. They collected data with five children and, after removing samples not signed correctly according to gameplay and samples with problems like fidgeting or poor form, they had 541 signed sentences and 1959 individual signs. They used GT^2k for gesture recognition, and found 93% accuracy for a user-dependent model and 86% accuracy for user-independent models.

Discussion:

I think this paper is well-motivated: it would be hard as a hearing parent to raise an unexpectedly deaf child well, and even if the parents try to learn sign language to speak to the child, most parents probably will have some difficulty approaching fluency in it. I remember from psychology classes that being exposed to language in a non-interactive format like television is not sufficient for a child to learn language (so neglected children may not learn language, even if often left alone in a room with a television playing.) An interactive game, then, seems more likely to assist in language acquisition (though I wouldn't trust it as the only source of teaching, no matter how good the recognition is.)

I also think the data samples with "problems" like false starts or fidgeting could be useful in the future as recognition becomes more refined, as it would be valuable for a computer to be able to deal with it, at least to identify it as noise, but maybe also to get extra information like how nervous the user is, or how fluent a speaker -- similarly it would be useful for a verbal speech recognition system to handle stuttering and self-corrections.

Monday, February 25, 2008

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition (Westeyn, Brashear, Atrash, Starner)

Summary:

This paper describes a toolkit, GT^2k, to aid in gesture recognition via HMM, which makes use of another existing toolkit, HTK, that supports speech recognition. GT^2k allows for training models with both real-time and off-line recognition. They discuss four sample applications for which their toolkit has been applied. A gesture panel in automobiles recognizes simple gestures with camera data and gets 99.2% accuracy. A security system is built to have recognition of patterned blinking, to go with face recognition, so that the system can't be fooled with a still photograph of an authorized person, and it gets an 89.6% accuracy. TeleSign is a system that does mobile sign language recognition, but real-time recognition was not yet implemented, but they got 90% accuracy by combining vision and accelerometer data. A workshop activity recognition system is built to recognize actions made while constructing an object in a workshop, such as hammering or sawing, based on accelerometer data, and they found 93% accuracy, though again not in real-time.

Discussion:

The blinking idea is interesting, but I have a mental image of someone with holes cut in a photograph, maybe with fake eyelids attached that blink when a string is pulled. Plus, from an interface standpoint, it seems like most people who have already been trained by today's hand-operated devices would be more comfortable using their hands to input a code. Maybe an input pad hanging on a cord that the user can pull close and hide from people standing behind them would be a simpler solution.

It does seem like a valuable idea to have a toolkit that can be used in multiple gesture recognition applications, and anything that could improve accuracy could benefit a lot of people.

Computer Vision-based Gesture Recognition for an Augmented Reality Interface (Storring, Moeslund, Liu, Granum)

This paper discusses wearable computing and augmented reality, which the authors believe should include gesture recognition as a way for the user to input commands to interact with the interface. Their focus here is on building towards a multi-user system that allows for a round-table meeting; since a table can be assumed and most of the expected gestures involve pointing, they believe that it is reasonable to restrict the gesture set to six gestures, these being a fist, and 1 to 5 fingers extended. Then all gestures are in a plane, and the recognition problem is reduced to 2D. Their system does low level pre-segmentation of the image of a hand, using pixel color (skin color) to find where the hand is, by changing the non-hand part of the image to black, and the hand shape to white. They look for the palm to be matched up to a circle and the fingers as rectangles centered at the circle, and they differentiate gestures by how many fingers are extended. They give no numerical results, but say that users adapted quickly to using the system and that the recognition rate was high enough that users found the gesture interface useful for the AR round-table architecture planning application.

Discussion:

I wonder how well their system does with fingers that are not held apart -- it would probably be possible to deal with recognizing multi-finger blobs when a human could infer that hey, that one "finger" is twice the width of two other fingers, or half as wide as the palm, so maybe it is really two fingers together. But it might be more of a pain to tweak the system to deal with those harder cases than to move to another recognition method -- and then of course there is the issue of moving to any more complicated gesture set that doesn't translate well to 2D.

Also, how did they find the thumb movement to be the most natural choice for a click gesture? Was that their design choice, or did they decide on it from user feedback?

Thursday, February 21, 2008

Television Control by Hand Gestures (Freeman, Weissman)

Summary:

This paper deals with studying how a person can control a TV set via hand gestures, focusing on two issues: (1) how can there be a large set of commands without requiring the users to learn difficult-to-remember gestures, and (2) how can the computer recognize commands in a complex vision-based setting? They chose gestures as the method of input because they anticipated problems with voice fatigue and awkwardness in incremental changes of parameters (like volume). Their approach is to have the user hold out an open hand to trigger a control mode, wherein a hand icon is displayed on the screen whose movements mirror the user's hand movements, and then there are command icons activated by mouseover and sliders which the hand icon can manipulate for things like volume. They use normalized correlation/template matching for recognition. They do not process objects that are stationary for some time. They created a prototype and found that users were excited about using gestures to control the TV, but weren't sure if this was just due to novelty.

Discussion:

The idea of giving the TV a camera with which to watch the viewer seems both interesting and unsettling. Maybe if combined with recognition of what the user is doing, this sort of technology could adjust the output of the television, adjusting volume down if the user is focused on reading material or a phone conversation, or up if the user is leaning forward with an intent expression as if trying to hear. I'm not sure I would want such technology in my home at all, because I think there is something to be said for predictable manual control and a sense of privacy. Maybe if the user can be convinced that the system only allows real-time analysis of motion and is not being recorded somewhere, they will be more likely to accept the technology, and perhaps having similar monitoring around the home could be useful for detecting injuries and illnesses that leave a person incapacitated and needing help. Someone falls to the floor, a voice prompt asks, "Are you all right?", and without a verbal response, the system calls for outside help.

I like the idea of having a hand icon appear on the display when the interactive mode is initiated by the user holding up a hand. This visual feedback, in sync with the user's motions, should help make the system easier and less frustrating to use. It might become annoying if people are watching TV together, making conversation about the show with hand gestures, and trigger an extensive display, so maybe the hand icon could be unobtrusive for a few seconds, maybe only a couple of sliders would be shown or other gestures could be allowed as "quick commands", and then instructions pop up for commands after a couple seconds of no significant change in input. I'm not sure if this would be better or not. Nor that a camera is a better solution in the first place for remote control loss than gluing a string to your remote and attaching it somewhere near where you watch tv, or such.

Monday, February 18, 2008

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola)

Summary (section 3):

This section of the paper summarizes a number of algorithmic techniques that have been applied to recognition of hand postures and gestures.

They cover feature extraction, statistics, and models, which includes template matching* (simple, accurate over a small set, small amount of calibration needed, recommended for postures not gestures), feature extraction* (a statistical technique to reduce data's dimensionality, handles gestures as well as postures, slow if many features), active shape models, principal components* (recognizes around 30 postures, requires lots of training by multiple users, requires normalization), linear fingertip models, and causal analysis.

They discuss three learning algorithms: neural networks* and HMMs* (both can recognize large posture/gesture sets, good accuracy given extensive training), and instance-based learning* (relatively simple to implement, moderately high accuracy for large set of postures, provides continuous training, memory/time intensive as data set grows, not well-researched for this application).

They also discuss three miscellaneous techniques: Linguistic approach* (uses formal grammar to represent posture & gesture set, simple approach with so-far low accuracy), appearance-based motion analysis, and spatio-temporal vector analysis.

*can be done using glove data rather than only applying to vision data.

Summary (section 4):

This section discusses applications that use hand postures and gestures: sign language, gesture-to-speech, presentations, virtual environments, 3D modeling, multimodal interaction, human/robot interaction, and television control.

Discussion:

This seems like it could be a very useful paper for an introduction to haptics -- the reader would get an idea of what kinds of tools have been used for recognition and their strengths and weaknesses. The linguistic approach sounds like what we were discussing in class regarding an evolution of LADDER, and while it's slightly worrying that this paper says the accuracy found in other implementations of it so far has been low, it could also be a good contribution to the field if we are in fact able to make it work well. It might be worth looking at the 1994 paper they cite ("A Linguistic Approach to the Recognition of Hand Gestures" -- Hand, Sexton & Mullan).

Wednesday, February 13, 2008

Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh, Watt)

This work revolves around a user study to determine how people make iconic hand gestures to convey spatial information, with the expectation that 3D gestures can be used to create/sculpt 3D objects in a virtual space. They questioned if people use iconic hand gestures while trying to communicate shapes and objects without speaking, and if so, how frequently do iconic gestures occur, what types are used, and do people prefer to make these gestures with one or two hands? They chose 15 objects for people to describe nonverbally, including primitives such as circle and pyramid, as well as complex and compound objects like chair, french baguette, and house. They found that users did use iconic hand gestures to describe all shapes. With primitives, subjects preferred to use 2-handed virtual depiction (tracing an outline in space). With complex shapes, subjects used iconic two-handed gestures as well as pantomimic, diectic, and body gestures when iconic gestures were not sufficient. Users had no trouble recalling gestures they'd made.

Discussion:

This is an interesting study, but I think there would be more information they could have gotten at the same time. I think it would be interesting to know to what extent users made similar gestures, or whether they could recognize each other's gestures without seeing the card with the object name. I suppose with just the motivation of allowing intuitive 3D sculpting, rather than recognition of gestures, their study makes pretty good sense.

A Dynamic Gesture Recognition System for the Korean Sign Language (KSL) (Kim, Jang & Bien)

Summary:

This work describes a method to recognize gestures with a fuzzy min-max neural network for online recognition. They focus on recognizing Korean Sign Language (KSL) gestures, which are generally two-handed gestures, and most of the 6000 gestures in the language are made up of combinations of basic gestures, so they chose 25 important gestures for their study. They use a VPL Data-Glove, which has 10 flex angles for the fingers of a hand, (x, y, z) position, and roll, pitch, yaw. They found 10 basic direction types of motion patterns, which mostly seem to be motion in a straight line, and one case each of an arc and a circle. They use a Fuzzy Min-Max Neural Network to recognize gestures, and get nearly 85% accuracy.

Discussion:

The idea of 10 basic types of direction is interesting from a 3D gesture-based LADDER equivalent idea. Does this hold true in ASL, that there tends to be only a certain set of general, simple motions, rather than complicated curves, that could be used to describe most gestures? Then there would seem to be hope for creating a system that lets a user describe gestures in terms of those simple motions.

Wednesday, February 6, 2008

A Similarity Measure for Motion Stream Segmentation and Recognition (Li & Prabhakaran)

This paper deals with recognizing motion streams, as generated by human body motions like sign language, by using a SVD-based similarity measure. They represent position with a matrix: columns represent positions of different joints, and rows represent different instants in time. Motions are considered to be similar if the matrices that represent them are similar in that they have the same number of attributes (columns), but they may have a different number of rows, since a fast gesture can have the same meaning as a slow one. They discuss how SVD shows the geometric structure of a matrix and can be used to compare two matrices. They performed a study with a CyberGlove, covering 18 motions (ASL for Goodbye, Idiom, 35, etc.), with 24 motion streams generated, with 5 to 10 motions in a stream. This data allowed consideration of segmentation issues. They also got motion capture data from 62 isolated motions from dance movements (each repeated 5 times). They find near 100% accuracy for isolated motions and around 94% accuracy in motion streams with their kWAS algorithm (looking at the first k eigenvectors with k=6). It is much more accurate and faster than EROS and slightly better than MAS, which are the other algorithms they compare theirs to.

Discussion:

I appreciate that they used words that would be used in regular ASL rather than just letters, since having a broad selection of recognized words will probably be more useful than finger spelling for native speakers of ASL. It's also to see an unusual application like dance, even if it does include only isolated motions -- I'm curious as to how complex these repeated short motions were, and if it was enough to count as a basic step that the dance as a whole might be made up of, and so someone might do repeatedly as practice to master that basic step, or if it is just a brief segment of the dance that is not common to be practiced by itself. It might also be interesting to see how well a practice step's motion data corresponds to the data for that same step integrated into a more complicated set of dance motions, and if there is a greater difference between these cases for expert versus novice dancers.

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip, Law, & Kwong)

Summary:

This work describes a system designed to allow both experienced musicians and novices to compose music by using hand gestures. The authors explain automated music generation in terms of musical theory, discussing how tonality, chord progression, closure of musical phrases (cadence), and generation of melody that follows the chords, and how all of these can be somewhat automated based on general rules for what makes a coherent piece of music. Then they describe their system architecture and implementation, which uses a pair of CyberGloves with Polhemus 3D position trackers; MIDI is used to synthesize musical sound; a music interface converts musical expressions to MIDI signals; background music is generated according to music theories and user-defined parameters of tempo and key; melody is generated according to hand signals, music theories and style template.

They describe the specific gesture mapping they chose for the system in depth, based on five guidelines: (1) Musical expressions should be intuitive; (2) those requiring fine control should be mapped to agile parts of the hands; (3) most important expressions should be easily triggered; (4) no two gestures should be too similar; (5) accidental triggering should be avoided. They map rhythym to the wrist flexion because it is very important but doesn't require fine movement. Pitch is important, so they map it to the relative height of the right hand, though it resets at a new bar of music. Pitch shifting of melody notes also occurs if the right hand is moved far enough relative to its position for the previous melody note. Dynamics (how strongly the note is played) and volume are controlled by right-hand finger flexion: extended fingers mean a stronger note. Lifting the left hand higher than the right hand adds a second instrument, which plays in unison or harmonizes two notes higher. Cadence occurs when the left-hand fingers completely bend, and keeping the hand closed stops the music.

The GUI lets the user choose an instrument, key, tonality, and tempo from drop-down menus (presumably with a mouse) before beginning composition with the CyberGloves.

Discussion:

Due to my lack of knowledge of composing music, I'm not sure I understood all of the automated music generation section, or if I did, it seems as though this could limit how much variety of music can be composed with this system. Then again, I could possibly be convinced that that working within the rules easily leaves more than enough flexibility to create interesting, original music, and it makes sense that it would be easier to create a system that could automate some things in order to allow the user to adjust the big picture with gestures. It does seem like their system might be difficult enough to learn to use without requiring the user to specify every detail of the music via hand movements, though maybe this could be alleviated with a sufficiently informative and usable GUI.

I think the part of the paper that is most likely to apply to other applications is their list of guidelines for determining gesture-to-meaning mapping, if we go to create our own gesture set. They seem kind of obvious, but in designing a system and writing a paper about it, it would be good to have a list of rules like that to compare our choices against.

Monday, February 4, 2008

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation (Hernandez-Rebollar, Linderman, Kyriakopoulos)

Summary:

This paper describes a system/method for recognizing the 26 letters in ASL, along with two other signs (space and enter), and also describes an attempt at an affordable alternative to other data gloves. They call their device The Accele Glove, which uses a micro controller and dual-axis accelerometers. They chose dual-axis sensor location to include the middle joints of the fingers and the distal joint of the thumb to eliminate ambiguity in the ASL alphabet. The accelerometer can track joint flexion and hand roll or yaw, or individual finger abduction, because it is suspended by springs.

They acquired data with five people each signing all 26 letters ten times, with J & Z only at their final position. For classification, they divide gestures into three subclasses: vertical, horizontal, or closed, defined by dividing the 3D space with planes whose locations are based on index finger position. They use a decision tree whose first division is based on these subclasses and further divisions are based on features of the gestures, like "flat" vs. "rolled" postures. They found that 21/26 letters reached 100% recognition, and that R, U & V could not be distinguished with their system.

Discussion:

This seems like a fairly reasonable method, and it seems like it should be able to be tweaked to recognize the letters they can't recognize. Of course, then there's always the question of whether it extends to other kinds of gestures without more and more tweaking.

At one point I misread their discussion of a voice synthesizer and thought they meant they were taking in voice data as well, which seemed like an interesting idea on its own. Having voice output while a person makes gestures during a user study could be useful later when we are trying to figure out what gestures they intended to make. On the other hand, it might have a negative effect if the person talks more or more slowly than she would ordinarily make the accompanying gestures, or if the act of talking distracts the user from making gestures, sort of how people may feel awkward describing what they are doing aloud in studies that test product usability.

Saturday, February 2, 2008

Hand Tension as a Gesture Segmentation Cue (Harling & Edwards)

Summary:

This paper focuses on "the segmentation problem" -- discriminating between two or more fluidly connected gestures. The authors emphasize that their approach is recognition-led: rather than looking at what gestures would be useful for a particular interface and creating a recognizer for just those, they are making recognizers that could possibly be incorporated into various interfaces. They divide gestures into postures and gestures (static & dynamic), and each of these groups is divided by whether or not hand motion and orientation are considered (giving categories like Static Posture Static Location: SPSL -- similarly, DPSL, SPDL, DPDL, in order of complexity). Segmenting gestures from a less complex class is easier than from a more complex class. They suggest that fingertip acceleration maxing away from the body may indicate an intention to produce another gesture. They also suggest considering the minima on the hand tension graph or other changes in the graph's shape. They give an equation to model finger tension based on finger-joint angles. Tension is considered as a sum of the tension in each finger.

They tested the hand tension model with two sets of data using a Mattel Power Glove, which scores finger bentness from 1-4 on 4 fingers. They tried two BSL sentence fragments: "MY NAME" and "MY NAME ME". The graphs indicate tension maximizes where the intentional postures occur and minima occurs between them. They admit that more data is a necessary next step before firm conclusions can be made.

Discussion:

This paper seems to focus on the equivalent of recognizing sketched geometric primitives so as to eventually be able to recognize complex shapes made up of them, which makes it an excellent followup to recent class discussion. Supposing that they didn't choose an overly easy pair of sample sentence fragments to segment, their approach seems pretty promising, and even if most gestures don't turn out to divide well based on tension, it seems likely that it will be worth including in some way for the cases where it is useful. I wonder if it would be worth having some sensor of tension in the glove, supposing the glove fits well, maybe using a elastic string along the inside/outside of the finger that stretches and puts pressure on a sensor or falls slack and doesn't, and if this would compare favorably to the kind of angle-based tension that they are talking about or if it is just redundant and unnecessary.

On hand tracking & gesture recognition