On hand tracking & gesture recognition

Wednesday, April 2, 2008

Activity Recognition using Visual Tracking and RFID (Krahnstoever, Rittscher, Tu, Chean & Tomlinson)

Summary:

This paper discusses the combination of RFID and video tracking, so as to help determine what people are doing in their interactions with tagged objects. Their system uses camera data together with mathematical models of the human body to track a human's head and hands in 3D space. An RFID tracking unit detects the presence, location, and orientation of RFID tags in 3D space which are attached to objects. They believe that by combining these sources of information, it should be possible to recognize activity such as theft or tampering with retail items.

Discussion:

This paper has a decent, if fairly straightforward, idea. Yes, combining different types of data should allow for better recognition. A more thorough user study would be nice.

Reading about the way the system uses visual tracking made me think more about the situations where a glove would be more useful than cameras. For casual, everyday activities, where you might want to track the motion of one user in several locations, a glove would be a better choice if it could be wireless and send a signal to a recording device on a more stable part of a person's body. It would have to be voluntary, so it's probably not ever going to be very useful for security purposes (though it might be possible to identify when a person takes a glove off based on glove data.)

Wednesday, March 19, 2008

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman & Breazeal)

Summary:

This work is motivated by the idea of motor learning being improved by feedback over all joints, which is not available from a human instructor, especially in a large class setting. Their system includes optical tracking (based on reflecting markers on their suit), tactile actuators (vibrotactile feedback -- by pulsing a series of actuators around a joint, an order for rotation is communicated), feedback software (triggers feedback when error between the learned joint angles and current angles is high enough), and hardware for output control. Their user study compared people learning an action with only video instruction to learning it with both video and tactile feedback from the suit. They found that subjects' error in performing the motions was reduced by a statistically very significant amount (numbers around 20%).

Discussion:

I like the idea of using tactile feedback to help learn actions that involve the whole body, like dancing, but I wonder if there would be a problem with not emphasizing the important joints more than the unimportant ones. It seems to me that in dance, some amount of stylistic variation is acceptable (though I haven't spoken to any professional dancers about how important it is to have every joint bent at exactly the correct angle). Or a short person engaged in ballroom dancing with a tall person would have to position at least the arms differently than a pair of dancers of the same height. It might be possible to offset this by training the system with several different experts, and it might be slightly less of an issue with a solo activity with no props.

I also think it would be very interesting to see, as they ask when discussing future work, whether human attention can take in information from more joints of the body all at once and react effectively to it.

Wiizards: 3D Gesture Recognition for Game Play Input (Kratz, Smith & Lee)

This paper describes a game where players make gestures to cast spells against opponents with a dueling wizards theme. Different gestures can be combined in series to make different sorts of spells. They use the Wii controller and feed its accelerometer data into an HMM gesture recognition package. They had 7 users perform each gesture over 40 times, and they found that having 10 states for a gesture in the HMM gave over 90% accuracy. After 20 training gestures from one user, recognition is over 95% (or 80% for 10 gestures). Recognition without user-dependent training data was found to be around 50%. Evaluating gestures in their system works at near real-time for 250 gestures, but training HMMs takes more time.

Discussion:
This sounds like it would be an interesting game to try, though I don't know how well you could convince all players to input a lot of training data so as to get remotely palatable recognition rates. It could work for a single-player game, if the design of the training session was sufficiently clever and fun, but as the focus is on being a multiplayer game, you don't want to make players who have played before wait for an hour while the new player goes through tons of sample gestures. It might be worth trying to have important features of a given gesture that can be communicated to the user, e.g. the most important features of this spiral gesture are that it is circular, on a plane, and never crosses itself, and since the user is playing a game, they might be okay with learning these things. Then recognition could be based in significant part on these purposely chosen features as well.

Friday, February 29, 2008

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series (Kadous)

Summary (ONLY intro and 6.3.2 - Auslan):

Kadous' thesis deals with machine learning in domains where values vary over time, including gesture recognition. His method involves using metafeatures of the data, which capture properties including temporal data, so a local maximum in height might be noted. He considers synthetic events combined with metafeatures, or "interesting examples", where a local maximum occurs near the nose or near the chin, which can lead to two very different classifications.

One application he tests his classifier on is Auslan (Australian sign language). He acknowledges that fingerspelling is not sufficient for proper communication, so focuses on whole signs. A sign is made up of handshape, location, orientation, movement, and expression (such as raising eyebrows to make a question).

He tested his classifier, which tries many different machine learning methods, on data from a Nintendo Powerglove as well as on data from 5DT gloves with a Flock-of-Birds tracker for each hand. The Powerglove data was one-handed, included one user's data, and totalled 1900 signs. The Flock data was two-handed, gave more data per hand, and had significantly less noise, and it was collected from one native Auslan signer over a period of 9 weeks, totalling 27 samples for sign, making 2565 signs.

short notes:
-powerglove low accuracy, HMM best
-flock good accuracy, adaboost best, maybe HMM not so good because of too many more channels of data
-rules formed for classification

Discussion:

It was nice to see data from a native signer, although it sounded a bit more exciting before I realized it was only one signer. I think this paper includes some interesting machine learning techniques, though I didn't look too closely at the parts of the thesis focusing on that rather than the sign language study. His data set is available at http://archive.ics.uci.edu/ml/datasets/Australian+Sign+Language+signs+(High+Quality) which could be nice to have in order to compare different, new techniques to published results.

Wednesday, February 27, 2008

American Sign Language Recognition in Game Development for Deaf Children (Brashear, Henderson, Park, Hamilton, Lee, Starner)

Summary:

This work discusses CopyCat, a game intended to help teach ASL to young deaf children. It is aimed at ages 6-11, and encourages signing in complete phrases using gesture recognition technology. They used a Wizard of Oz method to collect relevant data, since no sufficient ASL recognizer existed for their use. They use a camera and colored gloves, with a different color on each fingertip, to find hand position. The game has a character, Iris, who must be woken up with a click and then instructed by ASL gestures, and then the interaction is ended with another click. Sample phrases used for the game include "go chase snake" and "orange kitten in flowers". They use HSV histograms and a Bayes classifier to make a binary mask from the video data. They have wearable, wireless accelerometers that provide extra data to be combined with video data. They collected data with five children and, after removing samples not signed correctly according to gameplay and samples with problems like fidgeting or poor form, they had 541 signed sentences and 1959 individual signs. They used GT^2k for gesture recognition, and found 93% accuracy for a user-dependent model and 86% accuracy for user-independent models.

Discussion:

I think this paper is well-motivated: it would be hard as a hearing parent to raise an unexpectedly deaf child well, and even if the parents try to learn sign language to speak to the child, most parents probably will have some difficulty approaching fluency in it. I remember from psychology classes that being exposed to language in a non-interactive format like television is not sufficient for a child to learn language (so neglected children may not learn language, even if often left alone in a room with a television playing.) An interactive game, then, seems more likely to assist in language acquisition (though I wouldn't trust it as the only source of teaching, no matter how good the recognition is.)

I also think the data samples with "problems" like false starts or fidgeting could be useful in the future as recognition becomes more refined, as it would be valuable for a computer to be able to deal with it, at least to identify it as noise, but maybe also to get extra information like how nervous the user is, or how fluent a speaker -- similarly it would be useful for a verbal speech recognition system to handle stuttering and self-corrections.

Monday, February 25, 2008

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition (Westeyn, Brashear, Atrash, Starner)

Summary:

This paper describes a toolkit, GT^2k, to aid in gesture recognition via HMM, which makes use of another existing toolkit, HTK, that supports speech recognition. GT^2k allows for training models with both real-time and off-line recognition. They discuss four sample applications for which their toolkit has been applied. A gesture panel in automobiles recognizes simple gestures with camera data and gets 99.2% accuracy. A security system is built to have recognition of patterned blinking, to go with face recognition, so that the system can't be fooled with a still photograph of an authorized person, and it gets an 89.6% accuracy. TeleSign is a system that does mobile sign language recognition, but real-time recognition was not yet implemented, but they got 90% accuracy by combining vision and accelerometer data. A workshop activity recognition system is built to recognize actions made while constructing an object in a workshop, such as hammering or sawing, based on accelerometer data, and they found 93% accuracy, though again not in real-time.

Discussion:

The blinking idea is interesting, but I have a mental image of someone with holes cut in a photograph, maybe with fake eyelids attached that blink when a string is pulled. Plus, from an interface standpoint, it seems like most people who have already been trained by today's hand-operated devices would be more comfortable using their hands to input a code. Maybe an input pad hanging on a cord that the user can pull close and hide from people standing behind them would be a simpler solution.

It does seem like a valuable idea to have a toolkit that can be used in multiple gesture recognition applications, and anything that could improve accuracy could benefit a lot of people.

Computer Vision-based Gesture Recognition for an Augmented Reality Interface (Storring, Moeslund, Liu, Granum)

This paper discusses wearable computing and augmented reality, which the authors believe should include gesture recognition as a way for the user to input commands to interact with the interface. Their focus here is on building towards a multi-user system that allows for a round-table meeting; since a table can be assumed and most of the expected gestures involve pointing, they believe that it is reasonable to restrict the gesture set to six gestures, these being a fist, and 1 to 5 fingers extended. Then all gestures are in a plane, and the recognition problem is reduced to 2D. Their system does low level pre-segmentation of the image of a hand, using pixel color (skin color) to find where the hand is, by changing the non-hand part of the image to black, and the hand shape to white. They look for the palm to be matched up to a circle and the fingers as rectangles centered at the circle, and they differentiate gestures by how many fingers are extended. They give no numerical results, but say that users adapted quickly to using the system and that the recognition rate was high enough that users found the gesture interface useful for the AR round-table architecture planning application.

Discussion:

I wonder how well their system does with fingers that are not held apart -- it would probably be possible to deal with recognizing multi-finger blobs when a human could infer that hey, that one "finger" is twice the width of two other fingers, or half as wide as the palm, so maybe it is really two fingers together. But it might be more of a pain to tweak the system to deal with those harder cases than to move to another recognition method -- and then of course there is the issue of moving to any more complicated gesture set that doesn't translate well to 2D.

Also, how did they find the thumb movement to be the most natural choice for a click gesture? Was that their design choice, or did they decide on it from user feedback?