Tuesday, January 22, 2008

Flexible Gesture Recognition for Immersive Virtual Environments (Deller, Ebert, Bender, Hagen)

Summary:

This paper presents use of all three dimensions in a computer interface as more immersive than a traditional desktop setup, but not yet common because adequate interfaces are not common. The authors believe that a good interface will allow manipulation using intuitive hand gestures, as this will be most natural and easy to learn. They review related work, and find that there is still a need for gesture recognition that can work with various conditions like different users, different hardware, and able to work in real-time for a variety of gestures.

The authors used a P5 Glove with their gesture recognition engine to get position and orientation data. Tracking is cut off if the back of the hand is angled away from the receptor and conceals too many reflectors, which is okay for the P5's intended purpose of being used sitting in front of a desktop computer. To reduce computation time, they define gestures as a sequence of postures with specific positions and orientations of the user's hand, rather than as motion over a period of time. Postures for the recognition engine are made up of flexion values of the fingers, orientation data of the hand, and a value representing relevance of the orientation for the posture. Postures are taught to the system by performing them, and the system can similarly be trained for a given user. The recognition engine has a data acquisition thread constantly checking to see if received data from the glove matches anything from its gesture manager component. Data is filtered to reduce noise, marked as a candidate for a gesture if the gesture manager finds a likely match, and marked recognized if held for a minimum time span (300-600 milliseconds by their tests). The gesture manager keeps a list of known postures and functions to manage it. A known gesture is stored with a "finger constellation", which is a set of 5D vectors representing each finger's bend value. If the current data is within some minimum recognition distance, the orientation is checked similarly. Likelihood of a match is given based on these comparisons.

They find that the system works well with applications like American Sign Language and letting the user make a "click" gesture by pointing and "tapping" at an object. Their sample implementation uses a virtual desktop where the user can select a document by moving his hand over it and making a fist to represent grabbing it. They have other gestures for opening and browsing through documents.

Discussion:

I wonder if there might be a good way to enable a wider range of hand motion to be recognized by using more than one tracker location. If there were views from the side, could they be incorporated into the big picture for a more complete view, or perhaps whichever view sees the largest number of reflectors? I don't know if this is feasible or worthwhile with the equipment we have, or if users are even really bothered by needing to learn how to hold their hands.

I also imagine that there may be a tradeoff between ease of use and allowing intuitive, natural gestures -- the gestures they describe for browsing are not entirely intuitive to me, and I would not likely guess how to open a document without being shown how to. However, without tactile feedback, making the same gesture to open a book that I would make in physical reality could be just as difficult to accomplish, as some sort of sweeping gesture could be interpreted as moving the book instead.


Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

4 comments:

Brandon said...

Your idea about using a multiple trackers is interesting. It would be possible to use two trackers (we have two birds in our flock of birds system). I'm not sure how much extra information we could gain from tracking the hand from two different positions since a single tracker already gives X,Y,Z values. I do think tracking from two different positions would be beneficial for a system that uses cameras that can't easily capture the Z position.

- D said...

The "hardware group" for the class has actually been discussing using several Wii-motes to triangulate position, since we're fairly certain the Wii can only do 2D and uses the sensor bar's two IR sources to get depth. Triangulation would also help in noise reduction.

Grandmaster Mash said...

Intuitive gestures are tricky, since what is intuitive to me might not be intuitive to another. For instance, I use mouse gestures with my internet browser where I slash forward and back to perform the corresponding functions. It takes slightly more time than hitting a button if you have a billion button mouse, but it also is more intuitive to me. Waving my hand at the screen might be intuitive as well, but having to hold my hand for half a second might not.

Paul Taele said...

As Aaron brought up, intuitiveness is a tricky affair. The way they defined the gestures for their particular application was most likely what felt natural to them. The next logical step in this research would be a user study on what gestures would be appropriate for their system. Oh, and publishable results. That'd be nice, too.