This paper deals with recognizing motion streams, as generated by human body motions like sign language, by using a SVD-based similarity measure. They represent position with a matrix: columns represent positions of different joints, and rows represent different instants in time. Motions are considered to be similar if the matrices that represent them are similar in that they have the same number of attributes (columns), but they may have a different number of rows, since a fast gesture can have the same meaning as a slow one. They discuss how SVD shows the geometric structure of a matrix and can be used to compare two matrices. They performed a study with a CyberGlove, covering 18 motions (ASL for Goodbye, Idiom, 35, etc.), with 24 motion streams generated, with 5 to 10 motions in a stream. This data allowed consideration of segmentation issues. They also got motion capture data from 62 isolated motions from dance movements (each repeated 5 times). They find near 100% accuracy for isolated motions and around 94% accuracy in motion streams with their kWAS algorithm (looking at the first k eigenvectors with k=6). It is much more accurate and faster than EROS and slightly better than MAS, which are the other algorithms they compare theirs to.
Discussion:
I appreciate that they used words that would be used in regular ASL rather than just letters, since having a broad selection of recognized words will probably be more useful than finger spelling for native speakers of ASL. It's also to see an unusual application like dance, even if it does include only isolated motions -- I'm curious as to how complex these repeated short motions were, and if it was enough to count as a basic step that the dance as a whole might be made up of, and so someone might do repeatedly as practice to master that basic step, or if it is just a brief segment of the dance that is not common to be practiced by itself. It might also be interesting to see how well a practice step's motion data corresponds to the data for that same step integrated into a more complicated set of dance motions, and if there is a greater difference between these cases for expert versus novice dancers.
Subscribe to:
Post Comments (Atom)
1 comment:
I like the relative simplicity of this approach. We're not using a super complicated HMM that we have to train with a bunch of data. Instead, we're just using template matching. The templates and the data are expressed using eigenvectors, and the similarity metric is just cosine of the angle between them. Beautiful. This was one paper I actually really liked, as it does both recognition and segmentation (with that cheesy moving window of frames thing, which has problems all its own).
Post a Comment