Techniques for Facial Expression Recognition Using the Kinect

TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Tech

Facial expressions convey non-verbal cues. Humans use facial expressions to show emotions, which play an important role in interpersonal relations and can be of use in many applications involving psychology, human-computer interaction, health care, e-commerce, and many others. Although humans recognize facial expressions in a scene with little or no effort, reliable expression recognition by machine is still a challenging problem.

Automatic facial expression recognition (FER) has several related problems: face detection, face representation, extraction of the facial expression information, and classification of expressions, particularly under conditions of input data variability such as illumination and pose variation. A system that performs these operations accurately and in realtime would be a major step forward in achieving a human-like interaction between the man and machine.

This document introduces novel approaches for the automatic recognition of the basic facial expressions, namely, happiness, surprise, sadness, fear, disgust, anger, and neutral using relatively low-resolution noisy sensor such as the Microsoft Kinect. Such sensors are capable of fast data collection, but the low-resolution noisy data present unique challenges when identifying subtle changes in appearance. This dissertation will present the work that has been done to address these challenges and the corresponding results. The lack of Kinect-based FER datasets motivated this work to build two Kinect-based RGBD+time FER datasets that include facial expressions of adults and children. To the best of our knowledge, they are the first FER-oriented datasets that include children. Availability of children data is important for research focused on children (e.g., psychology studies on facial expressions of children with autism), and also allows researchers to do deeper studies on automatic FER by analyzing possible differences between data coming from adults and children.

The key contributions of this dissertation are both empirical and theoretical. The empirical contributions include the design and successful test of three FER systems that outperform existing FER systems either when tested on public datasets or in realtime. One proposed approach automatically tunes itself to the given 3D data by identifying the best distance metric that maximizes the system accuracy. Compared to traditional approaches where a fixed distance metric is employed for all classes, the presented adaptive approach had better recognition accuracy especially in non-frontal poses. Another proposed system combines high dimensional feature vectors extracted from 2D and 3D modalities via a novel fusion technique. This system achieved 80% accuracy which outperforms the state of the art on the public VT-KFER dataset by more than 13%. The third proposed system has been designed and successfully tested to recognize the six basic expressions plus neutral in realtime using only 3D data captured by the Kinect. When tested on a public FER dataset, it achieved 67% (7% higher than other 3D-based FER systems) in multi-class mode and 89% (i.e., 9% higher than the state of the art) in binary mode. When the system was tested in realtime on 20 children, it achieved over 73% on a reduced set of expressions. To the best of our knowledge, this is the first known system that has been tested on relatively large dataset of children in realtime. The theoretical contributions include 1) the development of a novel feature selection approach that ranks the features based on their class separability, and 2) the development of the Dual Kernel Discriminant Analysis (DKDA) feature fusion algorithm. This later approach addresses the problem of fusing high dimensional noisy data that are highly nonlinear distributed.

Facial Expression Recognition, FACS, Dual KDA, EMFACS, Action Units, ASD