If there is a root domain to the recent explosion in deep learning, it’s certainly computer vision, the analysis of image and video data. So it doesn’t come as a huge surprise that you try your luck with some computer vision techniques while studying deep learning. Long story short, my partner (Maximiliane Uhlich) and I decided to apply this form of deep learning to images of romantic couples. Specifically, we wanted to find out whether we can accurately tell if any given couple, depicted on an image or video, is happy in their relationship or not? Turns out, we can! With a classification accuracy of nearly 97 percent, our final model (which we dubbed DeepConnection) was able to clearly differentiate between unhappy and happy couples. You can read the full story in our preprint, what follows here is a rough sketch of what we did.
The mainstay of deep learning methods in computer vision are convolutional neural networks (CNNs), which, briefly, apply different filters to an image to efficiently detect hierarchies of features. So, of course, we used a CNN for the base level of our model. With a twist. Another popular model architecture are residual neural networks (ResNet), basically CNNs with the possibility to skip layers if they don’t benefit the classification performance. By choosing such a ResNet base model, pretrained on millions of diverse images, we were already able to fine-tune our model to have a pretty good idea how happy/unhappy couples look like.
But DeepConnection’s better than this! We then took the features found by the ResNet base and subjected them to a spatial pyramid pooling layer (SPP). Here, the processed image data is split into different numbers of squares and only the maximum values are passed on for further analysis. This allows the model to focus on important features, makes it robust to different images sizes, and immune to image perturbation. After that, we placed a power mean transformation (PMT) layer to transform the data with several mathematical functions to introduce nonlinearities and allow DeepConnection to capture more complex relationships from the data. All this analysis then went into the final network layers in which two scores were calculated: the model prediction for any given image if it depicts a happy or unhappy couple, respectively.
The interesting part now was to have a look at what the model actually learned. We generated heatmaps for several images, to find out which regions of an image were most important for the model to reach a conclusion. Perhaps unsurprisingly, facial features were the most essential aspects for classification. As humans, our faces are very expressive and a potent means of communication. But of course other aspects, such as body posture, were relevant for the prediction outcomes as well. Even the presence of other people, not being part of the couple, didn’t unduly influence DeepConnection. That’s why we think this model has potential for applications in public, in private, and in the context of couple therapy for real-time feedback regarding the happiness state of a couple. But you should really read the preprint, there’s much more to discover!
Update: For an overview, you can also have a look at an explainer article about this work at Towards Data Science.