The Perfect Dataset

Smart devices, digital assistants, and service chatbots are becoming ubiquitous, but they don’t yet have the emotional capacity they need to fully understand how we communicate. Emotional AI models should be able to detect specific emotions and act with empathy, understand societal norms, and recognize specific communicational nuances, like irony and humor. Datasets of emotional data exist, but are inadequate in terms of size and realistic context. Fortunately, there’s a dataset which can satisfy all these requirements: movies.

Cinema is the holy grail of emotional and anthropological data, for these five reasons:

Film Data is Self-Labelling

Movies contain multiple “dimensions” (e.g. word choice, facial expression, etc.) of emotional data, which can be used in tandem as “self-labelling” data. For example, what if we want to know the emotional impact of the phrase “I never want to see you again”? We can look across every film to see a character’s reaction when the phrase is said to them - the reactions to this phrase are already labeled.

Most times, the character’s face will fall into a frown. Sometimes, the character will laugh. Never, will the character’s head will fall off. A film is a reproduction of how we interact with each other. The data is valid because movies depict dramatizations of society. A model trained on movies will find only proper reactions to this phrase. And these are only two dimensions - there are many more, such as voice tone, word choice, expression of the phrase-delivering character, etc.

A Movie is 90 Minutes of Cause-and-Effect

Existing datasets lack a clear emotional antecedent, or stimuli which causes a change in behavior. RAVDESS has actors delivering a line “The kids are sitting by the door”, but this doesn’t have any actual context – there are no kids. The direct, literal antecedent of original-content datasets is “I am demonstrating how this sentence is delivered in six different emotional contexts.”

In cinema, there’s cause-and-effect for everything. Abstractly, a two-character dialogue scene is just a layering of conversational antecedents atop one another. A character says something, which causes the other character to respond, which causes the first character to respond, and so on. What was just previously said in the conversation to make a character declare “Prison changes you”?

At a plot level, we can track character motivations and what’s important to them, which allows us to interpret antecedent reactions on a broader scale. If we know the character is excited about a job interview, how do they react when they hear they didn’t get the job?

Cinema is a Document of Societal Norms

Going further into the contextual benefits of cinematic expression, movies offer real-world societal context. Consider an average movie scene that takes place in a restaurant. In a restaurant, there are a number of rules that diners follow. What usually happens when a server asks “What can I get you?” or “Can I take your order?” And a common dynamic in restaurant scenes involves one character starting to lose their cool and yelling, and the other calming them down and asking them to be quiet. It’s an unspoken rule to not make a scene in restaurants.

Airport scenes are typically filled with goodbyes. Driving scenes depict characters seated in a specific pattern, facing forward. Birthday celebrations show a number of rituals which include cake, candle, and song. Although they aren’t necessarily important to the plot of the film (e.g. a restaurant dialogue could be adapted to take place anywhere else), they provide important societal context, reflecting the typical anthropological rules and rituals of how people actually act in specific locations and situations.

Understanding Human Behavior Requires Multiple Streams of Emotional Data

Many datasets may only focus on one specific type of data: facial expression, word choice sentiment, voice tone, etc. They all get the job done in their respective fields, but some societal norms are too nuanced to be quantified by just a single type of emotional data. One of the biggest examples, sarcasm, has notoriously stumped AI models. In this clip from The Simpsons, Comic Book Guy sarcastically muses how (un-)useful a sarcasm detector would be.

Sarcasm often fools AI models which track word choice sentiment, because positive words are used in a negative manner. But if we looked across all of cinema for positive word choice, deadpan voice tone, and a neutral facial expression, we could reliably find instances of characters being sarcastic. There are a number of other mannerisms which require context of other streams, like facial models detecting crying (tears of heartbreak vs. tears of joys), and body language models detecting clapping (a crowd cheering vs. a villain emerging from the shadows).

It’s Already There

Hundreds of thousands of movies already exist, ready to be parsed as structured data. They contain lots of emotional cause-and-effect information, and exist as a mirror image of how society perceives itself. Existing datasets are limited in size, may only focus on specific streams of emotional data, are devoid of societal context, and lack the behavioral antecedents needed to truly understand emotional response. Hiring actors to deliver lines or paying people to watch and label clips of TV requires lots of human effort (which translates to a monetary cost). Again, movies are already there, ready to be turned into self-labelling, contextually-rich data.