Turns out AI is really bad at picking up on social cues

Ernest Hemingway had an influential theory about fiction that might explain a lot about a particular weakness of artificial intelligence, or AI. In Hemingway’s opinion, the best stories are like icebergs—with what characters actually say and do located above the surface, but making up only a fraction of the unfolding action. The rest of the story—the characters’ motivations, feelings, and their understanding of the world—ideally resides instead beneath the surface, like the bulk of an iceberg, serving as unarticulated subtext for all that transpires.

Perhaps the reason Hemingway’s theory struck a chord is because human beings are like icebergs. Whatever people say or do at any given moment is undergirded by reams of nonverbal context that exists beyond the cold, hard facts of what may appear to be happening. What does it look like when there’s tension between two people, or supreme comfort? What kind of face does someone make when they’re desperately trying to end a conversation? These are things humans come to understand intuitively. According to a new study from Johns Hopkins University, though, AI is hopelessly out of its depth at interpreting such things so far.

“I don’t think humans even have a full understanding of how we pick up on nonverbal social cues in the moment, but the idea behind most modern AI systems is that they should just be able to pick up on it from all of the data they’re trained on,” says Leyla Isik, lead author of the study.

Isik is a cognitive scientist whose work centers around human vision and social perceptions. She had read a lot of scientific work recently suggesting that current AI models are adept at discerning human behavior when they categorize objects in static images. Since plenty of AI in the near-future won’t be parsing static images, though, but instead processing dynamic action in real time, Isik set out to determine whether AI could correctly identify what is happening in videos depicting people engaged in different social interactions with each other.

It’s the kind of thing a person would want their self-driving car to excel at before trusting it to correctly size up, say, whether two people are having a heated exchange on a nearby sidewalk, and if one of them seems to be perhaps one harsh word away from bolting into the crosswalk.

Isik’s team asked a group of people to watch three-second video clips of humans either engaging with each other or doing independent activities near each other, and interpret what the clips portrayed. Sourced from a computer vision data set, the clips included everyday actions ranging from driving to cooking to dancing. The researchers then fed the same short clips to 350 AI language, video and image models, and asked them to predict what humans would say and feel about them. All of the videos were soundless, so neither humans nor AI models could make use of vocal tone, pitch, or dialogue to contextualize what they were taking in.

The results were conclusive; while human participants were overwhelmingly in agreement about what was happening in the videos, the AI models were not.

To be clear, participating AI were able to determine some aspects of what transpired in the clips. The scientists asked questions about things like whether a video was taking place indoors or outdoors, and in a small enclosed space or a large open setting. The AI always matched humans on those kinds of questions. 

They were less successful, however, at peering beneath the surface details.

“Pretty much everything else, we found that most AI models struggled at some subset of it,” Isik says. “Including questions as simple as ‘Are these two people in the video facing each other or not?’ All the way up to higher level questions like, ‘Are these people communicating?’ and ‘Does this video seem like it’s depicting a positive or negative interaction?’”

The researchers asked, in particular, about both the emotional valence of a scene—whether it appeared to be positive or negative—and the level of arousal—how intense or engaging the actions in the video seemed. While a lot of humans involved couldn’t always pick up on what was being communicated in a video, they were able to determine whether a scene seemed intensely positive or mildly negative. AI models could not read the subtext in nonverbal cues, though.

This disparity is likely due, the study claims, to AI being largely built on neural networks inspired by infrastructure from the part of the brain that processes static images, rather than the parts that process social interactions. Most AI models are trained to see an image and recognize objects and faces, but not relationships, context, or social dynamics. They may be trained on data sets that encompass movies, YouTube clips, or Zoom calls, and they may have encountered labels that explain what smiles, crossed arms, or furrowed brows mean. But they do not have the accumulated experience from years and decades spent constantly encountering these data sets and cultivating an intuitive understanding of how to navigate them in real time.

Since another line of research in Isik’s lab at Johns Hopkins is developing models for building more human-centered priorities into modern AI systems, perhaps her research will help close some of these gaps eventually. 

If so, it won’t be a second too soon, as the AI boom continues to expand out into therapy and AI companions, along with other areas that rely on nonverbal cues and everything else lurking beneath the surface.

“Any time you want assistive AI or certainly assistive robots in the workplace or in the home, you’re gonna want it to be able to pick up on these subtle nonverbal cues,” Isik says. “More basically, though, you also just want it to know what people are doing with each other. And I think this study highlights that we’re still pretty far from that reality with a lot of these systems.”

https://www.fastcompany.com/91324372/ai-is-really-bad-at-picking-up-on-social-cues-artificial-intelligence-study?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Creată 5h | 29 apr. 2025, 12:20:04


Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

In his first 100 days, Trump’s tariffs are already threatening the AI boom

When Donald Trump returned to the White House in 2025, many in the tech world hoped his promises to champion artificial intelligence and cut regulation would outweigh the risks of his famously vol

29 apr. 2025, 16:50:07 | Fast company - tech
How learning like a gamer helped this high-school dropout succeed

There are so many ways to die. You could fall off a cliff. A monk could light you on fire. A bat the size of a yacht could kick your head in. You’ve only just begun the game, and yet here you are,

29 apr. 2025, 12:20:08 | Fast company - tech
Renate Nyborg’s Meeno wants to become the Duolingo of dating

Former Tinder CEO Renate Nyborg launched Meeno less than two years ago with the intention of it being an AI chatbot that help

29 apr. 2025, 12:20:07 | Fast company - tech
How Big Tech’s Faustian bargain with Trump backfired

The most indelible image from Donald Trump’s inauguration in January is not the image of the president taking the oath of office without his hand on the Bible. It is not the image of the First Lad

29 apr. 2025, 12:20:06 | Fast company - tech
Signal is the unlikely star of Trump’s first 100 days

The first 100 days of Trump’s second presidential term have included a surprising player that doesn’t seem likely to go away anytime soon: Signal.

The encrypted messaging pl

29 apr. 2025, 09:50:13 | Fast company - tech
How federal funding cuts could threaten America’s lead in cancer research

Cancer research in the U.S. doesn’t rely on a single institution or funding stream—it’s a complex ecosystem made up of interdependent parts: academia, pharmaceutical companies, biotechnology start

29 apr. 2025, 09:50:11 | Fast company - tech