New Technique Overcomes Spurious Correlations Problem in AI
en-GBde-DEes-ESfr-FR

New Technique Overcomes Spurious Correlations Problem in AI


AI models often rely on “spurious correlations,” making decisions based on unimportant and potentially misleading information. Researchers have now discovered these learned spurious correlations can be traced to a very small subset of the training data and have demonstrated a technique that overcomes the problem.

“This technique is novel in that it can be used even when you have no idea what spurious correlations the AI is relying on,” says Jung-Eun Kim, corresponding author of a paper on the work and an assistant professor of computer science at North Carolina State University. “If you already have a good idea of what the spurious features are, our technique is an efficient and effective way to address the problem. However, even if you are simply having performance issues, but don’t understand why, you could still use our technique to determine whether a spurious correlation exists and resolve that issue.”

Spurious correlations are generally caused by simplicity bias during AI training. Practitioners use data sets to train AI models to perform specific tasks. For example, an AI model could be trained to identify photographs of dogs. The training data set would include pictures of dogs where the AI is told a dog is in the photo. During the training process, the AI will begin identifying specific features that it can use to identify dogs. However, if many of the dogs in the photos are wearing collars, and because collars are generally less complex features of a dog than ears or fur, the AI may use collars as a simple way to identify dogs. This is how simplicity bias can cause spurious correlations.

“And if the AI uses collars as the factor it uses to identify dogs, the AI may identify cats wearing collars as dogs,” Kim says.

Conventional techniques for addressing problems caused by spurious correlations rely on practitioners being able to identify the spurious features that are causing the problem. They can then address this by modifying the data sets used to train the AI model. For example, practitioners might increase the weight given to photos in the data set that include dogs that are not wearing collars.

However, in their new work, the researchers demonstrate that it is not always possible to identify the spurious features that are causing problems – making conventional techniques for addressing spurious correlations ineffective.

“Our goal with this work was to develop a technique that allows us to sever spurious correlations even when we know nothing about those spurious features,” Kim says.

The new technique relies on removing a small portion of the data used to train the AI model.

“There can be significant variation in the data samples included in training data sets,” Kim says. “Some of the samples can be very simple, while others may be very complex. And we can measure how ‘difficult’ each sample is based on how the model behaved during training.

“Our hypothesis was that the most difficult samples in the data set can be noisy and ambiguous, and are most likely to force a network to rely on irrelevant information that hurt a model’s performance,” Kim explains. “By eliminating a small sliver of the training data that is difficult to understand, you are also eliminating the hard data samples that contain spurious features. This elimination overcomes the spurious correlations problem, without causing significant adverse effects.”

The researchers demonstrated that the new technique achieves state-of-the-art results – improving performance even when compared to previous work on models where the spurious features were identifiable.

The peer-reviewed paper, “Severing Spurious Correlations with Data Pruning,” will be presented at the International Conference on Learning Representations (ICLR), being held in Singapore from April 24-28. First author of the paper is Varun Mulchandani, a Ph.D. student at NC State.

“Severing Spurious Correlations with Data Pruning”

Authors: Varun Mulchandani and Jung-Eun Kim, North Carolina State University

Presented: April 24-28, ICLR 2025
Regions: North America, United States
Keywords: Applied science, Artificial Intelligence, Computing, Engineering, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • BBC
  • The Times
  • National Geographic
  • The University of Edinburgh
  • University of Cambridge
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement