In recent years, foundation Vision-Language Models (VLMs), such as CLIP [1], which empower zero-shot transfer to a wide variety of domains without fine-tuning, have led to a significant shift in machine learning systems. Their success can primarily be credited to three factors: Web-scale multimodal data, self-contrastive losses and the rise of Transformer architecture. Despite the impressive capabilities, it is concerning that the VLMs are prone to inheriting biases from the uncurated datasets scraped from the Internet [4–8]. We examine these biases from three perspectives: (1) Label bias, certain classes (words) appear more frequently in the pre-training data. (2) Spurious correlation, non-target features, e.g., image background, that are correlated with labels, resulting in poor group robustness. (3) Social bias, which is a special form of spurious correlation, focuses on societal harm. Unaudited image-text pairs might contain human prejudice, e.g., gender, ethnicity, and age, that are correlated with targets. These biases are subsequently propagated to downstream tasks, leading to biased predictions.
A research team provide an overview of the three biases prevalent in visual classification within the area of VLMs, along with strategies to mitigate these biases in
Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
Currently, most VLMs debiasing methods focus on discriminative models, such as image classification, while generative tasks like image captioning and image generation receive little attention in terms of debiasing. This could become a significant research direction in the future.
DOI:
10.1007/s11704-024-40051-3