In recent years, the field of deep learning has experienced remarkable growth, leading to the emergence of large, pre-trained models such as ChatGPT, which demonstrates significant capability in understanding and responding to human language inputs, and DALL-E, which creatively generates images from textual descriptions in a zero-shot manner. Another notable innovation in this domain is CLIP (Contrastive Language-Image Pre-Training), a model that excels in representation learning by bridging multiple modalities to perform classifications, also in a zero-shot manner. CLIP, trained on a diverse array of images and natural language descriptions readily available on the internet, can interpret natural language instructions to execute a wide range of classification tasks without specific optimization for those tasks. These advanced models have shown remarkable effectiveness in various real-world applications, showcasing their potential even when not trained on task-specific data. Notably, CLIP achieved a zero-shot accuracy of 76.2% on the ImageNet dataset. However, a pressing question remains within the machine learning community: Does CLIP know everything?
To solve the problems, a research team led by Da-Wei Zhou published their new research on 15 October 2024 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
This question is pivotal. If a model could truly understand and react to all information, the exploration of alternative models might become redundant. Nevertheless, the reality is that no model, including CLIP, possesses complete knowledge. Our world is in constant flux, with new data, objects, categories, and information emerging regularly. For instance, ChatGPT’s knowledge of world events, such as political changes, is contingent upon its training data, and CLIP cannot recognize images of products released after its last update, such as the ’Apple Vision Pro’ launched in 2023.
This paper focuses on identifying datasets unknown to CLIP, a task of considerable importance. Given CLIP’s training on the extensive LAION dataset, identifying such datasets not only facilitates the application of transfer learning for downstream tasks but also serves as a means to evaluate CLIP’s ability to detect out-of-distribution or novel instances and continual learning. This is particularly relevant in the context of addressing the hallucination issues prevalent in large models. To advance research in this area, we introduce a dataset of TV series released post2021, named TV100, to explore CLIP’s performance further.
To investigate whether a pre-trained CLIP knows these images, we conduct an experiment on its zero-shot performance and finetuned performance. Accordingly, we find a pre-trained CLIP cannot recognize any classes from the dataset. By contrast, if we finetune the CLIP model with the images, the performance drastically improves, indicating that the dataset is learnable and separable. This dataset holds significant potential for use in various research areas, including the evaluation of incremental learning, novel class discovery, and long-tailed learning, among others.
DOI: 10.1007/s11704-024-40217-z