The Neglected Tails of Vision Language Models

1Texas A&M University
2Carnegie Mellon University
3University of Macau
4Zhejiang Lab

*Indicates Equal Contribution

CVPR 2024


Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs’ large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000× less training time!

Measuring Concept Frequency in Pretraining Data


We use LLMs such as ChatGPT to help count texts relevant to the concept of interest, as visually illustrated above for the concept of "tiger".

Our Findings

VLMs show imbalanced performance due to a long-tailed concept distribution

Left: By analyzing the texts of large-scale pretraining datasets such as LAION-400M (blue) and LAION-2B (orange), we show that these pretraining datasets exhibit a long-tailed distribution for visual concepts defined in a variety of downstream tasks.

Right: We show that for zero-shot recognition, OpenCLIP models trained on LAION-400M (blue) and LAION-2B (orange) respecitvely yield per-class accuracies that strongly correlate with the long-tailed distribution of concept frequencies (binned on a log scale). Interestingly, other VLMs such as CLIP (red) and MetaCLIP (green) (trained on private-data) also show similar imbalanced performance.

Multi-Modal Systems Fail on Rare Concepts

Our analysis of LAION-400M and LAION-2B helps us identify visual concepts that are under-represented in the pretraining datasets of Vision Language models. Further, we show that state-of-the-art multi-modal systems including GPT4-V, LLaVA, DALL-E 3, and SD-XL all fail to recognize or generate these rare concepts. More examples are shown in our paper.


To mitigate the imbalanced performance of VLMs we propose a novel prompting solution (REAL-Prompt) and a retrieval-augmented strategy REAL-Linear.


Showcasing the efficiency of REAL-Prompt.

REAL-Prompt replaces the given concept names with their most frequent synonyms (in the pretraining data of VLMs) and constructs prompts for zero-shot recognition. We display some specific concepts from ImageNet, their most frequent synonyms, frequency (in LAION-400M), and their per-class accuracy. Clearly, the simple change in prompts significantly improves zero-shot recognition.

We show some tailed concepts where DALL-E 3 and SD-XL can fail to generate correct images when using the original concept name. However using REAL-Prompt, which uses the most frequent synonym to construct prompts, can help produce correct images.


Image showing DALL-E 3 output

To further improve the zero-shot recognition performance of VLMs, we propose REAL-Linear, a lightweight yet powerful retrieval-augmented solution. REAL-Linear uses all synonyms of the given concepts to retrieve a class-balanced subset of pretraining images (e.g., 500 images per class from the dataset LAION-400M). These steps have been demonstrated above.

Benchmarking REAL

Within the zero-shot prompting paradigm, our REAL-Prompt outperforms existing prompting approaches such as DCLIP, and CuPL. Next, we show that our REAL-Linear rivals the recent retrieval augmented method REACT. The best numbers are highlighted in bold, second best are underlined.

Image showing DALL-E 3 output

We show that for ImageNet REAL-Linear only uses 5% of retrieved images and 1% of compute as compared to recent state-of-the-art REACT.

Image showing DALL-E 3 output


        title={The Neglected Tails of Vision-Language Models},
        author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
        journal={arXiv preprint arXiv:2401.12425},