The Neglected Tails of Vision-Language Models.

Abstract

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs’ large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000× less training time!

Measuring Concept Frequency in Pretraining Data

We use LLMs such as ChatGPT to help count texts relevant to the concept of interest, as visually illustrated above for the concept of "tiger".

Our Findings

VLMs show imbalanced performance due to a long-tailed concept distribution

ImageNet

Frequency Distribution

Zero-Shot Accuracy

Flowers

Frequency Distribution

Zero-Shot Accuracy

Aircraft

Frequency Distribution

Zero-Shot Accuracy

CUB

Frequency Distribution

Zero-Shot Accuracy

Cars

Frequency Distribution

Zero-Shot Accuracy

Pets

Frequency Distribution

Zero-Shot Accuracy

Food

Frequency Distribution

Zero-Shot Accuracy

Left: By analyzing the texts of large-scale pretraining datasets such as LAION-400M (blue) and LAION-2B (orange), we show that these pretraining datasets exhibit a long-tailed distribution for visual concepts defined in a variety of downstream tasks.

Right: We show that for zero-shot recognition, OpenCLIP models trained on LAION-400M (blue) and LAION-2B (orange) respecitvely yield per-class accuracies that strongly correlate with the long-tailed distribution of concept frequencies (binned on a log scale). Interestingly, other VLMs such as CLIP (red) and MetaCLIP (green) (trained on private-data) also show similar imbalanced performance.

Multi-Modal Systems Fail on Rare Concepts

Tailed Frog

Electric Ray

BAE 146-200

Pan Flute

Night Snake

Longnose Gar

Coral Fungus

Hard-leaved Pocket Orchid

Our analysis of LAION-400M and LAION-2B helps us identify visual concepts that are under-represented in the pretraining datasets of Vision Language models. Further, we show that state-of-the-art multi-modal systems including GPT4-V, LLaVA, DALL-E 3, and SD-XL all fail to recognize or generate these rare concepts. More examples are shown in our paper.

Solutions

To mitigate the imbalanced performance of VLMs we propose a novel prompting solution (REAL-Prompt) and a retrieval-augmented strategy REAL-Linear.

REAL-Prompt

REAL-Prompt replaces the given concept names with their most frequent synonyms (in the pretraining data of VLMs) and constructs prompts for zero-shot recognition. We display some specific concepts from ImageNet, their most frequent synonyms, frequency (in LAION-400M), and their per-class accuracy. Clearly, the simple change in prompts significantly improves zero-shot recognition.

Bank Swallow, a small bird with brown back and white belly. When prompted with the original concept name (bank swallow), both DALL-E 3 and SD-XL generate incorrect images of birds with incorrect black backs. However, prompting with the most frequent synonym (sand martin) guides both systems to produce correct images.

Thorn Apple, a plant with large, white, trumpet-shaped flowers. When prompted with the original concept name(thorn apple), DALL-E 3 generates an image with sharp thorns along its stem. Even worse, SD-XL takes the names superficially and generates an apple with thorns. On the contrary, prompting with the most frequent synonym (datura) leads to correct images in both systems.

Hard Leaved Pocket Orchid, a type of orchid with a distinctive pouch and symmetrical large petals. When prompted with the original conceptname(hard leaved pocket orchid), both DALL-E 3 and SD-XL generate incorrect images (note the missing pocket and shape of the petals). However, when prompted with the most frequent synonym (Paphiopedilum micranthum), DALL-E 3 produces the correct image. In contrast, SD-XL can recover the shape of petals but still misses the pocket.

BAE 146-200, an airplane with 4 engines. When prompted with the original concept name (BAE 146-200), both DALL-E 3 and SD-XL fail by generating only 2 engines. DALL-E 3 also mistakenly generates extra wings sometimes! However, when prompted with the most frequent synonym (avro rj85), both can generate correct images with 4 engines.

keyboard space bar, a long bar at the bottom of a computer keyboard. When prompted with the original concept name (keyboard space bar), both DALL-E 3 and SD-XL fail by focusing on generating images of the keyboard. However, when prompted with the most frequent synonyms (space bar), both can generate correct images.

We show some tailed concepts where DALL-E 3 and SD-XL can fail to generate correct images when using the original concept name. However using REAL-Prompt, which uses the most frequent synonym to construct prompts, can help produce correct images.

REAL-Linear

To further improve the zero-shot recognition performance of VLMs, we propose REAL-Linear, a lightweight yet powerful retrieval-augmented solution. REAL-Linear uses all synonyms of the given concepts to retrieve a class-balanced subset of pretraining images (e.g., 500 images per class from the dataset LAION-400M). These steps have been demonstrated above.

Benchmarking REAL

Within the zero-shot prompting paradigm, our REAL-Prompt outperforms existing prompting approaches such as DCLIP, and CuPL. Next, we show that our REAL-Linear rivals the recent retrieval augmented method REACT. The best numbers are highlighted in bold, second best are underlined.

We show that for ImageNet REAL-Linear only uses 5% of retrieved images and 1% of compute as compared to recent state-of-the-art REACT.

BibTeX

@article{parashar2024neglected,
        title={The Neglected Tails of Vision-Language Models},
        author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
        journal={arXiv preprint arXiv:2401.12425},
        year={2024}
      }

The Neglected Tails of Vision Language Models

CVPR 2024, ICML 2024 DMLR Oral

Abstract

Measuring Concept Frequency in Pretraining Data

Our Findings

VLMs show imbalanced performance due to a long-tailed concept distribution

Multi-Modal Systems Fail on Rare Concepts

Solutions

REAL-Prompt

REAL-Linear

Benchmarking REAL

BibTeX