Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

1Department of Computer Science, University of North Carolina at Chapel Hill, USA 2Computer Vision Center & Computer Science Department, Universitat Autònoma de Barcelona, Spain 3MICC, University of Florence, Italy

BMVC 2024 (Oral)

Overview of our approach compared to existing VLM adaptation methods. (a) Zero-shot inference with CLIP; (b) Contextual prompt token learning; (c) Adapter-based tuning with handcrafted prompts on frozen CLIP representations; (d) Our setup (GCPL): generatively learning the [CLASS] token by prompting a frozen text-to-image latent diffusion model.

Abstract

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges.

Proposed Methodology

Contrastive multi-class prompt learning (CoMPLe) framework. Our proposed CoMPLe learns class prompts by optimizing the LDM loss for the trainable class token, minimizing noise reconstruction for ground truth noise while maximizing it for other class noises. Red arrows show "maximize" and blue arrows show "minimize". The diffusion classifier uses our few-shot learned [CLASS] embeddings for inference.
 

Results

Empirical comparison with prior state-of-the-art methods on few-shot image recognition via adapting vision-language models: (a) zero-shot (b) adapter tuning (c) prompt learning (d) our proposed GCPL and CoMPLe methods. We report the mean and standard deviation of 5 runs.  
Ablation over different number of shots across various datasets.  

BibTeX

@inproceedings{chattopadhyay2024towards,
    title={Towards Generative Class Prompt Learning for Fine-grained Visual Recognition},
    author={Chattopadhyay, Soumitri and Biswas, Sanket and Vivoli, Emanuele and Lladós, Josep},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2024}
}

© Soumitri Chattopadhyay | Last updated: 22 November 2024 | Good artists copy, great artists steal.