Soumitri Chattopadhyay's Homepage

Soumitri Chattopadhyay

I am a second year Computer Science Ph.D. student at the University of North Carolina at Chapel Hill. I am a part of the UNC Biomedical Image Analysis Group, advised by Prof. Marc Niethammer. Currently, I am working on domain generalization of foundational models for 3D medical image segmentation.

My research interests broadly encompass multimodal learning (Vision + X), medical image analysis and fine-grained representation learning. Since Summer of '24, I am a research intern at SRI International, working with Anirban Roy on multimodal diffusion models for decoding fMRI to visual content. Prior to that, in my first year of PhD, I worked with Prof. Gedas Bertasius on generative modeling for fine-grained discriminative tasks.

I have also been fortunate enough to have collaborated with several research groups and academic institutions during my undergrad years: Prof. Umapada Pal and Prof. Saumik Bhattacharya at CVPR Unit, Indian Statistical Institute, Kolkata; Prof. Jose Dolz at ETS Montreal; Prof. Yi-Zhe Song at SketchX Lab, CVSSP, University of Surrey.

I have published at top-tier computer vision/signal processing conferences, such as CVPR, BMVC, ICIP, ICASSP and INTERSPEECH.

Email / CV / Google Scholar / Blogs / GitHub / LinkedIn

Note I am actively looking for a research internship position for Summer '25. Please feel free to contact me if you have an opening!

News

^New!![Jul '24] One paper accepted at BMVC 2024 as an Oral presentation!
^New!![May '24] Joined SRI International as a research intern! Working with Anirban Roy on multimodal diffusion models for brain decoding.
^New!![Apr '24] Received summer internship offers from SRI International and Amazon Science!
^New!![Aug '23] One paper accepted at BMVC 2023!
^New!![Jun '23] One paper accepted at IEEE ICIP 2023!
[May '23] One paper accepted at INTERSPEECH 2023 as an Oral presentation!
[Apr '23] One paper accepted at ECV Workshop @ CVPR 2023 as an Oral presentation!
[Feb '23] One paper accepted at CVPR 2023!
[Feb '23] One paper accepted at ICASSP 2023!
[Dec '22] Selected for participation at Research Week with Google 2023!
[June '22] One paper accepted at IEEE ICIP 2022 as an Oral presentation!
[May '22] One paper accepted at ICPR 2022!
[May '22] Started my Research Internship at LIVIA, ETS Montreal under Prof. Jose Dolz.
[March '22] Won Best Paper Award (Track: Deep Learning) at MISP 2022!
[Dec '21] Selected for Mitacs Globalink Research Internship for Summer of 2022!

Selected Publications (* denotes equal contribution)

2024

Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli, Josep Lladós
British Machine Vision Conference ( BMVC ), 2024 (Oral presentation)

arXiv / Code / Project Page

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges.

2023

Active Learning for Fine-Grained Sketch-Based Image Retrieval

Himanshu Thakur*, Soumitri Chattopadhyay*
British Machine Vision Conference ( BMVC ), 2023

arXiv / Poster / Video

The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

Exploring Self-Supervised Representation Learning For Low-Resource Medical Image Analysis

Soumitri Chattopadhyay, Soham Ganguly*, Sreejit Chaudhury*, Sayan Nag*, Samiran Chattopadhyay
IEEE International Conference on Image Processing ( ICIP ), 2023

arXiv / Code / Video

The success of self-supervised learning (SSL) has mostly been attributed to the availability of unlabeled yet large-scale datasets. However, in a specialized domain such as medical imaging which is a lot different from natural images, the assumption of data availability is unrealistic and impractical, as the data itself is scanty and found in small databases, collected for specific prognosis tasks. To this end, we seek to investigate the applicability of self-supervised learning algorithms on small-scale medical imaging datasets. In particular, we evaluate 4 state-of-the-art SSL methods on three publicly accessible small medical imaging datasets. Our investigation reveals that in-domain low-resource SSL pre-training can yield competitive performance to transfer learning from large-scale datasets (such as ImageNet). Furthermore, we extensively analyse our empirical findings to provide valuable insights that can motivate for further research towards circumventing the need for pre-training on a large image corpus. To the best of our knowledge, this is the first attempt to holistically explore self-supervision on low-resource medical datasets.

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

Ahana Deb*, Sayan Nag*, Ayan Mahapatra*, Soumitri Chattopadhyay*, Aritra Marik*, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar
INTERSPEECH 2023 (Oral presentation)

arXiv / Project Page

Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data.

DeCAtt: Efficient Vision Transformers with Decorrelated Attention Heads

Mayukh Bhattacharyya*, Soumitri Chattopadhyay*, Sayan Nag*
CVPR 2023 Workshop -- Efficient Deep Learning for Computer Vision
(Oral presentation)

Paper / Code / Slides

The advent of Vision Transformers (ViT) has led to significant performance gains across various computer vision tasks over the last few years, surpassing the de facto standard CNN architectures. However, most of the prominent variations of Vision Transformers are resource-intensive architectures with huge parameter sizes. They are known to be data-hungry and overfit quickly on comparatively smaller datasets. Consequently, this holds back their widespread usage across low-resource settings, which brings forth the need to develop resource-efficient vision transformers. To this end, we introduce a regularization loss that prioritizes efficient utilization of model parameters by decorrelating the heads of a multi-headed attention block in a vision transformer. This forces the heads to learn distinct features rather than focus on the same ones. Using this loss provides a consistent performance improvement over a wide range of varying scenarios of models and datasets as we show in our experiments, which proves its superior effectiveness.

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, Yi-Zhe Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ), 2023

arXiv / Project Page / Video

This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by ~11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes.

IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised Medical Image Segmentation

Hritam Basak, Soumitri Chattopadhyay*, Rohit Kundu*, Sayan Nag*, Rammohan Mallipeddi
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing ( ICASSP ), 2023

arXiv / Code / Project Page

Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using a dense (dis)similarity learning for pre-training a deep encoder network, and employing a semi-supervised paradigm to fine-tune for the downstream task. Specifically, we propose a simple convolutional projection head for obtaining dense pixel-level features, and a new contrastive loss to utilize these dense projections thereby improving the local representations. A bidirectional consistency regularization mechanism involving two-stream model training is devised for the downstream task. Upon comparison, our IDEAL method outperforms the SoTA methods by fair margins on cardiac MRI segmentation.

2022

SWIS: Self-Supervised Representation Learning For Writer Independent Offline Signature Verification

Siladittya Manna, Soumitri Chattopadhyay, Saumik Bhattacharya, Umapada Pal
IEEE International Conference on Image Processing ( ICIP ), 2022
(Oral presentation)

arXiv / Code / Slides

Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets, we obtained encouraging results.

SURDS: Self-Supervised Attention-guided Reconstruction and Dual Triplet Loss for Writer Independent Offline Signature Verification

Soumitri Chattopadhyay, Siladittya Manna, Saumik Bhattacharya, Umapada Pal
International Conference on Pattern Recognition ( ICPR ), 2022

arXiv / Code / Video

Offline Signature Verification (OSV) is a fundamental biometric task across various forensic, commercial and legal applications. The underlying task at hand is to carefully model fine-grained features of the signatures to distinguish between genuine and forged ones, which differ only in minute deformities. This makes OSV more challenging compared to other verification problems. In this work, we propose a two-stage deep learning framework that leverages self-supervised representation learning as well as metric learning for writer-independent OSV. First, we train an image reconstruction network using an encoder-decoder architecture that is augmented by a 2D spatial attention mechanism using signature image patches. Next, the trained encoder backbone is fine-tuned with a projector head using a supervised metric learning framework, whose objective is to optimize a novel dual triplet loss by sampling negative samples from both within the same writer class as well as from other writers. The intuition behind this is to ensure that a signature sample lies closer to its positive counterpart compared to negative samples from both intra-writer and cross-writer sets. This results in robust discriminative learning of the embedding space. To the best of our knowledge, this is the first work of using self-supervised learning frameworks for OSV. The proposed two-stage framework has been evaluated on two publicly available offline signature datasets and compared with various state-of-the-art methods. It is noted that the proposed method provided promising results outperforming several existing pieces of work.

2021

Pneumonia Detection from Lung X-ray Images using Local Search Aided Sine Cosine Algorithm based Deep Feature Selection Method

Soumitri Chattopadhyay, Rohit Kundu, Pawan Kumar Singh, Seyedali Mirjalili, Ram Sarkar
International Journal of Intelligent Systems, Wiley, 2021 (IF: 8.993)

Paper / Code

Pneumonia is a major cause of death among children below the age of 5 years, globally. It is especially prevalent in developing and underdeveloped nations where the risk factors for the disease such as unhygienic living conditions, high levels of pollution and overcrowding are higher. Radiological examination (usually X-ray scans) is conducted to detect pneumonia, yet it is prone to subjective variability and can lead to disagreements among different radiologists. To detect traces of pneumonia from X-ray images, a more robust method is therefore required, which can be achieved by using a computer-aided diagnosis (CAD) system. In this study, we develop a two-stage framework, using the combination of deep learning and optimization algorithms, which is both accurate and time-efficient. In its first stage, the proposed framework extracts feature using a customized deep learning model called DenseNet-201 following the concept of transfer learning to cope with the scanty available data. In the second stage, we then reduce the feature dimension using an improved sine cosine algorithm equipped with adaptive beta hill climbing-based local search algorithm. The optimized feature subset is utilized for the classification of “Pneumonia” and “Normal” X-ray images using a support vector machines classifier. Upon an evaluation on a publicly available data set, the proposed method demonstrates the highest accuracy of 98.36% and sensitivity of 98.79% with a feature reduction of 85.55% (74 features selected out of 512), using a five-fold cross-validation scheme. Extensive additional experiments on continuous benchmark functions as well as the CEC-2017 test suite further showcase the superiority and suitability of our proposed approach in application to real-valued optimization problems.

Academic Services

Conference Reviewing:

Journal Reviewing:

© Soumitri Chattopadhyay | Last updated: August 2023 | Thanks to Jon Barron for sharing this awesome template!