Soumitri Chattopadhyay

I am a 3rd year Ph.D. student at UC San Diego, Dept. of Computer Science and Engineering. I am a part of the Biomedical Image Analysis Group, advised by Prof. Marc Niethammer. Currently, I am broadly working on multimodal learning and foundational models for 3D medical image analysis.

My research interests broadly encompass multimodal learning (Vision + X), medical image segmentation and fine-grained representation learning. I transferred to UCSD from UNC Chapel Hill in Fall '25, where I was a CS PhD student previously. In Summer of '24, I was a research intern at SRI International, working with Anirban Roy on multimodal diffusion models for decoding fMRI to visual content. I also collaborate with Prof. Praneeth Chakravarthula on unified image restoration and Prof. Josep Llados on multimodal document understanding.

I have also been fortunate enough to have collaborated with several research groups and academic institutions during my undergrad years: Prof. Umapada Pal and Prof. Saumik Bhattacharya at CVPR Unit, Indian Statistical Institute, Kolkata; Prof. Jose Dolz at ETS Montreal; Prof. Yi-Zhe Song at SketchX Lab, CVSSP, University of Surrey.

I have published at top-tier computer vision/signal processing conferences, such as CVPR, BMVC, ICIP, ICASSP and INTERSPEECH.

Email  /  Google Scholar  /  Blogs  /  GitHub  /  LinkedIn

Note I am actively looking for a research internship position for Summer 2026. Please feel free to contact me if you have an opening!

profile photo
News
  • New!! [Sept '25]  Transferred to UC San Diego to continue my PhD at the Computer Science and Engineering department!
  • New!! [Jul '24]  One paper accepted at BMVC 2024 as an Oral presentation!
  • [May '24]  Joined SRI International as a research intern! Working with Anirban Roy on multimodal diffusion models for brain decoding.
  • [Apr '24]  Received summer internship offers from SRI International and Amazon Science!
  • [Aug '23]  One paper accepted at BMVC 2023!
  • [Aug '23]  Started my PhD in Computer Science at UNC Chapel Hill!
Recent Preprints (* denotes equal contribution)
Unpaired Multimodal Medical Image Segmentation via Cross-Modal Prompt-driven 3D Foundational Models

Soumitri Chattopadhyay, Basar Demir, Marc Niethammer

[In Submission]

We address the practical challenge of cross-modal 3D medical image segmentation under extreme annotation constraints—assuming only a handful of labeled examples in a single source modality and none in the target modality. The support and query volumes are unpaired and may originate from distinct patients or imaging protocols. To tackle this, we propose a training-free, plug-and-play framework that leverages general-purpose 3D foundation models for segmentation. Our approach explores two complementary strategies for cross-modal prompting: (i) deformable spatial alignment via off-the-shelf multimodal registration, and (ii) semantic-aware feature similarity computation between cross-modal support-query volumes, transformed into soft prompts in logit space. These dense or sparse prompts condition frozen foundational segmentation backbones (such as SAM-Med3D), requiring no additional training or fine-tuning. We further extend the framework with single-image test-time adaptation, refining prompts in a self-supervised manner to improve generalization. We validate our proposed methods across three publicly available 3D abdominal organ segmentation datasets encompassing both CT and MR modalities, and by suitably designing cross-modal settings, we show that our methods have potential and perform competitively compared to oracle upper bounds.

UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

Debabrata Mandal, Soumitri Chattopadhyay, Guansen Tong, Praneeth Chakravarthula

Project Page

Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations.

Zero-shot Domain Generalization of Foundational Models for 3D Medical Image Segmentation: An Experimental Study

Soumitri Chattopadhyay, Basar Demir, Marc Niethammer

Domain shift, caused by variations in imaging modalities and acquisition protocols, limits model generalization in medical image segmentation. While foundation models (FMs) trained on diverse large-scale data hold promise for zero-shot generalization, their application to volumetric medical data remains underexplored. In this study, we examine their ability towards domain generalization (DG), by conducting a comprehensive experimental study encompassing 6 medical segmentation FMs and 12 public datasets spanning multiple modalities and anatomies. Our findings reveal the potential of promptable FMs in bridging the domain gap via smart prompting techniques. Additionally, by probing into multiple facets of zero-shot domain generalization, we offer valuable insights into the viability of foundational models for generalizing across domains and identify promising avenues for future research.

Downstream Analysis of Foundational Medical Vision Models for Disease Progression

Basar Demir, Soumitri Chattopadhyay, Thomas Hastings Greer, Boqi Chen, Marc Niethammer

Medical vision foundational models are used for a wide variety of tasks, including medical image segmentation and registration. This work evaluates the ability of these models to predict disease progression using a simple linear probe. We hypothesize that intermediate layer features of segmentation models capture structural information, while those of registration models encode knowledge of change over time. Beyond demonstrating that these features are useful for disease progression prediction, we also show that registration model features do not require spatially aligned input images. However, for segmentation models, spatial alignment is essential for optimal performance. Our findings highlight the importance of spatial alignment and the utility of foundation model features for image registration.


Selected Publications (* denotes equal contribution)
2024
Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli, Josep Lladós
British Machine Vision Conference ( BMVC ), 2024 (Oral presentation)

arXiv / Code / Project Page

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges.

2023
Active Learning for Fine-Grained Sketch-Based Image Retrieval

Himanshu Thakur*, Soumitri Chattopadhyay*
British Machine Vision Conference ( BMVC ), 2023

arXiv / Poster / Video

The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

Exploring Self-Supervised Representation Learning For Low-Resource Medical Image Analysis

Soumitri Chattopadhyay, Soham Ganguly*, Sreejit Chaudhury*, Sayan Nag*, Samiran Chattopadhyay
IEEE International Conference on Image Processing ( ICIP ), 2023

arXiv / Code / Video

The success of self-supervised learning (SSL) has mostly been attributed to the availability of unlabeled yet large-scale datasets. However, in a specialized domain such as medical imaging which is a lot different from natural images, the assumption of data availability is unrealistic and impractical, as the data itself is scanty and found in small databases, collected for specific prognosis tasks. To this end, we seek to investigate the applicability of self-supervised learning algorithms on small-scale medical imaging datasets. In particular, we evaluate 4 state-of-the-art SSL methods on three publicly accessible small medical imaging datasets. Our investigation reveals that in-domain low-resource SSL pre-training can yield competitive performance to transfer learning from large-scale datasets (such as ImageNet). Furthermore, we extensively analyse our empirical findings to provide valuable insights that can motivate for further research towards circumventing the need for pre-training on a large image corpus. To the best of our knowledge, this is the first attempt to holistically explore self-supervision on low-resource medical datasets.

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

Ahana Deb*, Sayan Nag*, Ayan Mahapatra*, Soumitri Chattopadhyay*, Aritra Marik*, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar
INTERSPEECH 2023 (Oral presentation)

arXiv / Project Page

Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data.

DeCAtt: Efficient Vision Transformers with Decorrelated Attention Heads

Mayukh Bhattacharyya*, Soumitri Chattopadhyay*, Sayan Nag*
CVPR 2023 Workshop -- Efficient Deep Learning for Computer Vision
(Oral presentation)

Paper / Code / Slides

The advent of Vision Transformers (ViT) has led to significant performance gains across various computer vision tasks over the last few years, surpassing the de facto standard CNN architectures. However, most of the prominent variations of Vision Transformers are resource-intensive architectures with huge parameter sizes. They are known to be data-hungry and overfit quickly on comparatively smaller datasets. Consequently, this holds back their widespread usage across low-resource settings, which brings forth the need to develop resource-efficient vision transformers. To this end, we introduce a regularization loss that prioritizes efficient utilization of model parameters by decorrelating the heads of a multi-headed attention block in a vision transformer. This forces the heads to learn distinct features rather than focus on the same ones. Using this loss provides a consistent performance improvement over a wide range of varying scenarios of models and datasets as we show in our experiments, which proves its superior effectiveness.

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, Yi-Zhe Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ), 2023

arXiv / Project Page / Video

This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by ~11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes.

IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised Medical Image Segmentation

Hritam Basak, Soumitri Chattopadhyay*, Rohit Kundu*, Sayan Nag*, Rammohan Mallipeddi
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing ( ICASSP ), 2023

arXiv / Code / Project Page

Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using a dense (dis)similarity learning for pre-training a deep encoder network, and employing a semi-supervised paradigm to fine-tune for the downstream task. Specifically, we propose a simple convolutional projection head for obtaining dense pixel-level features, and a new contrastive loss to utilize these dense projections thereby improving the local representations. A bidirectional consistency regularization mechanism involving two-stream model training is devised for the downstream task. Upon comparison, our IDEAL method outperforms the SoTA methods by fair margins on cardiac MRI segmentation.

2022
SWIS: Self-Supervised Representation Learning For Writer Independent Offline Signature Verification

Siladittya Manna, Soumitri Chattopadhyay, Saumik Bhattacharya, Umapada Pal
IEEE International Conference on Image Processing ( ICIP ), 2022
(Oral presentation)

arXiv / Code / Slides

Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets, we obtained encouraging results.

SURDS: Self-Supervised Attention-guided Reconstruction and Dual Triplet Loss for Writer Independent Offline Signature Verification

Soumitri Chattopadhyay, Siladittya Manna, Saumik Bhattacharya, Umapada Pal
International Conference on Pattern Recognition ( ICPR ), 2022

arXiv / Code / Video

Offline Signature Verification (OSV) is a fundamental biometric task across various forensic, commercial and legal applications. The underlying task at hand is to carefully model fine-grained features of the signatures to distinguish between genuine and forged ones, which differ only in minute deformities. This makes OSV more challenging compared to other verification problems. In this work, we propose a two-stage deep learning framework that leverages self-supervised representation learning as well as metric learning for writer-independent OSV. First, we train an image reconstruction network using an encoder-decoder architecture that is augmented by a 2D spatial attention mechanism using signature image patches. Next, the trained encoder backbone is fine-tuned with a projector head using a supervised metric learning framework, whose objective is to optimize a novel dual triplet loss by sampling negative samples from both within the same writer class as well as from other writers. The intuition behind this is to ensure that a signature sample lies closer to its positive counterpart compared to negative samples from both intra-writer and cross-writer sets. This results in robust discriminative learning of the embedding space. To the best of our knowledge, this is the first work of using self-supervised learning frameworks for OSV. The proposed two-stage framework has been evaluated on two publicly available offline signature datasets and compared with various state-of-the-art methods. It is noted that the proposed method provided promising results outperforming several existing pieces of work.

Academic Services
  • Conference Reviewing:
  • - CVPR 2025, ICCV 2025, ACM MM 2025, WACV 2026
    - CVPR Workshops: ECV @ CVPR'23,'24; T4V @ CVPR'24
    - ICCV Workshops: WiCV @ ICCV'23
    - ICPR 2022, 2024

  • Journal Reviewing:
  • - IEEE Access ('23-Present)
    - Cluster Computing ('24-Present)
    - Engineering Applications of Artificial Intelligene, Elsevier ('23-Present)
    - Computers in Biology and Medicine, Elsevier ('21-Present)
    - International Journal of Intelligent Systems, Wiley ('22-Present)
    - Informatics in Medicine Unlocked, Elsevier ('22-Present)
© Soumitri Chattopadhyay | Last updated: September 2025 | Thanks to Jon Barron for sharing this awesome template!