Soumitri Chattopadhyay's Homepage

Soumitri Chattopadhyay

I am a first year Computer Science Ph.D. student at the University of North Carolina at Chapel Hill. I work on diffusion models for downstream computer vision tasks.

I completed my Bachelors (B.E.) in June '23 majoring in Information Technology at Jadavpur University, India. During my undergrad, I was a research intern at CVPR Unit, Indian Statistical Institute, Kolkata under Prof. Umapada Pal and Prof. Saumik Bhattacharya on self-supervised learning for various computer vision problems. As a recipient of the Mitacs Globalink Research Internship, I spent the Summer of '22 at Ecole de technologie superieure (ETS) Montreal, where I worked with Prof. Jose Dolz on unpaired multi-modal medical image segmentation. I have also worked as a research assistant at SketchX Lab, University of Surrey, advised by Prof. Yi-Zhe Song, on fine-grained sketch-based image retrieval under limited data constraints. Prior to these, I worked with Prof. Pawan Kumar Singh and Prof. Ram Sarkar on nature-inspired optimization for medical imaging and speech emotion recognition.

I have published at top-tier computer vision/signal processing conferences such as CVPR, BMVC, ICIP, ICASSP and INTERSPEECH.

Email / CV / Google Scholar / Blogs / GitHub / LinkedIn

News

^New!![Aug '23] One paper accepted at BMVC 2023!
^New!![Jun '23] One paper accepted at IEEE ICIP 2023!
^New!![May '23] One paper accepted at INTERSPEECH 2023 for Oral presentation!
^New!![Apr '23] One paper accepted at ECV Workshop @ CVPR 2023 for Oral presentation!
^New!![Feb '23] One paper accepted at CVPR 2023!
^New!![Feb '23] One paper accepted at ICASSP 2023!
[Dec '22] Selected for participation at Research Week with Google 2023!
[June '22] One paper accepted at IEEE ICIP 2022 for Oral presentation!
[May '22] One paper accepted at ICPR 2022!
[May '22] Started my Research Internship at LIVIA, ETS Montreal under Prof. Jose Dolz.
[March '22] Won Best Paper Award (Track: Deep Learning) at MISP 2022!
[Dec '21] Selected for Mitacs Globalink Research Internship for Summer of 2022!

Selected Publications (* denotes equal contribution)

2023

Active Learning for Fine-Grained Sketch-Based Image Retrieval

Himanshu Thakur*, Soumitri Chattopadhyay*
British Machine Vision Conference ( BMVC ), 2023 (Accepted)

arXiv

The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

Exploring Self-Supervised Representation Learning For Low-Resource Medical Image Analysis

Soumitri Chattopadhyay, Soham Ganguly*, Sreejit Chaudhury*, Sayan Nag*, Samiran Chattopadhyay
IEEE International Conference on Image Processing ( ICIP ), 2023 (Accepted)

arXiv / Code / Slides

The success of self-supervised learning (SSL) has mostly been attributed to the availability of unlabeled yet large-scale datasets. However, in a specialized domain such as medical imaging which is a lot different from natural images, the assumption of data availability is unrealistic and impractical, as the data itself is scanty and found in small databases, collected for specific prognosis tasks. To this end, we seek to investigate the applicability of self-supervised learning algorithms on small-scale medical imaging datasets. In particular, we evaluate 4 state-of-the-art SSL methods on three publicly accessible small medical imaging datasets. Our investigation reveals that in-domain low-resource SSL pre-training can yield competitive performance to transfer learning from large-scale datasets (such as ImageNet). Furthermore, we extensively analyse our empirical findings to provide valuable insights that can motivate for further research towards circumventing the need for pre-training on a large image corpus. To the best of our knowledge, this is the first attempt to holistically explore self-supervision on low-resource medical datasets.

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

Ahana Deb*, Sayan Nag*, Ayan Mahapatra*, Soumitri Chattopadhyay*, Aritra Marik*, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar
INTERSPEECH 2023 (Oral presentation)

arXiv / Project Page

Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data.

DeCAtt: Efficient Vision Transformers with Decorrelated Attention Heads

Mayukh Bhattacharyya*, Soumitri Chattopadhyay*, Sayan Nag*
CVPR 2023 Workshop -- Efficient Deep Learning for Computer Vision
(Oral presentation)

Paper / Code / Slides

The advent of Vision Transformers (ViT) has led to significant performance gains across various computer vision tasks over the last few years, surpassing the de facto standard CNN architectures. However, most of the prominent variations of Vision Transformers are resource-intensive architectures with huge parameter sizes. They are known to be data-hungry and overfit quickly on comparatively smaller datasets. Consequently, this holds back their widespread usage across low-resource settings, which brings forth the need to develop resource-efficient vision transformers. To this end, we introduce a regularization loss that prioritizes efficient utilization of model parameters by decorrelating the heads of a multi-headed attention block in a vision transformer. This forces the heads to learn distinct features rather than focus on the same ones. Using this loss provides a consistent performance improvement over a wide range of varying scenarios of models and datasets as we show in our experiments, which proves its superior effectiveness.

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, Yi-Zhe Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ), 2023

arXiv / Project Page / Video

This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by ~11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes.

IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised Medical Image Segmentation

Hritam Basak, Soumitri Chattopadhyay*, Rohit Kundu*, Sayan Nag*, Rammohan Mallipeddi
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing ( ICASSP ), 2023

arXiv / Code / Project Page

Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using a dense (dis)similarity learning for pre-training a deep encoder network, and employing a semi-supervised paradigm to fine-tune for the downstream task. Specifically, we propose a simple convolutional projection head for obtaining dense pixel-level features, and a new contrastive loss to utilize these dense projections thereby improving the local representations. A bidirectional consistency regularization mechanism involving two-stream model training is devised for the downstream task. Upon comparison, our IDEAL method outperforms the SoTA methods by fair margins on cardiac MRI segmentation.

2022

SWIS: Self-Supervised Representation Learning For Writer Independent Offline Signature Verification

Siladittya Manna, Soumitri Chattopadhyay, Saumik Bhattacharya, Umapada Pal
IEEE International Conference on Image Processing ( ICIP ), 2022
(Oral presentation)

arXiv / Code / Slides

Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets, we obtained encouraging results.

SURDS: Self-Supervised Attention-guided Reconstruction and Dual Triplet Loss for Writer Independent Offline Signature Verification

Soumitri Chattopadhyay, Siladittya Manna, Saumik Bhattacharya, Umapada Pal
International Conference on Pattern Recognition ( ICPR ), 2022

arXiv / Code / Video

Offline Signature Verification (OSV) is a fundamental biometric task across various forensic, commercial and legal applications. The underlying task at hand is to carefully model fine-grained features of the signatures to distinguish between genuine and forged ones, which differ only in minute deformities. This makes OSV more challenging compared to other verification problems. In this work, we propose a two-stage deep learning framework that leverages self-supervised representation learning as well as metric learning for writer-independent OSV. First, we train an image reconstruction network using an encoder-decoder architecture that is augmented by a 2D spatial attention mechanism using signature image patches. Next, the trained encoder backbone is fine-tuned with a projector head using a supervised metric learning framework, whose objective is to optimize a novel dual triplet loss by sampling negative samples from both within the same writer class as well as from other writers. The intuition behind this is to ensure that a signature sample lies closer to its positive counterpart compared to negative samples from both intra-writer and cross-writer sets. This results in robust discriminative learning of the embedding space. To the best of our knowledge, this is the first work of using self-supervised learning frameworks for OSV. The proposed two-stage framework has been evaluated on two publicly available offline signature datasets and compared with various state-of-the-art methods. It is noted that the proposed method provided promising results outperforming several existing pieces of work.

2021

Pneumonia Detection from Lung X-ray Images using Local Search Aided Sine Cosine Algorithm based Deep Feature Selection Method

Soumitri Chattopadhyay, Rohit Kundu, Pawan Kumar Singh, Seyedali Mirjalili, Ram Sarkar
International Journal of Intelligent Systems, Wiley, 2021 (IF: 8.993)

Paper / Code

Pneumonia is a major cause of death among children below the age of 5 years, globally. It is especially prevalent in developing and underdeveloped nations where the risk factors for the disease such as unhygienic living conditions, high levels of pollution and overcrowding are higher. Radiological examination (usually X-ray scans) is conducted to detect pneumonia, yet it is prone to subjective variability and can lead to disagreements among different radiologists. To detect traces of pneumonia from X-ray images, a more robust method is therefore required, which can be achieved by using a computer-aided diagnosis (CAD) system. In this study, we develop a two-stage framework, using the combination of deep learning and optimization algorithms, which is both accurate and time-efficient. In its first stage, the proposed framework extracts feature using a customized deep learning model called DenseNet-201 following the concept of transfer learning to cope with the scanty available data. In the second stage, we then reduce the feature dimension using an improved sine cosine algorithm equipped with adaptive beta hill climbing-based local search algorithm. The optimized feature subset is utilized for the classification of “Pneumonia” and “Normal” X-ray images using a support vector machines classifier. Upon an evaluation on a publicly available data set, the proposed method demonstrates the highest accuracy of 98.36% and sensitivity of 98.79% with a feature reduction of 85.55% (74 features selected out of 512), using a five-fold cross-validation scheme. Extensive additional experiments on continuous benchmark functions as well as the CEC-2017 test suite further showcase the superiority and suitability of our proposed approach in application to real-valued optimization problems.

Academic Services

Conference Reviewing:

Journal Reviewing: