Dataset Information

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter.

ABSTRACT:

Purpose

In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.

Methods

We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.

Results

Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.

Conclusion

In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .

SUBMITTER: Wei M

PROVIDER: S-EPMC11230947 | biostudies-literature | 2024 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter.

Wei Meng M Shi Miaojing M Vercauteren Tom T

International journal of computer assisted radiology and surgery 20240508 7

<h4>Purpose</h4>In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.<h4>Methods</h4>We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data effic ...[more]

PMID: 38717737

Similar Datasets

Project description:Transformers have demonstrated significant promise for computer vision tasks. Particularly noteworthy is SwinUNETR, a model that employs vision transformers, which has made remarkable advancements in improving the process of segmenting medical images. Nevertheless, the efficacy of training process of SwinUNETR has been constrained by an extended training duration, a limitation primarily attributable to the integration of the attention mechanism within the architecture. In this article, to address this limitation, we introduce a novel framework, called the MetaSwin model. Drawing inspiration from the MetaFormer concept that uses other token mix operations, we propose a transformative modification by substituting attention-based components within SwinUNETR with a straightforward yet impactful spatial pooling operation. Additionally, we incorporate of Squeeze-and-Excitation (SE) blocks after each MetaSwin block of the encoder and into the decoder, which aims at segmentation performance. We evaluate our proposed MetaSwin model on two distinct medical datasets, namely BraTS 2023 and MICCAI 2015 BTCV, and conduct a comprehensive comparison with the two baselines, i.e., SwinUNETR and SwinUNETR+SE models. Our results emphasize the effectiveness of MetaSwin, showcasing its competitive edge against the baselines, utilizing a simple pooling operation and efficient SE blocks. MetaSwin's consistent and superior performance on the BTCV dataset, in comparison to SwinUNETR, is particularly significant. For instance, with a model size of 24, MetaSwin outperforms SwinUNETR's 76.58% Dice score using fewer parameters (15,407,384 vs 15,703,304) and a substantially reduced training time (300 vs 467 mins), achieving an improved Dice score of 79.12%. This research highlights the essential contribution of a simplified transformer framework, incorporating basic elements such as pooling and SE blocks, thus emphasizing their potential to guide the progression of medical segmentation models, without relying on complex attention-based mechanisms.

Dataset Information

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter.

Purpose

Methods

Results

Conclusion

Publications

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets