Dataset Information

Hint-Based Image Colorization Based on Hierarchical Vision Transformer.

ABSTRACT: Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the literature, they are based on convolution neural networks (CNNs) that have strong spatial locality due to the convolution operations. This often causes non-trivial visual artifacts in the colorization results, such as false color and color bleeding artifacts. To overcome this limitation, this study proposes a vision transformer-based colorization network. The proposed hint-based colorization network has a hierarchical vision transformer architecture in the form of an encoder-decoder structure based on transformer blocks. As the proposed method uses the transformer blocks that can learn rich long-range dependency, it can achieve visually plausible colorization results, even with a small number of color hints. Through the verification experiments, the results reveal that the proposed transformer model outperforms the conventional CNN-based models. In addition, we qualitatively analyze the effect of the long-range dependency of the transformer model on hint-based image colorization.

SUBMITTER: Lee S

PROVIDER: S-EPMC9570951 | biostudies-literature | 2022 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Hint-Based Image Colorization Based on Hierarchical Vision Transformer.

Lee Subin S Jung Yong Ju YJ

Sensors (Basel, Switzerland) 20220929 19

Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the literature, they are based on convolution neural networks (CNNs) that have strong spatial locality due to the convolution operations. This often causes non-trivial visual artifacts in the colorization results, such as f ...[more]

PMID: 36236517

Similar Datasets

Project description:Transformers have demonstrated significant promise for computer vision tasks. Particularly noteworthy is SwinUNETR, a model that employs vision transformers, which has made remarkable advancements in improving the process of segmenting medical images. Nevertheless, the efficacy of training process of SwinUNETR has been constrained by an extended training duration, a limitation primarily attributable to the integration of the attention mechanism within the architecture. In this article, to address this limitation, we introduce a novel framework, called the MetaSwin model. Drawing inspiration from the MetaFormer concept that uses other token mix operations, we propose a transformative modification by substituting attention-based components within SwinUNETR with a straightforward yet impactful spatial pooling operation. Additionally, we incorporate of Squeeze-and-Excitation (SE) blocks after each MetaSwin block of the encoder and into the decoder, which aims at segmentation performance. We evaluate our proposed MetaSwin model on two distinct medical datasets, namely BraTS 2023 and MICCAI 2015 BTCV, and conduct a comprehensive comparison with the two baselines, i.e., SwinUNETR and SwinUNETR+SE models. Our results emphasize the effectiveness of MetaSwin, showcasing its competitive edge against the baselines, utilizing a simple pooling operation and efficient SE blocks. MetaSwin's consistent and superior performance on the BTCV dataset, in comparison to SwinUNETR, is particularly significant. For instance, with a model size of 24, MetaSwin outperforms SwinUNETR's 76.58% Dice score using fewer parameters (15,407,384 vs 15,703,304) and a substantially reduced training time (300 vs 467 mins), achieving an improved Dice score of 79.12%. This research highlights the essential contribution of a simplified transformer framework, incorporating basic elements such as pooling and SE blocks, thus emphasizing their potential to guide the progression of medical segmentation models, without relying on complex attention-based mechanisms.

Project description:Breast cancer is a significant global public health concern, with various treatment options available based on tumor characteristics. Pathological examination of excision specimens after surgery provides essential information for treatment decisions. However, the manual selection of representative sections for histological examination is laborious and subjective, leading to potential sampling errors and variability, especially in carcinomas that have been previously treated with chemotherapy. Furthermore, the accurate identification of residual tumors presents significant challenges, emphasizing the need for systematic or assisted methods to address this issue. In order to enable the development of deep-learning algorithms for automated cancer detection on radiology images, it is crucial to perform radiology-pathology registration, which ensures the generation of accurately labeled ground truth data. The alignment of radiology and histopathology images plays a critical role in establishing reliable cancer labels for training deep-learning algorithms on radiology images. However, aligning these images is challenging due to their content and resolution differences, tissue deformation, artifacts, and imprecise correspondence. We present a novel deep learning-based pipeline for the affine registration of faxitron images, the x-ray representations of macrosections of ex-vivo breast tissue, and their corresponding histopathology images of tissue segments. The proposed model combines convolutional neural networks and vision transformers, allowing it to effectively capture both local and global information from the entire tissue macrosection as well as its segments. This integrated approach enables simultaneous registration and stitching of image segments, facilitating segment-to-macrosection registration through a puzzling-based mechanism. To address the limitations of multi-modal ground truth data, we tackle the problem by training the model using synthetic mono-modal data in a weakly supervised manner. The trained model demonstrated successful performance in multi-modal registration, yielding registration results with an average landmark error of 1.51 mm (±2.40), and stitching distance of 1.15 mm (±0.94). The results indicate that the model performs significantly better than existing baselines, including both deep learning-based and iterative models, and it is also approximately 200 times faster than the iterative approach. This work bridges the gap in the current research and clinical workflow and has the potential to improve efficiency and accuracy in breast cancer evaluation and streamline pathology workflow.

Project description:PurposeA reliable and comprehensive cancer prognosis model for oropharyngeal cancer (OPC) could better assist in personalizing treatment. In this work, we developed a vision transformer-based (ViT-based) multilabel model with multimodal input to learn complementary information from available pretreatment data and predict multiple associated endpoints for radiation therapy for patients with OPC.Methods and materialsA publicly available data set of 512 patients with OPC was used for both model training and evaluation. Planning computed tomography images, primary gross tumor volume masks, and 16 clinical variables representing patient demographics, diagnosis, and treatment were used as inputs. To extract deep image features with global attention, we used a ViT module. Clinical variables were concatenated with the learned image features and fed into fully connected layers to incorporate cross-modality features. To learn the mapping between the features and correlated survival outcomes, including overall survival, local failure-free survival, regional failure-free survival, and distant failure-free survival, we employed 4 multitask logistic regression layers. The proposed model was optimized by combining the multitask logistic regression negative-log likelihood losses of different prediction targets.ResultsWe employed the C-index and area under the curve metrics to assess the performance of our model for time-to-event prediction and time-specific binary prediction, respectively. Our proposed model outperformed corresponding single-modality and single-label models on all prediction labels, achieving C-indices of 0.773, 0.765, 0.776, and 0.773 for overall survival, local failure-free survival, regional failure-free survival, and distant failure-free survival, respectively. The area under the curve values ranged between 0.799 and 0.844 for different tasks at different time points. Using the medians of predicted risks as the thresholds to identify high-risk and low-risk patient groups, we performed the log-rank test, the results of which showed significantly larger separations in different event-free survivals.ConclusionWe developed the first model capable of predicting multiple labels for OPC simultaneously. Our model demonstrated better prognostic ability for all the prediction targets compared with corresponding single-modality models and single-label models.

Dataset Information

Hint-Based Image Colorization Based on Hierarchical Vision Transformer.

Publications

Hint-Based Image Colorization Based on Hierarchical Vision Transformer.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets