Project description:Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.
Project description:The inherent low contrast of electron microscopy (EM) datasets presents a significant challenge for rapid segmentation of cellular ultrastructures from EM data. This challenge is particularly prominent when working with high resolution big-datasets that are now acquired using electron tomography and serial block-face imaging techniques. Deep learning (DL) methods offer an exciting opportunity to automate the segmentation process by learning from manual annotations of a small sample of EM data. While many DL methods are being rapidly adopted to segment EM data no benchmark analysis has been conducted on these methods to date. We present EM-stellar, a platform that is hosted on Google Colab that can be used to benchmark the performance of a range of state-of-the-art DL methods on user-provided datasets. Using EM-Stellar we show that the performance of any DL method is dependent on the properties of the images being segmented. It also follows that no single DL method performs consistently across all performance evaluation metrics. EM-stellar (code and data) is written in Python and is freely available under MIT license on GitHub (https://github.com/cellsmb/em-stellar). Supplementary data are available at Bioinformatics online.
Project description:This study aims at high-frequency ultrasound image quality assessment for computer-aided diagnosis of skin. In recent decades, high-frequency ultrasound imaging opened up new opportunities in dermatology, utilizing the most recent deep learning-based algorithms for automated image analysis. An individual dermatological examination contains either a single image, a couple of pictures, or an image series acquired during the probe movement. The estimated skin parameters might depend on the probe position, orientation, or acquisition setup. Consequently, the more images analyzed, the more precise the obtained measurements. Therefore, for the automated measurements, the best choice is to acquire the image series and then analyze its parameters statistically. However, besides the correctly received images, the resulting series contains plenty of non-informative data: Images with different artifacts, noise, or the images acquired for the time stamp when the ultrasound probe has no contact with the patient skin. All of them influence further analysis, leading to misclassification or incorrect image segmentation. Therefore, an automated image selection step is crucial. To meet this need, we collected and shared 17,425 high-frequency images of the facial skin from 516 measurements of 44 patients. Two experts annotated each image as correct or not. The proposed framework utilizes a deep convolutional neural network followed by a fuzzy reasoning system to assess the acquired data's quality automatically. Different approaches to binary and multi-class image analysis, based on the VGG-16 model, were developed and compared. The best classification results reach 91.7% accuracy for the first, and 82.3% for the second analysis, respectively.
Project description:Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (∼300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.
Project description:This paper contains datasets related to the "Efficient Deep Learning Models for Categorizing Chenopodiaceae in the wild" (Heidary-Sharifabad et al., 2021). There are about 1500 species of Chenopodiaceae that are spread worldwide and often are ecologically important. Biodiversity conservation of these species is critical due to the destructive effects of human activities on them. For this purpose, identification and surveillance of Chenopodiaceae species in their natural habitat are necessary and can be facilitated by deep learning. The feasibility of applying deep learning algorithms to identify Chenopodiaceae species depends on access to the appropriate relevant dataset. Therefore, ACHENY dataset was collected from natural habitats of different bushes of Chenopodiaceae species, in real-world conditions from desert and semi-desert areas of the Yazd province of IRAN. This imbalanced dataset is compiled of 27,030 RGB color images from 30 Chenopodiaceae species, each species 300-1461 images. Imaging is performed from multiple bushes for each species, with different camera-to-target distances, viewpoints, angles, and natural sunlight in November and December. The collected images are not pre-processed, only are resized to 224 × 224 dimensions which can be used on some of the successful deep learning models and then were grouped into their respective class. The images in each class are separated by 10% for testing, 18% for validation, and 72% for training. Test images are often manually selected from plant bushes different from the training set. Then training and validation images are randomly separated from the remaining images in each category. The small-sized images with 64 × 64 dimensions also are included in ACHENY which can be used on some other deep models.
Project description:This data article provides details for the RDD2020 dataset comprising 26,336 road images from India, Japan, and the Czech Republic with more than 31,000 instances of road damage. The dataset captures four types of road damage: longitudinal cracks, transverse cracks, alligator cracks, and potholes; and is intended for developing deep learning-based methods to detect and classify road damage automatically. The images in RDD2020 were captured using vehicle-mounted smartphones, making it useful for municipalities and road agencies to develop methods for low-cost monitoring of road pavement surface conditions. Further, the machine learning researchers can use the datasets for benchmarking the performance of different algorithms for solving other problems of the same type (image classification, object detection, etc.). RDD2020 is freely available at [1]. The latest updates and the corresponding articles related to the dataset can be accessed at [2].
Project description:Recent advancements in image analysis and interpretation technologies using computer vision techniques have shown potential for novel applications in clinical microbiology laboratories to support task automation aiming for faster and more reliable diagnostics. Deep learning models can be a valuable tool in the screening process, helping technicians spend less time classifying no-growth results and quickly separating the categories of tests that deserve further analysis. In this context, creating datasets with correctly classified images is fundamental for developing and improving such models. Therefore, a dataset of urine test Petri dishes images was collected following a standardized process, with controlled conditions of positioning and lighting. Image acquisition was conducted by applying a hardware chamber equipped with a led lightning source and a smartphone camera with 12 MP resolution. A software application was developed to support image classification and handling. Experienced microbiologists classified the images according to the positive, negative, and uncertain test results. The resulting dataset contains a total of 1500 images and can support the development of deep learning algorithms to classify urine exams according to their microbial growth.
Project description:Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 106 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.