Project description:BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
Project description:This article presents a Bangla handwriting dataset named BanglaWriting that contains single-page handwritings of 260 individuals of different personalities and ages. Each page includes bounding-boxes that bounds each word, along with the unicode representation of the writing. This dataset contains 21,234 words and 32,787 characters in total. Moreover, this dataset includes 5,470 unique words of Bangla vocabulary. Apart from the usual words, the dataset comprises 261 comprehensible overwriting and 450 handwritten strikes and mistakes. All of the bounding-boxes and word labels are manually-generated. The dataset can be used for complex optical character/word recognition, writer identification, handwritten word segmentation, and word generation. Furthermore, this dataset is suitable for extracting age-based and gender-based variation of handwriting.
Project description:Language is a method by which individuals express their thoughts. Each language has its own alphabet and numbers. Oral and written communication are both effective means of human interaction. However, each language has a sign language equivalent. Hearing-impaired and/or nonverbal individuals communicate through sign language. BDSL is the abbreviation for the Bangla sign language. The dataset contains images of hand signs in Bangla. The collection comprises 49 individual sign language images of the Bengali alphabet. BDSL49 is a set of 29,490 images with 49 labels. During data collection, images of fourteen distinct adults, each with a unique appearance and context, were captured. During data preparation, numerous strategies have been utilized to reduce noise. This dataset is available for free to researchers. Using techniques such as machine learning, computer vision, and deep learning, they are able to develop automated systems. Moreover, two models were applied to this dataset. The first is for detection, and the second is for identification.
Project description:According to the WHO, the number of mental disorder patients, especially depression patients, has overgrown and become a leading contributor to the global burden of disease. With the rising of tools such as artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and recordings of spoken language data from clinically depressed patients and matching normal controls, who were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes data collected using a traditional 128-electrodes mounted elastic cap and a wearable 3-electrode EEG collector for pervasive computing applications. The 128-electrodes EEG signals of 53 participants were recorded as both in resting state and while doing the Dot probe tasks; the 3-electrode EEG signals of 55 participants were recorded in resting-state; the audio data of 52 participants were recorded during interviewing, reading, and picture description.
Project description:We describe data acquired with multiple functional and structural neuroimaging modalities on the same nineteen healthy volunteers. The functional data include Electroencephalography (EEG), Magnetoencephalography (MEG) and functional Magnetic Resonance Imaging (fMRI) data, recorded while the volunteers performed multiple runs of hundreds of trials of a simple perceptual task on pictures of familiar, unfamiliar and scrambled faces during two visits to the laboratory. The structural data include T1-weighted MPRAGE, Multi-Echo FLASH and Diffusion-weighted MR sequences. Though only from a small sample of volunteers, these data can be used to develop methods for integrating multiple modalities from multiple runs on multiple participants, with the aim of increasing the spatial and temporal resolution above that of any one modality alone. They can also be used to integrate measures of functional and structural connectivity, and as a benchmark dataset to compare results across the many neuroimaging analysis packages. The data are freely available from https://openfmri.org/.
Project description:The prognostic value of mitotic figures in tumor tissue is well-established for many tumor types and automating this task is of high research interest. However, especially deep learning-based methods face performance deterioration in the presence of domain shifts, which may arise from different tumor types, slide preparation and digitization devices. We introduce the MIDOG++ dataset, an extension of the MIDOG 2021 and 2022 challenge datasets. We provide region of interest images from 503 histological specimens of seven different tumor types with variable morphology with in total labels for 11,937 mitotic figures: breast carcinoma, lung carcinoma, lymphosarcoma, neuroendocrine tumor, cutaneous mast cell tumor, cutaneous melanoma, and (sub)cutaneous soft tissue sarcoma. The specimens were processed in several laboratories utilizing diverse scanners. We evaluated the extent of the domain shift by using state-of-the-art approaches, observing notable differences in single-domain training. In a leave-one-domain-out setting, generalizability improved considerably. This mitotic figure dataset is the first that incorporates a wide domain shift based on different tumor types, laboratories, whole slide image scanners, and species.
Project description:Machine learning (ML) methods for the analysis of electrocardiography (ECG) data are gaining importance, substantially supported by the release of large public datasets. However, these current datasets miss important derived descriptors such as ECG features that have been devised in the past hundred years and still form the basis of most automatic ECG analysis algorithms and are critical for cardiologists' decision processes. ECG features are available from sophisticated commercial software but are not accessible to the general public. To alleviate this issue, we add ECG features from two leading commercial algorithms and an open-source implementation supplemented by a set of automatic diagnostic statements from a commercial ECG analysis software in preprocessed format. This allows the comparison of ML models trained on clinically versus automatically generated label sets. We provide an extensive technical validation of features and diagnostic statements for ML applications. We believe this release crucially enhances the usability of the PTB-XL dataset as a reference dataset for ML methods in the context of ECG data.
Project description:News analysis is a popular task in Natural Language Processing (NLP). In particular, the problem of clickbait in news analysis has gained attention in recent years [1, 2]. However, the majority of the tasks has been focused on English news, in which there is already a rich representative resource. For other languages, such as Indonesian, there is still a lack of resource for clickbait tasks. Therefore, we introduce the CLICK-ID dataset of Indonesian news headlines extracted from 12 Indonesian online news publishers. It is comprised of 15,000 annotated headlines with clickbait and non-clickbait labels. Using the CLICK-ID dataset, we then developed an Indonesian clickbait classification model achieving favourable performance. We believe that this corpus will be useful for replicable experiments in clickbait detection or other experiments in NLP areas.
Project description:This paper introduces the Welsh Advanced Neuroimaging Database (WAND), a multi-scale, multi-modal imaging dataset comprising in vivo brain data from 170 healthy volunteers (aged 18-63 years), including 3 Tesla (3 T) magnetic resonance imaging (MRI) with ultra-strong (300 mT/m) magnetic field gradients, structural and functional MRI and nuclear magnetic resonance spectroscopy at 3 T and 7 T, magnetoencephalography (MEG), and transcranial magnetic stimulation (TMS), together with trait questionnaire and cognitive data. Data are organised using the Brain Imaging Data Structure (BIDS). In addition to raw data, we provide brain-extracted T1-weighted images, and quality reports for diffusion, T1- and T2-weighted structural data, and blood-oxygen level dependent functional tasks. Reasons for participant exclusion are also included. Data are available for download through our GIN repository, a data access management system designed to reduce storage requirements. Users can interact with and retrieve data as needed, without downloading the complete dataset. Given the depth of neuroimaging phenotyping, leveraging ultra-high-gradient, high-field MRI, MEG and TMS, this dataset will facilitate multi-scale and multi-modal investigations of the healthy human brain.
Project description:WEMAC is a unique open multi-modal dataset that comprises physiological, speech, and self-reported emotional data records of 100 women, targeting Gender-based Violence detection. Emotions were elicited through visualizing a validated video set using an immersive virtual reality headset. The physiological signals captured during the experiment include blood volume pulse, galvanic skin response, and skin temperature. The speech was acquired right after the stimuli visualization to capture the final traces of the perceived emotion. Subjects were asked to annotate among 12 categorical emotions, several dimensional emotions with a modified version of the Self-Assessment Manikin, and liking and familiarity labels. The technical validation proves that all the targeted categorical emotions show a strong statistically significant positive correlation with their corresponding reported ones. That means that the videos elicit the desired emotions in the users in most cases. Specifically, a negative correlation is found when comparing fear and not-fear emotions, indicating that this is a well-portrayed emotional dimension, a specific, though not exclusive, purpose of WEMAC towards detecting gender violence.