Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Dataset of Karakalpak language stop words.

ABSTRACT: The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper.

SUBMITTER: Madatov K

PROVIDER: S-EPMC10126844 | biostudies-literature | 2023 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Publications

Dataset of Karakalpak language stop words.

Madatov Khabibulla K Bekchanov Shukurla S Vičič Jernej J

Data in brief 20230405

The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Freque ...[more]

PMID: 37113499

Similar Datasets

MyWSL: Malaysian words sign language dataset.

Project description:Deaf and hard-of-hearing individuals use sign language as a means of communication. However, those around them, especially family members like the children of deaf adults, may face communication challenges if they are unfamiliar with sign language. This issue has prompted numerous researchers to conduct studies on sign language translation and recognition. However, there is currently no publicly available dataset specifically for Malaysian sign language. This article introduces an image dataset of the Malaysian Sign Language (MySL) hand gestures used in everyday situations. The dataset, named MyWSL2023, comprises 3,500 images of ten static Malaysian sign language words collected from five participants (two males and three females) aged between 20 and 21 years old. The data collection took place indoors under normal lighting conditions. The MyWSL2023 dataset, which has been made freely accessible to all researchers, serves as a valuable resource for not only investigating and developing automated systems for hearing-impaired and deaf individuals but also gesture and sign language recognition using vision-based methods. The dataset can be accessed for free at https://data.mendeley.com/datasets/zvk55p7ktd.

| S-EPMC10439288 | biostudies-literature

BdSLW-11: Dataset of Bangladeshi sign language words for recognizing 11 daily useful BdSL words.

Project description:The dataset of Bangladeshi sign language words (BdSLW) is rare. Though there are lots of datasets of BdSL sign alphabets, numbers, or characters, there are not enough datasets of sign words. This is the first dataset about sign words of BdSL according to the author(s) knowledge. So, this dataset is developed by collecting data from people. This is an image dataset. This dataset is a collection of 1105 images of sign words. A total of 11 sign word categories are selected which are important and daily use in our life. As this is an image dataset, so the images of sign words are taken by camera from the sign users of Bangladesh. Authors have gone to the individuals of sign users and captured images from them with their permission. Then the images are analyzed and segmented into the images which have quality such as no background, clear, bright, etc. This dataset is used for recognizing BdSL sign words.

| S-EPMC9679746 | biostudies-literature

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

Project description:Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.

| S-EPMC7689026 | biostudies-literature

Hand gestures for emergency situations: A video dataset based on words from Indian sign language.

Project description:Automatic sign language recognition provides better services to the deaf as it avoids the existing communication gap between them and the rest of the society. Hand gestures, the primary mode of sign language communication, plays a key role in improving sign language recognition. This article presents a video dataset of the hand gestures of Indian sign language (ISL) words used in emergency situations. The videos of eight ISL words have been collected from 26 individuals (including 12 males and 14 females) in the age group of 22 to 26 years with two samples from each individual in an indoor environment with normal lighting conditions. Such a video dataset is highly needed for automatic recognition of emergency situations from the sign language for the benefit of the deaf. The dataset is useful for the researchers working on vision based sign language recognition (SLR) as well as hand gesture recognition (HGR). Moreover, support vector machine based classification and deep learning based classification of the emergency gestures has been carried out and the base classification performance shows that the database can be used as a benchmarking dataset for developing novel and improved techniques for recognizing the hand gestures of emergency words in Indian sign language.

| S-EPMC7378574 | biostudies-literature

Mark my words: High frequency marker words impact early stages of language learning.

Project description:High frequency words have been suggested to benefit both speech segmentation and grammatical categorization of the words around them. Despite utilizing similar information, these tasks are usually investigated separately in studies examining learning. We determined whether including high frequency words in continuous speech could support categorization when words are being segmented for the first time. We familiarized learners with continuous artificial speech comprising repetitions of target words, which were preceded by high-frequency marker words. Crucially, marker words distinguished targets into 2 distributionally defined categories. We measured learning with segmentation and categorization tests and compared performance against a control group that heard the artificial speech without these marker words (i.e., just the targets, with no cues for categorization). Participants segmented the target words from speech in both conditions, but critically when the marker words were present, they influenced acquisition of word-referent mappings in a subsequent transfer task, with participants demonstrating better early learning for mappings that were consistent (rather than inconsistent) with the distributional categories. We propose that high-frequency words may assist early grammatical categorization, while speech segmentation is still being learned. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

| S-EPMC6746567 | biostudies-literature

Pashtu Language Digits Dataset.

Project description:Pashtu is a language spoken by 50 million people in the world [1]. It is the national language of Afghanistan and also spoken in the two largest provinces of Pakistan. It is a language written in complex way by calligraphers. Instead of enormous literature and research work in Optical Character Recognition for other languages of the world, this language still requires a mature optical character recognition system [2], [3]. A real dataset of Pashtu digits having 50000 scanned images is introduced and made publically available in this paper. All the digits in the images are handwritten images written and collected from faculty members, staff, and students of the Pak-Austria Fachhochschule, Institute of Applied Sciences and Technology, Pakistan. A total of 1250 candidates appeared in writing the text, out of which half are male and half female. The dataset will be publically available for research purposes.

| S-EPMC9679712 | biostudies-literature

Emotion Words in Early Childhood: A Language Transcript Analysis.

Project description:Children learn the abstract, challenging categories of emotions from young ages, and it has recently been suggested that language (and more specifically emotion words) may aid this learning. To examine the language that young children hear and produce as they're learning emotion categories, the present study examined nearly 2,000 transcripts from 179 children ranging from 15- to 47-months from the Child Language Data Exchange System (CHILDES). Results provide key descriptive, developmental, and predictive information regarding child emotion language production, including the finding that child emotion word production was predicted by mothers' emotion word production (β=.21, p<.001), but not by child or mother language complexity (β=.01, p=.690; β=.00, p=.872). Frequency of specific emotion words are presented, as are developmental trends in early emotion language production and input. These results improve the understanding of children's daily emotional language environments and may inform theories of emotional development.

| S-EPMC8530275 | biostudies-literature

Ultraconserved words point to deep language ancestry across Eurasia.

Project description:The search for ever deeper relationships among the World's languages is bedeviled by the fact that most words evolve too rapidly to preserve evidence of their ancestry beyond 5,000 to 9,000 y. On the other hand, quantitative modeling indicates that some "ultraconserved" words exist that might be used to find evidence for deep linguistic relationships beyond that time barrier. Here we use a statistical model, which takes into account the frequency with which words are used in common everyday speech, to predict the existence of a set of such highly conserved words among seven language families of Eurasia postulated to form a linguistic superfamily that evolved from a common ancestor around 15,000 y ago. We derive a dated phylogenetic tree of this proposed superfamily with a time-depth of ~14,450 y, implying that some frequently used words have been retained in related forms since the end of the last ice age. Words used more than once per 1,000 in everyday speech were 7- to 10-times more likely to show deep ancestry on this tree. Our results suggest a remarkable fidelity in the transmission of some words and give theoretical justification to the search for features of language that might be preserved across wide spans of time and geography.

| S-EPMC3666749 | biostudies-literature

Dataset of sentiment tagged language resources for Bosnian language.

Project description:The Bosnian language holds significant importance as a member of the West-South Slavic subgroup within the Slavic branch of the Indo-European linguistic family. With approximately 2.5 million speakers in Europe, including 1.87 million individuals in Bosnia and Herzegovina alone, the Bosnian language constitutes the mother tongue for a considerable portion of the population. In Natural Language Processing (NLP) tasks related to the Bosnian language, besides removing stop words, it is important to consider the influence of other linguistic elements. Bosnian text contains words derived from diminishers, relative intensifiers, minimizers, maximizers, boosters, and approximators. These words contribute to the overall meaning and sentiment analysis of the text. By including these elements in NLP models and algorithms, researchers can achieve more accurate and nuanced analysis of Bosnian language data, enhancing the effectiveness of NLP applications. The two lists of sentiment annotated words that present the core of the Bosnian sentiment-annotated lexicon, a list of the stopwords, and a list of Affirmative and non-Affrimative words (AnAwords) composed mostly of intensifiers and diminishers, were used to construct a dataset that presents the base for sentiment analysis in the Bosnian language.

| S-EPMC10964063 | biostudies-literature

TLFS23 Tamil language fingerspelling dataset.

Project description:Tamil is one of the oldest existing languages, spoken by around 65 million people across India, Sri Lanka and South-East Asia. Countries such as Fiji and South Africa also have a significant population with Tamil ancestry. Tamil is a complex language and has 247 characters. A labelled dataset for Tamil Fingerspelling named TLFS23 has been created for research related to vision-based Fingerspelling translators for the Speech and hearing Impaired. The dataset would open up avenues to develop automated systems as translators and interpreters for effective communication between fingerspelling language users and non- users, using computer vision and deep learning algorithms. One thousand images representing each unique finger flexion motion for every Tamil character was collected overall constituting a large dataset with 248 classes with a total of 2,55,155 images. The images were contributed by 120 individuals from different age groups. The dataset is made publicly available at: https://data.mendeley.com/datasets/39kzs5pxmk/2.

| S-EPMC10790027 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data