Project description:Neural keyword spotting could form the basis of a speech brain-computer-interface for menu-navigation if it can be done with low latency and high specificity comparable to the "wake-word" functionality of modern voice-activated AI assistant technologies. This study investigated neural keyword spotting using motor representations of speech via invasively-recorded electrocorticographic signals as a proof-of-concept. Neural matched filters were created from monosyllabic consonant-vowel utterances: one keyword utterance, and 11 similar non-keyword utterances. These filters were used in an analog to the acoustic keyword spotting problem, applied for the first time to neural data. The filter templates were cross-correlated with the neural signal, capturing temporal dynamics of neural activation across cortical sites. Neural vocal activity detection (VAD) was used to identify utterance times and a discriminative classifier was used to determine if these utterances were the keyword or non-keyword speech. Model performance appeared to be highly related to electrode placement and spatial density. Vowel height (/a/ vs /i/) was poorly discriminated in recordings from sensorimotor cortex, but was highly discriminable using neural features from superior temporal gyrus during self-monitoring. The best performing neural keyword detection (5 keyword detections with two false-positives across 60 utterances) and neural VAD (100% sensitivity, ~1 false detection per 10 utterances) came from high-density (2 mm electrode diameter and 5 mm pitch) recordings from ventral sensorimotor cortex, suggesting the spatial fidelity and extent of high-density ECoG arrays may be sufficient for the purpose of speech brain-computer-interfaces.
Project description:With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.
Project description:Scientific production has increased exponentially in recent years. It is necessary to find methodological strategies for understanding holistic or macro views of the major research trends developed in specific fields. Data mining is a useful technique to address this task. In particular, our study presents a global analysis of the information generated during last decades in the Sport Sciences Category (SSC) included in the Web of Science database. An analysis of the frequency of appearance and the dynamics of the Author Keywords (AKs) has been made for the last thirty years. Likewise, the network of co-occurrences established between words and the survival time of new words that have appeared since 2001 has also been analysed. One of the main findings of our research is the identification of six large thematic clusters in the SSC. There are also two major terms that coexist ('REHABILITATION' and 'EXERCISE') and show a high frequency of appearance, as well as a key behaviour in the calculated co-occurrence networks. Another significant finding is that AKs are mostly accepted in the SSC since there has been high percentage of new terms during 2001-2006, although they have a low survival period. These results support a multidisciplinary perspective within the Sport Sciences field of study and a colonization of the field by rehabilitation according to our AK analysis.
Project description:A document's keywords provide high-level descriptions of the content that summarize the document's central themes, concepts, ideas, or arguments. These descriptive phrases make it easier for algorithms to find relevant information quickly and efficiently. It plays a vital role in document processing, such as indexing, classification, clustering, and summarization. Traditional keyword extraction approaches rely on statistical distributions of key terms in a document for the most part. According to contemporary technological breakthroughs, contextual information is critical in deciding the semantics of the work at hand. Similarly, context-based features may be beneficial in the job of keyword extraction. For example, simply indicating the previous or next word of the phrase of interest might be used to describe the context of a phrase. This research presents several experiments to validate that context-based key extraction is significant compared to traditional methods. Additionally, the KeyBERT proposed methodology also results in improved results. The proposed work relies on identifying a group of important words or phrases from the document's content that can reflect the authors' main ideas, concepts, or arguments. It also uses contextual word embedding to extract keywords. Finally, the findings are compared to those obtained using older approaches such as Text Rank, Rake, Gensim, Yake, and TF-IDF. The Journals of Universal Computer (JUCS) dataset was employed in our research. Only data from abstracts were used to produce keywords for the research article, and the KeyBERT model outperformed traditional approaches in producing similar keywords to the authors' provided keywords. The average similarity of our approach with author-assigned keywords is 51%.
Project description:Despite the growing interest of archiving information in synthetic DNA to confront data explosion, quantitatively querying the data stored in DNA is still a challenge. Herein, we present Search Enabled by Enzymatic Keyword Recognition (SEEKER), which utilizes CRISPR-Cas12a to rapidly generate visible fluorescence when a DNA target corresponding to the keyword of interest is present. SEEKER achieves quantitative text searching since the growth rate of fluorescence intensity is proportional to keyword frequency. Compatible with SEEKER, we develop non-collision grouping coding, which reduces the size of dictionary and enables lossless compression without disrupting the original order of texts. Using four queries, we correctly identify keywords in 40 files with a background of ~8000 irrelevant terms. Parallel searching with SEEKER can be performed on a 3D-printed microfluidic chip. Overall, SEEKER provides a quantitative approach to conducting parallel searching over the complete content stored in DNA with simple implementation and rapid result generation.
Project description:Interpreting and integrating results from omics studies typically requires a comprehensive and time consuming survey of extant literature. GeneCup is a literature mining web service that retrieves sentences containing user-provided gene symbols and keywords from PubMed abstracts. The keywords are organized into an ontology and can be extended to include results from human genome-wide association studies. We provide a drug addiction keyword ontology that contains over 300 keywords as an example. The literature search is conducted by querying the PubMed server using a programming interface, which is followed by retrieving abstracts from a local copy of the PubMed archive. The main results presented to the user are sentences where gene symbol and keywords co-occur. These sentences are presented through an interactive graphical interface or as tables. All results are linked to the original abstract in PubMed. In addition, a convolutional neural network is employed to distinguish sentences describing systemic stress from those describing cellular stress. The automated and comprehensive search strategy provided by GeneCup facilitates the integration of new discoveries from omic studies with existing literature. GeneCup is free and open source software. The source code of GeneCup and the link to a running instance is available at https://github.com/hakangunturkun/GeneCup.
Project description:Fraud detection through auditors' manual review of accounting and financial records has traditionally relied on human experience and intuition. However, replicating this task using technological tools has represented a challenge for information security researchers. Natural language processing techniques, such as topic modeling, have been explored to extract information and categorize large sets of documents. Topic modeling, such as latent Dirichlet allocation (LDA) or non-negative matrix factorization (NMF), has recently gained popularity for discovering thematic structures in text collections. However, unsupervised topic modeling may not always produce the best results for specific tasks, such as fraud detection. Therefore, in the present work, we propose to use semi-supervised topic modeling, which allows the incorporation of specific knowledge of the study domain through the use of keywords to learn latent topics related to fraud. By leveraging relevant keywords, our proposed approach aims to identify patterns related to the vertices of the fraud triangle theory, providing more consistent and interpretable results for fraud detection. The model's performance was evaluated by training with several datasets and testing it with another one that did not intervene in its training. The results showed efficient performance averages with a 7% increase in performance compared to a previous job. Overall, the study emphasizes the importance of deepening the analysis of fraud behaviors and proposing strategies to identify them proactively.
Project description:BackgroundThe literature suggests that specific keywords included in summative rotation assessments might be an early indicator of abnormal progress or failure.ObjectiveThis study aims to determine the possible relationship between specific keywords on in-training evaluation reports (ITERs) and subsequent abnormal progress or failure. The goal is to create a functional algorithm to identify residents at risk of failure.MethodsA database of all ITERs from all residents training in accredited programs at Université Laval between 2001 and 2013 was created. An instructional designer reviewed all ITERs and proposed terms associated with reinforcing and underperformance feedback. An algorithm based on these keywords was constructed by recursive partitioning using classification and regression tree methods. The developed algorithm was tuned to achieve 100% sensitivity while maximizing specificity.ResultsThere were 41 618 ITERs for 3292 registered residents. Residents with failure to progress were detected for family medicine (6%, 67 of 1129) and 36 other specialties (4%, 78 of 2163), while the positive predictive values were 23.3% and 23.4%, respectively. The low positive predictive value may be a reflection of residents improving their performance after receiving feedback or a reluctance by supervisors to ascribe a "fail" or "in difficulty" score on the ITERs.ConclusionsClassification and regression trees may be helpful to identify pertinent keywords and create an algorithm, which may be implemented in an electronic assessment system to detect future residents at risk of poor performance.