Project description:The evolution of Information and Communication Technologies (ICT) has changed the way we communicate. Access to the Internet and social networks has even changed the way we organise ourselves socially. Despite advances in this field, research on the use of social networks in political discourse and citizens’ perceptions of public policy remains scarce. So, the empirical study of politicians’ discourse on social networks in relation to citizens’ perception of public and fiscal policies according to their political affinity is of particular interest. The aim of the research is, therefore, to analyse positioning, from a dual perspective. Firstly, the study analyses the positioning in the discourse of the communication campaigns posted on social networks of Spain’s most prominent politicians. And secondly, it evaluates whether this positioning is reflected in citizens’ opinions about the public and fiscal policies being implemented in Spain. To this end, a qualitative semantic analysis and a positioning map is performed on a total of 1553 tweets published between 1 June and 31 July 2021 by the leaders of the top ten Spanish political parties. In parallel, a cross-sectional quantitative analysis is carried out, also through positioning analysis, based on the database of the Public Opinion and Fiscal Policy Survey of July 2021 by the Sociological Research Centre (CIS), whose sample is 2849 Spanish citizens. The results show a significant difference in the discourse of political leaders’ social network posts—which is more pronounced between right-wing and left-wing parties—and only some differences in citizens’ perception of public policies according to their political affinity. This work contributes to identifying the differentiation and positioning of the main parties and helps to guide the discourse of their posts.
Project description:IntroductionTwitter data have been used to surveil public sentiment about tobacco products; however, most tobacco-related Twitter research has been conducted with English-language posts. There is a gap in the literature on tobacco-related discussions on Twitter in languages other than English. This study summarized tobacco-related discussions in Spanish on Twitter.MethodsA set of Spanish terms reflecting electronic cigarettes (eg, "cigarillos electrónicos"), cigarettes (eg, "pitillo"), and cigars (eg, "cigaro") were identified. A content analysis of tweets (n = 1352) drawn from 2021 was performed to examine themes and sentiment. An initial codebook was developed in English then translated to Spanish and then translated back to English by a bilingual (Spanish and English) member of the research team. Two bilingual members of the research team coded the tweets into themes and sentiment.ResultsThemes in the tweets included (1) product promotion (n = 168, 12.4%), (2) health warnings (n = 161, 11.9%), (3) tobacco use (n = 136, 10.1%), (4) health benefits of vaping (n = 58, 4.3%), (5) cannabis use (n = 50, 3.7%), (6) cessation (n = 47, 3.5%), (7) addiction (n = 33, 2.4%), (8) policy (n = 27, 2.0%), and (9) polysubstance use (n = 12, 0.9%). Neutral (n = 955, 70.6%) was the most common category of sentiment observed in the data.ConclusionsTobacco products are discussed in multiple languages on Twitter and can be summarized by bilingual research teams. Future research should determine if Spanish-speaking individuals are frequently exposed to pro-tobacco content on social media and if such exposure increases susceptibility to use tobacco among never users or sustained use among current users.ImplicationsSpanish-language pro-tobacco content exists on Twitter, which has implications for Spanish-speaking individuals who may be exposed to this content. Spanish-language pro-tobacco-related posts may help normalize tobacco use among Spanish-speaking populations. As a result, anti-tobacco tweets in Spanish may be necessary to counter areas of the online environment that can be considered pro-tobacco.
Project description:Communication is of paramount importance in responding to health crises. We studied the media messages put forth by different stakeholders in two Ebola vaccine trials that became controversial in Ghana. These interactions between health authorities, political actors, and public citizens can offer key lessons for future research. Through an analysis of online media, we analyse stakeholder concerns and incentives, and the phases of the dispute, to understand how the dispute evolved to the point of the trials being suspended, and analyse what steps might have been taken to avert this outcome.A web-based system was developed to download and analyse news reports relevant to Ebola vaccine trials. This included monitoring major online newspapers in each country with planned clinical trials, including Ghana. All news articles were downloaded, selecting out those containing variants of the words "Ebola," and "vaccine," which were analysed thematically by a team of three coders. Two types of themes were defined: critiques of the trials and rebuttals in favour of the trials. After reconciling differences between coders' results, the data were visualised and reviewed to describe and interpret the debate.A total of 27,460 articles, published between 1 May and 30 July 2015, were collected from nine different newspapers in Ghana, of which 139 articles contained the keywords and met the inclusion criteria. The final codebook included 27 themes, comprising 16 critiques and 11 rebuttals. After coding and reconciliation, the main critiques (and their associated rebuttals) were selected for in-depth analysis, including statements about the trials being secret (mentioned in 21% of articles), claims that the vaccine trials would cause an Ebola outbreak in Ghana (33%), and the alleged impropriety of the incentives offered to participants (35%).Perceptions that the trials were "secret" arose from a combination of premature news reporting and the fact that the trials were prohibited from conducting any publicity before being approved at the time that the story came out, which created an impression of secrecy. Fears about Ebola being spread in Ghana appeared in two forms, the first alleging that scientists would intentionally infect Ghanaians with Ebola in order to test the vaccine, and the second suggesting that the vaccine might give trial participants Ebola as a side-effect - over the course of the debate, the latter became the more prominent of the two variants. The incentives were sometimes criticised for being coercively large, but were much more often criticised for being too small, which may have been related to a misperception that the incentives were meant as compensation for the trials' risks, which were themselves exaggerated.The rumours captured through this research indicate the variety of strong emotions drawn out by the trials, highlighting the importance of understanding the emotional and social context of such research. The uncertainty, fear, and distrust associated with the trials draw from the contemporary context of the Ebola outbreak, as well as longstanding historical issues in Ghana. By analysing the debate from its inception, we can see how the controversy unfolded, and identify points of concern that can inform health communication, suggesting that this tool may be valuable in future epidemics and crises.
Project description:Although the use of network simulator (NS) in predicting the behavior of computer networks has increased, the users often face a variety of challenges and share them on Stack Overflow (SO). However, the challenges that users deal with have not been studied. This paper presents an NS discussion dataset extracted from SOTorrent, which consists of 2,322 NS-related question posts spanning 17 features. The process of data collection was conducted in five steps, including filtering initial post dataset using simulator tags, discovering NS-related tags, collecting the tagged posts, extracting the posts title and preprocessing for LDA (Latent Dirichlet Allocation), and finally applying the LDA topic modeling to obtain the NS posts clustered into eight different topic names. We believe that this dataset will help research community in highlighting issues faced by NS users.
Project description:The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.
Project description:This paper presents dataset collected from social networks that are mostly used by youth of Commonwealth of Independent States (CIS) countries. The data was collected from public accounts of VKontakte social network by using VK.api and applying the most used keywords that would signify depressive mood. The collected data was classified by psychologists into two types: depressive and non-depressive. The dataset consists of 32 018 depressive posts and 32 021 non-depressive posts. Since the most common language that is spoken in CIS countries is Russian, the posts are written in Russian, consequently the collected data is in Russian language as well. The data can mostly be useful for researchers who explore tendencies to depression in CIS countries. The dataset is important for the research community, as it was not only collected from open sources, but also marked by our psychiatrists from the republican scientific and practical center of mental health. Since the dataset has very high validity, it can be used for further research in the field of mental health.
Project description:Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.
Project description:Along with the traditional news publishing policies, news agencies now share news over the internet since people nowadays prefer reading news online. Moreover, news media maintain YouTube channels to publish visual stories. Readers comment to share their opinions below the corresponding news item. These news and comments have been a great source of information and research. However, there is a lack of research in the Bengali news context. This article presents a dataset containing 7,62,678 public comments and replies from 16,016 video news published from 2017 to 2023 from a renowned Bengali news YouTube channel. The data withholds 15 properties of news that include video URL, title, likes, views, date of publishing, hashtags, description, comment author, comment time, comment, likes in the comment, reply author, reply time, reply, and likes in the responses. To ensure privacy, the commentator's name is encoded in the dataset. The dataset is open to use for researchers at https://data.mendeley.com/datasets/3c3j3bkxvn/4. A translated file for the raw dataset is also included. This data may help scholars to identify patterns in public opinion and analyze how public opinion changes over time.
Project description:Topic modeling is an active research area with several unanswered questions. The focus of recent research in this area is on the use of a vector embedding representation of the input text with both generative and evolutionary topic modeling techniques. Unfortunately, it is hard to compare different techniques when the underlying data and preprocessing steps that were used to develop the models are not available. This paper presents two secondary datasets that can help address this gap. These datasets are derived from two primary datasets. The first consists of 8145 posts from the r/Cancer health forum and the second consists of 18,294 messages submitted to 20 different news groups. The same preprocessing procedure is applied to both datasets by removing punctuation, stop words and high frequency words. Each dataset is then clustered using three different topic modeling techniques: pPSO, ETM and NVDM and three topic numbers: 10, 20, 30. In addition, for pPSO two text embeddings representation are considered: sBERT and Skipgram. The secondary datasets were originally developed in support of a comparative analysis of the aforementioned topic modeling techniques in a study titled "Comparing PSO-based Clustering over Contextual Vector Embeddings to Modern Topic Modeling" submitted to the Journal of Information Processing and Management. The present paper provides a detailed description of the two secondary datasets including the unique identifier that can be used to retrieve the original documents, the pre-processing scripts, the topic keywords generated by the three topic modeling techniques with varying topic numbers and embedding representations. As such, the datasets allow direct comparison with other topic modeling techniques. To further facilitate this process, the algorithm underlying the evolutionary topic modeling technique, pPSO, proposed by the authors is also provided.
Project description:Radon is, after tobacco, the most frequent cause of lung cancer. Communicating about its risks with a didactic perspective so that citizens become aware and take action to avoid radon remains a challenge. This research is framed in Spain, where 17% of the territory exceeds the maximum radon limits allowed by the WHO, and aims to study the role and impact of the media in radon risk communication. A mixed methodological design is applied, combining content analysis of news published in the last two decades by local media in the most affected areas with interviews with journalists and a survey of citizens to provide a multi-perspective approach. The results show that, although news coverage of radon is becoming more frequent, it is a topic that fails to position itself on the agenda for effective communication. The media are the most frequent source of information on radon, although they are not considered by the public the most trustworthy one. News stories about radon focus mainly on health and research to inform about the radon levels to which citizens are exposed and the risks associated with cancer. Collaborative strategies between the media, organizations, and public administration seem key to advancing the fight against radon.