Deriving Disinformation Insights from Geolocalized Twitter Callouts, Singapore
16 August 2021
David Tuxworth of the Crime and Security Research Institute (CSRI) presents research to the Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining (KDD).
Based in Singapore, KDD 2021 is a premier data mining conference that is taking place virtually between the 14th and 18th of August, 2021. The event seeks to further the advancement, education, and adoption of the "science" of knowledge discovery and data mining from all types of data stored in computers and networks of computers.
The paper Deriving Disinformation Insights from Geolocalized Twitter Callouts by authors in the CSRI was recently presented at the WIT: Workshop On Deriving Insights From User-Generated Text as part of KDD 2021. The research was shortlisted for the workshop because it helps addresses challenges around harnessing text-heavy user-generated data, especially on topics pertaining to extracting data from unstructured text to a structured form to obtain insights.
Presented by CSRI Research Associate David Tuxworth, the work utilises the rich stream of user-generated data that can be found on social media to derive insights into disinformation. The research makes a number of contributions including a novel transformerbased geolocation method that performs in multiple languages, as well as an analytical method that uses lexical specificity and word embeddings to interrogate multilingual user-generated content with respect to mis/disinformation narratives. In addition, a dataset of 36 million disinformation related tweets in English, French and Spanish has been made available to other researchers.
The study was conceived to help aid the practical task of helping detect, track and understand disinformation operations in a variety of geopolitical contexts. To this end, Twitter data relating to misinformation, disinformation and related terms including propaganda and ‘fake news’ have been continuously collected since 2019 in multiple languages including English, French and Spanish, which are the languages of focus in the study.
The work shows that user-generated content in multiple languages can be used as a data source for deriving insights into disinformation. To achieve this, first a transformer-based classifier is trained on the 0.34% of 87.9 million tweets that contain geolocation data which is then applied to the rest of the data, separating it into European and non-European tweets. This is done for two periods, 2019 and 2020, in English, French and Spanish allowing for multiple types of comparative analysis. The research then demonstrates that monolingual classifiers trained and tested on data from the same year outperform multilingual classifiers.
The paper also illustrates how geolocation metadata from a relatively small subset of tweets can be used to classify the entire set. An advantage of this method is that the data used to train the classifier is self-contained and usable so long as there is a large enough volume of geolocated tweets to make machine learning methods viable. Secondly, lexical specificity and word embeddings are used to explore the classified tweets and reveal insights into disinformation. For example, it is shown that the conspiracies surrounding the origin of COVID-19 are revealed through comparing the most similar words to a relevant keyword.
The full paper Deriving Disinformation Insights from Geolocalized Twitter Callouts can be accessed via the button below.