Data update and extension (2021-02-01)

Tagged as: blog, data set
Group: G_20/21 Updating the data set by collecting the twitter data 3 weeks after the last relevant date and adding the followers connections

With a buffer of three weeks after the end of 2020 we collected the Twitter data of the politicians again. This allows us to analyse tweet and engagement data somewhat equally over our whole datset. While scraping this updated data, we noticed that three accounts from our list of politicians deleted their account and one renamed his account since our first scraping session took place. We adjusted our list accordingly.

Additional we collected the follower connections between the politicians, as well as their relations to the Twitter accounts of the most important German news outlets and virologists. These additional accounts were decided by using all mentions with an active Twitter account from these lists of news outlets and virologists respectively. With this information we can create follower networks to examine the relations between these different accounts. Along with the analysis of urls and mentions in the tweet dataset, this can give us a good impression of which politicians quote and refer to which institutions.

We updated the word list for matching tweets regarding Covid-19 by adding the plural and female forms of the words. After analyzing the collected tweets for word frequencies, additional covid-related hashtags could be added to the word list. The matching results were then randomly checked, which revealed that further adjustments were necessary. On the one hand, the matching has been revised so that at least two words from the list of words, which cannot be clearly assigned to Covid-19 thematically, must appear in a tweet in order to be considered as a Covid-Tweet. Furthermore, the matching was extended by Pattern Matching, whereby words from the word list with flexible endings can be matched. The process now takes place in three steps:

  1. The tweet is searched for unique matches with words from the word list
  2. Pattern Matching
  3. Fuzzy Matching, in order to be able to evaluate possible spelling mistakes or similar

Data clean up

To analyse the collected data we need to clean it first. For this step we use the libraries nltk1 and spacy2. Combined with regular expressions we are able to remove stopwords, links, mentions, punctuations and smileys from the tweets for filtering the COVID-Tweets. The word list is also preprocessed before matching by removing numbers and punctuation. For some analysis like the topic detection we additionally reduce the tweet to lemmatized nouns.

1https://pypi.org/project/spacy/

2https://pypi.org/project/nltk/