06 Building of the corpus (2021-01-16)

Tagged as: blog,nlp,covid-19
Group: H_20/21 Current state of scraping and planned analysis.

In the meantime, we have begun scraping the contents of our selected users on Twitter on a daily basis. We started scraping the respective tweets of each user, the whole conversation connected to it, as well as quoted retweets on 16th December 2020. As the Twitter API allows retrieving tweets that are up to 7 days old, we have data from 10th December and plan on scraping 6 weeks until 21st January 2021.

Additionally, we execute a script daily that retrieves the number of Likes, Retweets and such for all tweets that are exactly 14 days old. This ensures that tweets have enough time to be discussed and shared, while also guaranteeing that we treat each tweet the same (otherwise a tweet from the beginning of the scraping process would have much more time to be shared and liked than one from the end of the process).

In parallel, we are starting to build a pipeline of scripts to analyze the corpus. We will apply several methods from german Natural Language Processing (NLP), e.g. Sentiment Analysis and Lexical Analysis, as well as a Network Analysis for the accounts that are active in the corpus we created. This will help us understand how and by whom the topics are discussed - stay tuned for further updates!