Initial Data Collection (2021-01-13)

Inhaltsverzeichnis

Initial Data Collection (2021-01-13)

Initial Data Collection (2021-01-13)

Tagged as: blog, data collection, tweet scraping, web scraping
Group: G_20/21 A short overview of the initial scraping process, in which the tweets of 555 german politicians over the year 2020 were collected.

Data Collection

We used the previously already mentioned tool TweetScraper to gather the tweets of the 555 german politicians we compiled. TweetScraper utilizes the Advanced Search functionality of the Twitter Web App to find the desired tweets. Unfortunately, native retweets cannot be accessed via this method, so we were only able to retrieve the original tweets by the politicians, and tweets quoted by them. Nonetheless, the result is still a sizeable dataset, containing a total of 295.588 tweets by a total of 27.040 different Twitter users. The data was retrieved for the timespan of 01.01.2020 until 23.12.2020. These numbers match the results of tests we did previously to the scraping process, where we compared the number of tweets retrieved via our method to the numbers presented on the Social Media analysis website socialblade.com. In order to gather as much tweets as possible, we did some more test beforehand to determine the optimal number of requests per politician. These tests resulted in us generating a request for every politician in our list and every two weeks timeframe in our total timespan.

First Analysis

First we wanted to get an overview of the collected data. We created a bar diagram presenting the distribution of all tweets depending on the parties.

After deleting the politicians from the list with no Tweets we wanted to get an overview about the distribution of the politicians per party.

We further analysed quantitative data for the collected tweets of each politician on our list. This analysis is comprised of the metadata each tweet is annotated with. The obvious first metrics were followers, total tweets found, and tweets authored by the respective politician. Further, we also looked at engagement numbers, represented by likes, retweets and replies to a tweet. Combined with the ratio of original to reply tweets, these measures give a first indication how the pliticians use Twitter. Do they just post tweets or are thy participating in discussions? How popular are their tweets? How often do they tweet?

Lastly, the annotated languages of the collected tweets were also analysed. We found that the language annotations only really work for longer tweets, while shorter ones are often mislabeled, or not annotated. Overall, the results of these tests show, that the vast majority of the collected tweets are in German. We can therefore focus on language analysis tools specifically developed for German in some of the next steps.

Filtering COVID-related tweets

The tweets are preprocessed in several steps. Initially, the tweet-processor library is used to remove URLs, mentions, reserved words (RT, FAV) and smileys from the tweets. Further preprocessing is done with the help of NLTK through Tokenization, Removal of Digits, Stop Words and Punctuations.

To filter Covid-related tweets, a match is searched for between a word from the word list and the words in the tweet. In order to get the best possible results, we have tested different approaches: Stemming, Lemmatization, Fuzzy-Matching.

Stemming: Return to the root of a word. Due to the purely algorithmic determination of the word stem, it may be that this is not identical with the morphological root of the word.
Lemmatization: Reduction of the word to the linguistically correct basic form, the so-called lemma.
Fuzzy-Matching: With fuzzy matching, the similarity of two strings is determined on a character basis. A certain threshold value is set for the similarity. With a high threshold you reduce false reports, but also get fewer hits.

The approaches of stemming and lemmatization look for an exact match. Whereas Fuzzy-Matching also allows approximate matching, which makes this approach robust against typing errors. And also works regardless of the language.

Example: Jens Spahn

Stemming	Lemmatization	Fuzzy-Matching	Total tweets
116	124	142	314

A comparison of the approaches shows that Fuzzy-Matching finds the most matches. From a total of 254.388 tweets written by the respective politicians themselves, there are 28.807 Covid-related tweets.

Further approach

Based on this we want to optimize our worldlist and maybe add some missing words by summing up the used words in all collected tweets. After that we want to filter the tweets properly again and select up to 100 important politicians. This will be done either by the number of COVID-related tweets, the percentage of COVID-related tweets compared to all tweets or a combination of both.