COVID-19FakeNews

Abstract

COVID is a virus that rampages through every country, from rural to urban areas. Since the beginning of the virus, facts and science have been politicized to align with party agendas which have unfortunately resulted in constituents being misinformed about the dangerous virus. From early April 2020 to early May 2020, Dr. Mohamed Farag collected a large set of tweets from users on Twitter. In these tweets, Twitter users expressed their thoughts, opinions, and facts on the virus. We aimed to filter these tweets, sort them into classes, and utilize machine learning to determine if these tweets, and future tweets that are to come, are a reliable source of accurate information or not.

Our goal in this project was to find rumors and false information that is spread about COVID as well as the perpetrators that spread this information. As more people around the world gain access to the internet, more people will continue spreading information and this results in an information “overload” where facts and myth are intertwined, and the public is unaware of the real truth. The COVID19FakeNews team focused on contributing to providing clarity to the public about which tweets spread dangerous lies.

We received a one terabyte file, filled with tweets, that Dr. Farag had collected. We converted these tweets into a unified format and stored them into a readable JSON format. We did this by making a Python script that utilizes different libraries associated with Python. We extracted the tweet IDs from the stored tweets collected, and, using the Twarc2 library, we were able to hydrate still existing – i.e., not deleted – tweets using the tweet ID that we extracted from the collection. This was crucial for finding currently visible tweets, so we can sort into future categories (buckets).

Once hydrated, a small sample of tweets was labeled into seven different categories by our team. These labels were then leveraged to train and test a machine learning model using SVM through the sklearn Python library. The model was trained with sufficient data so that the group would be satisfied with its accuracy. Then, the model was run on the remaining hydrated tweets, and we were able to classify those tweets. We created a front-end display to show the timeline of when different classes of tweets were published. The front-end also shows statistics on the raw and clean datasets, as well as users that have tweeted misinformation regularly. Overall, this project should be useful for researchers who are doing similar studies. It should also be useful to members of the public who are concerned about COVID.

Description

Keywords

Coronavirus, Machine Learning, SVM, Twitter, Angular

Citation