Crisis Events Information Extraction

Abstract

Unfortunately, crises occur quite frequently throughout the world. In an increasingly digital age, where most news outlets post articles about events online, there are often tens or even hundreds of articles about the same event. Although the information found in each article is often similar, some information may be specific to a certain article or news outlet. And, as each news outlet usually writes a lengthy article for each crisis event that happens, it can be hard to quickly locate and learn the basic, important information about a given crisis event.

This web app project aims to expedite this lengthy process by consolidating any number of articles about a crisis event into who, what, where, when, and how (WWWWH). This information extraction is accomplished using machine learning for named entity recognition and dependency parsing. The extracted WWWWH info is displayed to the user in an easily digestible table, which allows for users to quickly learn the essential information regarding any given crisis event. Both the user’s input and the output data will be saved to a database, so that users can see their previous usages of the program again at any time. While users must manually input web articles into the program, whether as links or .txt files, there is potential in the future to use a web crawler to automate this initial article gathering.

The stack for this applications utilizes the MERN Stack. MongoDB was chosen due to its flexible document structure. For the back-end features such as natural language processing and our server we utilized Python and Express/Node.js. The front-end consists of React which is used to fetch our data and utilizes component libraries such as MUI for consistent design language.

The deliverables for this project include our Final Presentation and Final Report which show our progress throughout the development stages, and finally our code for the application which are submitted to our professor and client, Mohamed Farag.

Description

The documents titled "Crisis Events Information Extraction Presentation" in both PDF and PPTX formats represent our final presentations. These presentations portray the progress achieved throughout the development stages, provide an overview of the problem, outline our solution strategy, and detail the functionality of our application. The documents tiled "Crisis Events Information Extraction Report" in both PDF and DOCX formats are our in-depth reports that portray everything that needs to be known about this project. This includes our abstract, introduction, requirements, design, implementation, testing, user's and developer's manuals, and lessons learned. The "Crisis Events Extraction Information Extraction Code" is a zip file that contains all the code needed for our application to run.

Keywords

Extraction, Natural Language Processing, WARC, Webpages, Archive, Python, JavaScript, React, MongoDB

Citation