Integrated Web App for Crisis Events Crawling

Abstract

The integration of a web crawler and a text classifier into a unified web application is a practical advancement in digital tools for crisis event information retrieval and parsing. This project combines HTML text processing techniques and a priority-based web crawling algorithm into a system capable of gathering and classifying web content with high relevance to specific crisis events. Utilizing the classifier project’s model trained with targeted data, the application enhances the crawler's capability to identify and prioritize content that is most pertinent to the crisis at hand. The transition from Firebase to MongoDB for backend services provides a much more flexible, accessible, and permanent database solution. As well as this, the system’s backend is further supported by a Flask API, which facilitates the interaction between the frontend, the machine learning model, and the database. This setup not only streamlines the data flow within the application but also simplifies the maintenance and scalability of the system. This integrated web app aims to serve as a valuable tool for stakeholders involved in crisis management, such as journalists, first responders, and policy makers, enabling them to access timely and relevant information swiftly. During development of this project there were many challenges with fixing the two projects; out of the box neither was functional when they were obtained from their respective repositories. As well as this, the projects had incomplete documentation, leaving a lot for our team to figure out on our own. The results of our team is a redesigned frontend, backend, and MongoDB local database together into a cohesive, full application.

Description

Two previous projects built two web apps for retrieving webpages about a crisis event. The first web app provided a nice web interface for building and using one-class classification to judge if a webpage is related to a crisis event or not and the second web app provided a nice web interface for crawling the WWW about webpages related to a crisis event. In this project, we would like to merge these two web apps into one, where we will have an integrated web interface for preparing the one class classifier and then using it to crawl the web.

Keywords

text classifier, text classification, web crawler, information retrieval, web crawling, crisis event

Citation