Improving Text Classification Using Graph-based Methods

Karajeh, Ola Abdel-Raheem Mohammed2024-06-062024-06-062024-06-05vt_gsexam:40027https://hdl.handle.net/10919/119309Text classification is a fundamental natural language processing task. However, in real-world applications, class distributions are usually skewed, e.g., due to inherent class imbalance. In addition, the task difficulty changes based on the underlying language. When rich morphological structure and high ambiguity are exhibited, natural language understanding can become challenging. For example, Arabic, ranked the fifth most widely used language, has a rich morphological structure and high ambiguity that result from Arabic orthography. Thus, Arabic natural language processing is challenging. Several studies employ Long Short- Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs), but Graph Convolutional Networks (GCNs) have not yet been investigated for the task. Sequence- based models can successfully capture semantics in local consecutive text sequences. On the other hand, graph-based models can preserve global co-occurrences that capture non- consecutive and long-distance semantics. A text representation approach that combines local and global information can enhance performance in practical class imbalance text classification scenarios. Yet, multi-view graph-based text representations have received limited attention. In this research, first we introduce Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN), a transductive multi-view text classification model that captures textual graph representations for the minority class alongside sequence-based text representations. Experimental results show that MMCT-GCN obtains consistent improvements over baselines. Second, we develop an Arabic Bidirectional Encoder Representations from Transformers (BERT) Graph Convolutional Network (AraBERT-GCN), a hybrid model that combines the large-scale pre-trained models that encode the local context and semantics alongside graph-based features that are capable of extracting the global word co-occurrences in non-consecutive extended semantics by only one or two hops. Experimental results show that AraBERT-GCN outperforms the state-of-the-art (SOTA) on our Arabic text datasets. Finally, we propose an Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) designed for text classification that encapsulates richer and context-aware representations of word and phrase relationships, thus mitigating the impact of the complexity and ambiguity of the Arabic language.ETDenIn CopyrightGraph convolutional networksText classificationTweetsImbalanced dataArabicImproving Text Classification Using Graph-based MethodsDissertation