Mold Allergomics: Comparative and Machine Learning Approaches
Dang, Ha Xuan
MetadataShow full item record
Fungi are one of the major organisms that cause allergic disease in human. A number of proteins from fungi have been found to be allergenic or possess immunostimulatory properties. Identifying and characterizing allergens from fungal genomes will help facilitate our understanding of the mechanism underlying host-pathogen interactions in allergic diseases. Currently, there is a lack of tools that allow us to rapidly and accurately predict allergens from whole genomes. In the context of whole genome annotation, allergens are rare compared to non-allergens and thus the data is considered highly skewed. In order to achieve a confident set of predicted allergens from a genome, false positive rates must be lowered. Current allergen prediction tools often produce many false positives when applied to large-scale data set such as whole genomes, and thus lower the precision. Moreover, the most accurate tools are relatively slow because they use sequence alignment to construct feature vectors for allergen classifiers. This dissertation presents computational approaches in characterizing the allergen repertoire in fungal genomes as part of the whole genome studies of Alternaria, an important allergenic/opportunistic human pathogenic fungus and necrotrophic plant parasite. In these studies, the genomes of multiple Alternaria species were characterized for the first time. Functional elements (e.g. genes, proteins) were first identified and annotated from these genomes using computational tools. Protein annotation and comparative genomics approaches revealed the link between Alternaria genotypes and its prolific saprophytic lifestyle that provides at least a partial explanation for the development of pathological relationships between Alternaria and humans. A machine learning based tool (Allerdictor) was developed to address the neglected problem of allergen prediction in highly skewed large-scale data sets. Allerdictor exhibited high precision over high recall at fast speed and thus it is a more practical tool for large-scale allergen annotation compared with existing tools. Allerdictor was then used together with a comparative genomics approach to survey the allergen repertoire of known allergenic fungi. We predicted a number of mold allergens that have not been experimentally characterized. These predicted allergens are potential candidates for further experimental and clinical validation. Our approaches will not only facilitate the study of allergens in the increasing number of sequenced fungal genomes but also will be useful for allergen annotation in other species and rapid prescreening of synthesized sequences for potential allergens.
- Doctoral Dissertations