mobileOG-db: a Manually Curated Database of Protein Families Mediating the Life Cycle of Bacterial Mobile Genetic Elements

ABSTRACT Bacterial mobile genetic elements (MGEs) encode functional modules that perform both core and accessory functions for the element, the latter of which are often only transiently associated with the element. The presence of these accessory genes, which are often close homologs to primarily immobile genes, incur high rates of false positives and, therefore, limits the usability of these databases for MGE annotation. To overcome this limitation, we analyzed 10,776,849 protein sequences derived from eight MGE databases to compile a comprehensive set of 6,140 manually curated protein families that are linked to the “life cycle” (integration/excision, replication/recombination/repair, transfer, stability/transfer/defense, and phage-specific processes) of plasmids, phages, integrative, transposable, and conjugative elements. We overlay experimental information where available to create a tiered annotation scheme of high-quality annotations and annotations inferred exclusively through bioinformatic evidence. We additionally provide an MGE-class label for each entry (e.g., plasmid or integrative element), and assign to each entry a major and minor category. The resulting database, mobileOG-db (for mobile orthologous groups), comprises over 700,000 deduplicated sequences encompassing five major mobileOG categories and more than 50 minor categories, providing a structured language and interpretable basis for an array of MGE-centered analyses. mobileOG-db can be accessed at mobileogdb.flsi.cloud.vt.edu/, where users can select, refine, and analyze custom subsets of the dynamic mobilome. IMPORTANCE The analysis of bacterial mobile genetic elements (MGEs) in genomic data is a critical step toward profiling the root causes of antibiotic resistance, phenotypic or metabolic diversity, and the evolution of bacterial genera. Existing methods for MGE annotation pose high barriers of biological and computational expertise to properly harness. To bridge this gap, we systematically analyzed 10,776,849 proteins derived from eight databases of MGEs to identify 6,140 MGE protein families that can serve as candidate hallmarks, i.e., proteins that can be used as “signatures” of MGEs to aid annotation. The resulting resource, mobileOG-db, provides a multilevel classification scheme that encompasses plasmid, phage, integrative, and transposable element protein families categorized into five major mobileOG categories and more than 50 minor categories. mobileOG-db thus provides a rich resource for simple and intuitive element annotation that can be integrated seamlessly into existing MGE detection pipelines and colocalization analyses.


SUPPLEMENTAL METHODS
(i) Annotation of accessory genes in mobile genetic element databases. (ii) Example rationale for annotating proteins Figure S1. Example of incorrect annotation manually reconciled in mobileOG-db.
(iii) Description of the mobileOG-kyanite for autonomous element detection and classification Figure S2. Description of mobileOG.pl-kyanite, a preliminary pipeline for autonomous element detection and classification.
SUPPLEMENTARY DATA Table S1. Keywords used to identify mobile genetic element abstracts in PubMed. Table S2. Keywords and their associated categories created to identify putative MGE sequences that are associated with the target categories in the merged database. Table S3. Evaluation of mobileOG-kyanite, a pipeline for identifying putative mobile element contigs. Attached as csv. Table S4. Complete list of major and minor mobileOG category combinations. Attached as csv. Table S5. CRISPR, BREX, and CBASS anti-phage system components present within mobileOG-db. Attached as csv. Figure S3. Comparison of mobileOG-db.pl in classifying putative phages and prophages derived from wastewater metagenomes described in Brown & Keenum et al 2021 [1]. Top panel: VirSorter produces three levels of confidence for the annotation of phages in metagenomic data with different levels of confidence in the prediction. "Confident phage" refers to the highest level of confidence in the VirSorter (category-1); confident prophage corresponds to category 4 (the highest-confidence of a positive prophage identification); and "Likely phage" refers to category-2 (a "medium" level of confidence in phage identification). "Conservative Plasmids" refers to a more stringent cut-off selected in the mobileOG-db pipeline (k= 15 and purity ≥ 80%). Bottom panel: protein-coding gene content is consistent with a tentative annotation as plasmid fragments.

Supplemental Methods
(I) Annotation of accessory genes in public mobile genetic element databases.
Antibiotic resistance genes, metal resistance genes, and virulence factors were identified in public databases using diamond blastp [2], with cut-offs of >90% sequence identity and >80% query coverage. Antibiotic resistance genes were annotated using CARD v. 3.0.7 [3]; metal resistance genes were annotated using BacMet [4], and virulence factors were annotated using VF-db [5].
(II) Example rationale of protein annotations.
Protein families were included in mobileOG-db only if there was experimental evidence of their direct involvement with one of the targeted functions. Protein families with only indirect interactions with one of the target functions were not included unless they had been shown to be essential for element persistence or replication. For example, these criteria excluded ribonucleotide reductases found within many phage genomes [6], which only have an indirect impact on replication through nucleotide metabolism [7,8], except under conditions of anaerobic growth [7,9]. While these proteins are useful indicators of phage diversity [10,11], we were unable to find evidence of a direct role in replication other than nucleotide metabolism and thus these proteins are not present in mobileOG-db. By contrast, phageencoded thymidylate synthase homologs provide nucleotide substrates for replication and control levels of methyl-or hydroxymethyl-thymidine monophosphates [12]. These modified pyrimidines can then be further hypermodified [13] by additional functional moieties [12,14], which alter the steric properties of the nucleic acid of the viral genome. This process can therefore provide a phage genome with defense against host-encoded CRISPR [15] and restriction modification systems [16][17][18][19]. Thus, thymidylate synthases were included in mobileOG-db and categorized in the replication/recombination/repair major category with minor categories stability and defense.
By contrast, we found that there were several examples of proteins with names that did not match the results of the abstract database, and therefore had to be manually curated to reconcile the disagreement. For example, tr|A0A2Z2Q3C7|A0A2Z2Q3C7_9RHIZ Polyamine ABC transporter ATP-binding protein OS=Agrobacterium larrymoorei OX=160699 GN=repB PE=3 SV=1 The protein repB was identified as a regulator of plasmid replication by the abstract analysis and this sequence initially appeared to be an erroneous attribution of the name, or a protein with the same name but different function. Upon further inspection, it became apparent that the header was not descriptive of the putative function of the protein:  Figure S1. Example of incorrect annotation manually reconciled in mobileOG-db.
Thus, this entry was included in the manually curated sequences as it had a positive association between name, literature, and putative function. UniProt was additionally contacted to seek a correction for this entry.
Below are two examples of MGE gene names that also correspond to names of other genes and proteins. mobC is also the name of a gene encoding a mobilase associated with conjugal plasmid transfer [20]; motA also refers to a gene encoding a T4 phage transcriptional regulator [21]. (ii) mobileOG-db.pl-kyanite, a preliminary pipeline to detect and classify genomic contigs or long reads as putative MGEs. Figure S2. mobileOG-db.pl-kyanite takes genomic contigs as input, converts the nucleotide sequences to open reading frames using prodigal, then aligns the open reading frames against mobileOG-db. Different diamond settings can be used, and were tested for recovering phages or plasmids from a test data set. Figure S3. Comparison of mobileOG-db.pl-kyanite in classifying putative phages and prophages derived from wastewater metagenomes described in Brown & Keenum et al. 2021 [1]. Top panel: VirSorter [22] produces three levels of confidence for the annotation of phages in metagenomic data with different levels of confidence in the prediction. "Confident phage" refers to the highest level of confidence in the VirSorter (category-1); confident prophage corresponds to category 4 (the highestconfidence of a positive prophage identification); and "Likely phage" refers to category-2 (a "medium" level of confidence in phage identification). "Conservative Plasmids" refers to a more stringent cut-off selected in the mobileOG-db pipeline (k= 15 and purity ≥ 80%).