Description
All associated data is available here, including all the PPK1 protein sequences, structures, and metadata we used, plus the MMseqs2 and Foldseek results, result tables, and files for phylogenetic inference.
Polyphosphate is an important polymer for diverse organisms, specifically for bacterial stress response, pathogen virulence, and basic metabolism. In wastewater treatment plants, specific microbial lineages remove phosphorus from the water by taking in orthophosphate [(PO4)3−] and polymerizing it into chains of polyphosphate (polyP). At a later treatment stage, these phosphorus-filled cells are removed from the water. This process is crucial for preventing eutrophication of the downstream water and maintaining environmental standards. However, identifying which microbes perform specific polyP accumulation activities in wastewater is challenging. Namely, just because a given bacterium encodes enzymes that catalyze polyP formation does not mean that the bacterium contributes meaningfully to polyP accumulation in wastewater
. This lack of predictability hinders rational engineering approaches to make the wastewater treatment process more reliable. While there could be many explanations for differing polyP accumulation phenotypes, we wondered if structural differences in polyP-polymerizing enzymes might explain this observation.We recently developed a tool called ProteinCartography that uses protein structural similarity to identify homologous protein families [2], and we thought this polyP puzzle could be an interesting test case. We hypothesized that regardless of sequence divergence, bacteria with enhanced polyP accumulation would have highly similar structures of the polyphosphate kinase PPK1, which catalyzes polyP formation, since protein structure tends to be indicative of protein function [3]. We first used ProteinCartography to cluster all PPK1 structures and compare them to the PPK1 protein structure from a bacterium, Accumulibacter, that we know is important for polyP accumulation in wastewater. We then explored support for our hypothesis using different metrics and visualizations, such as comparing sequence and structural similarity and phylogenetic distance against the Accumulibacter PPK1 protein.
We found examples of high PPK1 protein structural similarity within pathogenic bacteria that are phylogenetically related to Accumulibacter, and which also display enhanced polyP accumulation as part of their virulence and stress response mechanisms. Additionally, we found examples of high PPK1 structural similarity between lineages that are distantly related and are either important or abundant in the wastewater treatment process. This suggests that this method could serve as an initial screening step to prioritize lineages to be tested for polyP activity. However, these PPK1 similarity trends weren’t universal compared to other experimentally verified polyP-accumulating organisms in wastewater. Overall, making useful inferences with this approach is highly dependent on curating polyP trait data, which is only available for a handful of bacterial lineages in wastewater. However, even based on this limited trait data, we were still able to come up with novel protein candidates and species that could be experimentally tested for validation purposes.
While we don’t have plans to follow up on these findings for translational purposes, we think these findings may be useful to groups specifically studying phosphorus removal in wastewater treatment plants, or more broadly, to those interested in general stress responses in bacteria. This work may also be interesting to those curious about the types of insights that can be gained by exploring structural homologs of a protein of interest.
This pub is part of the platform effort, “Annotation: Mapping the functional landscape of protein families across biology.” Visit the platform narrative for more background and context.
Data from this pub is available in Zenodo.
All associated code is available in this GitHub repository.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Inorganic polyphosphates (polyP) are polymers of orthophosphate [(PO4)3−] and are ubiquitous across the tree of life, from bacteria to higher-order eukaryotes. Polyphosphates span numerous essential functions in prokaryotes across varying contexts, such as involvement in basic metabolism, sensing/responding to environmental changes, stress responses, and virulence and host immune evasion [4][5]. Nearly all sequenced bacteria have the genetic repertoire for taking up inorganic phosphorus and forming chains of polyP, catalyzed by the PPK polyphosphate kinases [6]. Since most eukaryotes form polyP through different genetic pathways than in prokaryotes [7][8], the PPK enzymes have been of particular interest as an antibiotic target for pathogens such as Acinetobacter baumannii, Mycobacterium tuberculosis, and Pseudomonas aeruginosa [9][10][11]. Some archaea also possess PPK enzymes, but it is unknown if they contribute significantly to environmental polyP cycling [12].
Not only is polyphosphate accumulation important with respect to human pathogens, it also plays a critical role in the process of wastewater treatment. The goal of wastewater treatment is to remove inorganic nutrients such as nitrogen and phosphorus to prevent downstream eutrophication, where excessive nutrients lead to freshwater ecosystem imbalance and harmful algal blooms [13]. In modern-day wastewater treatment plants, this process depends on specific microbial lineages present in wastewater, which accumulate phosphorus and are eventually removed from the water [14].
Engineering these systems to improve efficiency of phosphorus removal is tricky because it’s not yet clear which microbes contribute the most to polyP accumulation. It’s not even clear how to predict whether a given microbe will accumulate a lot of polyP or very little — almost all bacteria have genes for phosphate polymerization machinery, but there isn’t a clear correlation between sequence and accumulation activity. That said, we do know about a few groups of bacteria that accumulate high levels of polyP. As its name suggests, Candidatus Accumulibacter phosphatis (hereafter referred to as Accumulibacter) is a model polyphosphate-accumulating organism in wastewater within Pseudomonadota (previously Proteobacteria). Tetrasphaera spp. within the Actinobacteria are also abundant in Danish wastewater treatment plants and contribute to polyphosphate cycling [15][16][17]. Many other microbes are important in wastewater treatment as a whole, but it's not known which participate heavily in phosphate accumulation. Additionally, outside of wastewater, certain bacterial lineages store substantial amounts of intracellular polyphosphate in response to stress [18][19].
Why some bacteria seem to be good at accumulating polyP and others aren’t remains an open question. While there could be numerous explanations for this, such as gene expression differences, copy number variation, metabolic dynamics, etc., we decided to explore this question through the lens of protein sequence and predicted protein structural similarity. We hypothesized that regardless of sequence divergence or phylogenetic distance, bacteria that exhibit enhanced polyphosphate accumulation in different contexts may have highly similar PPK1 protein structures. We decided to:
Compare the sequences and structures of approximately 28,000 PPK1 proteins to that of the Accumulibacter PPK1 protein (since we know this bacterial lineage has high levels of polyP accumulation).
Look for signatures of potential convergent evolution of protein structure, which could reveal mechanistic clues about phosphate polymerization. We sought to do this by searching for examples of high structural similarity of PPK1 proteins in taxa that are either distantly related to Accumulibacter, or that we do not expect to have high structural similarity based on phylogenetic distance.
Construct general frameworks for integrating protein sequence and structural similarity metrics with phylogenetic comparisons, so that in the longer-term, we might perform these types of analyses for other proteins in a high-throughput and reproducible fashion.
We used the PPK1 protein from Accumulibacter as a query to compare sequence and structural similarity to all other PPK1 proteins retrieved from UniProt. To assess how phylogenetic distance connects to both sequence and structural similarity, we inferred a phylogeny of PPK1 sequences from Pseudomonadota, the phylum in which Accumulibacter is classified. From this tree, we calculated the patristic (i.e. phylogenetic) distance and compared it among protein sequences and structures. By comparing phylogenetic distance to protein sequence and structural similarity, we sought to find proteins that were highly similar in structure (and presumably function), yet highly evolutionarily distant from the Accumulibacter PPK1. Species with such proteins may have thus convergently evolved the ability to accumulate polyP.
We first collected metadata for approximately 35,000 accessions annotated as PPK1 in bacteria and archaea in UniProt (Figure 1). This included information about protein length, assigned functional annotation, and taxonomic information for the organism. We then selected all proteins larger than 500 amino acids (AAs) to filter out short proteins such as incomplete clone sequences or incorrectly annotated sequences. We chose this filter based on plotting the distribution of protein lengths from all PPK1 entries from UniProt, and a length of greater than 500 AAs was sufficient to remove incorrectly annotated proteins or short clone sequences. This resulted in approximately 28,000 accessions that we were confident were annotated as PPK1. We curated metadata with the tidyverse R package (version 2.0) [20]. For each accession, we downloaded the protein sequence from UniProt and the protein structure from the AlphaFold database (version 4) [21]. We’ve provided a TSV file of the metadata for the resulting ~28,000 accessions and gathered protein sequences and structures in this Zenodo archive [22].
SHOW ME THE DATA: You can access all the PPK1 protein sequences, structures, and metadata that we used, plus the MMseqs2 and Foldseek results, result tables, and files for phylogenetic inference on Zenodo (DOI: 10.5281/zenodo.8378182).
Since Accumulibacter is a hallmark polyphosphate-accumulating organism in wastewater, we wanted to compare all PPK1 protein sequences and structures to the Accumulibacter PPK1. We used the PPK1 protein (UniProt accession A0A369XMZ4) from the Candidatus Accumulibacter phosphatis UW-LDO-IC strain, which is now reclassified as Candidatus Accumulibacter meliphilus UW-LDO [23][24] (GenBank genome accession GCA_003332265.1). First, we clustered all PPK1 structures using Foldseek (version 6.29) with foldseek easy-cluster
[25] within the ProteinCartography pipeline [2]. We then created a Nextflow workflow that runs both mmseqs easy-search
with MMseqs (version 14.7) [26] and foldseek easy-search
that performs all-v-all pairwise sequence and structure comparisons for all PPK1 sequences or structures against the Accumulibacter PPK1 and plots the results.
We used results from mmseqs easy-search
and foldseek easy-search
to plot the comparison of protein sequence similarity to TM-score for all PPK1 proteins against the Accumulibacter PPK1 using the R packages tidyverse (version 2.0) and ggpubr (version 0.6.0) [27]. TM-score is a metric for measuring the topological similarity of two protein structures, where scores range from 0–1 and a score of 1 is a perfect match between the two structures [28]. We plotted and overlaid pairwise comparisons of protein sequence similarity and structural similarity for each PPK1 query compared to the Accumulibacter PPK1 with the corresponding phylum as the color.
For highlighting specific comparisons to the Accumulibacter PPK1 structure, we used the notebook explore-ppk1-structures.ipynb
to visualize the alignment of two protein structures with Biopython (version 1.81) [29] and the py3Dmol (version 2.0.1) package [30] using PDB files as inputs. For each comparison, we took screenshots of the structure alignment from the notebook.
To investigate the phylogenetic distribution of sequences within the Pseudomonadota phylum (in which Accumulibacter is classified), we inferred a phylogenetic tree of a reduced set of Pseudomonadota PPK1 sequences. To obtain this reduced set of PPK1 sequences, we first clustered sequences at 80% identity using mmseqs easy-cluster
, appending PPK1 sequences for Accumulibacter, Neisseria gonorrhoeae strain ATCC 700825 [Q5FAJ0], Pseudomonas aeruginosa strain ATCC 15692 [P0DP44], Acinetobacter baumannii 83444 [A0A829RFS7], and Ralstonia solanacearum strain UW386 [A0A5B7U1Z3]. We also included an outgroup PPK1 sequence from Streptomyces coelicolor to root the tree. We created an alignment of approximately 1,500 sequences with MUSCLE (version 5.1) [31] and a phylogenetic tree inferred with FastTree 2 (version 2.1.11) [31].
We inspected and rooted the tree using iTOL [32], and visualized in Empress v1.2.0 [33]. In the HTML viewer of Empress, we added two metadata rings for each representative sequence to show sequence similarity and structure similarity (TM-score) for each query compared to the Accumulibacter PPK1. Finally, we compared phylogenetic distance for these representative sequences to pairwise sequence identity and TM-score compared to Accumulibacter PPK1. We read the tree in Newick format into R using the ape package (version 5.7) [34], calculated the patristic distance (sum of branch lengths between two terminal branches and their common ancestor node) with the adephylo package (version 1.11) [35], and plotted into an interactive HTML plot with Plotly (version 4.10.2) [36].
We used ChatGPT to write and clean up code. We also used it to suggest wording ideas, then we picked which parts to use.
All the code we generated and used for the pub is available in this GitHub repository (DOI: 10.5281/zenodo.8412197), including a workflow for making protein sequence and structural comparisons to a query, a Jupyter notebook for overlaying structures, and visualization scripts.
SHOW ME THE DATA: You can access all the PPK1 protein sequences, structures, and metadata that we used, plus the MMseqs2 and Foldseek results, result tables, and files for phylogenetic inference on Zenodo.
We sought to test the hypothesis that phosphate-polymerizing PPK1 enzymes from bacteria that we know to be effective polyP accumulators have more similar protein structures than expected given their sequence divergence. If supported this hypothesis would suggest that we may predict whether uncharacterized species accumulate high levels of polyP. We predicted that we’d find proteins with divergent sequences that are still structurally similar to the Accumulibacter PPK1 protein.
We first clustered all ~28,000 PPK1 structures and labeled the clusters with phylum information (Figure 2). We inspected clusters that contain Accumulibacter PPK1 structures: SC59, SC21, SC13. We found a few proteins within those clusters that have high TM-scores (i.e. their structures are very similar to the Accumulibacter PPK1), but which come from other phyla. These include Nitrospira sp. [A0A3C1Z3C9], Gemmatimonadetes sp. [A0A7Y2B3S7] and Methanomassiliicoccus sp. [A0A847T1M7] (compare their structures in Figure 3). We were encouraged that the first two taxa are bacterial lineages that are either important or abundant in wastewater and freshwater [37][38]. Methanomassiliicoccus spp. are methanogenic archaea important for anaerobic wastewater treatment processes and production of methane. It is still largely unknown how or if methanogenic archaea contribute to polyphosphate accumulation in wastewater even though they have the genetic potential [12]. PPK1 proteins from additional microbes cluster with the Accumulibacter PPK1, but we don’t have data on their polyphosphate phenotypes. These results highlight that our approach could be useful in screening for candidate polyP-accumulating bacteria, which could then be verified through wet-lab experiments.
We were also interested in examples where proteins have high structural similarity but low sequence similarity, which could suggest convergent evolution of structure. Alternatively, this could suggest that structural similarity of PPK1 is dictated by local, rather than global sequence similarity. To explore this, we compared all PPK1 protein sequences and structures to our model phosphate polymerizing enzyme, the Accumulibacter PPK1 (Figure 4). We were reassured to find that all pairwise TM-score comparisons to the Accumulibacter PPK1 were 0.8 and above, as current practice is to treat a TM-score above 0.5 as sufficient for inferring the same fold and assigning an annotation to a protein [39]. This high structural conservation of all queries is likely due to us prefiltering accessions greater than 500 AAs to ensure we made comparisons to correctly annotated PPK1 proteins.
As expected, the general trend is that with decreasing PPK1 sequence identity, protein structural alignment (represented by TM-score) also decreases. However, there is a plateau of decreasing protein sequence similarity but fairly high structural similarity, specifically for sequences within Pseudomonadota (Figure 4, grey points). This suggests that there are indeed proteins with similar protein structure despite dissimilar sequence composition.
To test if PPK1 structures convergently evolved among distantly related taxa, we inferred a tree for 1,500 representative Pseudomonadota PPK1 sequences. We overlaid the phylogenetic tree with each PPK1 TM-score compared to the Accumulibacter PPK1 and labeled a handful of organisms known to exhibit enhanced polyP accumulation (Figure 5). We then used the phylogeny of PPK1 sequences to obtain the patristic distance among sequences, a measure of evolutionary distance defined as the sum of branch lengths separating two proteins in the tree. We compared the patristic distance to both the protein sequence identity and structure alignment to the Accumulibacter PPK1 (Figure 6). Unsurprisingly, there is a consistent decrease in protein sequence similarity as phylogenetic distance increases for all sequences compared to the Accumulibacter PPK1 (Figure 6). Notably, the shape of the pattern differs when we plot phylogenetic distance versus structural similarity (TM-score). That is, whereas sequence similarity drops off consistently with increasing phylogenetic distance before plateauing, protein structure is conserved at greater phylogenetic distances before eventually dropping off sharply (Figure 6). This aligns with the thinking that protein structures evolve slower and overall more conserved than protein sequences, but emphasizes a need for additional assessment of the extent to which we expect TM-score and sequence similarity to correspond.
Based on knowledge of human pathogens where polyphosphate accumulation is important for virulence and in looking at the results as a whole, the most striking data points were in Neisseria gonorrhoeae strain ATCC 700825 [Q5FAJ0], Pseudomonas aeruginosa strain ATCC 15692 [P0DP44], Acinetobacter baumannii 83444 [A0A829RFS7], and Ralstonia solanacearum strain UW386 [A0A5B7U1Z3] (Figure 5 and Figure 6), where each protein had a > 0.98 TM-score compared to the Accumulibacter PPK1. The first three organisms are human pathogens in which polyphosphate accumulation is linked to virulence. Some strains of Neisseria gonorrhoeae accumulate large amounts of polyphosphate granules on the exterior of the cell into a pseudo-capsule and this is connected to human immune system evasion [40]. Pseudomonas aeruginosa causes infections in immunocompromised individuals, and ppk1 knockouts lead to deficiencies in biofilm formation, motility, and quorum sensing [41]. Acinetobacter baumannii is a multi-drug resistant bacterium that causes nosocomial infections, and inhibition of PPK1 by repurposed drugs led to decreased biofilm formation, surface motility, and overall virulence [42]. Ralstonia solanacearum is a plant pathogen that causes bacterial wilt disease in crops like potatoes and tomatoes [43], where biofilm formation, motility, and quorum sensing are important virulence factors for surviving in the nutrient-poor xylem of plants [44][45].
Overall, these results highlight that this comparative approach to integrating protein structural predictions with phylogenetics could identify patterns of convergent evolution and functional importance across diverse bacterial lineages within the contexts of human health, agriculture, and biotechnological applications. Creating explicit statistical tests for correlating sequence and structural similarity and looking for phylogenetic outliers of this ratio will help us narrow down protein and species candidates for further validation.
From these results, we’ve generated interesting hypotheses about the structural conservation of PPK1 across diverse bacteria, specifically in those that are known to accumulate large amounts of polyphosphate. Subsequent wet-lab experiments would be needed to validate whether protein structures with similar TM-scores indeed have similar activities or phenotypes related to polyphosphate accumulation, but this approach provides a starting place to test in the lab.
Interestingly, we did not find the same level of high similarity between PPK1 protein structures from Accumulibacter and Tetrasphaera spp. (average TM-score of 0.931 between five Tetrasphaera PPK1 proteins), even though these are the two main, experimentally verified bacterial lineages that contribute to polyphosphate accumulation in wastewater. If structural similarity and assessed PPK1 function were perfectly correlated, we would have expected that Tetrasphaera spp. would have the highest structural similarity to the Accumulibacter PPK1. However, the five Tetrasphaera spp. PPK1 proteins fell into the SC22, SC29, and SC39 clusters. Interestingly within these clusters also were important lineages in the wastewater treatment process such as other methanogenic archaeal lineages including Methanomicrobiales, and several Gemmatimonadetes spp. Additionally, the Tetrasphaera clusters also contained several Cyanobacteria lineages such as the marine Prochlorococcus, Synechococcus, and Leptolyngbya. Although these lineages did not fall in the same clusters as Accumulibacter or have as much protein structure similarity to the Accumulibacter PPK1 as expected, this could suggest that several, different protein structures evolved and converged in different lineages that could be connected to increased polyphosphate accumulation under certain conditions.
Additionally, we restricted our analysis to comparisons of only the PPK1 protein, but PPK2 or copy number variation of PPK family proteins can contribute to enhanced polyphosphate accumulation, as they do in Pseudomonas aeruginosa [46][47]. Follow-up to this work could include co-clustering of PPK1 along with PPK2 for bacterial lineages that contain both to connect to polyphosphate accumulation phenotypes.
Querying ~28,000 PPK1 proteins against the Accumulibacter PPK1 resulted in highly similar comparisons to PPK1 protein structures in other lineages important in the wastewater treatment process and human pathogens where polyphosphate accumulation is an important virulence trait
Searching for examples of high structural similarity of PPK1 proteins in distantly related taxa provided cases to test for potential convergent evolution of the protein structure
More broadly, we can start connecting protein structure and phylogenetic comparisons together to generate more informed hypotheses about the evolutionary patterns of protein families, as well as harnessing novel or efficient protein functions that can be re-engineered for biotechnological applications.
We believe that polyP accumulation and the PPK1 protein could be a good test case as we continue developing our platform, both computationally and in the lab. We could interrogate why certain proteins end up in certain structural clusters by performing domain analyses to look for common motifs within clusters. With more trait information, we could start to compare PPK1 structures from high vs. low polyP-accumulating bacteria to identify key structural features required for efficient polyP formation.
As we build out our platform workflows, we are actively looking for proteins that are biologically interesting and allow for quick experimental validation of our computational predictions. Since there are many existing assays for quantifying polyphosphate in the lab [48], we believe we could potentially build off our results with PPK to test subsequent in silico tools and eventually test hypotheses with wet-lab validation.
We’re curious to hear what tools and approaches you’d like to see us explore next for connecting protein structure comparisons to phylogenetic metrics, and we’re open to ideas for other proteins that could be better test cases for our development efforts.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.