How can we improve upon and expand the scope of our phylogenomic inferences?
We’re seeking feedback on NovelTree, our modular phylogenomic workflow. We’d appreciate your insights into how we can improve gene family inference, incorporate protein structure predictions, and expand to whole-genome data as input.
We recently released NovelTree — a modular Nextflow workflow that takes proteomes from diverse organisms as input and conducts phylogenetic inference for thousands of genes [1]. Since its release, we’ve started to consider alternative methodologies and input data to facilitate a range of new use cases for the method.
We’re seeking feedback from the community on how we might improve our approach to inferring gene families, perform protein structural phylogenetics, and conduct phylogenomic inference of not only coding sequences but genome-wide synteny. Whether you’re a phylogeny lover, develop phylogenetic methods, or apply them as a systematist or comparative evolutionary biologist, we’d love to hear from you!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Background on the original pub
By performing phylogenetic inference for thousands of genes, we can both develop and test hypotheses about the role of proteins and their constituent gene families over evolutionary time scales, teaching us about the tempo and mode of evolution across the tree of life. To do this, we developed NovelTree — a modular Nextflow workflow that takes proteomes from diverse organisms as input and infers orthology, gene family trees, species trees, and gene family evolutionary dynamics [1]. The pipeline is a helpful tool to, for instance, associate gene family expansion or contraction with specific organismal traits and generate stronger hypotheses for the function of uncharacterized proteins.
Our latest questions
NovelTree has proven useful for many tasks — generating phylogenomic datasets, mapping genome-wide evolutionary patterns across broad portions of the tree of life, and more. However, there are still multiple areas wherein the framework could be optimized, improved, or expanded upon. Simultaneously, the tools, data types, and theoretical frameworks used in computational and comparative evolutionary research are rapidly changing. We’d therefore love to elicit feedback on the following questions. Rather than leading with our own thoughts on the matter, we’d like to hear your ideas, independent of what we’ve already begun to consider. We hope this can help us understand what may be useful to the community as a whole.
How can I weigh in?
We hope you’ll respond publicly to our questions below by selecting/highlighting the question you’d like to answer, clicking the comment icon, and typing in your thoughts (as shown in the GIF below)! You’ll need a PubPub account to do this, but it’s free and quick to make one. Here’s a quick tutorial on how to comment.
What’s the best strategy for gene family inference?
Nearly all gene-based phylogenomic analyses rely on the accurate inference of gene families. Despite this, the methodology underlying gene family inference has historically received relatively little scrutiny compared to that for multiple sequence alignment or inference of phylogenetic trees. Right now, NovelTree uses a procedure based on OrthoFinder’s [2], clustering protein sequences into gene families (orthogroups) based on their sequence similarity. We extend this procedure by assessing the impact of the MCL clustering algorithm’s inflation parameter using the COGEQC functional annotation metric, which quantifies the extent to which biologically informative protein annotations are distributed within vs. split among gene families [3]. However, much can still be done to improve, add to, or extend how we assess the accuracy of orthogroup inference or how we infer gene families entirely. We’re particularly interested in how we can perform gene family inference without the use of protein functional annotations that are frequently unavailable for non-model organisms. Similar to how we’ve implemented various methods for multiple sequence alignment and gene/species tree inference, we’d like to have multiple gene family inference methods available for NovelTree users.
Other than the COGEQC functional annotation metric, how might we assess the quality of gene family inference?
How might we improve our gene family inference procedure (e.g., using alternative methodology), and what types of data would be most suited to doing so?
Going a step further, how might we infer gene family evolutionary dynamics more efficiently, without loss of accuracy?
How can we level up our phylogenomic inferences using newly abundant protein structure predictions?
Our phylogenetic analyses are based on protein (i.e. amino acid) sequences, as these are readily available in various public databases. Yet given the recent advances in predicting protein structure with tools such as AlphaFold [4] and ESMFold [5], there are many opportunities to develop novel statistics accounting for both sequence and structural evolutionary patterns. One small example: it might be possible to infer shifts in the adaptive evolution of certain proteins by investigating discontinuities between sequence and structural similarity. Generating a theoretical and applied framework for this would open up new possibilities for identifying interesting evolutionary patterns.
How could we best incorporate protein structural predictions into phylogenomic analyses?
How can we move beyond just proteins and use whole genomes for phylogenomic analysis?
One of the biggest limitations of NovelTree and other related phylogenomic pipelines is their exclusive applicability to protein-coding sequences. Expanding the framework to accommodate genome-wide sequence data (e.g. whole-genome assemblies from multiple species) would empower us in several ways. Some example benefits include: 1) Expanding our scope beyond identifying orthologous genes and proteins to inferring synteny across the genome. 2) More exhaustively studying patterns of adaptive and non-adaptive molecular evolution of coding sequences (e.g., the ratio of non-synonymous to synonymous substitutions, dN/dS). 3) Developing tools to investigate the evolution of non-coding and regulatory regions of the genome.
Given these possibilities and numerous others, we’d love to explore means for incorporating whole-genome sequence data into our phylogenomic analyses.
How might we extend NovelTree to conduct truly whole-genome phylogenomics, reaching beyond the scope of coding sequences alone?
Let us know what you think!
We’ve outlined several outstanding questions and potential development opportunities that should inform our next steps in enhancing NovelTree and, hopefully, future phylogenomics tools from others. That said, this is not an exhaustive list of questions related to NovelTree or related applications of the evolutionary datasets it generates. We encourage public responses to the questions posed above — but we’d love to hear about anything else that came to mind while reading the pub!
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
I’m wondering if there might be some way to incorporate alignment-free methods like min-hash here. They aim for blunt efficiency over explicit modeling, but I’m curious what a more nuanced version of their use might look like. For example, by partitioning the genomic into some number of annotations or even within gene families as a comparative tool.
Austin H. Patton:
Hi Joel, thanks for the suggestion! I certainly think there is room for such approaches in a workflow like NovelTree that strives for highly-efficient, scaleable phylogenomic inference. This idea is similar to a question posed below by Jacek. I wonder to what extent approaches like these may be useful even just as a means to expedite and even improve MSA and MSA-based phylogenetic inference, perhaps by quickly generating a higher-quality starting tree to either seed MSA inference, or likelihood-based tree inference?
?
Natasha Picciani:
Perhaps depending on the questions you’d like to address here, it would be useful to consider methods that retrieve subtrees of large and multicopy gene family trees (tree decomposition). That way instead of relying in the delimitation of gene families, you could rely on single copy clusters from all gene families. Check out Smith et al. MBE 2022.
Austin H. Patton:
I definitely agree! In fact, we implement this type of gene tree decomposition within the ASTEROID species tree inference model using DISCO (https://doi.org/10.1093/sysbio/syab070). Perhaps it might be useful for us to break that out into its own module, returning these decomposed, single copy gene family trees as a separate output, independent of the (optional) ASTEROID module?
?
Adi Lavy:
There are several approaches that could be combined to improve phylogeny prediction. The first is to use LLM as an improved method of predicting ORFs of genes. This would alleviate the problem of mis-categorized proteins in existing databases. Second, in the case of microbial genomes, using gene clusters as a single unit to not only identify similarities in a single gene, but rather explore differences in spacing and non-coding regions in between. Third, use predicted structure of proteins and compare their 3D space, generating a score for similarities and differences in the space they occupy. The scoring should consider differences in the active vs non active regions of enzymes. Fourth, apply whole genome average nucleotide identity (ANI) test. Using a combination of these would improve our ability to infer phylogeny relationships. Of course there are also restrictions to some of these methods, for example it will require a certain cutoff for genome completeness.
Austin H. Patton:
Thanks so much for sharing these ideas, Adi! We’ve actually been floating around some ideas related to a number of these, namely the potential value of incorporating LLMs and structural information/features as a complement to more traditional sequence based methods. Although I’d be concerned about the sole use of one of these approaches in lieu of sequence based approaches (due to the difficulty in distinguishing between homology and analogy), they are certain to improve recall of highly divergent sequences, and perhaps enable us to glean more insight into protein structure as you suggest!
?
Jacek Kominek:
I frequently found it a useful, interesting, surprisingly accurate and (of course) vastly more efficient alternative approach to skip MSAs entirely and use k-mers to build phylogenies. Granted, since this approach boils down to a distance matrix, it comes at the cost of explicit evolution models and intuitive interpretability of branch-lengths, but I think it still can be very much worthwhile. Especially in either very straightforward cases, where the answer is obvious (e.g. pure descent, no dups, no loss, not xfers) or very difficult ones (so much noise that even 100s of millions of steps of Bayesian inference won’t do any good).
Austin H. Patton:
Yes, I think these types of approaches have the potential to be of great utility in phylogenomic inference, particularly as the size of datasets increase! This is certainly something worth looking into!
Akin to the k-mer based approaches (but still reliant on MSAs), recent approaches like PhyloFormer could have similar utility in this context. This approach uses one-hot encoded MSAs and a transformer with site-level and pair-level attention mechanisms to learn and predict evolutionary distance matrices, which in turn are used to infer phylogenies using FastME with impressive accuracy.
?
Jacek Kominek:
it could be beneficial for gene family inference to compare protein structures of the members, and either empirically determine some helpful heuristic cutoffs and/or create “structural HMMs” for a gene family, and use that to assess how consistent the members are. This approach would probably only help with gene families with significant levels of sequence divergence, but it still could be helpful with potential candidate members in the twilight zone, which can be very problematic.
Austin H. Patton:
I couldn’t agree more! We’re definitely excited about the prospect of incorporating protein structural information into our workflows and phylogenomic analyses, especially as these resources become increasingly available and efficient to use at scale. I agree that an approach like what you’ve outlined could be a great complement to traditional sequence-based methods to improve recovery of highly divergent proteins! One outstanding question that remains in this context is to what extent use of such structure-informed comparisons will identify or be able to distinguish between cases of homology vs analogy (or convergence, etc).
+ 2 more...
?
Jacek Kominek:
One way to incorporate structural information could be to use it to filter out/downweigh sequence alignment sites that are very close to each other in 3D structures. The rationale being that diversity at such sites is likely to represent co-evolution to compensate for one another, rather than represent an independent, evolutionary signal.
Austin H. Patton:
That’s an interesting thought! If these physically proximal sites do in fact typically co-evolve, then would cases wherein this co-evolution breaks down be particularly informative? My primary concern would be that down-weighting their contribution may lead to these major shifts being missed - but I’m not sure to what extent that concern is warranted!