How can we improve upon and expand the scope of our phylogenomic inferences?

Prachee Avasthi; Megan L. Hochstrasser; Jasmine Neal; Austin H. Patton; Ryan York

doi:10.57844/arcadia-5b41-637f

Open question Not actively updating Genetics: Decoding evolutionary drivers across biology

Published on Mar 05, 2024 by Arcadia Science

How can we improve upon and expand the scope of our phylogenomic inferences?

We’re seeking feedback on NovelTree, our modular phylogenomic workflow. We’d appreciate your insights into how we can improve gene family inference, incorporate protein structure predictions, and expand to whole-genome data as input.

How can we improve upon and expand the scope of our phylogenomic inferences?

Purpose

We recently released NovelTree — a modular Nextflow workflow that takes proteomes from diverse organisms as input and conducts phylogenetic inference for thousands of genes [1]. Since its release, we’ve started to consider alternative methodologies and input data to facilitate a range of new use cases for the method.

We’re seeking feedback from the community on how we might improve our approach to inferring gene families, perform protein structural phylogenetics, and conduct phylogenomic inference of not only coding sequences but genome-wide synteny. Whether you’re a phylogeny lover, develop phylogenetic methods, or apply them as a systematist or comparative evolutionary biologist, we’d love to hear from you!

This is a follow-up to work described in a prior pub, “NovelTree: Highly parallelized phylogenomic inference.” Visit that pub for complete background info and context.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Background on the original pub

By performing phylogenetic inference for thousands of genes, we can both develop and test hypotheses about the role of proteins and their constituent gene families over evolutionary time scales, teaching us about the tempo and mode of evolution across the tree of life. To do this, we developed NovelTree — a modular Nextflow workflow that takes proteomes from diverse organisms as input and infers orthology, gene family trees, species trees, and gene family evolutionary dynamics [1]. The pipeline is a helpful tool to, for instance, associate gene family expansion or contraction with specific organismal traits and generate stronger hypotheses for the function of uncharacterized proteins.

Our latest questions

NovelTree has proven useful for many tasks — generating phylogenomic datasets, mapping genome-wide evolutionary patterns across broad portions of the tree of life, and more. However, there are still multiple areas wherein the framework could be optimized, improved, or expanded upon. Simultaneously, the tools, data types, and theoretical frameworks used in computational and comparative evolutionary research are rapidly changing. We’d therefore love to elicit feedback on the following questions. Rather than leading with our own thoughts on the matter, we’d like to hear your ideas, independent of what we’ve already begun to consider. We hope this can help us understand what may be useful to the community as a whole.

How can I weigh in?
We hope you’ll respond publicly to our questions below by selecting/highlighting the question you’d like to answer, clicking the comment icon, and typing in your thoughts (as shown in the GIF below)! You’ll need a PubPub account to do this, but it’s free and quick to make one. Here’s a quick tutorial on how to comment.

What’s the best strategy for gene family inference?

Nearly all gene-based phylogenomic analyses rely on the accurate inference of gene families. Despite this, the methodology underlying gene family inference has historically received relatively little scrutiny compared to that for multiple sequence alignment or inference of phylogenetic trees. Right now, NovelTree uses a procedure based on OrthoFinder’s [2], clustering protein sequences into gene families (orthogroups) based on their sequence similarity. We extend this procedure by assessing the impact of the MCL clustering algorithm’s inflation parameter using the COGEQC functional annotation metric, which quantifies the extent to which biologically informative protein annotations are distributed within vs. split among gene families [3]. However, much can still be done to improve, add to, or extend how we assess the accuracy of orthogroup inference or how we infer gene families entirely. We’re particularly interested in how we can perform gene family inference without the use of protein functional annotations that are frequently unavailable for non-model organisms. Similar to how we’ve implemented various methods for multiple sequence alignment and gene/species tree inference, we’d like to have multiple gene family inference methods available for NovelTree users.

Other than the COGEQC functional annotation metric, how might we assess the quality of gene family inference?

How might we improve our gene family inference procedure (e.g., using alternative methodology), and what types of data would be most suited to doing so?

Going a step further, how might we infer gene family evolutionary dynamics more efficiently, without loss of accuracy?

How can we level up our phylogenomic inferences using newly abundant protein structure predictions?

Our phylogenetic analyses are based on protein (i.e. amino acid) sequences, as these are readily available in various public databases. Yet given the recent advances in predicting protein structure with tools such as AlphaFold [4] and ESMFold [5], there are many opportunities to develop novel statistics accounting for both sequence and structural evolutionary patterns. One small example: it might be possible to infer shifts in the adaptive evolution of certain proteins by investigating discontinuities between sequence and structural similarity. Generating a theoretical and applied framework for this would open up new possibilities for identifying interesting evolutionary patterns.

How could we best incorporate protein structural predictions into phylogenomic analyses?

How can we move beyond just proteins and use whole genomes for phylogenomic analysis?

One of the biggest limitations of NovelTree and other related phylogenomic pipelines is their exclusive applicability to protein-coding sequences. Expanding the framework to accommodate genome-wide sequence data (e.g. whole-genome assemblies from multiple species) would empower us in several ways. Some example benefits include: 1) Expanding our scope beyond identifying orthologous genes and proteins to inferring synteny across the genome. 2) More exhaustively studying patterns of adaptive and non-adaptive molecular evolution of coding sequences (e.g., the ratio of non-synonymous to synonymous substitutions, dN/dS). 3) Developing tools to investigate the evolution of non-coding and regulatory regions of the genome.

Given these possibilities and numerous others, we’d love to explore means for incorporating whole-genome sequence data into our phylogenomic analyses.

How might we extend NovelTree to conduct truly whole-genome phylogenomics, reaching beyond the scope of coding sequences alone?

Let us know what you think!

We’ve outlined several outstanding questions and potential development opportunities that should inform our next steps in enhancing NovelTree and, hopefully, future phylogenomics tools from others. That said, this is not an exhaustive list of questions related to NovelTree or related applications of the evolutionary datasets it generates. We encourage public responses to the questions posed above — but we’d love to hear about anything else that came to mind while reading the pub!

Share your thoughts!

Provide feedback

Pub details

Content 5 contributors

5 references

Activity 15 discussions

4 social posts

This work is licensed under CC BY 4.0

Purpose
Background on the original pub
Our latest questions
What’s the best strategy for gene family inference?
How can we level up our phylogenomic inferences using newly abundant protein structure predictions?
How can we move beyond just proteins and use whole genomes for phylogenomic analysis?
Let us know what you think!

Prachee Avasthi

Supervision

Megan L. Hochstrasser

Editing, Supervision

Jasmine Neal

Writing

Austin H. Patton

Conceptualization, Writing

Ryan York

Supervision, Writing

Jacek Kominek on Sep 07, 2024

Selection

ework for this would open up new possibilities for identifying interesting evolutionary patterns.How could we best incorporate protein structural predictions into phylogenomic analyses?How can we move beyond just proteins and use whole genomes for phylogenomic analysis?One of the

One way to incorporate structural information could be to use it to filter out/downweigh sequence alignment sites that are very close to each other in 3D structures. The rationale being that diversity at such sites is likely to represent co-evolution to compensate for one another, rather than represent an independent, evolutionary signal.

Austin H. Patton on Sep 11, 2024

That’s an interesting thought! If these physically proximal sites do in fact typically co-evolve, then would cases wherein this co-evolution breaks down be particularly informative? My primary concern would be that down-weighting their contribution may lead to these major shifts being missed - but I’m not sure to what extent that concern is warranted!

Byron Smith on Jan 18, 2025

The three key places that I see structural information being useful is in (1) additional information for annotating function, (2) identification of distant homology (3) sequence alignment.

You touch on (1) in your discontinuity example.

I don’t see much discussion here or in the previous article about (2), the importance of finding distant homologs, but structural comparisons can sometimes be more sensitive across long evolutionary timescales. Adding a structural similarity index may further refine some of the family clustering. On the other hand, finding distant homology is probably not as important when you’re working at tree-of-life scales.

Regarding (3), I expect that the use of structure to refine alignments would reduce error/noise in MSAs—homologous residues would be in the same 3D position and interacting with the same neighbors. Positions that do not meet these criteria should probably be masked.

One tool to be aware of is Foldseek and its concept of a “3Di sequence”, which serves as a 1-dimensional proxy for full structural information. Aligning residues using the 3Di sequence instead of (or along with) amino-acids could be an efficient middle-ground between no structural information and full structural alignments.

Jacek Kominek on Sep 07, 2024

Selection

nference, we’d like to have multiple gene family inference methods available for NovelTree users.Other than the COGEQC functional annotation metric, how might we assess the quality of gene family inference?How might we improve our gene family inference procedure (e.g., using alternative methodology),

it could be beneficial for gene family inference to compare protein structures of the members, and either empirically determine some helpful heuristic cutoffs and/or create “structural HMMs” for a gene family, and use that to assess how consistent the members are. This approach would probably only help with gene families with significant levels of sequence divergence, but it still could be helpful with potential candidate members in the twilight zone, which can be very problematic.

Austin H. Patton on Sep 11, 2024

I couldn’t agree more! We’re definitely excited about the prospect of incorporating protein structural information into our workflows and phylogenomic analyses, especially as these resources become increasingly available and efficient to use at scale. I agree that an approach like what you’ve outlined could be a great complement to traditional sequence-based methods to improve recovery of highly divergent proteins! One outstanding question that remains in this context is to what extent use of such structure-informed comparisons will identify or be able to distinguish between cases of homology vs analogy (or convergence, etc).

Natasha Picciani on Sep 20, 2024

As it is, assessing the quality of gene family definitions sounds like a problem of orthogroup inference accuracy. It’s interesting to see that adjusting the MCL inflation value in OrthoFinder influences the orthogroup output (see that the author of OrthoFinder mentions that its “algorithm should be relatively resistant to changes in the MCL inflation value”; https://github.com/davidemms/OrthoFinder/issues/34). If changing the inflation value does have a strong effect on orthogroup inference, you could perhaps assess the accuracy of orthogroups using Orthobench (which is how they benchmarked Orthofinder using default values in the first place).

Austin H. Patton on Oct 02, 2024

Thanks for the suggestion, Natasha! I agree completely that it’s noteworthy how strong of an impact the choice of inflation parameter has on orthogroup inference. So, I think your suggestion of benchmarking accuracy on different dataset subsets in Orthobench could provide some additional nuance to this interpretation! The difficulty in selecting the “best” inflation parameter is that it is likely to always be (to an extent) dependent on the dataset being analyzed. Use of Orthobench to benchmark seems like a great way to quantify this intuition though!

Jacek Kominek on Sep 07, 2024

Selection

e COGEQC functional annotation metric, how might we assess the quality of gene family inference?How might we improve our gene family inference procedure (e.g., using alternative methodology), and what types of data would be most suited to doing so?Going a step further, how might we infer gene family evolutionary dynamics more efficiently, wit

I frequently found it a useful, interesting, surprisingly accurate and (of course) vastly more efficient alternative approach to skip MSAs entirely and use k-mers to build phylogenies. Granted, since this approach boils down to a distance matrix, it comes at the cost of explicit evolution models and intuitive interpretability of branch-lengths, but I think it still can be very much worthwhile. Especially in either very straightforward cases, where the answer is obvious (e.g. pure descent, no dups, no loss, not xfers) or very difficult ones (so much noise that even 100s of millions of steps of Bayesian inference won’t do any good).

Austin H. Patton on Sep 10, 2024

Yes, I think these types of approaches have the potential to be of great utility in phylogenomic inference, particularly as the size of datasets increase! This is certainly something worth looking into!

Akin to the k-mer based approaches (but still reliant on MSAs), recent approaches like PhyloFormer could have similar utility in this context. This approach uses one-hot encoded MSAs and a transformer with site-level and pair-level attention mechanisms to learn and predict evolutionary distance matrices, which in turn are used to infer phylogenies using FastME with impressive accuracy.

Adi Lavy on Sep 11, 2024

Selection

nome sequence data into our phylogenomic analyses.How might we extend NovelTree to conduct truly whole-genome phylogenomics, reaching beyond the scope of coding sequences alone?Let us know what you think!We’ve outlined several outstanding questions and potent

There are several approaches that could be combined to improve phylogeny prediction. The first is to use LLM as an improved method of predicting ORFs of genes. This would alleviate the problem of mis-categorized proteins in existing databases. Second, in the case of microbial genomes, using gene clusters as a single unit to not only identify similarities in a single gene, but rather explore differences in spacing and non-coding regions in between. Third, use predicted structure of proteins and compare their 3D space, generating a score for similarities and differences in the space they occupy. The scoring should consider differences in the active vs non active regions of enzymes. Fourth, apply whole genome average nucleotide identity (ANI) test. Using a combination of these would improve our ability to infer phylogeny relationships. Of course there are also restrictions to some of these methods, for example it will require a certain cutoff for genome completeness.

Austin H. Patton on Oct 02, 2024

Thanks so much for sharing these ideas, Adi! We’ve actually been floating around some ideas related to a number of these, namely the potential value of incorporating LLMs and structural information/features as a complement to more traditional sequence based methods. Although I’d be concerned about the sole use of one of these approaches in lieu of sequence based approaches (due to the difficulty in distinguishing between homology and analogy), they are certain to improve recall of highly divergent sequences, and perhaps enable us to glean more insight into protein structure as you suggest!

Natasha Picciani on Sep 20, 2024

Selection

at types of data would be most suited to doing so?Going a step further, how might we infer gene family evolutionary dynamics more efficiently, without loss of accuracy?How can we level up our phyloge

Perhaps depending on the questions you’d like to address here, it would be useful to consider methods that retrieve subtrees of large and multicopy gene family trees (tree decomposition). That way instead of relying in the delimitation of gene families, you could rely on single copy clusters from all gene families. Check out Smith et al. MBE 2022.

Austin H. Patton on Oct 02, 2024

I definitely agree! In fact, we implement this type of gene tree decomposition within the ASTEROID species tree inference model using DISCO (https://doi.org/10.1093/sysbio/syab070). Perhaps it might be useful for us to break that out into its own module, returning these decomposed, single copy gene family trees as a separate output, independent of the (optional) ASTEROID module?

Joel Smith on Sep 21, 2024

Selection

ove to explore means for incorporating whole-genome sequence data into our phylogenomic analyses.How might we extend NovelTree to conduct truly whole-genome phylogenomics, reaching beyond the scope of coding sequences alone?Let us know what you think!We’ve outlined

I’m wondering if there might be some way to incorporate alignment-free methods like min-hash here. They aim for blunt efficiency over explicit modeling, but I’m curious what a more nuanced version of their use might look like. For example, by partitioning the genomic into some number of annotations or even within gene families as a comparative tool.

Austin H. Patton on Oct 02, 2024

Hi Joel, thanks for the suggestion! I certainly think there is room for such approaches in a workflow like NovelTree that strives for highly-efficient, scaleable phylogenomic inference. This idea is similar to a question posed below by Jacek. I wonder to what extent approaches like these may be useful even just as a means to expedite and even improve MSA and MSA-based phylogenetic inference, perhaps by quickly generating a higher-quality starting tree to either seed MSA inference, or likelihood-based tree inference?

Contributors (A-Z)

Purpose

Share your thoughts!

Background on the original pub

Our latest questions

What’s the best strategy for gene family inference?

How can we level up our phylogenomic inferences using newly abundant protein structure predictions?

How can we move beyond just proteins and use whole genomes for phylogenomic analysis?

Let us know what you think!

References

Share your thoughts!

Provide feedback

Pub details

Table of contents