Graph neural networks: A unifying predictive model architecture for evolutionary applications
Graph neural networks: A unifying predictive model architecture for evolutionary applications
Neural networks are increasingly used in evolutionary biology research. Despite this burgeoning interest, most work uses just a few model architectures. This bias matters: the alignment of data structure, task, and architecture influences predictive and explanatory outcomes.
We propose that graph neural networks (GNNs), a comparatively underutilized architecture, are uniquely well-suited for evolutionary applications. We detail how GNNs leverage relational structures embedded in evolutionary data where other architectures can’t. We review example applications and discuss promising avenues where GNNs could advance evolutionary research. Our goal is to highlight the value of GNNs and encourage other evolutionary biologists to leverage the full extent of their utility.
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Evolutionary biologists are driven to answer fundamental questions about how the world works. What led to the adaptive radiation of Darwin’s finches [1]? What facilitated the repeated speciation and parallel ecological divergence between limnetic and benthic freshwater threespine sticklebacks [2]? Does epistasis increase or decrease phenotypic diversity [3]? Evolution is, historically speaking, the domain of explanatory rather than predictive models.
For example, when studying macroevolution, it’s common to interpret real data by fitting idealized models of evolution (e.g., Brownian motion (BM) or Ornstein-Uhlenbeck (OU) [4]) to them. Doing so has helped advance our understanding of a number of phenomena, such as resolving how species diversification along ecological gradients can underlie adaptive radiations (e.g., Anolis lizards [5]). However, the features driving these models’ explanatory power also restrict their predictive utility.
Though providing valuable biological insight, explanatory model design inherently limits their ability to predict unobserved or future outcomes. This mismatch between model intention and application isn't a shortcoming per se — these models were never intended to enable accurate prediction. It does mean, however, that when explanatory models are applied to predictive tasks, they rely on overly simplistic assumptions that maintain interpretability yet harm predictive capabilities. This issue isn't unique to evolutionary biology (for discussions, see [6] & [7]). For instance, phylogenetic imputation methods use explanatory models like BM or OU to predict missing trait values, constrained by assumptions such as constant rates of trait evolution across lineages and through time [8]. Dedicated predictive modeling frameworks tailored to evolutionary biology are needed.
Accordingly, evolutionary biologists have increasingly turned to machine learning frameworks more amenable to predictive tasks, particularly neural networks (NNs) (Figure 1, A–B) [9][10][11]. By leveraging multiple interconnected layers of artificial neurons, NNs can learn complex, non-intuitive relationships within data [12][13]. Despite challenges to interpretability, NNs' predictive capabilities make them highly valuable statistical tools, especially given the intricate and subtle patterns often present in biological data.
Convolutional neural networks (CNNs: Figure 1, C–D) have become the dominant architecture used in evolutionary biology. CNNs specialize in grid-structured data, such as images and sequences, leveraging spatial autocorrelation through convolutional kernels. Somewhat famously, CNNs have been shown to be “unreasonably effective” for population genetics inference, matching or exceeding existing explanatory models [14].
However, only some biological data are structured appropriately for CNNs, and restructuring comes with trade-offs. For example, genetic data are often converted into 2D "images" despite biologically irrelevant structuring in one input dimension, potentially limiting predictive accuracy and efficiency. Data preprocessing such as this can have an outsized impact on CNN performance [15]. While 1D CNNs offer a more natural and appropriate fit for linear genomic data — and have been successfully applied across a range of population genetic tasks — both 1D and 2D CNNs require input to conform to a regular grid. This requirement restricts possible applications since biological systems are often better represented as irregular non-Euclidean relational structures. Thus, although effective in some cases, the widespread use of CNNs may reflect convenience and historical precedent as much as innate architectural suitability.
Trends in the use of neural networks (NNs) in ecology and evolution (data from [9]) through 2021.
(A–B) Count of publications using each architecture type, considering all data types.
(C–D) Count of publications using each architecture type, considering only studies using molecular data.
In all panels, any publication that used more than one architecture type is counted once per architecture. DNN: deep neural network, CNN: convolutional neural network, RNN: recurrent neural network, VAE: variational auto-encoder, GAN: generative adversarial network.
NOTE: The trends shown here are meant to be exemplars — we have not extended this literature review to the present day.
So, is there a model architecture better suited for evolutionary data? This is an important question. Model architectures often act like Bayesian priors, each with unique inductive biases. Architectures can impose constraints on what models expect to see and, ultimately, what and how they learn. Effective alignment can simplify the learning task and improve predictive performance, particularly in small datasets common in biology. CNNs have succeeded in population genetic applications because genetic autocorrelation is amenable to convolution. But is there an alternative architecture better suited to the relational structures that evolution produces?
We think the answer may be graphs. From phylogenies (bifurcating graphs) to ancestral recombination graphs (ARGs) to interaction networks and genotypic fitness landscapes, a vast swath of biology can be meaningfully represented as graphs. Moreover, graphs may provide the key to spanning from microevolution to macroevolution by drawing connections between biological scales. This puts on the table the possibility of a universal evolutionary representation, from proteins to genes and species, even ecological communities, each represented as hierarchically nested graphs (Figure 2).
Graphs are a unifying biological data structure across scales, from macroevolution to microevolution.
(Top) Species trees (phylogenies) are fully bifurcating graphs that represent the relationships among extant species (terminal nodes) and their common ancestors (internal nodes) through descent with modification.
(Middle) A gene family tree — structured similarly — depicts the relationships among homologous gene copies possessed by the same species as in the species tree.
(Bottom) Proteins encoded by each homologous gene copy (and their common ancestors) in this gene family can be meaningfully and richly represented as protein residue graphs, where nodes correspond to amino acids, and edges correspond to interacting or spatially proximal residues, capturing detailed structural and physicochemical information.
Why is this the case? Because evolution through descent with modification induces a graph-like relational structure in biological data. We often represent these relationships as phylogenetic trees wherein each species or gene corresponds to a node interconnected through edges representing common ancestry. Ultimately, a phylogeny is inherently a regular, fully bifurcating graph. Similarly, genetic structures such as ARGs explicitly capture the complex histories of genomic segments across populations and recombination events [16]. Furthermore, ecological networks depicting species interactions like predation, mutualism, competition, or gene regulatory networks depicting complex genetic pathways are also naturally expressed as graphs. This ubiquity underscores graph representations' inherent suitability and explanatory power for evolutionary and ecological questions.
Given the inherent suitability of graph structures to address questions in ecology and evolution, we’re thus prompted to ask: Is there a predictive model architecture capable not only of handling such non-Euclidean, graph-structured data but also managing — and even exploiting — the complex nested hierarchical structures induced by evolutionary processes? After all, it’s previously been shown that CNN architectures aligned to image data markedly outperform non-convolutional NNs [17][18], and architectures specialized for non-Euclidean data lead to improved outcomes by inherently respecting the data’s geometry [19]. Could leveraging graph-based approaches thus bridge explanatory and predictive paradigms, harnessing the inherent relational structure of evolutionary data to improve both biological understanding and predictive accuracy?
Yes! The solution we propose lies in graph neural networks (GNNs: [20]). Graph neural networks are exactly what they sound like — a neural network architecture specifically designed to process and learn from graph-structured data comprising nodes (individual entities or observations) and edges (Box 1). GNNs can be used for a variety of prediction tasks: node regression/classification (e.g., variant effect prediction), edge prediction (e.g., phylogenetic inference), and graph regression/classification (e.g., gene-regulatory network functional classification). Given that graph-structured data is abundant in biology, the potential of GNNs is vast.
Why might GNNs work so well? For one, all GNNs use message passing to aggregate information from neighboring nodes along edges, thus allowing the model to learn complex local relationships in the data in a manner explicitly informed by graph structure [21]. In effect, this assumes that nodes that are closer to and more connected to one another in a graph are more similar to each other. Why might we care about this as evolutionary biologists?
Because this message-passing mechanism functionally leverages something that's both the bane and boon of any evolutionary comparative study — evolutionary non-independence. Descent with modification renders biological samples statistically non-independent "evolutionary pseudoreplicates," as demonstrated compellingly in Felsenstein's seminal 1985 publication "Phylogenies and the Comparative Method" [22]. Thankfully, there now exists a wealth of statistical methods based on explanatory models that explicitly use the inferred phylogeny to account for evolutionary non-independence [4]. Just as accounting for evolutionary non-independence is essential to the adequacy and performance of explanatory models, so will it be for predictive models. In fact, we're likely to push these models even further by explicitly making the model aware of that evolutionary non-independence by baking it into the model architecture and data representation. GNNs provide us with the key to do so.
GNNs are also exceptionally flexible. For example, message-passing can incorporate convolution (as in graph convolutional networks; GCNs [23]) or attention mechanisms (as in graph attention networks; GATs [24]) to more fully learn complex relationships present in the data at both local and global scales. Furthermore, many common neural architectures are special cases of GNNs: CNNs are a special case of GCNs on regular grids, RNNs are a special case of GNNs on sequential chain graphs, and transformers [25] are a special case of GATs with fully connected attention graphs.
Thus, while CNNs have been undeniably useful — particularly in population genetic contexts with spatially or sequentially structured (i.e., Euclidean) genomic data — GNNs offer even broader flexibility. This flexibility is reassuring. Biological data frequently exhibit relational complexity beyond simple adjacency or grid-like structures. GNNs inherently accommodate these complexities, making them highly versatile tools. In the following section, we discuss several applications that have been particularly fruitful and propose a couple of promising future applications.
Box 1. Useful GNN terminology.
GNNs are suited to multiple different levels of prediction tasks, ranging from node classification to link prediction, community detection/graph clustering, graph classification/regression, graph generation, and more. For a more detailed review of the architecture, please see [20].
Graphs are constructed from an adjacency matrix and may either be undirected, meaning information may flow in either direction along an edge, or directed, meaning information flows only in one direction.
Graphs may be either homogeneous, meaning all nodes and edges are of the same type, or heterogeneous, meaning multiple types of nodes or edges may be represented in a single graph.
Nodes in a graph correspond to individual entities within a graph. They may be represented by a set of node features and belong to one or more classes — these could be anything from distinct species to genes, proteins, or amino acids.
Edges may be similarly characterized by edge attributes, representing branches in a phylogeny, orthologous relationships, or physical distances among atoms in a protein structure.
Graph-level attributes characterize properties of the graph as a whole, such as the identity of a given gene family, biological process, or protein activity.
Message passing [21] is the mechanism all GNNs use to aggregate information from each node's immediate neighbors to update node feature representations (i.e., a local neighborhood aggregation function).
Graph convolution [23] extends the convolution mechanism implemented in CNNs to non-Euclidean graph data.
Graph attention [24] leverages self-attention to allow the contribution of each node's neighbors to feature updates to vary, scaling according to their learned importance.
Transformers [25] are a special case of attention-based GNNs wherein global, multi-headed attention forms a fully connected graph, thus creating a global neighborhood aggregation function.
As discussed previously, an early application of neural networks to evolution was the use of CNNs for population genetics. How do GNNs stack up here? Recent work [26] has found that a GCN matches and often exceeds CNN performance on population genetic tasks, particularly at identifying genomic regions under selective sweeps. Notably, the GCN achieves this performance with nearly two orders of magnitude fewer parameters than the CNN (~200 thousand parameters compared to ~21 million). This disconnect between model size and performance supports our earlier suspicion: that using an architecture aligned with the data structure indeed helps to learn more, and from less.
What about GNNs makes them suited for these tasks? The data used here — tree sequences — are highly efficient representations of genomic data that capture the changing evolutionary relationships among samples while walking along the genome [27]. These tree sequences approximate ARGs, complex graph structures capturing recombination and coalescent histories [16]. The message-passing framework inherent to GNNs allows for adaptive weighting of neighbors, enabling them to selectively integrate relevant local signals such as lineage-specific demographic events or recombination hotspots that are otherwise obscured by fixed receptive fields in CNNs. Indeed, recent studies have applied GNNs directly to ARGs, proving helpful in estimating demographic histories and identifying regions subject to selection under complex population scenarios [28].
Thus, using evolutionarily meaningful, graph-structured data, GNNs can infer everything from demographic history to the genomic landscape of natural selection and introgression/horizontal gene flow. While CNNs remain useful for specific structured genomic data tasks, the flexibility and general applicability of GNNs position them as a potentially superior choice across a broader range of population genetic and evolutionary biology problems. Despite these initially promising demonstrations, we emphasize that we have only begun to scratch the surface of GNN's potential for population genetic problems.
GNNs may also be helpful for the inference of diversification dynamics using phylogenetic trees (e.g., for understanding speciation/extinction or mapping pathogen transmission dynamics [29][30][31]). Historically, this work has disproportionately relied upon birth–death (BD) and coalescent models. Both model types are highly interpretable. For instance, BD models employ just two primary parameters: birth (λ), corresponding either to speciation or transmission events, and death (µ), corresponding either to extinction or loss of infected individuals, respectively.
The interpretability of these models has had immense practical value. During the COVID-19 pandemic, BD and coalescent models applied to SARS-CoV-2 phylogenies provided early and critical insights into the epidemiology of this novel infectious disease, directly informing public health decisions [32][33][34][35]. Beyond COVID-19, the application of these models has long been a critical component of coordinated responses to infectious disease outbreaks [36]. For example, they've historically been instrumental in identifying emerging seasonal influenza strains around which vaccines are developed and assessing vaccine efficacy (e.g., [37]).
So, how can GNNs propel the field forward? Phylodynamics is a field where many explanatory models have been useful for prediction tasks almost by coincidence. We can move beyond this, however. For instance, GNNs could explicitly leverage the temporal structure of pathogen phylogenies to simultaneously model shifts in transmission dynamics and predict the emergence of epidemiologically important variants, something traditionally challenging for simpler models.
Initial applications of GNNs to phylodynamic problems have demonstrated substantial promise, notably in classifying transmission clusters [38][39]. However, there are several immediate areas where GNNs could be refined for this application, such as comprehensive epidemiological parameter estimation. Interestingly, a comparative study evaluating macroevolutionary diversification parameter estimates (speciation and extinction) noted that other neural network architectures often outperformed GNNs [40]. However, these GNNs lacked features that improve performance, such as skip connections or attention-based graph convolutional layers. Thus, given the inherent flexibility of GNN implementations, a more comprehensive exploration of possibilities will be of interest here (as elsewhere).
Finally, GNNs may be uniquely well-suited to common tasks in comparative biology, such as trait imputation and ancestral state reconstructions. For example, ancestral state reconstruction is one of the most common use cases of phylogenetic comparative methods in evolutionary studies. Writ large, this includes the inference of everything from geographic ranges [41] and quantitative or discrete phenotypes [42] to even protein sequences [43] of the common ancestors of extant species.
Many of these tasks are built on a common methodological approach we stereotype here (for a review, see [44]). First, an explanatory model of how a trait has evolved is fit to a reconstructed phylogeny and trait data for a set of species. The fitted model is then used to probabilistically reconstruct trait values at the internal (ancestral reconstruction) or terminal (phylogenetic imputation) nodes, returning the most likely values based on the model parameters. Although intuitive, this approach can lead to biased or incorrect trait estimates, as commonly used models make unrealistic assumptions, such as constant evolutionary rates through time and shared rates across species.
GNNs, on the other hand, have the potential to model more realistic evolutionary scenarios. For instance, using a combination of graph convolution and graph attention, GNNs may be capable of flexibly and accurately modeling the underlying heterogeneity of evolutionary rates. Additionally, if modeling the evolution of multiple traits, GNNs may be able to capture additional complexity and nuance in patterns of correlated trait evolution that are typically out of reach of standard models. Last, mechanisms like jumping knowledge [45] may help GNNs to flexibly integrate information from both local and global phylogenetic neighborhoods to model and learn where saltational jumps in trait evolution occur. Fortunately, sophisticated simulation tools are readily available, enabling researchers to create realistic evolutionary scenarios for effective GNN training (e.g., [46][47][48][49]). Thus, while simulation quality remains essential, GNNs are an optimally structured architecture to handle these predictive tasks efficiently and accurately.
We have only begun to scratch the surface of the potential utility of GNNs for application to questions and subjects in evolutionary biology. Entire publications could be written about each. From the potential of GNNs to directly infer phylogenetic trees themselves (e.g., [50]) from genetic sequence data to predicting protein–protein interactions (e.g., [51]) and facilitating the inference of orthology at deep evolutionary time scales, the number and diversity of prospective use cases are vast. Excitingly, in many cases, we're beginning to see this exploration unfold, though we emphasize that it's just that — only the beginning. Ultimately, the creativity of implementation and thoughtful application, more than innate architectural limitations, will likely determine the success of GNNs in evolutionary biology.
Although outside of the scope of this pub, we encourage readers to familiarize themselves more with the technical details of how GNNs are implemented and how different individual architectural components may play key roles in their success and performance for any given application [20]. For instance, just as we've seen with the rampant success of the transformer architecture [25] in the context of large language models, it seems incredibly likely that GNN architectures that incorporate some form of attention mechanism will be vitally important to capture the complexity inherent to biological data. Furthermore, we emphasize that models needn't rely on a single architectural type. For instance, one recent study successfully combined protein language models with GNNs to enable the prediction of essential genes in metazoans [52].
In many cases, the primary utility of GNNs may be in bridging across architectures — explicitly building in the hierarchical relationships induced by evolution through descent with modification (e.g., Figure 2). Building sophisticated, complex hierarchical models such as these spanning biological scales is undoubtedly challenging, but GNNs present an explicit means by which to do so (e.g., [53][54]). Still, the value gained from more completely building in the evolutionary structure we know to exist in our models may be transformative. Ultimately, the boundary to GNN success in evolutionary biology lies primarily in our creativity and ingenuity in leveraging this powerful architecture.
We downloaded the supplementary table from Borowiec et al., 2022 [9] (found here) and converted it to a tab-separated text file. We loaded these data into R (v4.3.3), processed them, and visualized outputs with the following packages: readr (v2.1.5), dplyr (v1.1.4), tidyr (v1.3.1), stringr (v1.5.1), ggplot2 (v3.5.1) [55], reshape2 [56] (v1.4.4), cowplot (v1.1.3), and arcadiathemeR (v0.1.0) [57]. We excluded publications for which the entry for Architecture (i.e., the NN architecture used in the study) was “NA,” as these corresponded to review articles, as well as studies for which the architecture was "unknown." We counted each type once when multiple architecture types were used in a single study. For example, if a study used both a convolutional neural network and a recurrent neural network, we incremented the count for both architectures by one for that year.
We used ChatGPT to help write code and provide suggestions to restructure writing.
Code and data are available in our GitHub repo (DOI: 10.5281/zenodo.15693531).
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.