Description
Data sets including empirical and simulated phenotypes are available here.
A core focus of genetics is understanding the relationship between genetic variation (genotypes) and biological traits (phenotypes). Efforts as diverse as tracing the evolution of complex phenotypes, identifying disease-causing genes, and understanding how organisms are built are all contingent on deciphering the mapping between genotype and phenotype.
Our results show that assumptions underlying many current genotype-phenotype models (namely that genotypes are additive and linear) do not reflect the nonlinearities present in biology. Non-additive relationships between genes are well known — one gene can influence the effects of another (epistasis), and some genes have multiple phenotypic effects (pleiotropy). By accounting for such nonlinear interactions between genes and phenotypes, we show that we can accurately predict suites of simulated phenotypes.
These findings should be of interest to anyone whose work relies on accurately modeling genotype-phenotype relationships, especially those in the fields of quantitative, population, and human genetics. Additionally, we are excited to get feedback on how this work might help contribute to these fields and possible refinements or extensions of its utility.
This pub is part of the platform effort, “Genetics: Decoding evolutionary drivers across biology.” Visit the platform narrative for more background and context.
All associated code is available in this GitHub repository.
Data from this pub, including empirical and simulated phenotypes, are available on Zenodo.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
For several centuries now, scientists have attempted to decode how biology emerges from genetic information. Some of the models that have come out of this assume that traits are associated with infinitesimally complex genetic bases [1]; others hold that distinct clusters of few, but highly impactful genes drive each aspect of an organism’s growth and function [2]. Many (if not most) of these models share a common feature: no matter their complexity, phenotypes can be understood by simply adding up the effects of their genetic contributors.
A single human trait best demonstrates this tendency: height. Dozens, if not hundreds, of genetic studies have been conducted with the goal of mining ever greater amounts of genetic “gold dust” [2] predicting variation in height [3][4]. Through these efforts, two things have become clear: 1) height is highly heritable (> 80% by a recent study) [4] but 2) so much of the genome is involved that identifying a discrete molecular basis seems extremely unlikely.
In response to findings such as these (along with those gleaned from other human traits), some researchers have begun favoring the use of “polygenic (risk) scores” [5]. By aggregating the effects of many genomic loci, these scores can explain increasing amounts of trait variation (albeit at the cost of biological interpretability). Similarly, the “omnigenic model” [6] proposes that small sets of core, trait-determining genes work in parallel with many other “peripheral” genes. Through their sheer number, these peripheral loci thus also substantially contribute to trait variation. Importantly, in this model, contributions of core and peripheral genes are entangled and can’t be distinguished, ultimately implicating vast swaths of the genome. Both polygenic risk scores and the omnigenic model assume the same inevitable conclusion: identifying the molecular drivers of complex traits is exceedingly difficult, if not impossible [7]. But what if the problem isn’t how we are conceiving of the genetic bases of complex traits; what if the problem actually has to do with how we are thinking of traits themselves?
Height really is a complex trait, one that involves a myriad of interacting biological processes (development, metabolism, physiology, and so on). Each process is, in turn, regulated by its own sets of genes and it is likely that at least some of these gene sets relate to each other in complicated ways. Indeed, many complex phenotypes result from a variety of interconnected, nonlinear biological networks and genotypes that can relate to each other in a variety of directions (e.g., linear, nonlinear) and manners (e.g., additive, subtractive, dominant) [8][9].
We believe that these points, if considered seriously, may have substantial implications for genetics. How often are phenotypes and genotypes nonlinearly correlated relative to being purely additive? Can information about any one phenotype help to predict another? Does accounting for phenotypic relationships increase predictive power? With these questions in mind, we decided to see if an approach explicitly capturing the complex relationships among phenotypes might provide some useful insights for genetic analysis writ large. Specifically, we focus on simultaneous analysis of groups of multiple phenotypes (here referred to as “polyphenotypes”). We examine how levels of pleiotropy (the impact of single genes on multiple phenotypes) and gene-gene interaction (the non-linear impact of combinations of genes on phenotypes) structure polyphenotypes.
Polyphenotype
Any grouping of multiple organismal phenotypes. Here, we argue that polyphenotypes have multiple uses. On one hand, they can be used to understand the relationships between multiple phenotypes. At the same time, they are also useful for developing accurate predictions of any single phenotype (by allowing one to control for potential false positives/negatives arising from phenotypic relationships).
For reasons both causal and correlative, phenotypes co-vary. For example, as referenced above, height is likely correlated with other phenotypes such as mass or metabolic rate. In this pub, we quantify the nature and prevalence of phenotype-phenotype relationships within large groups of phenotypes (what we refer to as polyphenotypes) to gain insight into the processes that cause these phenotypes. We find non-linear relationships among phenotypes in natural populations to be widespread. Furthermore, we find that, in simulations, the degree of non-linear phenotypes is modulated by the degree of gene-gene interaction and pleiotropy. We then demonstrate that, where present, phenotype-phenotype relationships can be leveraged to increase prediction accuracy for individual phenotypes.
All the data we used to study empirical variation across sets of phenotypes are publicly available. Sources and details for these data are available in Table 1. We chose data sets on the basis of phenotype number, sample size, and population type. We sought data sets in which a minimum of 15 phenotypes were measured for at least 100 individuals of the same species or interspecies cross. We also generated a set of random, unrelated phenotypes to compare with the observed phenotypic relationships contained within these datasets. To do so, for a single “phenotype”, we randomly generated integer values (values could be any integer between 1 and 1,000) 600 times. This process thus resulted in 600 simulated “individuals”, each with a randomly chosen phenotypic value. This was repeated to ultimately generate 30 simulated phenotypes, each composed of 600 individual observations. After filtering on data completeness (see below for details on data set-specific filtering) we imputed missing values using the mean value for each phenotype and then performed rank normalization using the R function RankNorm
from the package RNOmni.
Below are descriptions of data set-specific filtering. We tailored filtering parameters to each study given variation in sample size and the rate of missing data.
Arabidopsis: We excluded individuals if they had NAs for more than 20 phenotypic measurements. Similarly, we excluded phenotypes with more than 20 NAs. In addition, we removed non-continuous phenotypes (at least five unique values required per phenotype).
Yeast: We removed non-continuous phenotypes (at least five unique values required per phenotype).
C. elegans: We removed non-continuous phenotypes (at least five unique values required per phenotype).
Mouse (AIL): We removed non-continuous phenotypes (at least five unique values required per phenotype).
Mouse (JAX): We excluded samples that were missing more than 100 measurements. Similarly, we excluded phenotypes missing more than 100 measurements. In addition, we removed non-continuous phenotypes (at least five unique values required per phenotype).
Fruit fly: We excluded samples that were missing more than 50 measurements. Similarly, we excluded phenotypes missing more than 50 measurements. We reduced the dimensionality of gene expression values from Huang et al. 2015 [10] using PCA (we extracted the first 30 PCs). In addition, we removed non-continuous phenotypes (at least five unique values required per phenotype).
Name | Main reference | Type | N samples | N phenos |
---|---|---|---|---|
Arabidopsis | [11] | Natural strains (accessions) | 514 | 110 |
Yeast | [12] | F1 segregant | 13,950 | 40 |
C. elegans | [13] | Recombinant inbred lines (RIL) | 2,017 | 19 |
Mouse (AIL) | [14] | Advanced intercross line (AIL) | 1,063 | 133 |
Mouse (JAX) | [15][16][17] | Laboratory strains | 106 | 271 |
Fruit fly | [18] | Inbred lines | 147 | 270 |
Random | This pub | – | 600 | 30 |
Table 1. Data sets we used for empirical analyses of nonlinearity.
All of these data are available on Zenodo.
We simulated 100 phenotypes for 121 populations (N individuals per population = 500). These populations were created by first simulating genetic data and deriving the phenotypes from these genotypes. For each individual, we randomly assigned one of three allelic states at each of 300 loci (e.g., homozygous reference, heterozygous, homozygous alternate). Then, we generated a genetic architecture for each phenotype by randomly assigning 100 loci to that phenotype and giving each possible allele at each locus a weight of influence between zero and 10.
We modeled effects of pleiotropy and gene-gene interaction on phenotypes, varying the impact of each systematically across populations such that each population had a unique pairing of the probability pleiotropy and the probability of gene-gene interactions. These probabilities were per gene-phenotype pair or gene-gene pair and ranged from 0–1 in increments of 0.1, thus forming an 11 by 11 grid with one simulated population for each pairing. For example: population 1 has probabilities P(pleiotropy) = 0, P(epistasis) = 0; population 2 has probabilities P(pleiotropy) = 0.1, P(epistasis) = 0; and so on.
To model pleiotropy, for each individual population for each phenotype, we assigned each locus already determined to influence a phenotype (100 loci per phenotype, 9900 locus-phenotype pairs) to be involved in pleiotropy with a population specific-probability as defined above. If we determined the locus-phenotype pair to be involved in pleiotropy, the weights assigned to that locus were included in the calculation of that phenotype. Similarly, to create gene-gene interactions (e.g., epistasis) that varied across populations, we assigned each gene-gene pair (N = 4,950) to be involved in an interaction with a population-specific probability as defined above. If we determined a locus was involved in an interaction, we randomly assigned that interaction to one of the six possible pairs of alleles (i.e., interaction among loci here occurs only between single pairs of alleles). We then multiplied the weights of those alleles. Finally, we calculated the phenotypes for each individual by summing the weights at loci influencing that phenotype.
We implemented a neural network called a denoising autoencoder to test the utility of examining multiple phenotypes for phenotypic prediction [19]. Autoencoders consist of two networks — first, an encoder that forces the data through an information bottleneck, and then a decoder that takes information compressed through that bottleneck and tries to reconstruct the data. Accurate reconstruction of the data following the compression through the information bottleneck suggests that the network has learned a representation of that data. During training, noise is added to the input data, preventing a common failure mode in which the learned representation does not extrapolate to data that was not in the training set; the learned model fails to generalize.
All code associated with this pub — including analysis notebooks, the synthetic phenotype generator, and the autoencoder model — is available in this GitHub repository (DOI: 10.5281/zenodo.8371249).
Briefly, our autoencoder consisted of an encoder with two fully connected rectified linear layers that we subjected to batch normalization and a similarly structured decoder. The latent space separating the two networks contained 32 nodes. We conducted the training with phenotypic data from 80% of the simulated individuals over 100 epochs with a batch size of 16. To all training data, we added 0.1 standard deviation of noise. Following training, we predicted phenotypes on the remaining 20% of the data. To evaluate the utility of increasing the number of phenotypes under different values of pleiotropy and interaction, we trained individual models using 5, 10, 20, and 30 phenotypes, and evaluated the model accuracy on five phenotypes. We calculated prediction error as the mean absolute percentage error. We implemented the autoencoder in PyTorch [20].
After filtering, imputation, and rank normalization (see “Data collection/generation” section) we computed the frequency of nonlinear phenotypic relationships for each data set. To do so, we fit a linear and a nonlinear model for all possible phenotypic pairs within the data set. We generated the linear model with a linear regression (lm
function in R). W generated the nonlinear model using a generative additive model (gam
function in the R package mgcv) with a single smoothing spline term (via the mgcv function s
). We compared model fits using the Akaike information criterion (AIC) and considered three possible outcomes: a tie (equal AIC), the nonlinear model is a better fit (nonlinear = lower AIC), or the linear model is a better fit (linear = lower AIC). We then calculated the frequency of nonlinearity from the ratio of the number of cases in which the nonlinear model had lower AIC compared to the full number of phenotypic comparisons.
Given the possible diversity of phenotypic relationships within any given data set, and to facilitate the measurement of variance in nonlinearity rates, we used a permutation-based approach to calculate nonlinearity across subsets of each data set. To do so, we calculated the nonlinearity rate for 1,000 random sets of phenotypes for each data set (data proportion per random set = 0.25). We visualized this distribution using violin plots (as in Figure 1, A). We then measured the variation of these permutation distributions using the R function var
(as in Figure 1, B) and calculated the correlation between all phenotype pairs using Pearson’s correlation (as in Figure 1, C).
To further dissect patterns of phenotypic nonlinearity, we generated 10,201 phenotypic matrices spanning possible combinations between gene-gene interaction and pleiotropy probabilities (each ranging from zero to one in increments of 0.01). We measured the nonlinearity rate of each phenotypic matrix using the same approach outlined above. We visualized the distribution of nonlinearity as a function of gene-gene interaction and pleiotropy by creating a generalized additive model (GAM). Here, nonlinearity was treated as a response variable predicted by gene-gene interaction and pleiotropy and was implemented using the gam
function in the R package mgcv (as in Figure 2, B). The predicted nonlinearity values are visualized in two dimensions, representing all possible combinations gene-gene interaction and pleiotropy probabilities.
We next wanted to characterize the entropy of full phenotypic data sets. Taking influence from the phenotypic integration literature [21], we first calculated the “generalized variance” (the determinant of the variance-covariance matrix) for each phenotypic matrix. Generalized variance is a useful measure in that it allows us to directly compare phenotypic data sets with different dimensionalities [21]. To extract a single-vector descriptor, we then calculated the eigenvector of the generalized variance matrix using spectral decomposition (eigen
R function). We then calculated the entropy of the leading eigenvector using the R function entropy.empirical
from the R package entropy [22]. We could thus use the resulting entropy estimate to infer the overall information contained among an arbitrarily large set of phenotypic measurements.
We next developed a method to infer the correlational structure of a phenotypic set by calculating entropy across increasingly large, random subsets of phenotypes. Broadly, this method sweeps through pre-set portions of a data set, randomly selects a set of phenotypes for each portion, and calculates entropy using the method described above. We applied this test to phenotypic matrices with varying probabilities of pleiotropy (probability zero to one, 0.01 increments) by calculating entropy for increasingly large proportions of samples (10% to 90%, 10% increments). For each portion, we analyzed 10 permuted sets of phenotypes and calculated their mean entropy. The results of this analysis appear in Figure 3, A. We extracted slopes of the resulting entropy distributions from a linear regression (lm
function in R) comparing pleiotropy probability and entropy (Figure 3, B).
SHOW ME THE DATA: You can find all the data used in this pub, including empirical and simulated phenotypes, on Zenodo (DOI: 10.5281/zenodo.8298808).
We calculated autoencoder prediction error as the mean absolute percent error between prediction and ground truth. We conducted autoencoder training using 80% of the individuals in the data set and evaluated accuracy on the remaining 20%.
We calculated entropy as above using the same parameters (portions: 10% to 90% of samples in 10% increments; 10 permutations per portion).
All code associated with this pub — including analysis notebooks, the synthetic phenotype generator, and the autoencoder model — is available in this GitHub repository (DOI: 10.5281/zenodo.8371249).
To our knowledge, it remains unclear just how common additivity and linearity are in genetic systems. To address this, we compiled a data set of “polyphenotypes” (see definition above) from a diverse set of interbreeding species populations (see Approach for details). We reasoned that inferring the rate of nonlinear phenotypic relationships would allow us to glean how well linear/additive models would fit these populations.
Using a simple test (see Approach) we determined the (non)linearity of pairwise phenotypic relationships within each species population. We found that all species display rates of nonlinearity that are significantly greater than expected by chance (p < 0.001, Kruskal-Wallis test) (Figure 1, A), ranging from 43.5% (fruit flies) to 84.4% (Arabidopsis) (Figure 1, A). These observations support the idea that nonlinearity is a prevalent feature of biological phenotypes and contributes to a substantial portion of species’ phenotypic relationships.
The range of nonlinearity also differs greatly across populations. For example, nematodes display almost 10× more variation in phenotypic relationships than fruit flies (C. elegans = 0.19, fruit flies = 0.029; mean normalized standard deviation) (Figure 1, B), while randomly generated data display the greatest degree of relative variation (0.39; mean normalized standard deviation). Interestingly, these randomly generated data should be largely independent of each other and, thus, may be considered representative of a set of non-pleiotropic, additive traits. Supporting this idea, we found that the mean pairwise correlation of the random phenotypes is significantly less than that of the species data (Figure 1, C). Overall, these observations suggest that complex aspects of phenotypic relationships may be inferred using a set of relatively simple descriptive statistics.
However, given their heterogeneity, determining how epistasis and pleiotropy might affect the frequency of phenotypic nonlinearity is hard using these datasets. Some data come from advanced genetic crosses (e.g. the DGRP and JAX data) while other data sets sampled variants from a diverse natural population (Arabidopsis). In addition, the polyphenotypes reflect the interests of the original studies and, therefore, occupy somewhat random and undetermined regions of phenotypic space. Therefore, while it is apparent that nonlinearity exists in a variety of quantitative genetic data sets, it is difficult to use these data to develop strong intuitions about its sources.
With this in mind, we sought to control many of these covariates to allow more direct interrogation of pleiotropy, epistasis, and phenotypic structure. Using a novel approach, we generated a series of polyphenotypes from simulated genotypic data. Briefly, we generated n random genotypes from a probability distribution for a given number of individuals (see Approach for a more in-depth description). Each genotype could influence an output phenotype given a set probability distribution and could interact with others via a predefined probability. We also allowed genes to influence more than one phenotype with a set probability, letting us vary the amount of epistasis (probability of gene-gene interactions) and pleiotropy (probability of phenotype-phenotype interactions) in the data. Using this approach, we generated a data set in which all combinations of epistatic and pleiotropic probabilities were considered (from P = 0 to P = 1, 0.01 increments). This produced a final set of 10,201 polyphenotypes, each containing 20 synthetic phenotypes measured across 600 simulated individuals.
A main goal in generating this synthetic data set was for it to capture a broad range of nonlinear relationships. To assess how well the data set accomplishes this, we used the same test as above (see Figure 1; Approach) to calculate the rate of nonlinearity for each of the 10,201 polyphenotypes. Notably, these percentages span the values observed among the empirical phenotypes, with a mean nonlinearity rate of 47.53% (min = 16%, max = 100%; Figure 2, A; Figure 1, A). In addition, nonlinearity varies smoothly across the distribution of pleiotropy/gene interaction probabilities (Figure 2, B). Together, these observations suggest that our data generation approach successfully produced a naturalistic range of nonlinearity from which to sample.
How can we best identify drivers of biological nonlinearity? Can we statistically decouple the effects of different genetic and phenotypic interactions? Motivated by our efforts to apply information theory to genetics (more on this in a companion pub coming soon), we hypothesized that we may start to untangle some of the drivers of nonlinearity by measuring the information content (or “entropy” as defined in information theory) of polyphenotypes. We reasoned that entropy may be informative in multiple ways. First, overall entropy reflects the interrelatedness of polyphenotypes. Lower values may reflect a set of phenotypes that are driven by the same underlying biology (e.g., multiple, correlated measurements of a trait such as finger length). On the other hand, higher values may indicate that polyphenotype data contain measurements from multiple, orthogonal features of biology (e.g., finger length and education level). The second way entropy may be informative is through its distribution across different portions of a polyphenotype. Consider the case of completely orthogonal phenotypes. If we select random combinations of orthogonal phenotypes and measure their entropy, it should be the case that entropy proportionally increases as we analyze larger and larger sets of phenotypes (i.e., more new information is being added with each increase in the number of randomly chosen phenotypes). In contrast, for a set of strongly correlated phenotypes (e.g., in the case of pleiotropy), one should expect entropy to stay constant as we analyze larger sets of the phenotypes (i.e., no new information is added).
Applying this framework to all 10,201 polyphenotypes, we calculated entropy across increasing proportions of randomly selected phenotypes (see Approach). We found that entropy distributions strongly vary with the probability of pleiotropy. Increasing pleiotropy equates with a flattening of the distribution (Figure 3, A). This point is further demonstrated by an extremely strong relationship between entropy distribution slopes and pleiotropy (Pearson’s r = −0.95; Figure 3, B). These observations support the notion that entropy is a reliable measure of the interrelatedness of a set of phenotypes. What’s more, this suggests that, by analyzing the within-polyphenotype distribution of entropy, we may infer the amount of phenotypic pleiotropy with minimal knowledge of the underlying genetics.
Are similar measures available for determining the frequency of gene-gene interactions? Taking a hint from the previously identified relationship between nonlinearity and gene-gene interactions/pleiotropy in Figure 2, we found that nonlinearity varies strongly in the absence of gene-gene interactions but decreases in dynamic range as interaction probability increases (Figure 3, C). Furthermore, comparing the relationship between the entropy slope and percentage of nonlinearity reveals an interesting trade-off between gene-gene interactions and pleiotropy (Figure 3, D). There is an overall negative relationship (Pearson’s r = −0.82) suggesting that, as pleiotropy increases, so too do nonlinear phenotypic interactions. Furthermore, as gene-gene interactions increase (as indicated by point color in Figure 3, D), the variance of entropy/nonlinearity relationships decreases.
Taken together, these results suggest that pleiotropy leads to increasingly nonlinear phenotypic relationships, especially in the absence of genetic interactions. Furthermore, we can study this trade-off via entropy and nonlinearity, which are both non-genetic measures. Finally, these patterns indicate that phenotypic nonlinearity — like that observed both here and among real phenotypes (Figure 1) — also reflects genetic nonlinearities, hinting at potential insufficiencies of additive/linear models for capturing the genetic components of biological traits.
We indicate pleiotropy probability (A and B) and the probability of genetic interactions (C and D) using a color scale.
If genetic nonlinearities are truly prevalent across many types of biological traits, which types of models might be better suited for capturing their effects? Neural network-based strategies are enticing options for several reasons. Neural networks inherently learn nonlinearities across their layers, letting them model complex interactions between inputs and outputs (e.g., between phenotypes and genotypes). In addition, they can model multiple inputs and outputs, facilitating nonlinear mapping of multiple phenotypes at once. We therefore hypothesized that neural network strategies might help us determine the benefit of accounting for complex, nonlinear interactions between phenotypes.
To do this, we constructed an autoencoder for modeling and predicting phenotypic relationships (see Approach; Figure 4, A). Taking simulated polyphenotypes as input, the model encoded phenotypic relationships into a lower-dimensional latent space and generated predictions via a decoder (Figure 4, A). We used this strategy to predict aspects of all 10,201 polyphenotypes. Specifically, we generated four sets of predictions for each, varying the number of input phenotypes (n = 5, 10, 20, 30) used to predict an output set (n = 5; Figure 4, A). We then assessed the accuracy of each model by calculating the percent error between observed and predicted phenotypes.
We found that the autoencoder approach predicted phenotypes with extremely high accuracy (mean error = 1.76%; Figure 4, B) and that models become more accurate when the number of input phenotypes increases (Figure 4, B). In fact, all models using more than five input phenotypes display significantly decreased error distributions (Kruskal-Wallis test followed by Dunn’s test; Figure 4, B). It’s also apparent that the error distributions themselves display a degree of heterogeneity, with some clear outliers displaying error percentages above 4% (Figure 4, B). Plotting percent error as a function of pleiotropy shows that these outliers are associated with cases in which the probability of pleiotropic interactions was very low (Figure 4, C), indicating that pleiotropic interactions can help increase the accuracy of multi-phenotype models. More broadly, it’s apparent that by accounting for nonlinearities such as pleiotropy, autoencoder strategies are able to predict individual phenotypes with great accuracy.
Finally, we explored whether these models could generate realistic polyphenotypes. To do so, we measured the entropy of each set of predicted phenotypes. Overall entropy decreases as the probability of pleiotropy increases (Figure 5, A), reflecting the patterns observed among the input phenotypes (Figure 4, A). There is also a similarly tight relationship between percent error and entropy across all models (Pearson’s r = 0.78; Figure 5, B). When comparing mean percent error and entropy slope, we found a strong separation between the five-phenotype model and all others (Figure 5, C). Models with more input phenotypes display lower entropy slopes and greater degrees of model accuracy. Furthermore, the 10-phenotype model displayed the lowest error rate and entropy slope (Figure 5, C), a pattern that is also apparent in the comparisons of percent error across the models (Figure 4, B). It is interesting to consider that this may reflect something important about the structure of the synthetic phenotypes. Specifically, the 10-phenotype model may represent a better trade-off between input phenotype information content and the overall model complexity (i.e., the 20- and 30-phenotype models may just be adding redundant information).
In total, these findings suggest that the autoencoder did indeed create realistic polyphenotypes with expected entropy distributions. Given this, we conclude that models 1) accommodating polyphenotypic designs and 2) accounting for biological nonlinearity provide opportunities to greatly increase the predictive capacity of genetic analysis.
Nonlinearity is a prevalent feature of biological phenotypes (Figure 1)
Phenotypic nonlinearity varies as a function of genetic and phenotypic interactions (Figure 2)
Measures from information theory, such as entropy, can quantify the structure of phenotypic interactions (Figure 3)
Models that account for nonlinearity and phenotypic interactions have increased predictive potential and improve as the number of phenotypes increases (Figure 4)
Model accuracy varies as a function of the information content of phenotype sets (Figure 5)
Biology is in an age of increasingly large, high-dimensional, and complex data sets. Endeavors such as AlphaFold are attempting to map the full universe of protein structures [23][24]. Similarly, a number of multi-team efforts are characterizing human cell type diversity via a host of omics and cell biological data types [25]. These data sets — and others like them — contain (or will contain) a diversity of phenotypic measurements possessing unknown and complex inter-relationships. A goal for many of these efforts will be to identify these relationships and, ultimately, use them to decode the function of complex biological systems (e.g., identifying how RNA expression, chromatin accessibility, and cell morphology interact to generate a cell type). This undertaking butts up against a statistical sampling problem: is there enough data available to power such analysis for the system you’re interested in? Put another way, have you sampled enough of “phenotypic space” to account for the biology in question?
These are hard questions to answer a priori. However, asking them is useful. If it’s possible to efficiently sample phenotypic space, minimizing measurement redundancy, then scalability and cost-effectiveness would correspondingly increase. It’s interesting to consider how the aggregated results of this pub may help. Our framework predicts that samplings of different parts of phenotypic space should be associated with correspondingly variable parameter combinations (Figure 6). We’d predict that sampling a single phenotype would be associated with uniformly low values of entropy, nonlinearity, and predictiveness (Figure 6, A). On the other hand, if multiple correlated phenotypes (perhaps due to pleiotropy) were sampled, the rate of nonlinearity and predictiveness would increase, but entropy would not (Figure 6, B; Figure 4, C). If multiple orthogonal phenotypes were measured, we’d find increased entropy and predictiveness and a modest amount of nonlinearity (Figure 6, C; Figure 4, C). Due to their numerical generality, it should be possible to measure the entropy and nonlinearity of most (if not all) polyphenotypes.
Given this, we propose that these measures may be implemented as a general-purpose toolkit for inferring the phenotypic structure, predictiveness, and even genetic patterns (i.e., gene-gene interactions and pleiotropy) associated with a given polyphenotypic data set. There are likely many useful extensions of this. For one, we can use phenotypic entropy to measure the complexity of a phenotypic data set. We could therefore determine if ongoing collection is adding new or informative dimensions to a data set. Similarly, we may use entropy and nonlinearity to estimate the number of generative biological processes associated with a polyphenotype (if one, then entropy should be low and nonlinearity high; if multiple, entropy will increase). Indeed, the entropy of a polyphenotype is the number of bits of information necessary to capture the phenotypic structure and, as a result, the generative processes that drive that structure. With these measures in hand, it’s possible to hypothesize, a priori, the structure of genetic mapping results and, by factoring in these patterns, design studies around minimally necessary and maximally informative polyphenotypes.
More generally, a “phenotype-forward” framework that allows for complex nonlinear relationships between traits (as we suggest here) has the prospect of reflecting organismal structure that is likely missed when we examine phenotypes individually. For example, modeling height and weight simultaneously likely provides more biological insight and predictive ability than modeling them independently, as some of their causal mechanisms are shared. The neural network method we use here explicitly captures these relationships and has the possibility of “encoding” the generative processes for sets of phenotypes with at least partially overlapping causes.
Treating the organism as a system in this way has the potential to answer more complex questions than modeling individual phenotypes alone. Such approaches may prove critical to leveraging the increasing amount of phenotypic data to achieve better biological understanding and outcomes across a host of problems.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.