David G. Mets

Really minor but I believe that the definition of the mutual information should have the opposite sign as the entropy in order for I(A;B) = H(A) + H(B) - H(A,B) to hold.

Good catch Erin! Thank you!

Skip to main content

# Purpose

# Historical background

# Entropy, divergence, and mutual information

## Entropy

### Example 1: Coin tosses

### Example 2: Allelic state at a single locus

### Example 3: Single phenotype

### Joint Entropy

### Conditional entropy

## Mutual information

### Conditional mutual information

## Kullback-Leibler divergence

## Equivalencies

## Extension to multiple genes and multiple phenotypes

### Chain rule for entropy

### Chain rule for mutual information

# Applying information theory to genetics

## Polyphenotypic analysis

### Examination of many phenotypes likely provides information about any one phenotype

### Theorem:

### Proof:

### Pleiotropy decreases total trait entropy

### Theorem:

### Proof:

# Key takeaways

# What’s next?

##### Contributors

(A–Z)

Applying information theory to genetics can better explain biological phenomena

Genetic models of complex traits often rely on incorrect assumptions that drivers of trait variation are additive and independent. An information theoretic framework for analyzing trait variation can better capture phenomena like allelic dominance and gene-gene interaction.

Published onSep 27, 2023

Applying information theory to genetics can better explain biological phenomena

Genetic analysis has been one of the most powerful tools for biological discovery, providing insight into almost every aspect of biology, ranging from identifying mechanisms supporting the cell cycle [1][2], to guiding selective breeding for agriculture [3], and identifying targets for disease treatment [4]. While phenotypes can be simple (e.g., a single gene can cause differences in pea color or lead to a genetic disease) the vast majority are subject to more elaborate causal mechanisms involving many genetic and non-genetic factors. Researchers studying these phenotypes (often called "complex" phenotypes) have relied on assumptions of additivity and independence among the elements driving individual-to-individual phenotypic variation. It’s widely appreciated that, in real data, these assumptions are often violated, potentially limiting the utility of and accuracy of some analyses [5]. However, this broad framework is retained for both historical and practical reasons [6][7]. Here, we explore a different and complementary mathematical framework that makes no assumptions about the drivers of phenotypic variation — we apply information theory to genetic questions with the objective of conducting system-wide analysis of large sets of genes, phenotypes, and environmental data.

This pub is intended as a regularly updated document covering how we are applying information theory to broad questions in genetics. As time progresses, and we release empirical studies of different topics, we will add sections here covering the information theory relevant to those studies. This work should be of interest to both geneticists and information theorists, but is primarily intended to formalize an information theoretic approach to genetic problems and make that approach available to geneticists. Accordingly, the first section after the introduction is a primer on major concepts in information theory intended for geneticists. The subsequent sections contain information theoretic definitions for genetic concepts and demonstrations of how these definitions provide insight into genetic processes.

This pub is part of the

**platform effort**, "Genetics: Decoding evolutionary drivers across biology." Visit the platform narrative for more background and context.

Share your thoughts!

Watch a video tutorialon making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

Contemporary quantitative genetics treats genetic influences on phenotypes as additive and independent of one another, and, generally any one phenotype is assumed to be separate from others [6]. The reasons for this are both historical and practical. Just prior to the turn of the last century, the study of human phenotypic variation (biometrics) was at its peak. Early biometric studies observed that phenotypic distributions among humans were often continuous and, across generations, appeared to vary gradually and not in jumps (e.g., [8]). Therefore, the field assumed that drivers of this continuous variation were themselves continuous, a model consistent with the then-new theory of evolution — phenotypes were expected to change gradually across generations. The tools and principles developed during the period (e.g. the mixture model, the t-distribution, the chi-square distribution) reflect these assumptions and, ultimately, came to form much of the theoretical backing of modern statistical genetics [9].

Around the same time von Tischermark, de Vries, Spillman, and Correns "rediscovered" the work of Gregor Mendel [10]. Mendel’s observations contradicted the dogma of continuity developed by biometrics. Through now-famous sets of experiments, Mendel found that phenotypes can in fact vary discretely within populations and across generations. For example, the hybridization of a yellow and a green pea plant could produce offspring that were either yellow or green, but not a combination of the two. Thus, some of the inherited drivers of phenotypic differences were discrete and not continuous. Subsequent experimental work in a variety of different organisms has strongly reinforced this view [10] and ultimately led to the generation of the term "gene" to describe the indivisible unit of heritable variation [11].

The presence of discrete units of inheritance (genes) and, in some settings, dramatic phenotypic change across generations led to a "non-gradualistic" view of inheritance (e.g., [12] and [13]). The "gradualists" and the "non-gradualists" were divided by a fundamental problem: how could phenotypes — often continuous and only gradually changing — be caused by discrete units of inheritance? Ronald Fisher provided a reconciliation in 1918. Through groundbreaking theoretical work, Fisher demonstrated that many discrete, additive, independent units of inheritance of small effect could generate continuously varying phenotypes within a population [14]. Furthermore, these assumptions were consistent with Mendel’s results. Fisher suggested that each trait (and the factors influencing that trait) could segregate independently following mating. By elegantly providing a resolution to the continuous/discrete paradox, Fisher thus forged the fundamental assumptions for genetic analysis that we still rely on today [7].

However, in the following decades, extensive work on the function and inheritance of genes established clear violations of additivity and independence [10]. Instead, modern biology has demonstrated that genes and their products are highly interactive and involved in complicated, nonlinear processes such as physical complexes, regulatory circuits, and metabolic circuits. Furthermore, these complex interactions may drive phenotypic variation across individuals via dependent and non-additive relationships between genes.

A clear example of such nonlinear relationships is epistasis [15], in which the effect of one gene can mask or modify the phenotypic impact of another. Epistasis is a common feature of genetic systems and is so prevalent that researchers began to use it to identify functionally related genes [10]. Genes that, when combined, caused no different phenotype than the individual genes alone were called "epistasis groups." For example, in *Saccharomyces cerevisiae*, the members of the RAD52 epistasis group were all individually sensitive to irradiation, and when combined, were no more sensitive than any one mutant. This suggests a functional relationship between the individual genes; if a mutation in any one of the genes disrupts the "functional unit," then further mutations in other members of that unit will not change the phenotype [16]. Many epistasis groups were identified through mutagenesis, but naturally occurring epistasis is prevalent and important for evolution [17]. Fisher’s initial reconciliation assumed no epistasis, an assumption that largely remains in contemporary models [7]. Given the complexity of biological systems, the resulting potential for phenomena like epistasis, and empirical evidence that such phenomena exist, a modeling framework that does not include gene-gene interactions (as is common in quantitative genetics) will likely fail to account for key aspects of the genotype-phenotype map. Indeed, in recent years many studies have explicitly demonstrated this problem [18].

To date, the solution has not been obvious. If we use the same statistical framework that’s been applied historically, capturing nonlinear relationships among genes would require data from an enormous number of individuals. Including interactions in traditional linear models (e.g., genome-wide association studies) would require the number of model parameters to scale with the square of the number of genetic or environmental factors. It’s common to conduct human genetic analysis using hundreds of thousands of genetic loci. Capturing interactions between even 100,000 loci would require a model with 10 billion parameters. Fitting such a model would require data from more humans than exist. As a result, despite increasing computational power, the utility of these models to effectively capture nonlinearity will always be limited by the available data.

We suggest using information theory to quantify the drivers of trait variation. Information theory was originally developed to formalize thinking about encoding schemes for communication [19], and to provide answers to questions like, "What’s the minimal amount of information required to encode a message?" or "How many bits of information are required to store this text document?" Since its inception, information theory has become very broad. Importantly for genetic analysis, we can use it to partition and quantify the impact of factors driving variation in a set of data. This allows us to answer questions like, "How much better can I predict the phenotype of an individual if I know that individual’s genotype?" or "How much information does genetic data contain about disease state?" In contrast to methods traditionally used in quantitative genetics, it makes no assumptions about the nature of factors impacting variation, so it may enable new, tractable, analyses capturing nonlinear relationships and lead to better mappings between genotypes and phenotypes.

In this section, we review some fundamental components of information theory and provide examples of how we might apply them to genetic data. In subsequent sections, we’ll expand on these examples and contrast genetically relevant information theoretic measures to similar measures from classical statistical genetics.

Entropy,

For random variable

$\begin{aligned}
\mathrm H(X) := -\sum\limits_{x\in \mathcal{X}} p(x) \log_{2} p(x)
\end{aligned}$

Consider two coins: one fair, Pr{heads = 0.5}, and one biased, Pr{heads = 0.9}. The degree of uncertainty about the outcome of a coin toss is higher for the fair coin as compared to the biased coin. A toss of the fair coin is equally likely to result in heads or tails. The biased coin is more likely to turn up heads. Entropy captures this intuition. The entropy for the fair coin is

$\begin{aligned}
\mathrm H(X) &= -\sum_{i=1}^{n} p(x_{i}) \log_{2} p(x_{i})\\
&= -\sum_{i=1}^{2} 0.5 \log_{2} 0.5\\
&= -\sum_{i=1}^{2} 0.5 \cdot -1\\
&= 0.5+0.5 =1
\end{aligned}$

Whereas the entropy of the biased coin is

$\begin{aligned}
\mathrm H(X) &= -\sum_{i=1}^{n} p(x_{i}) \log_{2} p(x_{i})\\
&= -0.9 \cdot \log_{2} (0.9) - 0.1 \cdot \log_{2} (0.1)\\
&\approx 0.137+0.332 \approx 0.469
\end{aligned}$

Thus, entropy is lower for the more predictable (biased) coin than for that of the less predictable (fair) coin. Indeed, the fair coin, with equivalent probability for all states, has the maximum entropy (1 bit) for a random variable with two states. For any random variable, a probability distribution that is uniform across states results in the maximal entropy.

Now consider two different genes,

$\begin{aligned}
\mathrm H(A) &= -\sum_{i=1}^{n} p(a_{i}) \log_{2} p(a_{i})\\
&= -\sum_{i=1}^{3} \frac{1}{3} \log_{2} \frac{1}{3}\\
&= -\sum_{i=1}^{3} \frac{1}{3} \cdot -1.58\\
&= -p(a_{AA}) \log_{2} p(a_{AA})-p(a_{Aa}) \log_{2} p(a_{Aa})-p(a_{aa}) \log_{2} p(a_{aa})\\
&\approx 0.528+0.528+0.528 \approx1.58
\end{aligned}$

As compared to the fair coin, with only two possible outcomes, the "fair" (equal probability of each allelic state across individuals) gene, with three possible states, has an increase in entropy:

$\begin{aligned}
\mathrm H(B) &= -\sum_{i=1}^{n} p(b_{i}) \log_{2} p(b_{i})\\
&= -p(b_{BB}) \log_{2} p(b_{BB})-p(b_{Bb}) \log_{2} p(b_{Bb})-p(b_{bb}) \log_{2} p(b_{bb})\\
&= -0.8 \cdot \log_{2} (0.8) - 0.1 \cdot \log_{2} (0.1)- 0.1 \cdot \log_{2} (0.1)\\
&\approx 0.258+0.332+0.332 \approx0.922
\end{aligned}$

Thus, the difference between

Similar to allelic state, we can calculate the entropy of a phenotype in a population. Unlike allelic state, phenotypes are often continuous (e.g., height) and not discrete (e.g., disease state). Throughout this work, for simplicity of exposition, we will only examine equations for discrete phenotypes. However, there are tools for estimating the information theoretic values we describe for continuous variables as well. Consider a disease trait

$\begin{aligned}
\mathrm H(T) &= -\sum_{i=1}^{n} p(t_{i}) \log_{2} p(t_{i})\\
&= -p(d_{T}) \log_{2} p(t_{T})-p(t_{t}) \log_{2} p(t_{t})\\
&\approx 0.137+0.332 \approx0.469
\end{aligned}$

We can extend the definition of entropy stated above to more than one random variable. Given genes

$\begin{aligned}
&\mathrm H(A,B) := -\sum\limits_{a\in A}\sum\limits_{b\in B} p(a,b) \log_{2} p(a,b)
\end{aligned}$

where the joint entropy is less than or equal to the maximal entropy of

As we will discuss later, the comparison between the maximal entropy and the joint entropy of a set of variables (such as phenotypes) is the decrease in randomness caused by relatedness among those variables. For a pair of traits,

For two variables

$\begin{aligned}
&\mathrm H(A|B) := -\sum\limits_{a\in \mathcal{A}}\sum\limits_{b\in \mathcal{B}} p(a,b) \log_{2} p(a|b)
\end{aligned}$

If

Furthermore,

Mutual information,

For two random variables

$\begin{aligned}
&\mathrm{I}(A;B) := \sum_{a \in \mathcal{A}} \sum_{b \in \mathcal{B}} p(a,b) \log_{2} \frac{p(a,b)}{p(a)p(b)}
\end{aligned}$

$\begin{aligned}
&\mathrm{I}(A;B) := \mathrm H(A) + \mathrm H(B) - \mathrm H(A,B)
\end{aligned}$

In other words, it is the degree to which dependency between

For three random variables,

$\begin{aligned}
&\mathrm I(A;B|C) := \mathrm H(A|C)-\mathrm H(A|B,C)
\end{aligned}$

*causes* all of the variation in

Kullback-Leibler divergence,

$\begin{aligned}
&\mathrm D_{kl}(p||q) := \sum\limits_{a\in \mathcal{A}} p(a) \log_{2} \frac{p(a)}{q(a)}
\end{aligned}$

$\begin{aligned}
&\mathrm I (A;B) = \mathrm D_{kl}(p(a,b)||p(a)p(b))
\end{aligned}$

In other words, the mutual information between

We note here a series of useful equivalencies. Throughout the rest of this pub, we will use

$\begin{aligned}
&\mathrm I(G;T) = \mathrm H(G)+\mathrm H(T)- \mathrm H(G,T)\\
&\mathrm I(G;T) = \mathrm I(T;G)\\
&\mathrm I(G;T) = \mathrm H(G)-\mathrm H(G|T)\\
&\mathrm I(G;T) = \mathrm H(T)-\mathrm H(T|G)
\end{aligned}$

Thus far we have mostly discussed individual random variables (e.g., single genes or phenotypes), but we can extend entropy, mutual information, and Kullback-Leibler divergence to cover the joint distribution of many variables, like a set of genetic loci or phenotypes. This results from the chain rule for probability and is most readily seen for entropy, where we have already defined joint and conditional entropy.

The joint entropy of

$\begin{aligned}
&\mathrm H(A,B)= \mathrm H(A)+\mathrm H(B|A)
\end{aligned}$

Or, the joint entropy of

$\begin{aligned}
&\mathrm H(A,B,C)= \mathrm H(A)+\mathrm H(B|A)+\mathrm H(C|B,A)\\
\end{aligned}\\
\vdots \\
\begin{aligned}
&\mathrm H(A_1,A_2\dots,A_n) = \sum_{i=1}^n \mathrm H(A_i|A_{i-1},\dots,A_1)
\end{aligned}$

In other words, the joint entropy of a set of variables is the sum of their conditional entropies. For

$\begin{aligned}
\mathrm H(A,B,C) = \mathrm H(A) + \mathrm H(B) +\mathrm H(C)
\end{aligned}$

if

We can apply a similar chain rule for mutual information, letting us extend to multiple random variables. We will not expand on this here, but, essentially, the variable expansion done previously to define conditional mutual information (jump to that equation) can be repeatedly applied to show that

$\begin{aligned}
&\mathrm I(A_1,A_2\dots,A_n;B) = \sum_{i=1}^n \mathrm I(A_i;B|A_{i-1},\dots,A_1)
\end{aligned}$

Essentially, the mutual information between a set of variables and another set of variables is the sum of the conditional mutual information values.

Given the ability to extend these measures to an arbitrary number of variables, we will indicate sets of variables with a sub bar. For example, we will denote sets of genes, phenotypes (or traits), and environments as

Having established some of the fundamental measures in information theory and examples of their application, we now expand on these definitions and apply them to broader genetic questions. Where appropriate, we compare the information theory-based assessments with classical statistical genetic measures.

Genetic analysis has most often focused on individual phenotypes, e.g., "How tall are the members of a population?", or, "Do cells pause at a particular stage of the cell cycle?" But considering multiple phenotypes simultaneously may provide more insight into overall organismal features than focusing on any one phenotype. For example, an organism’s height is likely linked to other organismal features (e.g., mass and metabolic rate) both causally and otherwise, so studying both height and metabolic rate together may enable more accurate predictions than studying height alone. However, the quantitative genetic infrastructure for simultaneous analysis of multiple phenotypes is poorly developed.

In a companion pub [20], we argue that examining multiple phenotypes simultaneously can provide better insight into the nature of individual phenotypes. Across a population, phenotypes are often correlated. That correlation could result from shared, causal, genetic variation, or from non-causal correlation like genetic drift or migration. We’ve shown that incorporating the correlational relationships between phenotypes into predictive models can increase prediction accuracy. We further showed empirically that increasing pleiotropy among a fixed set of genes (

$\begin{aligned}
\mathrm H(\underline{T}) \le \sum_{i=1}^n \mathrm H(T_i)
\end{aligned}$

with equality if, and only if, all phenotypes are independent of one another. Thus, the difference between the maximal phenotypic entropy and the total joint phenotypic entropy is the reduction in uncertainty caused by correlations (additive or otherwise) across phenotypes. In other words, we can quantify the amount of phenotype-phenotype structure by estimating the difference between the joint entropy and the maximal entropy. Importantly, this quantification provides examination of the relatedness (or lack thereof) among phenotypes without genetic or environmental information. Phenotypes with maximal entropy share no common cause or non-causal drivers of correlation. Thus, absent environmental variation or phenotypic correlations that are created by population structure, pairs of phenotypes with less than maximum entropy share a cause and those causes are, to some degree, epistatic.

Given dependence among phenotypes, examining one phenotype should provide information about other phenotypes. In other words, conditioning the entropy of one set of phenotypes,

$\begin{aligned}
\mathrm H(\underline{T}|T_{i}) \le \mathrm H(\underline{T})
\end{aligned}$

$\begin{aligned}
&\mathrm I(\underline{T}; T_{i}) \ge 0\\
&\mathrm H(\underline{T})- \mathrm H(\underline{T}|T_{i}) \ge 0\\
&\mathrm H(\underline{T}) \ge \mathrm H(\underline{T}|T_{i})\\
&\mathrm H(\underline{T}|T_{i}) \le \mathrm H(\underline{T})
\end{aligned}$

This shows that, given some correlated structure among traits, examining many phenotypes will be useful in predicting any one phenotype; something we have empirically demonstrated in our companion pub [20]. Furthermore, in the same pub we show that examining increasing numbers of phenotypes doesn’t reduce the amount of information about any one phenotype. However, we often estimate information theoretic values using numerical methods and, as a result, there is a limit to the number of phenotypes it is practical to examine.

Pleiotropy is the observation that allelic state at any one genomic location impacts multiple phenotypes. Intuitively, for any fixed set of phenotypes and genes impacting those phenotypes, increasing pleiotropy will increase co-variation among phenotypes and thus decrease the total trait entropy. For traits

$\begin{aligned}
Pleio(T_1,T_2,G) = \mathrm I(T_1;T_2)-\mathrm I(T_1;T_2|G)
\end{aligned}$

This is the amount of information shared between

If

$\begin{aligned}
&Pleio(T_1,T_2,G) >0\\
&\mathrm I(T_1;T_2)-\mathrm I(T_1;T_2|G)>0\\
&\mathrm I(T_1;T_2)>0\\
&\mathrm H(T_1)+\mathrm H(T_2)-\mathrm H(T_1,T_2)>0\\
&\mathrm H(T_1)+\mathrm H(T_2)>\mathrm H(T_1,T_2)
\end{aligned}$

where

In this section, we’ve shown several ways in which we can apply information theory to the analysis of multiple phenotypes. First, we showed that the deviation between the maximal phenotypic entropy and the joint phenotypic entropy provides a quantification of the relational structure of a set of phenotypes, which may result from shared causes. Importantly, we can use this to show that some phenotypes are unrelated from others, a situation that would only result if there was no shared causation among those phenotypes. Second, we show that increasing the number of phenotypes in an analysis should increase our understanding of other phenotypes. And finally, we provide a mathematical definition of pleiotropy and show that increasing pleiotropy should, in some circumstances, decrease overall phenotypic entropy. While also demonstrating these findings empirically in a companion pub [20], these formalisms provide certain guarantees about such analyses.

We provide formalisms for the analysis of cohorts of phenotypes ("polyphenotypes") using information theory.

Analysis of individual phenotypes will benefit from examining a polyphenotype.

Polyphenotypic analysis does not require genetic or other causal information.

We can identify sets of phenotypes that are causally independent.

We’ve presented a few examples of information theory applied to genetic questions. We view this as a work in progress and will, along with empirical and numerical studies in other pubs, expand these ideas into other areas of genetics and genetic analysis as our work progresses.

Share your thoughts!

Watch a video tutorialon making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

(A–Z)