Arcadia aims to identify biology’s greatest innovations and decode the principles that generate them. To do this, we need to embrace exploration across the tree of life. However, for this exploration to happen at all, we need to identify phenomena that are broadly shared by living things. Luckily, biology is built upon a single unifying feature, one that we are now able to compare and dissect at unprecedented scale: genomes.
Across biology, genomes perform a set of universal tasks: transmitting information between generations, generating phenotypes, and orchestrating functions over a lifetime. This is as true for viruses — which employ admirably minimal genetic toolkits — as it is for the rare flower Paris japonica, whose 149 billion base pairs outstrip the human genome 50 times over.
Our goal is to use this universality to accelerate exploration at the scale of the tree of life: from comparisons of kingdoms that diverged billions of years ago to rapidly evolving populations that we generated last week. Through these efforts, we are developing a variety of tools to help us identify hotspots of biological novelty, make strong hypotheses about their causal basis, and ultimately predict traits in the previously inaccessible reaches of the tree of life.
Identify transformative biological innovations, agnostically.
Why: We’ll never find the next transformative biological insight by looking under the lamppost of what we already know. Much of the tree of life awaits exploration, and it is through the holistic lens of phylogenetics that we can begin to do so efficiently, scalably, and agnostically.
How: We are leveraging publicly available genomic data to generate what we are calling “phylogenetic discovery platforms.” These platforms combine phylogenetic trees with a variety of other data (e.g., protein structure, environment, organismal phenotypes) to map key evolutionary innovations. We are generating platforms for a variety of taxa in order to generate novel hypotheses, identify undiscovered phenomena, and select organisms for study at Arcadia.
Rapidly dissect phenotypes across the tree of life.
Why: After identifying a strong hypothesis and the right organism(s) to study it, we want to quickly get into the lab and start identifying key elements that we can usefully employ. However, most of the organisms we work on at Arcadia are uncharacterized.
How: We’re developing a toolkit of sequencing, phenotyping, and analysis approaches that will allow us to connect genetic variation (natural or induced) with measurable traits. With this toolkit, we aim to quickly generate genotype-phenotype maps and infer molecular hypotheses in uncharacterized organisms.
Move from descriptions to predictions.
Why: Arcadia is searching for deep principles underlying biology. We don’t want to just describe phenomena; we want to predict the when, where, and how of biology. It is this shift, from description to prediction, that will let us truly engineer serendipity.
How: We are using a “phenotype-forward” approach to generate models that learn, and then genetically map, the structure of biological systems. We want to get to the point where we can figure out how a majority of organismal processes are encoded in the genome. To do so, we first need to reliably (and scalably) map the structure of phenotypic space. To this end, we are currently exploring a number of methods for high-dimensional phenotyping and their utility for mapping phenotypic space in a way that is quantitative, high-throughput, and broadly applicable to diverse organisms.
We are committed to developing tools that the broadest number of scientists possible can employ rapidly and cost-effectively. Given the sheer breadth of the diversity of life, coupled with the size of data sets produced by modern imaging and sequencing, no single organization will be able to tackle the depths of a given biological question alone. To this end, we are also constantly re-analyzing and integrating our results to identify minimal data amounts needed to conduct these experiments and to optimize for cost-effective solutions that can be applied in resource-limited contexts. Through these efforts, we hope to reduce wasted effort and empower community use, independent of funding level or institutional affiliation.
To decode the tree of life, we need to be able to understand the products of genomes — phenotypes — in their full richness across diverse organisms. We therefore started by focusing on creating computational frameworks and identifying experimental technologies that will allow us to measure, dissect, and compare complex phenotypes across the tree of life. These efforts will ultimately allow us to use these high-dimensional phenotypes to decode patterns of genetic variation.
One of our central goals is to scan the tree of life for biological innovation. Critical to this effort is the inference of evolutionary relationships — be it among species or their genes — via phylogenetic methods. To do this at the scale we are hoping to, our methods must be highly efficient, scalable, and able to be applied extremely broadly. That is, they must empower us to infer the evolutionary histories of all gene families, not just for a handful of genes of special interest.
With this in mind, we developed our first “phylogenetic discovery platform” — NovelTree, a Nextflow workflow that carries out all essential steps of standard phylogenomic analysis. This method infers not only the relationships among species, but also the relationships among all of their gene copies, for any number of gene families. NovelTree goes a step further still, inferring the history of gene duplication, transfer, and loss for each gene family and species.
To demonstrate NovelTree’s utility, we applied the method to a dataset composed of 36 species belonging to the four eukaryotic supergroups that comprise the TSAR clade: Telonemia, Stramenopila, Alveolata, and Rhizaria. We highlight key outputs, point to potential future research directions, and provide resources to facilitate summarization, visualization, and downstream analysis of the evolutionary datasets produced by NovelTree.
These efforts have resulted in our first “phylogenomic library” — a vast evolutionary data set that we can mine to pinpoint when and where evolutionary innovations have arisen in this group of organisms. This resource is only the first of many. We anticipate that its utility will only increase as we generate additional libraries across branches of the tree of life, allowing us to integrate these explorations and expand our mapping of the biological universe.
Learn more about this tool and how to use it here:
Inspired by our ongoing work on improving and extending the capabilities of our phylogenomic capabilities with NovelTree, we released a short pub to spur engagement with the broader community about future directions in phylogenomic inference and method development.
Read it and weigh in!
Unicellular organisms make up a substantial portion of the tree of life. To survive, many of these organisms are obligated to move around in the world via swimming, crawling, gliding, and even walking. These diverse, and sometimes ingenious, motility types are supported by complex and multifaceted biological processes. Given this, we want to see if we can leverage the diversity of unicellular motility to gain insight into its molecular and cellular underpinnings.
A key first step here is deciding how to quantitatively represent different modes of cellular movement. While swimming, gliding, and walking all share features with other types of organismal movement (e.g., the swimming of protists and models such as zebrafish may be treated similarly), crawling can be difficult to model since it involves active changes to the cell’s shape.
To address this, we developed a computational framework for processing, representing, and comparing images of cells crawling. We found that we could capture the movement dynamics of diverse cell types in a single ‘movement space.’ Using this space, we were able to discover that crawling varies broadly across multiple dimensions. Furthermore, we developed a simple statistic that can measure the relationship between variation in cell shape and the types of movement that are generated. This work lays the foundation for identifying the mechanistic bases of cellular movement across large evolutionary distances.
Learn more about this tool and how to use it here:
The vast majority of organisms we are interested in studying lack genetic and molecular toolkits. Most do not have genome sequences. For some, we aren’t even sure if they are unique species or not. We are interested in identifying tools that allow us to rapidly, and comparably, measure and monitor biological processes across taxonomic groups without needing to develop species-specific tools. Label-free imaging methods using vibrational spectroscopy, such as Raman imaging, offer promising alternatives for addressing a number of basic problems in biology. These methods can be non-destructive and do not require dyes, labels, or a priori knowledge. By detecting the presence of a diversity of chemical bonds, they can provide rich molecular fingerprints that can reflect metabolic or physiological state, cell type, or even species of origin.
Given this last point, we were interested in exploring which aspects of these signals, if any, correlate with phylogenetic relationships between species. We hypothesized that, if we were able to find such associations, then it might be possible to leverage Raman spectra for a variety of uses in our comparative work. To test this idea, we analyzed a publicly available dataset of Raman spectra measured from a variety of clinically isolated bacteria and fungi. As hypothesized, we found that specific portions of the spectra correlated with species relationships. Furthermore, these regions overlapped with variation in the abundance of nucleic acids across the species’ genomes, suggesting that this technology provides a potentially interesting way to measure the relationship between genetic and molecular components of a biological sample.
Learn more about this effort:
Genetic variation is the raw material of evolution. Analyses of variation within naturally interbreeding populations therefore make it possible to identify genetic sources generating phenotypic diversity. However, populations of organisms can vary in many ways at the same time (e.g., in their shapes, diets, reproductive strategies, and so on). To understand the diverse contributions of variants across the genome, it makes sense that we first need to capture the myriad phenotypes they affect.
With this in mind, we have begun deeply characterizing the phenotypes of two species of unicellular algae: Chlamydomonas reinhardtii and smithii. These species can interbreed and we plan to mate them to generate clonal libraries that we can use to correlate genetic variation with high-dimensional phenotypes. C. reinhardtii is a well-known cell biological model while the biology of C. smithii is relatively uncharacterized. To empower precision genetic mapping in crosses of these two species, we are beginning to chart their differences across a variety of phenotypes. Our initial efforts have identified differences related to morphology, growth, and physiology. We will be adding to this pub as we measure and compare more phenotypes.
Read more on this work:
One of our major efforts is the development of a “phenotype-forward” approach to genetic analysis. Biologists have long appreciated that organismal phenotypes can be interrelated, being the result of complex, nonlinear biological networks. Though that is the case, many common genetic approaches treat phenotypes as singular and independent. We are interested in reconciling these views by building phenotypic complexity into our genetic toolkit.
As a first step, we were interested in characterizing the nature of phenotypic relationships across biology. Analyzing data from yeast, fruit flies, mice, nematodes, and Arabidopsis, we found that ~40–80% of phenotypes display correlated, yet nonlinear relationships with each other. Using simulated phenotypic data, we discovered that these patterns may be influenced by the presence of gene-gene (i.e., epistasis) and phenotype-phenotype interactions (i.e., pleiotropy). We can usefully describe the effects of these interactions by measuring phenotypic entropy. Finally, we found that we could predict individual phenotypes with great accuracy using an autoencoder model that accounts for phenotypic nonlinearities. These efforts lay the groundwork for developing a new generation of quantitative approaches that leverage the interrelatedness of phenotypes, potentially accelerating the downstream discovery of causal genetics.
Read more on this work:
For nearly 100 years, genetic analysis of complex traits such as height and disease risk have relied on the assumption that drivers of individual-to-individual trait variation are additive and independent. In real data, these assumptions are often violated. Furthermore, research into the molecular processes of biology has repeatedly demonstrated complex, interconnected systems, suggesting that genetic variation at one location in the genome may interact in non-linear ways with variation in other locations. In an attempt to capture these relationships and avoid the assumptions of additivity and independence, we sought to apply information theory (a framework that makes no assumptions about the relationships between drivers of variation) to genetic analysis.
In this initial work, we provide an outline of the history that has led to contemporary genetic analysis, argue that, in some contexts, the application of information theory to genetics will provide important insights, and provide an initial application to high-dimensional phenotypes that is complementary to the companion pub listed above. The work is intended to bring this discussion to geneticists — accordingly, we provide a primer on key components of information theory and how we might apply them to genetic questions.
This work is intended as an ongoing, regularly updated document to provide theoretical backing to our empirical applications of information theory to genetics.
Read more on this work:
We’ve developed theory and empirical evidence that simultaneous analysis of many high-dimensional phenotypes will improve understanding of phenotypic variation and the generative processes that create phenotypes. However, most studies focus on individual phenotypic measurements. This results from historical bias to an extent, but the major cause is the effort required to collect phenotypic data. Therefore, we created a simple, inexpensive, and flexible imaging system that lets us rapidly collect phenotypic data.
We designed our system, the phenotype-o-mat, to collect image-based phenotypes from microtiter plates or Petri dishes. The system is capable of trans-illumination, incident illumination with up to four wavelengths selected by the experimenter, and light filtration for fluorescence. It’s built around the Blackfly S series of cameras from Teledyne, which have a wide range of available sensors in the same form factor and are all compatible with our data collection software. This gives the experimenter flexibility in a wide range of imaging properties, like frame rate, quantum efficiency, and pixel size.
We’re currently using this system to collect large high-dimensional sets of phenotypic data in populations with genetic diversity to enhance our understanding of genotype-to-phenotype relationships.
Learn more:
Now that we’ve started making progress on dissecting complex phenotypes, we’re excited to intersect these efforts with genomic data across many different biological scales.
On one hand, we’re searching for biological innovation across billions of years of evolutionary time by generating novel phylogenomic libraries. On the other, we are developing next-generation tools for predicting phenotypes from genotypes (and vice versa).
Longer-term, we are excited to apply these tools to a variety of use cases by identifying which organisms we can leverage, generating robust evolution-informed hypotheses, and, in the process, refining our abilities to ask and answer transformative biological questions.