DNA sequencing methods are rapidly advancing and producing mountains of high-quality protein sequence data. However, protein sequences are frequently only as useful as their functional annotations, and predicting a protein’s function remains a bottleneck in the meaningful analysis of protein sequence data [1].
Cellular functions can define protein identity.
We are developing a framework to computationally predict and validate annotations based on protein functions. These annotations should help scientists uncover novelty and generate hypotheses.
Most annotations are based on information gathered in a small selection of organisms. In fact, 85% of Gene Ontology annotations are based on information from just ten species, including humans and other typical model organisms [1]. Even in some of the most well-studied organisms, annotation remains difficult. For example, in E. coli, a third of the genome remains un-annotated [2] and in both fission and budding yeast, roughly 20% of the genome remains un-annotated [3].
Researchers are now developing computational approaches that take into account protein sequence, structure, and even interactions with other proteins or co-expression patterns to annotate new protein sequences [4]. Although an improvement upon past methods that rely solely upon sequence similarity, these newer methods often still return false positives, missed annotations, and uninformative annotations such as “hypothetical protein” or “protein of unknown function.” Practically, this leads to 1) scientists wasting time and resources investigating proteins that look similar to a known protein of interest but lack functional conservation, 2) missing out on really interesting proteins because they weren’t properly annotated, or 3) investing considerable upfront effort characterizing functions of proteins before being able to move into the discovery-based phase of a project.
At Arcadia, we want to foster an organism-agnostic way of doing science, and this is limited by the current annotations. To overcome these obstacles, we are building a customizable protein annotation framework that combines experimental data with computational predictions to provide higher-quality annotation of new, un-annotated, or under-annotated genomes. Our approach will employ automated annotation of proteins, annotation transfer from species with high information content, and functional annotation of specific protein families. Our hypothesis is that using this type of functional annotation framework can identify interesting proteins with unusual characteristics that may have novel functions and interactions.
Our overall goal is to create a customizable workflow that can annotate protein families in diverse organisms in a deep and functionally meaningful way to empower scientists to answer their research questions in the best way possible using the right organisms and the right proteins.
Computational prediction
We aim to use pre-existing tools like eggNOG-mapper [5][6] and OrthoFinder [7] for general annotation purposes, to incorporate platforms being developed within Arcadia and to develop new ways to make all of these software packages more accessible and interpretable to biologists. Additionally, we aim to bring something new to the field by digging deep into specific protein families to annotate proteins across organisms by leveraging existing knowledge to build predictive models and to explore feature-agnostic prediction using machine learning approaches.
Biochemical validation
We will validate these results by purifying a subset of proteins from different organisms and checking for the predicted function of interest. For example, if our prediction is focused on an enzyme binding a ligand, we might do binding assays to look at these interactions.
In vivo validation
Finally, we will investigate how these individual protein functions lead to cellular changes in vivo.
Hypothesis generation and testing
We believe that this new annotation framework will help us identify interesting proteins with unusual characteristics that may have novel functions and interactions. It could also help us choose the best organisms for answering biological questions in an efficient and informed way. This will help us generate new hypotheses about how proteins function, how proteins are regulated in the cell, and how proteins work together, and it will help us determine strategies for interrogating those hypotheses.
As an initial use case of this strategy, we wanted to use a protein family that meets the following criteria:
It’s responsible for a wide range of cellular functions and likely to have functions and interactions that we don’t fully understand
It’s studied enough that we know or can predict how it performs its most basic functions, like binding to other proteins
It’s broad and diverse, leaving plenty of room for discovery.
We settled on actin because it met each of our criteria.
Actin is important for a number of essential cellular functions, including maintaining cell shape, cell motility, cell division, intracellular trafficking, signaling, organellar regulation, membrane remodeling, and many others [8]. It’s also present throughout the tree of life — even archaea and bacteria have proteins similar to actin [9][10].
Because actin is so important and so widely expressed, it is well-studied. We know that actin monomers must associate and form long filaments for many of its broader functions in the cell, and we know that actin functions as an ATPase [8]. We also know the critical residues for these important functions from experimentally determined structures and biochemical studies [11].
Most of what we know about actin comes from a fairly small selection of cells — mostly mammalian cells with actins that are highly conserved, leaving a large gap in our understanding of the actin cytoskeleton, how it’s regulated, how it functions, and how it evolved.
Thus, using actin as a test case can tell us about a long list of cellular functions that we already know about, but could also help us answer fundamental biological questions about what else actin might be doing in cells. Beyond that, this work will help us define what really must be conserved for proteins to share functions (sequences, domains, structure, etc.) and help us create a framework that we and others can use for higher-throughput identification of proteins based on functional characteristics.
Learn more about this effort:
Now that we have created an initial pipeline for understanding what makes an actin an actin (step 1 in our general strategy), our next steps are validating the results of the pipeline in vivo and biochemically. Then we will refine the pipeline based on those results. The actin pipeline in its current form is available on GitHub and Binder if you want to run your own actin sequences!
Additionally, we want to use this general framework to create modules to investigate protein families beyond actin. For example, it might be useful to look at specific receptors and their ability to bind ligands, or it might be interesting to see how DNA-modifying enzymes interact with DNA. We will decide exactly where to look by considering where we can add value to other projects at Arcadia. We’ll use the results of this expanded effort to inform and further refine our pipeline design.
Read our FAQ and our publishing philosophy. For technical issues, visit the PubPub Help Community. If you need extra help after browsing the help site and forums, contact arcadia-help@pubpub.org.