Nature contains remarkable diversity, and the answers to many impactful biological questions are likely hidden within this diversity. By comparing the shared building blocks that make up all organisms, including the DNA, protein, and other components of cells, and identifying similarities and differences, we can make hypotheses about the answers to those questions.
Our team is particularly interested in the comparative biology of proteins because DNA sequencing methods are rapidly advancing and producing mountains of high-quality protein sequence data. However, protein sequences are frequently only as useful as their functional annotations, and predicting a protein’s function remains a bottleneck in the meaningful analysis of protein sequence data [1].
Most annotations are based on information gathered in a small selection of organisms. In fact, 85% of Gene Ontology annotations are based on information from just ten species, including humans and other typical model organisms [1]. Even in some of the most well-studied organisms, annotation remains difficult. For example, in E. coli, a third of the genome remains un-annotated [2] and in both fission and budding yeast, roughly 20% of the genome remains un-annotated [3]. This means that existing gene annotation tools could be propagating incorrect or incomplete annotations to new data sets and potentially assigning functions to new sequences based solely on large-scale screens and tenuous evolutionary relationships.
Researchers are now developing computational approaches that take into account protein sequence, structure, and even interactions with other proteins or co-expression patterns to annotate new protein sequences [4]. Although an improvement upon past methods that rely solely upon sequence similarity, these newer methods often still return false positives, missed annotations, and uninformative annotations such as “hypothetical protein” or “protein of unknown function.” Practically, this leads to 1) scientists wasting time and resources investigating proteins that look similar to a known protein of interest but lack functional conservation, 2) missing out on really interesting proteins because they weren’t properly annotated, or 3) investing considerable upfront effort characterizing functions of proteins before being able to move into the discovery-based phase of a project.
To overcome these obstacles, we’re building tools that combine experimental data with computational predictions to compare proteins and provide functional predictions to inform our annotations. Our hypothesis is that using this type of comparative analysis and functional annotation framework can help us identify interesting proteins with unusual characteristics that may have novel functions and interactions.
Our overall goal is to create user-friendly, customizable workflows and tools that compare protein families in diverse organisms in a deep and functionally meaningful way. This will empower scientists to answer their research questions in the best way possible using the right organisms and the right proteins.
We can break this down into a handful of guiding principles:
Our work should be usable by researchers both at Arcadia and outside Arcadia. Whether we are creating computational workflows or biochemical studies, we are developing tools with users and their questions in mind.
We want our work to foster organism inclusivity by working equally well for any organism of interest. In other words, we want to use all available data to create a space that is as representative of the tree of life as possible.
This work should enable scientists to learn more about the most important part of proteins — how they function.
We want our tools to be hypothesis-generation machines. We want to leverage comparative biology and capture the dark areas in protein space to find the novelty that can lead to the next great ideas.
Our first dive into the functional annotation space was to test how reliable existing annotations are and to show that one way to deeply understand a protein family is to combine sequence, structural, and functional comparisons.
To do this, we chose to use actin, which is important for a number of essential cellular functions [5] and is present throughout the tree of life [6][7]. It is well-studied, and we could use this to our advantage. Finally, most of what we know about actin comes from a fairly small selection of cells — mostly mammalian cells with actins that are highly conserved — leaving a large gap in our understanding of the actin cytoskeleton, how it’s regulated, how it functions, and how it evolved.
Thus, this test case can tell us about a long list of cellular functions, but could also help us answer fundamental biological questions about what else actin might be doing in cells. Beyond that, this work helped us define what really must be conserved for proteins to share functions (sequences, domains, structure, etc.) and helped us create a framework that we and others can use for higher-throughput identification of proteins based on functional characteristics. Finally, this demonstrates that annotations even for well-studied families are not always reliable.
Learn more about this effort:
In this pub, we use protein structural comparisons to explore protein families with the goal of helping users generate new hypotheses about what features could be driving functional differences within protein families and predict which proteins might be especially interesting for further analysis.
We created ProteinCartography, a Snakemake pipeline that compares protein structures at the family level for rapid and intuitive analysis. The pipeline starts with your favorite protein(s), identifies similar proteins using sequence- and structure-based searches [8][9], compares the AlphaFold-predicted structures of every protein to every other protein [10][11], and builds a network to identify clusters [12]. It then creates interactive maps with a number of overlays that allow you to evaluate relationships between clusters, metadata information, and more, all in one interactive plot. With these outputs, you can make hypotheses about the differences between clusters, identify outlier proteins, and even predict functional information about proteins.
This pub contains the initial version of the pipeline with core functionality, but we will continue to improve, and we welcome any feedback either directly on the pub or in the GitHub repository referenced therein.
To gain confidence in our tools, we’ve used several protein families as test cases. These analyses have each told us something new about how our tools work and they’ve generated a number of hypotheses that could tell us a lot about these protein families.
One of the first use cases of this tool was to investigate why some bacterial species accumulate more polyphosphate than others. This is an important area of interest for wastewater treatment, and so far it has been difficult to predict which bacteria will be better at accumulating polyphosphates. We used ProteinCartography, along with other tools, to test whether protein structural similarity among PPK1 enzymes, which catalyzes polyphosphate formation, might be able to help solve this question.
In addition to helping us learn about polyphosphate accumulation and PPKs, this work also helped us learn more about how to incorporate bacterial analyses into ProteinCartography, which was originally more eukaryote-focused. It also helped us connect phylogenetic comparisons and protein structure analysis to generate more informed predictions.
Inspired by a comment on that work, we released an “open question” pub to spur engagement with the broader community about future directions in polyphosphate accumulation.
In another use case, we combined ProteinCartography and the results of our analysis looking at important functional residues in the actin family. In our “Defining actin” pub [13], we used BLAST to generate a list of the 50,000 protein sequences most similar to human β-actin. In the pub below, we comparatively analyzed their structures using ProteinCartography [14]. We looked at the resulting map of similarly structured protein clusters and overlaid data [13] about the conservation of actin’s important functions — polymerization and ATP binding — in each protein across the space.
Because the actin family is so well studied, this was a useful test case for ProteinCartography. We were able to show that proteins generally sorted into their expected subfamilies (actins clustered together, actin-related proteins clustered together, etc). However, there’s plenty of room for discovery even in this well-studied protein family, and we developed a list of testable hypotheses. We pursued one of these hypotheses in “A structurally divergent actin conserved in fungi has no association with specific traits” (described next), but we’re sharing the rest here with hopes that the community will run with them.
As described above, applying ProteinCartography to the actin family suggested several avenues for follow-up study. One intriguing observation was a cluster of actin-like proteins that, in addition to being structurally divergent from canonical actin, are primarily found in fungal species. We wondered if these structural differences were related to functional differences linked to a specific fungal trait.
We first confirmed that these divergent actins are indeed mainly present in fungi. We then applied phylogenetic trait mapping to investigate the relationship between their presence and specific fungal traits. Any correlation could suggest a specific function for these proteins. We explored six traits, which spanned ecology, fungal structure, and genetics. The presence or absence of this divergent actin did not correlate with any of the traits we analyzed.
That said, trait data was limited, constraining the scope and power of our analysis. We remain optimistic about the potential for phylogenetic trait mapping as a tool for functional annotation and discovery in future studies.
This set of use cases is designed to help us validate our protein functional prediction tools, especially ProteinCartography, in vitro. ProteinCartography, a tool for structure-based clustering of protein families, could be useful for predicting protein function. However, we first need to test two hypotheses to be confident in these predictions: 1) proteins clustering together based on structure have similar functions and 2) proteins in different structure-based clusters have different functions.
Our overall strategy for in vitro validation of ProteinCartography requires answering four questions, including how we select protein families, how we select protein clusters and individual proteins, and finally, which functions to test in the lab.
For our first round of validation, we’ve selected protein families that have been previously characterized biochemically and that produce ProteinCartography results that have clearly defined clusters we can use to test our overall hypotheses. We’ve selected two protein families: Ras GTPase and deoxycytidine kinase.
Ras GTPase is an extensively studied protein superfamily of small monomeric GTPases. They’re involved in many signaling pathways in the cell and have been implicated in a number of cancers and other diseases. In addition to their critical biological roles, they’re small, structured, and have assays available to test biochemical functions. We’re looking for feedback on how we narrow down clusters and individual proteins to focus on for our biochemical analysis and how we might functionally characterize proteins from this superfamily.
Deoxycytidine kinases are a group of well-studied proteins involved in the nucleoside salvage pathway. They’ve been used to help with cancer and viral therapies. Additionally, these proteins are small, structured, and have commercially available assays that produce a wealth of data we can use to test our hypotheses. Like Ras GTPases, we’re also seeking feedback on how to narrow down which clusters and individual proteins to focus on.
We will continue to improve both our actin prediction and ProteinCartography pipelines. Specifically, we want to add broad software improvements, general validation, and new analyses and features. We also want to incorporate the information from our pipelines with other software packages and resources in development at Arcadia. More specific plans are listed in each pub.
Additionally, we plan to apply the pipelines to work happening at Arcadia. Using these pipelines will help us identify new areas for refinement and new features to add. We hope that you will use the pipelines as well, and let us know if you have thoughts for the future of the work by commenting on our pubs. To try out the actin pipeline, visit the GitHub repository here, and to try out the ProteinCartography pipeline, visit the GitHub repository here.