Nature contains remarkable diversity, and the answers to many impactful biological questions are likely hidden within this diversity. By comparing the shared building blocks that make up all organisms, including the DNA, protein, and other components of cells, and identifying similarities and differences, we can make hypotheses about the answers to those questions.
Our team is particularly interested in the comparative biology of proteins because DNA sequencing methods are rapidly advancing and producing mountains of high-quality protein sequence data. However, protein sequences are frequently only as useful as their functional annotations, and predicting a protein’s function remains a bottleneck in the meaningful analysis of protein sequence data .
Most annotations are based on information gathered in a small selection of organisms. In fact, 85% of Gene Ontology annotations are based on information from just ten species, including humans and other typical model organisms . Even in some of the most well-studied organisms, annotation remains difficult. For example, in E. coli, a third of the genome remains un-annotated  and in both fission and budding yeast, roughly 20% of the genome remains un-annotated . This means that existing gene annotation tools could be propagating incorrect or incomplete annotations to new data sets and potentially assigning functions to new sequences based solely on large-scale screens and tenuous evolutionary relationships.
Researchers are now developing computational approaches that take into account protein sequence, structure, and even interactions with other proteins or co-expression patterns to annotate new protein sequences . Although an improvement upon past methods that rely solely upon sequence similarity, these newer methods often still return false positives, missed annotations, and uninformative annotations such as “hypothetical protein” or “protein of unknown function.” Practically, this leads to 1) scientists wasting time and resources investigating proteins that look similar to a known protein of interest but lack functional conservation, 2) missing out on really interesting proteins because they weren’t properly annotated, or 3) investing considerable upfront effort characterizing functions of proteins before being able to move into the discovery-based phase of a project.
To overcome these obstacles, we’re building tools that combine experimental data with computational predictions to compare proteins and provide functional predictions to inform our annotations. Our hypothesis is that using this type of comparative analysis and functional annotation framework can help us identify interesting proteins with unusual characteristics that may have novel functions and interactions.
Our overall goal is to create user-friendly, customizable workflows and tools that compare protein families in diverse organisms in a deep and functionally meaningful way. This will empower scientists to answer their research questions in the best way possible using the right organisms and the right proteins.
We can break this down into a handful of guiding principles:
Our work should be usable by researchers both at Arcadia and outside Arcadia. Whether we are creating computational workflows or biochemical studies, we are developing tools with users and their questions in mind.
We want our work to foster organism inclusivity by working equally well for any organism of interest. In other words, we want to use all available data to create a space that is as representative of the tree of life as possible.
This work should enable scientists to learn more about the most important part of proteins — how they function.
We want our tools to be hypothesis-generation machines. We want to leverage comparative biology and capture the dark areas in protein space to find the novelty that can lead to the next great ideas.
Our first dive into the functional annotation space was to test how reliable existing annotations are and to show that one way to deeply understand a protein family is to combine sequence, structural, and functional comparisons.
To do this, we chose to use actin, which is important for a number of essential cellular functions  and is present throughout the tree of life . It is well-studied, and we could use this to our advantage. Finally, most of what we know about actin comes from a fairly small selection of cells — mostly mammalian cells with actins that are highly conserved — leaving a large gap in our understanding of the actin cytoskeleton, how it’s regulated, how it functions, and how it evolved.
Thus, this test case can tell us about a long list of cellular functions, but could also help us answer fundamental biological questions about what else actin might be doing in cells. Beyond that, this work helped us define what really must be conserved for proteins to share functions (sequences, domains, structure, etc.) and helped us create a framework that we and others can use for higher-throughput identification of proteins based on functional characteristics. Finally, this demonstrates that annotations even for well-studied families are not always reliable.
Learn more about this effort:
In this pub, we use protein structural comparisons to explore protein families with the goal of helping users generate new hypotheses about what features could be driving functional differences within protein families and predict which proteins might be especially interesting for further analysis.
We created ProteinCartography, a Snakemake pipeline that compares protein structures at the family level for rapid and intuitive analysis. The pipeline starts with your favorite protein(s), identifies similar proteins using sequence- and structure-based searches , compares the AlphaFold-predicted structures of every protein to every other protein , and builds a network to identify clusters . It then creates interactive maps with a number of overlays that allow you to evaluate relationships between clusters, metadata information, and more, all in one interactive plot. With these outputs, you can make hypotheses about the differences between clusters, identify outlier proteins, and even predict functional information about proteins.
This pub contains the initial version of the pipeline with core functionality, but we will continue to improve, and we welcome any feedback either directly on the pub or in the GitHub repository referenced therein.
One of the first use cases of this tool was to investigate why some bacterial species accumulate more polyphosphate than others. This is an important area of interest for wastewater treatment, and so far it has been difficult to predict which bacteria will be better at accumulating polyphosphates. We used ProteinCartography, along with other tools, to test whether protein structural similarity among PPK1 enzymes, which catalyzes polyphosphate formation, might be able to help solve this question.
In addition to helping us learn about polyphosphate accumulation and PPKs, this work also helped us learn more about how to incorporate bacterial analyses into ProteinCartography, which was originally more eukaryote-focused. It also helped us connect phylogenetic comparisons and protein structure analysis to generate more informed predictions.
Inspired by a comment on that work, we released an “open question” pub to spur engagement with the broader community about future directions in polyphosphate accumulation.
In another use case, we combined ProteinCartography and the results of our analysis looking at important functional residues in the actin family. In our “Defining actin” pub , we used BLAST to generate a list of the 50,000 protein sequences most similar to human β-actin. In the pub below, we comparatively analyzed their structures using ProteinCartography . We looked at the resulting map of similarly structured protein clusters and overlaid data  about the conservation of actin’s important functions — polymerization and ATP binding — in each protein across the space.
Because the actin family is so well studied, this was a useful test case for ProteinCartography. We were able to show that proteins generally sorted into their expected subfamilies (actins clustered together, actin-related proteins clustered together, etc). However, there’s plenty of room for discovery even in this well-studied protein family, and we developed a list of testable hypotheses. We pursued one of these hypotheses in a forthcoming pub, but we’re sharing the rest here with hopes that the community will run with them.
We will continue to improve both our actin prediction and ProteinCartography pipelines. Specifically, we want to add broad software improvements, general validation, and new analyses and features. We also want to incorporate the information from our pipelines with other software packages and resources in development at Arcadia. More specific plans are listed in each pub.
Additionally, we plan to apply the pipelines to work happening at Arcadia. Using these pipelines will help us identify new areas for refinement and new features to add. We hope that you will use the pipelines as well, and let us know if you have thoughts for the future of the work by commenting on our pubs. To try out the actin pipeline, visit the GitHub repository here, and to try out the ProteinCartography pipeline, visit the GitHub repository here.
Finally, we want to bring the results of our computational analyses to the bench and see if we’re able to identify functional differences within protein families using biochemical analysis. We hope that the in-lab analyses will be useful in helping us refine our pipelines moving forward.