A strategy to validate protein function predictions in vitro

Prachee Avasthi; Audrey Bell; Brae M. Bigge; Megan L. Hochstrasser; Atanas Radkov; Dennis A. Sun; Harper Wood; Ryan York

doi:10.57844/arcadia-cae9-96c4

Idea Not actively updating Annotation: Mapping the functional landscape of protein families across biology

Published on May 30, 2024 by Arcadia Science

A strategy to validate protein function predictions in vitro

We aim to validate ProteinCartography, a tool for structure-based protein clustering, by evaluating two foundational hypotheses: that proteins within a cluster have similar functions and proteins in different clusters have differing functions.

A strategy to validate protein function predictions in vitro

Purpose

In this pub, we outline a path for validating ProteinCartography, a computational tool for comparative analysis of protein structures across species [1]. ProteinCartography produces an interactive map of protein families with individual proteins separated into clusters based on their structural similarity. Our foundational hypotheses are that functionally similar proteins cluster together while proteins with distinct functions cluster separately. We plan to assess this using a couple of test protein families.

We’ve selected protein families for in vitro validation, and that’s mostly what we’ll focus on in this pub. We started with a list of the most common human proteins in the Protein Data Bank [2] and narrowed it down using criteria outlined below. We selected two candidate protein families, Ras GTPase and deoxycytidine kinase.

Now we face the challenge of selecting individual clusters and proteins to focus on. We go into much more depth on how we’re thinking about this for each family in our accompanying dCK and Ras GTPase pubs. Head there for specific information (and to provide family-specific feedback!). We’ll update this pub with generalizable takeaways from our studies of each protein family to build a roadmap for validation.

This pub is part of the platform effort, “Functional annotation: mapping the functional landscape of proteins across biology.” Visit the platform narrative for more background and context.
The accompanying pubs, “How can we biochemically validate ProteinCartography with the deoxycytidine kinase family?” and “How can we biochemically validate ProteinCartography with the Ras GTPase family?”, present ProteinCartography data and follow-up testing options for our two chosen protein families.
The ProteinCartography pipeline used to run these analyses is available in this GitHub repo.
The data associated with this pub, including ProteinCartography results for the 30 proteins we ran, can be found in this Zenodo repository. An additional four from previous ProteinCartography runs can be found in this Zenodo repo.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Motivation

What is ProteinCartography?

We previously introduced a tool for structural comparison of protein families: ProteinCartography [2]. ProteinCartography identifies proteins similar to an input using sequence- and structure-based searches. It aligns the structure of each protein to every other protein to generate a structural similarity score, or TM-score (template modeling score), for each pair of proteins in the analysis [3]. It uses these scores to populate a similarity matrix. It then uses this matrix to cluster proteins into similar groups and to create interactive maps (UMAP or t-SNE) for easy visualization [4][5][6].

The outputs of this analysis can be useful for making predictions about which proteins within families might be structurally similar or identifying which proteins might have novel structural features. Because structure and function are closely related, we hope that this analysis will also let us generate hypotheses about protein function.

Our foundational hypotheses

As we use ProteinCartography’s results to infer functional relationships, we want to biochemically validate ProteinCartography to show that the structure-based clustering can really give insights into protein function. To this end, we have two main hypotheses to test (Figure 1):

1 — Proteins within the same cluster have similar biochemical functions.
2 — Proteins in different clusters have functional differences.

We plan to test these hypotheses using candidate protein families that we can assess biochemically. For our first round of validation, we’re aiming for a couple protein families that are easy to work with in vitro and that produce ProteinCartography results with clearly defined clusters that present many opportunities to test our hypotheses.

**Foundational hypotheses we intend to test via biochemical validation**.
The ProteinCartography generated t-SNE for MAPK10 (UniProt ID: P53779) with examples of our hypotheses indicated. This ProteinCartography analysis was originally done in our initial ProteinCartography pub and full data for this analysis can be found there and in our Zenodo repo.

The plan

As we work toward validating ProteinCartography, we’ll go through the following steps. We’ll update this pub so that it can serve as a roadmap for future validation plans and for how one might follow up on ProteinCartography results.

So far, we’ve selected two protein families for initial validation using a strategy discussed below. If you’d like to read about these protein families and see some practical examples of this process, visit the pubs for our candidate protein families: deoxycytidine kinase or Ras GTPase.

Step 1: Decide which protein families to focus on

To test these hypotheses, we first had to identify protein families to work with. For our initial analyses, we aimed for families that are easy to work with in vitro and that had ProteinCartography outputs with defined clusters and functions that we can realistically assay in the lab. We came up with a list of criteria that we thought were essential (Table 1, column 1).

Rather than considering the entire protein universe, we decided to start somewhere with a more tractable number of protein families to choose from. We turned to the list of the 200 most-studied human proteins in the Protein Data Bank (PDB) [7]. These proteins have many experimentally determined protein structures, which means the proteins have likely been purified. A note that because these proteins have been deeply studied and because they’re easy to work with, they may represent a class of proteins that’s potentially more likely to validate ProteinCartography. However, for this first round of validation, we wanted to aim for lower-hanging fruit. For future validations, we may use protein families that more thoroughly stress-test the tool to find the edges of its functionality.

To narrow this down further, we went through our criteria from Table 1 and eliminated proteins in a stepwise manner. From our list of 200 purifiable proteins, we eliminated any proteins that didn’t have commercially available assay kits and that hadn’t been previously purified from a bacterial host. We also eliminated proteins that were outside our standard length and structural confidence (mean pLDDT) criteria for the ProteinCartography pipeline [1]. For example, because the AlphaFold database uses a length cutoff of 1,280 amino acids, we eliminated any proteins over this length, and we eliminated any proteins with a significant amount of disorder (mean pLDDT < 80) [8][9] as they’re not well-suited for structural comparisons [1][9] (at the time of writing, the AlphaFold database FAQ lists the length cutoff and significant disorder limitations described). This left us with 34 proteins, listed in Table 2.

We ran ProteinCartography [1] using the standard parameters (searching for 5,000 hits total) on those 34 proteins. We looked for maps with well-defined clusters that appeared to contain representatives from multiple broad taxonomic groups. Using those criteria, we narrowed those 34 protein families down to 14. We dug deep on our top five, including HRas/KRas GTPases (UniProt IDs: P01112 and P01116), glycogen synthase kinase 3 beta (GSK3ß) (P49841), lysozyme C (P61626), a tyrosine kinase (P43405), and deoxycytidine kinase (P27707). For these five protein families, we scaled up our ProteinCartography runs, asking it to fetch 10,000 total similar proteins from each family to capture additional protein diversity. We found that lysozyme C lacked taxonomic diversity, GSK3ß returned many hits with low-confidence predicted structures, and the tyrosine kinase lacked annotation diversity in existing annotations (all proteins had similar annotations). Other families had similar issues. While these are ProteinCartography outputs that we would eventually like to dive deeper into, for this round of validation, we chose protein families that would help us test the clustering outputs in the simplest possible manner. We chose two protein families so we can test our hypotheses through orthogonal experiments and rely on just one of the families if in-lab analyses prove challenging for the other.

SHOW ME THE DATA: The data associated with this pub, including ProteinCartography results for 30 proteins we ran, can be found in this Zenodo repository (DOI: 10.5281/zenodo.11264123). An additional four from previous ProteinCartography runs can be found in this Zenodo repo (DOI: 10.5281/zenodo.8377393)

The families we settled on are deoxycytidine kinases and Ras GTPases. For both families, we have open questions for which we’re seeking feedback. Visit the pubs to learn more and provide your feedback!

Criteria	How we met this criterion	Number of proteins after filtering
Protein must be purifiable	Started with a list of previously purified proteins	200
Protein must meet standard length and pLDDT criteria for ProteinCartography	Eliminated any proteins over 1,280 amino acids or with an average pLDDT under 80	34
Protein activity must be assayable	Eliminated any proteins that didn’t have a commercially available kit	34
Standard ProteinCartography outputs must present testable hypotheses	Eliminated any proteins that didn’t have well-defined clusters representing a broad taxonomic range	14
Scaled-up ProteinCartography outputs must present testable hypotheses	Chose the top two most interesting	2

Criteria for protein family selection.

Protein	UniProt ID	Data source	Length	Average pLDDT
Superoxide dismutase [Cu-Zn]	P00441	[1]	154	98
Peptidyl-prolyl cis-trans isomerase A	P62937	[1]	165	98
Glutathione S-transferase P	P09211	This study	210	98
Carbonic anhydrase 2	P00918	This study	260	97
Pancreatic alpha-amylase	P04746	This study	511	97
Dihydrofolate reductase	P00374	[1]	187	96
Histone deacetylase 8	Q9BY41	This study	377	95
Lysozyme C	P61626	This study	148	94
Transforming protein RhoA	P61586	This study	193	94
DNA polymerase beta	P06746	This study	335	94
Nicotinamide phosphoribosyltransferase	P43490	This study	491	94
⭐ GTPase HRas	P01112	This study	189	93
Hypoxia-inducible factor 1-alpha inhibitor	Q9NWT6	This study	349	93
⭐ GTPase KRas	P01116	This study	189	92
Fibroblast growth factor 1	P05230	This study	155	91
Interstitial collagenase	P03956	This study	469	91
Serine/threonine-protein kinase pim-1	P11309	This study	313	90
⭐ Deoxycytidine kinase	P27707	This study	260	89
Glycogen synthase kinase-3 beta	P49841	[1]	420	89
Cyclin-dependent kinase 2	P24941	This study	298	88
Beta-secretase 1	P56817	This study	501	88
Caspase-3	P42574	This study	277	86
Vitamin D3 receptor	P11473	This study	427	85
Serine/threonine-protein kinase PLK1	P53350	This study	603	85
Tyrosine-protein kinase Lck	P06239	This study	509	84
Tyrosine-protein kinase SYK	P43405	This study	635	84
Urokinase-type plasminogen activator	P00749	This study	431	82
Aldo-keto reductase family 1 member B1	P15121	This study	316	98
Casein kinase II subunit alpha	P68400	This study	391	91
Mitogen-activated kinase 1	P28482	This study	360	91
Mitogen-activated kinase 14	Q16539	This study	360	89
Macrophage metalloprotease	P39900	This study	470	88
Peptidyl-prolyl cis-trans isomerase NIMA-interacting 1	Q13526	This study	163	93
Renin	P00797	This study	406	86

Proteins we analyzed with ProteinCartography.

Proteins we moved forward with for validation are indicated with stars (⭐️).

Future directions

Now that we’ve selected which protein families to focus on for our initial validation, we’re seeking feedback on how we decide which protein clusters to focus on and how to select individual proteins from within clusters. Additionally, we’re beginning to plan how we’ll actually assay biochemical functions for our protein families.

Step 2: Select clusters to focus on

Our ProteinCartography runs for Ras GTPase and dCK generated 12 clusters for each protein family. For our first round of validation, we want to test our foundational hypotheses on only a handful of clusters. We can identify appropriate clusters based on additional information from ProteinCartography. In addition to the Leiden cluster overlay shown in Figure 1, we also get metadata overlays, including overlays that can tell us about the broad taxonomy of the proteins, characteristics like length, and how similar the proteins are to our input proteins. Additionally, we get an analysis that tells us more about the UniProt annotations for proteins in our space, called a semantic analysis.

In accompanying pubs, we outline all of this data for both deoxycytidine kinases and Ras GTPases. We’ve selected clusters that we find interesting based on these analyses and request your feedback on deciding which ones to use for our initial validation.

Step 3: Pick individual proteins to bring into the lab

Once we select which clusters to focus on, we’ll need a plan for selecting individual proteins to bring into the lab. A typical cluster can contain hundreds of individual proteins. Our goal for this first round of validation is to keep the number of proteins we analyze relatively low, so we want to be thoughtful about picking proteins. We’d love your input on ways that we might tackle this challenge.

Step 4: Biochemically analyze function across proteins

We have plans in place for purification and simple activity assays, but we’d love to know if there are additional ways to evaluate biochemical or protein-level function that might be useful for validating ProteinCartography.

Summary

We’re working toward validating our ProteinCartography tool by testing two foundational hypotheses:

Proteins within the same cluster have similar biochemical functions.
Proteins in different clusters have functional differences.

We’re sharing our strategy for validation as we generate it to gather feedback from the community, but also to provide a roadmap for future validation and for how one might use ProteinCartography results.

So far, we’ve addressed our first open question — how to select protein families for validation. Further analyses of these protein families can be found in the accompanying pubs:

How can we biochemically validate protein function predictions with the…
deoxycytidine kinase family? [10]
Ras GTPase family? [11]

Next, we’ll work to answer the remaining questions, including how we select clusters to test, how we select individual proteins, and how we go about biochemical analyses of these proteins.

References

Avasthi P, Bigge BM, Celebi FM, Cheveralls K, Gehring J, McGeever E, Mishne G, Radkov A, Sun DA. (2024). ProteinCartography: Comparing proteins with structure-based maps for interactive exploration. https://doi.org/10.57844/ARCADIA-A5A6-1068

Berman H, Henrick K, Nakamura H. (2003). Announcing the worldwide Protein Data Bank. https://doi.org/10.1038/nsb1203-980

Zhang Y, Skolnick J. (2004). Scoring function for automated assessment of protein structure template quality. https://doi.org/10.1002/prot.20264

Traag VA, Waltman L, van Eck NJ. (2019). From Louvain to Leiden: guaranteeing well-connected communities. https://doi.org/10.1038/s41598-019-41695-z

Belkina AC, Ciccolella CO, Anno R, Halpert R, Spidlen J, Snyder-Cappione JE. (2019). Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. https://doi.org/10.1038/s41467-019-13055-y

McInnes L, Healy J, Saul N, Großberger L. (2018). UMAP: Uniform Manifold Approximation and Projection. https://doi.org/10.21105/joss.00861

Li Z, Buck M. (2021). Beyond history and “on a roll”: The list of the most well‐studied human protein structures and overall trends in the protein data bank. https://doi.org/10.1002/pro.4038

Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Žídek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D, Velankar S. (2021). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. https://doi.org/10.1093/nar/gkab1061

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. (2021). Highly accurate protein structure prediction with AlphaFold. https://doi.org/10.1038/s41586-021-03819-2

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the deoxycytidine kinase family? https://doi.org/10.57844/ARCADIA-1E5D-E272

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the Ras GTPase family? https://doi.org/10.57844/ARCADIA-74AD-345F

Share your thoughts!

Provide feedback

Prachee Avasthi

Conceptualization, Supervision

Audrey Bell

Visualization

Brae M. Bigge

Conceptualization, Formal Analysis, Investigation, Supervision, Writing

Megan L. Hochstrasser

Editing

Atanas Radkov

Editing, Formal Analysis, Investigation

Dennis A. Sun

Critical Feedback

Harper Wood

Editing, Formal Analysis, Investigation

Ryan York

Supervision

Berman H, Henrick K, Nakamura H. (2003). Announcing the worldwide Protein Data Bank. https://doi.org/10.1038/nsb1203-980

Zhang Y, Skolnick J. (2004). Scoring function for automated assessment of protein structure template quality. https://doi.org/10.1002/prot.20264

Traag VA, Waltman L, van Eck NJ. (2019). From Louvain to Leiden: guaranteeing well-connected communities. https://doi.org/10.1038/s41598-019-41695-z

McInnes L, Healy J, Saul N, Großberger L. (2018). UMAP: Uniform Manifold Approximation and Projection. https://doi.org/10.21105/joss.00861

Li Z, Buck M. (2021). Beyond history and “on a roll”: The list of the most well‐studied human protein structures and overall trends in the protein data bank. https://doi.org/10.1002/pro.4038

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the deoxycytidine kinase family? https://doi.org/10.57844/ARCADIA-1E5D-E272

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the Ras GTPase family? https://doi.org/10.57844/ARCADIA-74AD-345F

Erle Holgersen on Sep 10, 2024

This was a great read, thanks for sharing!

For the selection of proteins within each cluster, my suggestion would be to aim for sequence diversity. If the shortest protein in the cluster and other proteins with relatively low sequence similarity are shown to have the same function as the human protein, I’d be inclined to think that the clustering is picking up the key function of the proteins. The risk is you might end up picking cluster outliers and falsely debunking the hypothesis, so maybe a compromise could be to select a couple of proteins with low similarity to human, and 1-2 that are mostly similar.

Brae M. Bigge on Sep 10, 2024

Thanks for the comment! The idea of trying to capture some sequence diversity with the proteins we select to bring in the lab is a great idea. We have a new pub coming soon where we’ve tried to do something like this with the deoxycytidine kinase family, so stay tuned for that work!

Adam Pratt on Oct 04, 2024

What are the implications of the recent paper “The known protein universe is phylogenetically biased” for this work? Given that it is now known that AlphaFold is phylogenetically biased and ProteinCartography relies fundamentally on AlphaFold, I would think there would be fundamental biases present (perhaps detectable?) in ProteinCartography. One possible solution to this issue is the final suggestion in the more recent paper: incorporating phylogenetic data into the models. Incorporating such data into AlphaFold would likely solve the problem without modifying ProteinCartography.

Brae M. Bigge on Oct 04, 2024

Thanks, Adam! This is a great point. Because ProteinCartography relies on predicted structures, as you mentioned, it likely would benefit by having more phylogenetically balanced models predict the compared structures. This is definitely something to keep in mind when using tools like ProteinCartography that utilize predicted structures. But, I think even with the bias, we can still learn useful things from these tools, as long as we remember that the results are predictions/hypotheses themselves and are meant to be used for exploration along with other analyses (like phylogenetic analysis). We’re hoping to demonstrate that here and in some upcoming pubs where we pair ProteinCartography results with biochemical analysis.

I do think one interesting experiment related to this could be to try using an AlphaFold-type model with phylogenetic data incorporated to fold proteins and run ProteinCartography from “Cluster mode”. Comparing this to a typical ProteinCartography analysis using regular AlphaFold could be an interesting way to evaluate the effects of incorporating phylogenetic data into structure prediction models, especially at the protein family level.

Leah Schaffer on Oct 08, 2024

I would be interested how these clusters overlap with known structures; it would be interesting to focus follow-up analyses on proteins that are novel members of known complexes. I would also be interested in how other data for protein structure and function overlap with these results, such as AP-MS protein interaction network or CRISPR-based functional screens.

Brae M. Bigge on Oct 08, 2024

Thanks for the comment, Leah! We are interested in seeing how known structures cluster alongside the AlphaFold structures ProteinCartography currently uses. I think that could help us gain confidence in the AlphaFold structures themselves We’re also very interested in protein-protein interactions in general. To look into this in the current version of ProteinCartography, we can create custom overlays for any data that we want, including things like AP-MS data or functional screen data. While they aren’t related to protein-protein interactions, there are some examples of these custom overlays in our pubs about actin, deoxycytidine kinases, and Ras GTPases. The problem that we often run into is data availability, but if we chose a good starting protein family, it could be cool to overlay experimental data related to protein-protein interactions.

Kyle Glockzin on Oct 09, 2024

Can the ProteinCartography system differentiate promiscuous proteins that have multiple functions? If so, how effective is this system in separating these types of proteins?

Brae M. Bigge on Oct 09, 2024

Thanks for the question! I don’t know that we’ve done any experiments to specifically test this idea, but it could be a great test of what ProteinCartography can do. Do you have any suggestions for short, structured promiscuous proteins that might be a good starting point?

Kyle Glockzin on Oct 22, 2024

One example that I can think of is an EcSPDS2 (EcSPMT) enzyme from Erythroxylum coca. There is a PNAS paper that presents the enzymes dual activity. Here is the link to the paper: https://doi.org/10.1073/pnas.2215372119 (Elucidation of tropane alkaloid biosynthesis in Erythroxylum coca using a microbial pathway discovery platform). I am unaware if the structure of the protein has been solved, but the sequence should be available.

Steven Strutt on Nov 13, 2024

Towards pressure testing the clusters generated by the ProteinCartography method, have you considered generating the complementary clusters through sequence homology alone (or other method of comparison) and testing protein sequences that cluster differentially between the methods? Are there super/subpopulations of proteins that are revealed uniquely by the ProteinCartography approach? I would imagine information here could also inform the genetic/structural constraints on protein evolution.

Brae M. Bigge on Nov 13, 2024

Thanks for the comment! We have thought quite a bit about comparing structure-based clustering to clustering based on other things, like sequence. I really like the idea of evaluating the proteins that cluster differently between methods. I think that could tell us a lot about sequence-structure-function relationships and could help us uncover some of the benefits and drawbacks to particular methods.

Lina Schmidt on Nov 15, 2024

One area that could be especially impactful for drug discovery and repurposing is how ProteinCartography could help in identifying potential off-target interactions for existing drugs. By clustering proteins not only by their structural similarity but also by their functional characteristics, could this toolkit be applied to predict proteins that may have similar functional roles to known drug targets?

Taken together, the computational approach you’ve outlined seems like it could significantly speed up the process of identifying new potential therapeutic targets by narrowing down functional similarities across diverse proteins. I'm curious to hear your thoughts on how ProteinCartography could be leveraged for this purpose, especially in a high-throughput setting.

Congratulations on the impressive work! I am looking forward to seeing how this methodology evolves!

Brae M. Bigge on Nov 15, 2024

Thanks for the comment, Lina! Identifying novel drug targets was one of the guiding ideas behind ProteinCartography, and I think clustering proteins based on additional functional characteristics could provide useful insight. While we haven’t done this, we have overlaid functional information on the map. For example, in this pub, we looked at the conservation of the residues involved in polymerization. This kind of analysis could be used to look for proteins that have similar binding sites or active sites to a target of interest.

Cody C on Nov 29, 2024

What considerations are being made for accessory domains? Many families of paralogs display discrete functions dictated by their multidomain architecture. Further, are you devising strategies to differentiate true enzymes from pseudo-enzymes from your structure-based clusters? This can be tricky as the catalytic motifs can be cryptic. Goodluck with this very useful tool!

Brae M. Bigge on Dec 02, 2024

Hi Cody! One of the limitations of ProteinCartography is that it doesn’t work great for multi-domain proteins because it uses TM-align, which performs global alignment of structures. We’ve considered maybe using different protein representations, like embeddings or shape-mers, which might help us get more info about accessory domains. However, in its current state, ProteinCartography can be used on a pre-defined set of proteins or even a pre-defined set of domains. So, if you had a family of interest, you could obtain the proteins, truncate them to just the domain you’d like to compare and run ProteinCartography on that data to compare the domains of interest. As for your other question, I imagine that the strategy to differentiate true enzymes from pseudo-enzymes will change from family to family. We do often look at binding or active site conservation, but even that can only get you so far. Maybe when it comes to pseudo-enzymes, ProteinCartography is a good starting point, but additional analyses or data should be layered on top. Thanks for your great questions!

Afroza Akhtar on Feb 05, 2025

This is a great work and well written documentation about protein clustering based on their structures which lead to exert their function. This idea will facilitate the future drug discovery purpose to minimize the off target effects, particularly the kinases and Ras GTPases

Brae M. Bigge on Feb 05, 2025

Thanks Afroza!

Philipp Ross on Jan 14, 2025

Selection

e number of proteins we analyze relatively low, so we want to be thoughtful about picking proteins. We’d love your input on ways that we might tackle this challenge.Step 4: Biochemically analyze function across proteinsWe have plans in place for purification an

The best way that I know is to just look through the literature at what’s been recombinantly purified before and just build off those protocols.

As described in this paper, predicting protein expression is hard. But there is major opportunity in being really good at it!

Philipp Ross on Jan 14, 2025

I also wonder if a cell free protein expression system might be a reasonable expression system to try considering that it takes less time and is easier to scale.

Brae M. Bigge on Jan 15, 2025

Thanks Philipp! I like the idea of setting ourselves up for success by looking at proteins that have been previously purified. I also think that cell-free expression could be a great tool for validating these types.

Philipp Ross on Jan 14, 2025

Selection

ways that we might tackle this challenge.Step 4: Biochemically analyze function across proteinsWe have plans in place for purification and simple activity assays, but we’d love to know if there are additional ways to evaluate biochemical or protein-level function that might be useful for validating ProteinCartography.SummaryWe’re working toward validating our ProteinCartography tool by testing two foundational h

Cellular signaling assays are a pretty common alternative when the number of variants to test is high and/or purification of even a small number of variants is extremely cumbersome or impossible. And by this, I mean assaying for the downstream build up or depletion of some other protein or molecule as a result of enzyme activity. This would avoid the pitfalls likely faced in using probes or antibodies designed to detect particular proteins (probably human variants) that might be problematic when assaying evolutionarily distant homologs.

For example, in order to assay downstream activation following cytokine incubation and Janus Kinase phosphorylation, we measure the abundance of phosphorylated STAT transcription factor.

Not sure if this is possible for the systems proposed here, but just a thought!

Brae M. Bigge on Jan 15, 2025

We have commercially available assays for Ras GTPase and for dCK (our two proposed families), but this is definitely something to consider for future analyses where there might not be a simple or commercially available assay.

Richard Sobe on Jan 15, 2025

Selection

I suggest selecting 2-3 proteins at a particular distance from the cluster core for 2-3 increasing distances from the core would be a reasonable approach for validation. This should provide a reasonable representation of the likelihood that proteins in a given cluster are to be correctly predicted to have the same function.

Brae M. Bigge on Jan 15, 2025

Thanks for the suggestion, Richard!

Ronnie Bourland on Jan 23, 2025

This is a well written publication, and I completely agree with using protein structural similarities to determine function due to their known relationship. The ProteinCartography selection criteria is optimal and starting with deoxycytidine kinases and Ras GTPases appears like a good starting point due to the chromophore characteristics of the predicted ligands for biochemical characterization.

In reference to how to select clusters to test in Table 2, searching a protein gene cluster where at least one protein has a known PDB deposited structure with ligands bound in the active site is critical to understanding the structure-function relationship. Optimal ligands bound in the active site would be chromophores that can provide a readout that many spectroscopic instruments can detect. One would search within that cluster for proteins with an uncharacterized function with a highly similar yet different active site composition/structure to learn something new about the different structural feature(s). This protein with an uncharacterized function would be the protein to test biochemically. A database, like HHpred, uses a protein’s primary sequence to search for structurally similar PDB structures to provide additional protein expression/purification and ligand information. This information is paramount for biochemically characterizing the unknown protein’s ultimate function to associate it with the unique structural feature(s).

I have used a similar but different methodology in my previous biochemical studies to characterize multi-functional bacterial enzymes to some success. I’m sure you have already thought about most if not all of this information, but I thoroughly find this topic very interesting. I would welcome further discussion on the topic. Great work and good luck!

Brae M. Bigge on Jan 24, 2025

Thanks for the comment and suggestion, Ronnie! We agree that starting with a protein family or gene cluster that has at least one experimentally determined PDB is a useful strategy not only for the reasons you mention but also because it gives us a starting point for purifications. We are also interested in looking at active site conservation across proteins in our analyses, which could definitely be useful for selecting proteins.

Ronnie Bourland on Jan 24, 2025

Thank you for getting back to me so quickly! That was the best part, the expression and purification conditions as well as biochemical assay conditions in the publication associated with the deposited structures. Active site conservation is a great parameter to look for and the orientation of the catalytic residues as well as the residues that stabilize ligand binding allow for an optimal structure to function interpretation.

I looked at the other two publications on Ras GTPases and Deoxycytidine Kinases. In reference to your first protein family, Ras GTPases, I have previous experiences characterizing XTPases biochemically but used ATP, CTP, and UTP instead of GTP. I took two different biochemical approaches. Depending on your equipment, simple anion exchange chromatography on a HPLC instrument will monitor the change from GTP to GDP based on the two molecule’s different negative charge character leading to different elution profiles (Both absorb at 260nm). If you only have microplate readers, you could use the highly robust malachite green or a derivative chemical to chelate inorganic phosphate released from the reaction to read absorbance at 420nm. The HPLC method would be ideal because malachite green reagents are highly sensitive, and the signal is easily saturated. If the protein itself doesn’t express well, attach a maltose binding protein (MBP) tag to the N-terminus to increase expression and therefore yield. If protein solubility is the problem, try a cell type/expression system with an additional transfected plasmid encoding GroEL and GroES chaperone homologs to facilitate proper folding and introduce detergents into the enzyme purification buffer below the critical micelle concentration (CMC). These same types of experiments can be applied to the Deoxycytidine Kinase family of proteins. Anion exchange chromatography HPLC work would be easiest because the molecules absorb at 260nm and are easily separated based on charge or the microplate reader method with a coupled enzyme reaction for a molecule absorbing or fluorescing at a different wavelength.

I’m sure you all have already implemented such experiments, but this research aligns well with my past research experience and therefore highly invigorating to me on a personal level. I know this type of work is the way of the future and I look forward to seeing the results from your dedication and effort.

Brae M. Bigge on Jan 24, 2025

This is great! We have a pub coming out soon where we’ve investigated the function of some dCK proteins. We just did a simple commercial, plate reader-based assay using dC, dT, dU, dA, and dG that seemed to work well for our needs. However, we’ve paused on the Ras GTPase work for now while we wrap up the dCK analysis and get moving on some other things, so if/when we get back to it, this will be super useful. Thanks so much for your interest in this work and for your helpful and kind words!

Ronnie Bourland on Jan 24, 2025

Excellent! The plate reader-based assays are great for this type of biochemical work. I’m glad it all worked out! No problem, happy to help. Good luck with your future work!

Contributors (A-Z)

Purpose

Share your thoughts!

Motivation

What is ProteinCartography?

Our foundational hypotheses

The plan

Step 1: Decide which protein families to focus on

Future directions

Step 2: Select clusters to focus on

Step 3: Pick individual proteins to bring into the lab

Step 4: Biochemically analyze function across proteins

Summary

References

Share your thoughts!

Provide feedback

Pub details

Table of contents