Skip to main content
SearchLoginLogin or Signup

A strategy to validate protein function predictions in vitro

We aim to validate ProteinCartography, a tool for structure-based protein clustering, by evaluating two foundational hypotheses: that proteins within a cluster have similar functions and proteins in different clusters have differing functions.
Published onMay 31, 2024
A strategy to validate protein function predictions in vitro
·

Purpose

In this pub, we outline a path for validating ProteinCartography, a computational tool for comparative analysis of protein structures across species [1]. ProteinCartography produces an interactive map of protein families with individual proteins separated into clusters based on their structural similarity. Our foundational hypotheses are that functionally similar proteins cluster together while proteins with distinct functions cluster separately. We plan to assess this using a couple of test protein families.

We’ve selected protein families for in vitro validation, and that’s mostly what we’ll focus on in this pub. We started with a list of the most common human proteins in the Protein Data Bank [2] and narrowed it down using criteria outlined below. We selected two candidate protein families, Ras GTPase and deoxycytidine kinase.

Now we face the challenge of selecting individual clusters and proteins to focus on. We go into much more depth on how we’re thinking about this for each family in our accompanying dCK and Ras GTPase pubs. Head there for specific information (and to provide family-specific feedback!). We’ll update this pub with generalizable takeaways from our studies of each protein family to build a roadmap for validation.

Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

Motivation

What is ProteinCartography? 

We previously introduced a tool for structural comparison of protein families: ProteinCartography [2]. ProteinCartography identifies proteins similar to an input using sequence- and structure-based searches. It aligns the structure of each protein to every other protein to generate a structural similarity score, or TM-score (template modeling score), for each pair of proteins in the analysis [3]. It uses these scores to populate a similarity matrix. It then uses this matrix to cluster proteins into similar groups and to create interactive maps (UMAP or t-SNE) for easy visualization [4][5][6]

The outputs of this analysis can be useful for making predictions about which proteins within families might be structurally similar or identifying which proteins might have novel structural features. Because structure and function are closely related, we hope that this analysis will also let us generate hypotheses about protein function.

Our foundational hypotheses

As we use ProteinCartography’s results to infer functional relationships, we want to biochemically validate ProteinCartography to show that the structure-based clustering can really give insights into protein function. To this end, we have two main hypotheses to test (Figure 1):

1 — Proteins within the same cluster have similar biochemical functions.
2 — Proteins in different clusters have functional differences.

We plan to test these hypotheses using candidate protein families that we can assess biochemically. For our first round of validation, we’re aiming for a couple protein families that are easy to work with in vitro and that produce ProteinCartography results with clearly defined clusters that present many opportunities to test our hypotheses.

Figure 1

Foundational hypotheses we intend to test via biochemical validation.

The ProteinCartography generated t-SNE for MAPK10 (UniProt ID: P53779) with examples of our hypotheses indicated. This ProteinCartography analysis was originally done in our initial ProteinCartography pub and full data for this analysis can be found there and in our Zenodo repo.

The plan

As we work toward validating ProteinCartography, we’ll go through the following steps. We’ll update this pub so that it can serve as a roadmap for future validation plans and for how one might follow up on ProteinCartography results.

So far, we’ve selected two protein families for initial validation using a strategy discussed below. If you’d like to read about these protein families and see some practical examples of this process, visit the pubs for our candidate protein families: deoxycytidine kinase or Ras GTPase.

Step 1: Decide which protein families to focus on

To test these hypotheses, we first had to identify protein families to work with. For our initial analyses, we aimed for families that are easy to work with in vitro and that had ProteinCartography outputs with defined clusters and functions that we can realistically assay in the lab. We came up with a list of criteria that we thought were essential (Table 1, column 1). 

Rather than considering the entire protein universe, we decided to start somewhere with a more tractable number of protein families to choose from. We turned to the list of the 200 most-studied human proteins in the Protein Data Bank (PDB) [7]. These proteins have many experimentally determined protein structures, which means the proteins have likely been purified. A note that because these proteins have been deeply studied and because they’re easy to work with, they may represent a class of proteins that’s potentially more likely to validate ProteinCartography. However, for this first round of validation, we wanted to aim for lower-hanging fruit. For future validations, we may use protein families that more thoroughly stress-test the tool to find the edges of its functionality.

To narrow this down further, we went through our criteria from Table 1 and eliminated proteins in a stepwise manner. From our list of 200 purifiable proteins, we eliminated any proteins that didn’t have commercially available assay kits and that hadn’t been previously purified from a bacterial host. We also eliminated proteins that were outside our standard length and structural confidence (mean pLDDT) criteria for the ProteinCartography pipeline [1]. For example, because the AlphaFold database uses a length cutoff of 1,280 amino acids, we eliminated any proteins over this length, and we eliminated any proteins with a significant amount of disorder (mean pLDDT < 80) [8][9] as they’re not well-suited for structural comparisons [1][9] (at the time of writing, the AlphaFold database FAQ lists the length cutoff and significant disorder limitations described). This left us with 34 proteins, listed in Table 2.

We ran ProteinCartography [1] using the standard parameters (searching for 5,000 hits total) on those 34 proteins. We looked for maps with well-defined clusters that appeared to contain representatives from multiple broad taxonomic groups. Using those criteria, we narrowed those 34 protein families down to 14. We dug deep on our top five, including HRas/KRas GTPases (UniProt IDs: P01112 and P01116), glycogen synthase kinase 3 beta (GSK3ß) (P49841), lysozyme C (P61626), a tyrosine kinase (P43405), and deoxycytidine kinase (P27707). For these five protein families, we scaled up our ProteinCartography runs, asking it to fetch 10,000 total similar proteins from each family to capture additional protein diversity. We found that lysozyme C lacked taxonomic diversity, GSK3ß returned many hits with low-confidence predicted structures, and the tyrosine kinase lacked annotation diversity in existing annotations (all proteins had similar annotations). Other families had similar issues. While these are ProteinCartography outputs that we would eventually like to dive deeper into, for this round of validation, we chose protein families that would help us test the clustering outputs in the simplest possible manner. We chose two protein families so we can test our hypotheses through orthogonal experiments and rely on just one of the families if in-lab analyses prove challenging for the other.

SHOW ME THE DATA: The data associated with this pub, including ProteinCartography results for 30 proteins we ran, can be found in this Zenodo repository (DOI: 10.5281/zenodo.11264123). An additional four from previous ProteinCartography runs can be found in this Zenodo repo (DOI: 10.5281/zenodo.8377393)

The families we settled on are deoxycytidine kinases and Ras GTPases. For both families, we have open questions for which we’re seeking feedback. Visit the pubs to learn more and provide your feedback!

Criteria

How we met this criterion

Number of proteins after filtering

Protein must be purifiable

Started with a list of previously purified proteins

200

Protein must meet standard length and pLDDT criteria for ProteinCartography 

Eliminated any proteins over 1,280 amino acids or with an average pLDDT under 80

34

Protein activity must be assayable

Eliminated any proteins that didn’t have a commercially available kit

34

Standard ProteinCartography outputs must present testable hypotheses

Eliminated any proteins that didn’t have well-defined clusters representing a broad taxonomic range

14

Scaled-up ProteinCartography outputs must present testable hypotheses

Chose the top two most interesting

2

Table 1. Criteria for protein family selection.

Protein

UniProt ID

Data source

Length

Average pLDDT

Superoxide dismutase [Cu-Zn]

P00441

[1]

154

98

Peptidyl-prolyl cis-trans isomerase A

P62937

[1]

165

98

Glutathione S-transferase P

P09211

This study

210

98

Carbonic anhydrase 2

P00918

This study

260

97

Pancreatic alpha-amylase

P04746

This study

511

97

Dihydrofolate reductase

P00374

[1]

187

96

Histone deacetylase 8

Q9BY41

This study

377

95

Lysozyme C

P61626

This study

148

94

Transforming protein RhoA

P61586

This study

193

94

DNA polymerase beta

P06746

This study

335

94

Nicotinamide phosphoribosyltransferase

P43490

This study

491

94

GTPase HRas

P01112

This study

189

93

Hypoxia-inducible factor 1-alpha inhibitor

Q9NWT6

This study

349

93

GTPase KRas

P01116

This study

189

92

Fibroblast growth factor 1

P05230

This study

155

91

Interstitial collagenase

P03956

This study

469

91

Serine/threonine-protein kinase pim-1

P11309

This study

313

90

Deoxycytidine kinase

P27707

This study

260

89

Glycogen synthase kinase-3 beta

P49841

[1]

420

89

Cyclin-dependent kinase 2

P24941

This study

298

88

Beta-secretase 1

P56817

This study

501

88

Caspase-3

P42574

This study

277

86

Vitamin D3 receptor

P11473

This study

427

85

Serine/threonine-protein kinase PLK1

P53350

This study

603

85

Tyrosine-protein kinase Lck

P06239

This study

509

84

Tyrosine-protein kinase SYK

P43405

This study

635

84

Urokinase-type plasminogen activator

P00749

This study

431

82

Aldo-keto reductase family 1 member B1

P15121

This study

316

98

Casein kinase II subunit alpha

P68400

This study

391

91

Mitogen-activated kinase 1

P28482

This study

360

91

Mitogen-activated kinase 14

Q16539

This study

360

89

Macrophage metalloprotease 

P39900

This study

470

88

Peptidyl-prolyl cis-trans isomerase NIMA-interacting 1

Q13526

This study

163

93

Renin

P00797

This study

406

86

Table 2. Proteins we analyzed with ProteinCartography.
Proteins we moved forward with for validation are indicated with stars (⭐️).

Future directions

Now that we’ve selected which protein families to focus on for our initial validation, we’re seeking feedback on how we decide which protein clusters to focus on and how to select individual proteins from within clusters. Additionally, we’re beginning to plan how we’ll actually assay biochemical functions for our protein families.

Step 2: Select clusters to focus on

Our ProteinCartography runs for Ras GTPase and dCK generated 12 clusters for each protein family. For our first round of validation, we want to test our foundational hypotheses on only a handful of clusters. We can identify appropriate clusters based on additional information from ProteinCartography. In addition to the Leiden cluster overlay shown in Figure 1, we also get metadata overlays, including overlays that can tell us about the broad taxonomy of the proteins, characteristics like length, and how similar the proteins are to our input proteins. Additionally, we get an analysis that tells us more about the UniProt annotations for proteins in our space, called a semantic analysis.

In accompanying pubs, we outline all of this data for both deoxycytidine kinases and Ras GTPases. We’ve selected clusters that we find interesting based on these analyses and request your feedback on deciding which ones to use for our initial validation.

Step 3: Pick individual proteins to bring into the lab 

Once we select which clusters to focus on, we’ll need a plan for selecting individual proteins to bring into the lab. A typical cluster can contain hundreds of individual proteins. Our goal for this first round of validation is to keep the number of proteins we analyze relatively low, so we want to be thoughtful about picking proteins. We’d love your input on ways that we might tackle this challenge.

Step 4: Biochemically analyze function across proteins

We have plans in place for purification and simple activity assays, but we’d love to know if there are additional ways to evaluate biochemical or protein-level function that might be useful for validating ProteinCartography.

Summary

We’re working toward validating our ProteinCartography tool by testing two foundational hypotheses: 

  1. Proteins within the same cluster have similar biochemical functions.

  2. Proteins in different clusters have functional differences.

We’re sharing our strategy for validation as we generate it to gather feedback from the community, but also to provide a roadmap for future validation and for how one might use ProteinCartography results.

So far, we’ve addressed our first open question — how to select protein families for validation. Further analyses of these protein families can be found in the accompanying pubs: 

How can we biochemically validate protein function predictions with the…

deoxycytidine kinase family? [10]
Ras GTPase family? [11]

Next, we’ll work to answer the remaining questions, including how we select clusters to test, how we select individual proteins, and how we go about biochemical analyses of these proteins.


Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.


Contributors
(A–Z)
Conceptualization, Supervision
Visualization
Conceptualization, Formal Analysis, Investigation, Supervision, Writing
Editing, Formal Analysis, Investigation
Critical Feedback
Editing, Formal Analysis, Investigation
Supervision
Comments
7
?
Cody C:

What considerations are being made for accessory domains? Many families of paralogs display discrete functions dictated by their multidomain architecture. Further, are you devising strategies to differentiate true enzymes from pseudo-enzymes from your structure-based clusters? This can be tricky as the catalytic motifs can be cryptic. Goodluck with this very useful tool!

?
Brae M. Bigge:

Hi Cody! One of the limitations of ProteinCartography is that it doesn’t work great for multi-domain proteins because it uses TM-align, which performs global alignment of structures. We’ve considered maybe using different protein representations, like embeddings or shape-mers, which might help us get more info about accessory domains. However, in its current state, ProteinCartography can be used on a pre-defined set of proteins or even a pre-defined set of domains. So, if you had a family of interest, you could obtain the proteins, truncate them to just the domain you’d like to compare and run ProteinCartography on that data to compare the domains of interest. As for your other question, I imagine that the strategy to differentiate true enzymes from pseudo-enzymes will change from family to family. We do often look at binding or active site conservation, but even that can only get you so far. Maybe when it comes to pseudo-enzymes, ProteinCartography is a good starting point, but additional analyses or data should be layered on top. Thanks for your great questions!

?
Lina Schmidt:

One area that could be especially impactful for drug discovery and repurposing is how ProteinCartography could help in identifying potential off-target interactions for existing drugs. By clustering proteins not only by their structural similarity but also by their functional characteristics, could this toolkit be applied to predict proteins that may have similar functional roles to known drug targets?

Taken together, the computational approach you’ve outlined seems like it could significantly speed up the process of identifying new potential therapeutic targets by narrowing down functional similarities across diverse proteins. I'm curious to hear your thoughts on how ProteinCartography could be leveraged for this purpose, especially in a high-throughput setting.

Congratulations on the impressive work! I am looking forward to seeing how this methodology evolves!

?
Brae M. Bigge:

Thanks for the comment, Lina! Identifying novel drug targets was one of the guiding ideas behind ProteinCartography, and I think clustering proteins based on additional functional characteristics could provide useful insight. While we haven’t done this, we have overlaid functional information on the map. For example, in this pub, we looked at the conservation of the residues involved in polymerization. This kind of analysis could be used to look for proteins that have similar binding sites or active sites to a target of interest.

?
Steven Strutt:

Towards pressure testing the clusters generated by the ProteinCartography method, have you considered generating the complementary clusters through sequence homology alone (or other method of comparison) and testing protein sequences that cluster differentially between the methods? Are there super/subpopulations of proteins that are revealed uniquely by the ProteinCartography approach? I would imagine information here could also inform the genetic/structural constraints on protein evolution.

?
Brae M. Bigge:

Thanks for the comment! We have thought quite a bit about comparing structure-based clustering to clustering based on other things, like sequence. I really like the idea of evaluating the proteins that cluster differently between methods. I think that could tell us a lot about sequence-structure-function relationships and could help us uncover some of the benefits and drawbacks to particular methods.

Kyle Glockzin:

Can the ProteinCartography system differentiate promiscuous proteins that have multiple functions? If so, how effective is this system in separating these types of proteins?

?
Brae M. Bigge:

Thanks for the question! I don’t know that we’ve done any experiments to specifically test this idea, but it could be a great test of what ProteinCartography can do. Do you have any suggestions for short, structured promiscuous proteins that might be a good starting point?

+ 1 more...
?
Leah Schaffer:

I would be interested how these clusters overlap with known structures; it would be interesting to focus follow-up analyses on proteins that are novel members of known complexes. I would also be interested in how other data for protein structure and function overlap with these results, such as AP-MS protein interaction network or CRISPR-based functional screens.

?
Brae M. Bigge:

Thanks for the comment, Leah! We are interested in seeing how known structures cluster alongside the AlphaFold structures ProteinCartography currently uses. I think that could help us gain confidence in the AlphaFold structures themselves We’re also very interested in protein-protein interactions in general. To look into this in the current version of ProteinCartography, we can create custom overlays for any data that we want, including things like AP-MS data or functional screen data. While they aren’t related to protein-protein interactions, there are some examples of these custom overlays in our pubs about actin, deoxycytidine kinases, and Ras GTPases. The problem that we often run into is data availability, but if we chose a good starting protein family, it could be cool to overlay experimental data related to protein-protein interactions.

?
Adam Pratt:

What are the implications of the recent paper “The known protein universe is phylogenetically biased” for this work? Given that it is now known that AlphaFold is phylogenetically biased and ProteinCartography relies fundamentally on AlphaFold, I would think there would be fundamental biases present (perhaps detectable?) in ProteinCartography. One possible solution to this issue is the final suggestion in the more recent paper: incorporating phylogenetic data into the models. Incorporating such data into AlphaFold would likely solve the problem without modifying ProteinCartography.

?
Brae M. Bigge:

Thanks, Adam! This is a great point. Because ProteinCartography relies on predicted structures, as you mentioned, it likely would benefit by having more phylogenetically balanced models predict the compared structures. This is definitely something to keep in mind when using tools like ProteinCartography that utilize predicted structures. But, I think even with the bias, we can still learn useful things from these tools, as long as we remember that the results are predictions/hypotheses themselves and are meant to be used for exploration along with other analyses (like phylogenetic analysis). We’re hoping to demonstrate that here and in some upcoming pubs where we pair ProteinCartography results with biochemical analysis.

I do think one interesting experiment related to this could be to try using an AlphaFold-type model with phylogenetic data incorporated to fold proteins and run ProteinCartography from “Cluster mode”. Comparing this to a typical ProteinCartography analysis using regular AlphaFold could be an interesting way to evaluate the effects of incorporating phylogenetic data into structure prediction models, especially at the protein family level. 

?
Brae M. Bigge:

Thanks for the comment! The idea of trying to capture some sequence diversity with the proteins we select to bring in the lab is a great idea. We have a new pub coming soon where we’ve tried to do something like this with the deoxycytidine kinase family, so stay tuned for that work!

Erle Holgersen:

This was a great read, thanks for sharing!

For the selection of proteins within each cluster, my suggestion would be to aim for sequence diversity. If the shortest protein in the cluster and other proteins with relatively low sequence similarity are shown to have the same function as the human protein, I’d be inclined to think that the clustering is picking up the key function of the proteins. The risk is you might end up picking cluster outliers and falsely debunking the hypothesis, so maybe a compromise could be to select a couple of proteins with low similarity to human, and 1-2 that are mostly similar.