Structure-based protein clustering sometimes, but not always, provides insight into protein function

Audrey Bell; Brae M. Bigge; Feridun Mert Celebi; Megan L. Hochstrasser; Atanas Radkov; Ryan York

doi:10.57844/arcadia-a757-3651

Result Feedback requested Annotation: Mapping the functional landscape of protein families across biology

Published on Feb 14, 2025 by Arcadia Science

Structure-based protein clustering sometimes, but not always, provides insight into protein function

We asked whether ProteinCartography’s structure-based protein clustering reflects functional features of proteins. We found that proteins often clustered with proteins that have similar functions, but there were cases when this wasn’t the case.

Structure-based protein clustering sometimes, but not always, provides insight into protein function

Purpose

ProteinCartography is a tool for structurally comparing, clustering, and mapping protein families [1]. It relies on the idea that structure and function are closely linked, an idea that we tested in this analysis. Our foundational hypotheses are that ProteinCartography will cluster functionally similar proteins together while sorting functionally distinct proteins into different clusters based on structural similarities. Here, we test these hypotheses using in vitro data to help give ProteinCartography users some idea of how well clustering aligns with function and when they should confidently use ProteinCartography results.

Building on our previous work, we investigated this by analyzing biochemically characterized deoxycytidine kinases (dCK), proteins that convert deoxynucleosides to their monophosphate form. We evaluated publicly available biochemical data for 34 dCK homologs, and we biochemically characterized four novel proteins from two specific ProteinCartography clusters [2]. We tested the enzymatic activity for each protein and five different substrates. We also noted their general characteristics throughout the purification. We used this data to evaluate ProteinCartography, but we hope this data is also useful to scientists studying dCK and related proteins.

We found that ProteinCartography, which uses global structural alignment, is able to sort proteins into clusters based on their enzymatic function, but it does not always do so. For example, proteins annotated as thymidine kinase that act on deoxythymidine all populate a single cluster. However, while proteins annotated as dCK all cluster together, they don’t share all functions. This is likely related to how ProteinCartography compares and clusters proteins, something that we’re interested in exploring more, and highlights the importance of combining analyses like these with other analyses to learn more about protein function.

This pub is part of the platform effort, “Annotation: Mapping the functional landscape of protein families across biology.” Visit the platform narrative for more background and context.
This pub is part of our validation strategy series of pubs that starts with “A strategy to validate protein function predictions in vitro” [3]. Our original ProteinCartography results for the deoxycytidine kinase family can be found in “How can we biochemically validate protein function predictions with the deoxycytidine kinase family?” [2].
Data from this pub, including ProteinCartography results, expression constructs, purification data and images, and individual protein selection data, is available on Zenodo.
All associated code is available in this GitHub repository.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Background

What is ProteinCartography?

ProteinCartography (RRID: SCR_027230) is a structure-based protein clustering tool designed to compare protein structures from a single family across multiple species [1]. It identifies proteins similar to an input and compares the structures to produce an interactive map with clustering information. To see whether the results of ProteinCartography can be used to infer functional relationships, we decided to test the two foundational hypotheses underlying ProteinCartography — that proteins within a cluster have similar functions, and proteins in different clusters have differing functions [3]. We chose two model protein families for biochemical testing, one of which is deoxycytidine kinase [2].

What is deoxycytidine kinase (dCK)?

The nucleoside kinase dCK is involved in producing DNA synthesis precursors [4]. It phosphorylates deoxycytidine (dC) into deoxycytidine monophosphate (dCMP) and can also convert deoxyadenosine (dA) and deoxyguanosine (dG) into their monophosphate forms [5]. Human dCK activates several nucleoside analog prodrugs, including anticancer and antiviral drugs [4]. While much is known about the human dCK, non-human homologs present an intriguing area of study due to their potentially distinct properties that could enhance anticancer and antiviral therapies.

Using ProteinCartography to investigate the dCK family

In a previous pub [2], we ran ProteinCartography using the human dCK (UniProt ID: P27707). The analysis produced well-defined clusters, with our input protein in cluster 4 (see Figures 1 and 2 in [2]). We identified three additional clusters that we thought were interesting — clusters 2, 8, and 9. We're excited about these clusters because cluster 2 contains almost exclusively plant proteins with long disordered regions, while the proteins in clusters 8 and 9 come from a diverse set of species but show a high structural divergence from the human dCK protein.

In this pub, we’ll tell you how and why we selected specific proteins and clusters for in vitro testing, we’ll correlate existing biochemical data for the dCK family with our ProteinCartography results, and we’ll present new activity data from the dCK enzymes in the clusters we selected. Finally, we’ll talk through the implications or our results for ProteinCartography predictions.

SHOW ME THE DATA: The data associated with this pub, including ProteinCartography results, expression constructs, purification data and images, and individual protein selection data, are in this Zenodo repository (DOI: 10.5281/zenodo.14517344).

The approach

To evaluate the ProteinCartography results for the dCK family, we compiled data on the activity of various dCK homologs. First, we looked to the literature, where we found activity data for 34 dCK proteins. Additionally, to test the results of ProteinCartography using data we generated, we selected a handful of proteins from a couple specific clusters to evaluate in vitro. Using the data from the literature and our generated data, we reviewed our original hypotheses related to ProteinCartography. For more info on each of these steps, keep reading. To jump straight to “The results,” click here.

Obtaining data from the literature

Before selecting individual proteins for laboratory study, we conducted a literature review to identify biochemically characterized dCK homologs that could help us evaluate ProteinCartography. We found a review article containing biochemical information for 34 dCK proteins [6] (Table 1 and Figure 1). We cross-referenced these proteins with our ProteinCartography clusters by re-running our initial analysis using “Cluster” mode and including the biochemically characterized proteins from the literature as key proteins. We used version 0.5.0 of the pipeline for this analysis. We generated a heatmap and Sankey plot to visualize the data (Figure 1, A-B).

The code for these visualizations is in this GitHub repository.

Selecting proteins for in vitro analysis

To select individual proteins to bring into the lab, we first identified representative proteins for each cluster. Using the all-v-all structural similarity matrix generated by ProteinCartography, we selected the protein from each cluster that had the highest similarity to every other protein in its cluster.

Next, to ensure diversity in our representatives, we sub-clustered our clusters of interest. We used the Elbow method to determine the optimal number of clusters [7] and Scikit-Learn’s k-means clustering algorithm for sub-clustering. To find the representatives for each sub-cluster, we selected the protein with the highest similarity to every other protein within its sub-cluster.

Finally, we confirmed that the proteins we selected would be soluble using the web server for Protein-Sol [8] (Table 2).

All of the scripts are available in this GitHub repository and the resulting data are available in this Zenodo repository.

Purifying selected dCK proteins

We based our expression and purification protocol on a previously successful protocol for human dCK purification [9].

Cloning

We synthesized and cloned the codon-optimized sequences for our proteins of interest into the pET28a(+) vector for E. coli expression using Twist Biosciences. The constructs include a 6× N-terminal His tag, as well as a Human Rhinovirus (HRV) 3C cut site. We ultimately didn’t use the cut site, as the purified proteins were active with the tag. We’ve included our protein expression constructs in this Zenodo repository.

Induction

We transformed the constructs into E. coli BL21 (DE3) cells (NEB - C2527H) and incubated overnight at 37˚C in 9 mL YT media with 50 µg/mL kanamycin. The following day, we added 10 mL of the overnight culture to 1 L (4 L for the bc-dNK) of fresh YT media with 50 µg/mL kanamycin. We incubated the cells with shaking until the OD600 reached about 0.6, after which we induced expression with 0.1 mM IPTG. After 4 hours of shaking at 37 °C, we collected the cells via centrifugation at 4,000 × g for 15 minutes and snap-froze the pellets in liquid nitrogen before storing them at −80 °C. All proteins besides human dCK had lower yields. Yield could be increased by optimizing purification buffer conditions, for example, adding a protease inhibitor or altering the pH to account for the pI. The almond protein, pd-dNK, showed signs of degradation during purification, but we were able to get enough active protein to analyze.

Lysis

We thawed the cell pellets and resuspended them in a resuspension buffer containing 50 mM HEPES, pH 7.5, and 500 mM NaCl (pH adjusted with NaOH). We sonicated (VWR 76193-590) the cells, keeping them on ice, using a ⅜-inch horn at 50% amplitude for 2 minutes, with 20-second on and 20-second off intervals, for three total cycles. We clarified the lysate by centrifuging for 15 minutes (or longer if necessary) at 19,000 × g at 10 °C.

Purification

For affinity chromatography, we used a 1 mL HisTrapFF column (Cytiva - 17-5319-01), an AKTA system, and the resuspension buffer described above as our wash buffer and as our elution buffer (with 200 mM imidazole). We combined the elution fractions from the affinity run and concentrated them to 1 mL using a 10 kDa MWCO Pierce concentrator (ThermoFisher - 88528) before injecting the sample onto the size exclusion chromatography column (HiPrep 16/60 Sephacryl S-200 HR; Cytiva - 17116601). The buffer we used for size exclusion chromatography contained 20 mM HEPES, pH 7.5, 200 mM sodium citrate, 2 mM EDTA (pH adjusted with 10 M NaOH). Chromatograms (affinity and size exclusion) are in the Zenodo repository.

Checking concentration and purity

We evaluated protein concentration with a Bradford assay reagent kit (Thermofisher - 23236) and a SpectraMax iD3 plate reader at 595 nm. We prepared a standard curve of bovine serum albumin (Thermofisher - 23209) with eight different standard protein concentrations ranging from 0 µg/mL to 2,000 µg/mL.

We confirmed protein identity and purity with gel electrophoresis. We used Any kD™ Mini-PROTEAN® TGX™ precast protein Gels, 15 wells (Bio-Rad - 4569036) and the Precision Plus Protein™ Dual Color Standards as the molecular weight ladder (Bio-Rad - 1610394). We prepared the 1× running buffer from a commercial stock (10× Tris/Glycine/SDS buffer; Bio-Rad - 1610732). We ran the gels for 30 minutes at a constant voltage of 200 V in a tetra electrophoresis chamber (Bio-Rad - 1658004). We stained the gels using commercial Coomassie solutions (Bio-Rad - 1610436 and 1610438). We imaged the final destained gel using an Azure 600 gel imaging system.

For the western blot, we transferred protein to nitrocellulose membranes (Bio-Rad - 1620112) using a Trans-Blot Turbo transfer system (Bio-Rad - 1704150) and the built-in StandardSD method. We prepared the 1× transfer buffer from a commercial stock (10× Tris/Glycine buffer; Bio-Rad - 1610734), with 20% (vol/vol) final concentration of methanol (VWR - BDH1135-4LG). We blocked the membrane with a commercial casein solution (1× Tris Buffered Saline with 1% Casein; Bio-Rad - 1610782), containing 0.1% (vol/vol) Tween-20 (Bio-Rad - 1610781), for 30 minutes at room temperature with shaking. We incubated the blot at room temperature with shaking, first with a primary anti-His antibody at 1:2,000 dilution (Histidine Tag Antibody | AD1.1.10; Bio-Rad - MCA1396) and then with a secondary antibody at 1:5,000 dilution (Goat-anti-mouse IgG (H+L), HRP conjugate; Advansta - R-05071-500). After the transfer and the blocking step, between the two antibody incubation steps, and after the secondary antibody incubation, we rinsed the membrane several times with a 1× buffer, prepared from a commercial stock (10× Tris Buffered Saline; Bio-Rad - 1706435) containing 0.1% (vol/vol) final concentration Tween-20 (Bio-Rad - 1610781). We visualized the protein using the WesternBright ECL-HRP Substrate (Advansta - K-12045-D20) with the Azure 600 gel imaging system.

Assessing biochemical activity of dCK proteins

We assessed activity of the protein with the Kinase-Glo® luminescent kinase assay kit from Promega (V6071). We prepared the luminescence reagent that we added to each assay according to the manufacturer’s instructions. We did each assay in a 50 µl total volume, containing 40 µl of enzyme and 5 µl of dN substrate at 500 µM final concentration [Cayman Chemical; dC - 34708; dG - 9002864; dA - 27315; thymidine (dT) - 20519; deoxyuridine (dU) - 27803], and 5 µl of ATP, also at 500 µM final concentration (Cayman Chemical - 14498). For each assay, we used 0.4 mg/mL final protein concentration. We incubated the reactions at room temperature without shaking for 60 minutes, after which we added 50 µl of the luminescence reagent and incubated for 10 minutes at room temperature. We measured the outputs with a SpectraMax iD3 plate reader (integration: 1,000 ms and read height: 1 mm). We performed the assays for each protein and each deoxynucleoside in triplicate. We calculated enzyme activity as the luminescence signal per minute per mg protein and normalized the activity values so that the sample with the highest enzyme activity was set to 100%.

Additional methods

We used ChatGPT to write some of the text, as well as to suggest wording ideas and then chose which small phrases or sentence structure ideas to use. We also used ChatGPT to help critique, clarify, and streamline text that we wrote.

We generated figures in this pub using code in the arcadia-pycolor GitHub repo [10].

All code we generated and used for this pub is available in this GitHub repository (DOI: 10.5281/zenodo.14814709), including scripts used to evaluate the literature and identify representative proteins. Additionally, all data we generated for this pub is available in this Zenodo repository.

The results

We evaluated our ProteinCartography results with data from the literature and data we generated in-house. To jump to our analysis of all this data as a whole, click here.

SHOW ME THE DATA: The data associated with this pub, including ProteinCartography results, expression constructs, purification data and images, and individual protein selection data can be found in this Zenodo repository.

Biochemically characterized dCK proteins from the literature demonstrate the utility of ProteinCartography clustering

Before selecting dCK proteins to characterize in vitro, we searched the literature for pre-existing enzymatic data. The review article “Non-Viral Deoxyribonucleoside Kinases – Diversity and Practical Use” [6] includes biochemical data for 34 dCK enzymes, listed in Table 1, that we used to evaluate ProteinCartography (Figure 1).

ProteinCartography results for dCK family.

The interactive UMAP for the dCK family produced using ProteinCartography with the 34 additional proteins from the literature added. The human protein is in LC04.

Members of the dCK family are known to act on multiple deoxynucleosides (dNs), a distinguishing feature of proteins in this family (Figure 2). Sixteen of the 34 characterized enzymes are annotated as thymidine kinase (TK, TK1, TK1a, TK1b, TK2) and show high activity towards dT (Figure 2). The one exception is the Xenopus laevis enzyme, annotated as TK2, which shows the highest activity towards dC. Of the five biochemically characterized dCK proteins, three are annotated as dCK and have a dominant activity towards dC, while two are annotated as dCK2 and have a dominant activity towards dG (Figure 2). The dAKs and dGKs generally have matching annotations and biochemical activity (Figure 2). The dNKs show broad specificity towards two or more deoxynucleosides.

Sankey diagram showing the relationship between enzymatic activity, annotation, and ProteinCartography cluster for the proteins from the literature. In general, enzymatic activity appears to align well with the clusters. — **ProteinCartography’s structure based clustering largely reflects the function of characterized dCK proteins**.
Normalized enzymatic activity of characterized dCK proteins compiled in this review article are shown in the heatmap on the left. Enzymatic activity refers to catalytic efficiency (k_cat/K_m). The data are normalized so that the highest activity for each protein is set at 100%. These proteins were sorted into ProteinCartography clusters on the right. A Sankey plot connects the enzymatic activity to the clustering. Each cluster is represented by a box, the color of which matches the cluster colors in Figure 1.

Organism	Annotation	UniProt ID	Cluster
Homo sapiens (human)	TK2	O00142	0
Xenopus laevis (African clawed frog)	TK2	Q8UVZ9	0
Bombyx mori (silk moth)	dNK	Q9BKL3	0
Arabidopsis thaliana (thale cress)	dNK	A0A654ENJ6	2
Anopheles gambiae (African malaria mosquito)	dNK	Q86LB8	3
Drosophila melanogaster (fruit fly)	dNK	Q9XZT6	3
Dictyostelium discoideum (social amoeba)	dAK	Q54YL2	3
Bacillus cereus	dGK	Q81JC3	3
Mycoplasma mycoides subsp. mycoides SC (mycoplasma)	dAK	Q93IG4	3
Flavobacterium psychrophilum	dAK	A6GWA3	3
Polaribacter sp. MED 152	dAK	A2U3R9	3
Homo sapiens (human)	dCK	P27707	4
Gallus gallus (chicken)	dCK	Q5ZMF3	4
Gallus gallus (chicken)	dCK2	Q5ZJM7	4
Xenopus laevis (African clawed frog)	dCK	A0A1L8HV70	4
Xenopus laevis (African clawed frog)	dCK2	Q6DD33	4
Homo sapiens (human)	dGK	Q16854	5
Xenopus laevis (African clawed frog)	dGK	Q6GPW6	5
Dictyostelium discoideum (social amoeba)	dGK	Q54UT2	6
Bacillus cereus	dAK	Q0H0H5	6
Homo sapiens (human)	TK1	P04183	7
Gallus gallus (chicken)	TK1	P04047	7
Xenopus tropicalis (Western clawed frog)	TK1	Q5I0A2	7
Caenorhabditis elegans (roundworm)	TK1	F3Y5P8	7
Arabidopsis thaliana (thale cress)	TK1a	Q9S750	7
Arabidopsis thaliana (thale cress)	TK1b	F4KBF5	7
Dictyostelium discoideumI (social amoeba)	TK1	Q27564	7
Escherichia coli	TK1	P23331	7
Salmonella enterica	TK1	Q7CQF3	7
Bacillus anthracis	TK1	Q81JX0	7
Bacillus cereus	TK1	Q0H0H6	7
Ureaplasma parvum serovar 3 (mycoplasma)	TK1	Q9PPP5	7
Flavobacterium psychrophilum	TK1	A6GYI4	7
Polaribacter sp. MED 152	TK	A2TYX7	7

dCK proteins in the established dataset we further analyzed in this pub.

A similar table first appeared in this review article.

Next, we investigated where these proteins fall within our ProteinCartography map. Cluster 4, which contains the human dCK protein, also contains the other characterized dCK proteins (Figure 1, Figure 2, and Table 1). The two dCK proteins, dCK and dCK2, do act on different dNs according to the biochemical data, but perhaps share enough of their structure that they still cluster together. There’s a tight cluster of TK1 proteins in cluster 7, but proteins annotated as TK2 fall into a separate cluster (Figure 1, Figure 2, and Table 1). This aligns well with the function of these proteins as TK1s act almost exclusively on dTs, while TK2s act on dTs and dGs (Figure 1 and Figure 2). This suggests that in some cases ProteinCartography is able to sort proteins into structure-based clusters that do reflect some function.

The results are less straightforward for the dAKs and dGKs. Four of the five dAK proteins are in cluster 3, with the other dAK falling into another cluster (Figure 1, Figure 2, and Table 1). There are small structural differences between this dAK and the others in the cluster, mostly around the N and C termini, but we found no clear functional reason for them to cluster separately. Proteins annotated as dGK are distributed between three clusters with two proteins landing in cluster 5 (Figure 1, Figure 2, and Table 1). This is interesting as dGKs do seem to exclusively act on dG (Figure 1 and Figure 2). Perhaps there are structural differences in these proteins that don't affect substrate specificity. The dNKs, whose activity varies, are distributed into three clusters as well, with two landing together in cluster 3 with the dAKs (Figure 1, Figure 2, and Table 1). Interestingly, although the protein from Bombyx mori is annotated as a dNK, it’s sorted into cluster 0 with the TK2 proteins, which reflects its function (Figure 1 and Figure 2). Together, this provides evidence that while ProteinCartography can separate proteins based on function, it doesn't always do so. However, it's possible that there are other functional differences between these proteins beyond enzymatic activity and are therefore not reflected in this analysis.

In vitro analysis of dCK proteins further highlights ProteinCartography’s utility

The previously characterized proteins show that ProteinCartography can sort proteins into clusters based on function, but it also showed that there are cases when it doesn't. We wondered if there were additional functions beyond enzymatic activity that might better align with clustering. For example, for the 34 proteins in the study, we only had enzymatic data. We hoped working with the proteins ourselves might allow us to look at other functions that could lead to differences in structures and, therefore, clustering. Additionally, purifying and analyzing proteins in the lab allowed us to more directly compare proteins purified in the same lab, using the same purification strategy and the exact same assay conditions. To learn how we chose which dCK enzymes to test ourselves, read on. To skip straight to what we found, click here.

Selecting clusters and individual proteins for biochemical characterization

We planned to directly compare proteins within the same cluster and proteins in different clusters. Which clusters we chose for this comparison wasn’t necessarily important, so we selected clusters that were interesting for reasons beyond ProteinCartography validation. We previously identified [3] three interesting clusters in addition to the cluster containing our input protein (cluster 4) — clusters 2, 8, and 9 (see Figure 2 in “How can we biochemically validate protein function predictions with the deoxycytidine kinase family?”). We polled the Twitter/X community to help us select a single cluster. We decided to focus on LC02, the top choice in the Twitter/X poll, and the input-containing cluster, LC04, for our subsequent protein selection and validation studies of the ProteinCartography results. Thanks to all who voted!

Cluster 2 almost exclusively contains plant proteins that are longer than the human protein with disordered regions at the N- or the C-terminus. However, the part of the plant proteins that align well with the human dCK protein is well-structured and quite conserved. Only one of these proteins was included in the list of previously biochemically characterized proteins in the previous section. We also selected proteins from cluster 4 because this cluster contains our input protein. We can test the hypothesis that proteins within a cluster function similarly by comparing the input human dCK to another protein in this cluster and by comparing the proteins in cluster 2 to each other. Comparing proteins from both of these two clusters should let us test our hypothesis that proteins from different clusters have distinct activities.

We selected representatives for each cluster by identifying the protein with the highest similarity to every other protein in the cluster. We also sub-clustered the clusters of interest to select additional proteins that are more representative of the diversity of the larger clusters. We evaluated the solubility and the predicted isoelectric point (pI) of each representative protein. The proteins in Table 2 are the representatives identified. One of the cluster 2 sub-clusters presented a representative that was predicted to be insoluble. Therefore, we substituted in the Rickettsiales protein for this sub-cluster, which is unique in that it’s one of the only proteins in this cluster not from plants.

UniProt ID	Cluster	Annotation	Organism	Predicted molecular weight	Predicted scaled solubility	Isoelectric point (pI)
A0A4Y1QVV5	LC02	P-loop containing nucleoside triphosphate hydrolases superfamily protein (pd-dNK)	Prunus dulcis (almond)	55.0 kDa	0.373	6.95
A0A3P6ASY1	LC02	Deoxynucleoside kinase domain-containing protein (bc-dNK)	Brassica campestris (field mustard)	27.5 kDa	0.484	6.45
A0A2A5BCG8	LC02	Deoxynucleoside kinase domain-containing protein (rb-dNK)	Unidentified Rickettsiales bacterium	23.2 kDa	0.465	5.37
A0A7J5YK87	LC04	Deoxynucleoside kinase domain-containing protein (dm-dCK)	Dissostichus mawsoni (Antarctic toothfish)	29.2 kDa	0.507	5.14
P27707	LC04	Deoxynucleoside kinase	Homo sapiens (human)	33.0 kDa	0.560	5.57

Selected proteins for in-lab analysis.

We determined the predicted scaled solubility on the Protein-Sol website, where higher values indicate higher predicted solubility. The average protein in E. coli has a predicted scaled solubility of 0.45.

The human dCK shares some, but not all, functions with a protein from its cluster

To start, we purified the human dCK protein using a published expression and purification protocol [9] and confirmed that it exists as a dimer in its native state (Figure 2, A and C). The kinase activity of our purified human dCK closely matched its reported activity, acting primarily on dC and less so dA and dG [11] (Figure 2, B).

To determine if proteins within a structure-based cluster share biochemical functions, we purified and analyzed the Antarctic toothfish (Dissostichus mawsoni) dCK (dm-dCK), which resides in the same cluster as the human dCK (Figure 3, A). We found that dm-dCK, like human dCK, behaved as a dimer (Figure 3, C). If we think of assembly of monomers into an oligomeric form as another function of this protein, this can be counted as another instance where proteins within a ProteinCartography cluster share functions. The Antarctic toothfish protein, dm-dCK, showed similar activity to our input protein against a comparable selection of deoxynucleoside substrates, with the exception of the activity towards dT (Figure 3, B). While the human dCK enzyme didn't show any activity towards dT, the dm-dCK did (Figure 3, B).

These results lend support to our hypothesis while also generating some questions. Functions conserved between the two proteins from cluster 4, human dCK and dm-dCK, include behavior as a dimer, enriched activity towards dC, and lesser activity towards dA and dG. A function not conserved between the two proteins is the activity towards dT. This suggests that while ProteinCartography can separate proteins based on function, it doesn’t separate on every function. This is expected, as proteins are complex and perform many different functions.

Protein purification and activity data for the proteins in cluster 4. (A) shows the UMAP with the selected proteins highlighted. (B) is a heatmap with the enzymatic activity of the human and antarctic toothfish proteins showing that they both have the highest activity on dC. (C) contains size exclusion chromatography results for each protein demonstrating that they each exist as a dimer. — **The human dCK and the antarctic toothfish protein from cluster 4 share similar functions**.
(A) We first analyzed the cluster containing our input protein the human dCK (P27077). This cluster also contains the Antarctic toothfish protein (A0A7J5YK87).
(B) We measure kinase activity for the human dCK and the antarctic toothfish protein using five substrates. We calculated enzyme activity as the luminescence signal per minute per mg protein. We set the highest activity to 100% and normalized the data accordingly. We also show that our measured human data matches that of the literature.
(C) Size exclusion chromatography results show that both the human and antarctic toothfish dCK proteins form dimers. The graph on the left shows the analyzed commercial standards that we used to estimate the weight of the purified human dCK protein. The size exclusion data and all accompanying gels and western blots from the purification are on Zenodo. Additionally, gels can be found in Supplementary Figure 1.

We also selected three proteins from cluster 2 to purify and analyze, including the almond (Prunis dulcis) dNK (pd-dNK), the field mustard (Brassica campestris) dNK (bc-dNK), and a dNK from a Rickettsiales bacterium (rb-dNK) (Table 2). As we did with human dCK and dm-dCK, we compared the functions of these three proteins to test whether proteins in the same cluster share functions (Figure 4, A).

All three proteins from this cluster eluted from the size exclusion run at very high molecular weights, indicating that they either form a multimer or aggregate, but the protein is active after purification (Figure 4, C). This oligomerization could be considered a function that’s shared by proteins within a cluster. Two of the proteins in cluster 2, rb-dNK and pd-dNK, had high activities towards all of the tested deoxynucleosides, including dU (Figure 4, B). The final protein from this cluster, bc-dNK, has the highest activity towards dG but also acted dT, dC, and dU (Figure 4, B).

Similar to the comparison we made with the protein in cluster 4, some functions are conserved between all three proteins, while some are not. All three proteins form some higher-order multimer and act on multiple deoxynucleosides. However, rb-dNK and pd-DNK seem to act less specifically than bc-dNK. These results support the idea that ProteinCartography can separate proteins based on some, but not all, of their functions.

Protein purification and activity data for the proteins in cluster 2. (A) shows the UMAP with the three selected proteins highlighted. (B) is a heatmap with the enzymatic activity of the proteins showing that they all act on multiple substrates but are not always consistent on which substrates. (C) contains size exclusion chromatography results for each protein demonstrating that they each exist as a multimer or aggregate of some sort. — **Three proteins from cluster 2 share similar functions**.
(A) We next analyzed cluster 2 which contains primarily plant proteins. We specifically looked at the almond protein (A0A4Y1QVV5), the field mustard protein (A0A3P6ASY1), and the Ricketsiales protein (A0A2A5BCG8).
(B) We measure kinase activity for each enzyme using five substrates. We calculated enzyme activity as the luminescence signal per minute per mg protein. We set the highest activity to 100% and normalized the data accordingly.
(C) Size exclusion chromatography results show that all three proteins elute at a higher than expected molecular weight. The graph on the left shows the analyzed commercial standards. The size exclusion data and all accompanying gels and western blots from the purification are on Zenodo. Additionally, gels can be found in Supplementary Figure 1.

Proteins in different clusters have some distinct biochemical features

To test if proteins in different clusters have different functions, we compared the proteins from cluster 4 to the proteins from cluster 2. First, the proteins in cluster 4, which contains the human dCK and dm-dCK, eluted as a dimer from our size exchange column, while the proteins in cluster 2 eluted as multimers larger than dimers. In every case that we’ve tested, this oligomerization “function” aligns with ProteinCartography clustering.

We previously established that the functions aren’t totally conserved within the clusters. However, the activity profiles of proteins within a cluster are much more similar to each other than to the activity profiles of proteins from the other cluster (Figure 5). All proteins act on dC and dG to some degree, meaning that the two clusters do share functions, which isn’t totally unexpected as they’re all from the same family of proteins (Figure 5). We’re also able to identify functional differences between the proteins in the two clusters. The proteins in cluster 4 act primarily on dCK, while the proteins in cluster 2 are generally less specific, acting on the deoxynucleosides that aren’t substrates for the proteins in cluster 4 (Figure 5).

Overall, the comparison between clusters 2 and 4 supports the idea that proteins from different structure-based clusters show at least some distinct functions. They form different higher-order structures and have differing substrate specificity.

Figure comparing enzymatic and purification data for proteins from cluster 2 and cluster 4 showing that the substrate specificity and oligomerization state of the two clusters differ. — **Proteins in cluster 2 and cluster 4 have different functions**.
We compare the activity and tertiary structure of proteins in cluster 2 and cluster 4. We see that proteins in cluster 2 tend to act on multiple substrates while proteins in cluster 4 tend to act primarily on dC. We also find that proteins in cluster two form multimers as demonstrated on the right, while proteins in cluster 4 form dimers.

Bringing it all together

In this pub, we presented data from the literature for 34 proteins related to dCK and generated our own data to add to that list. We found instances where function clearly aligned with cluster separation and instances where it was less clear. For example, the 14 TK1 proteins that act exclusively on dT all landed in cluster 7, supporting the idea that ProteinCartography can sort proteins into structural clusters based on their function (Figure 6). Similarly, the proteins in cluster 0 all act on both dT and dC, but not dA and dG, while all the proteins in cluster 5 act primarily on dG (Figure 6). However, the activities of dCK family proteins towards different substrates are more mixed for clusters like 3 and 6, suggesting that ProteinCartography doesn’t always separate proteins based on function. This is also a trend we see for the proteins we selected for our in-lab analysis. Most proteins in cluster 4, which contains the human dCK protein, act most strongly on dC, with the exception of the dCK2s (Figure 6). The proteins in cluster 2 seem to act on all substrates to some degree (Figure 6).

There are many possibilities for why ProteinCartography sometimes, but not always, sorts proteins based on function. First, ProteinCartography performs a global structural alignment, so perhaps in these cases there are subtle local structural differences between proteins that a global alignment doesn’t pick up. For example, we know that proteins in cluster 2 are generally longer with large disordered termini. ProteinCartography is much more likely to pick up these larger differences than subtle differences that might account for differences in enzymatic activity.

Looking at a ProteinCartography map is like taking a bird’s eye view of the similarities between proteins. ProteinCartography creates a continuous distribution that the clustering tries to discretize. On average, proteins in cluster 2 are likely more similar to other proteins in cluster 2 than in other clusters. However, upon closer evaluation, the reality is more nuanced. For example, two of the characterized proteins in cluster 3, Q86LB8 and Q9XZT6 from Drosophila and Anopheles, are actually more closely related to the characterized proteins in cluster 0 that have similar functions than to the characterized proteins in cluster 3. These two proteins have an average TM-score of 0.80 compared to the rest of the characterized proteins in cluster 3, while they have an average TM-score of 0.90 compared to the characterized proteins in cluster 0. The functions of these proteins align with these findings, but this isn't always the case. The proteins in cluster 4, which have some functional diversity, have an average structural similarity of 0.94, while the proteins in the very tight cluster of TK1 proteins in cluster 7 that exclusively actin on dT have an average TM-score of only 0.84.

Because clustering tries to sort a continuous distribution into discrete groups of proteins, there's no “correct” clustering, only clustering that's more or less reflective of the properties we care about. ProteinCartography uses the baseline Foldseek settings to create the all-vs-all similarity matrix, meaning that it only calculates structural similarity scores for the top 1,000 proteins for each protein based on an initial alignment of structure-representing 3Di sequences [12]. Changing the parameters does alter clustering, so tuning these parameters could help us get closer to clusters that reflect function. However, the truly optimal parameters for each protein family and use case are likely different, so perhaps this is something users should experiment with for their own use cases.

One of the novelties of ProteinCartography is that it uses structural comparisons to identify matches and for clustering. This has some benefits. For example, because we use structure-based searches (in addition to sequence-based searches) to identify similar proteins we’re able to cast a larger net. For example, in this analysis, we have proteins that have as little as 8.5% identity compared to our input protein that we identified using protein structure. The Antarctic toothfish and the human protein share 70% of their sequences, so it’s not surprising to find them in the same cluster and with similar functions. However, the three proteins in cluster 2 have less than 40% sequence identity. Despite this, they share structural similarity and some functional similarity. On the flip side, because ProteinCartography is based on rigid, global structural alignment, it might not pick up on small changes at the active site for example.

Finally, it’s typical to focus primarily on enzymatic activity when comparing enzymes, but we use the term “function” broadly. Other “functions” or functional properties we could look at include things like stability, tertiary structure, and other functions in the cell. These auxiliary functions should be considered. In our in-lab analysis we found differences in oligomerization states between proteins in cluster 2 and cluster 4, suggesting this could be another “function” that’s picked up by ProteinCartography for this family (Figure 6). It would be useful to apply or develop assays that can be generalized and used on multiple protein families quickly to gather more multi-dimensional data about proteins.

Overall, based on our results for the dCk family, ProteinCartography can be a useful tool for investigating protein function, but it should be used alongside other tools and analyses.

UMAP highlighting the proteins from the literature and the five proteins we analyzed. We see instances where activity aligns well with clustering, like the TKs in cluster 7, and instances where activity doesn’t align well with clustering. — **ProteinCartography sometimes but not always sorts proteins into structure-based clusters that reflect function**.
We bring together the existing literature data and our experimental data to look at how protein functions are distributed across the ProteinCartography map. Proteins analyzed in this study are represented as four-point stars in the map in the upper left.

Key takeaways

ProteinCartography separates proteins based on their global protein structures. We asked if these global protein structure relationships could be used to learn anything about the function of the proteins.
ProteinCartography can sort proteins based on their functions. However, it doesn’t always do so.
ProteinCartography can be used to learn more about protein function and to form hypotheses but should be used alongside other tools and analyses designed to study protein function.

References

Avasthi P, Bigge BM, Celebi FM, Cheveralls K, Gehring J, McGeever E, Mishne G, Radkov A, Sun DA. (2024). ProteinCartography: Comparing proteins with structure-based maps for interactive exploration. https://doi.org/10.57844/ARCADIA-A5A6-1068

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the deoxycytidine kinase family? https://doi.org/10.57844/ARCADIA-1E5D-E272

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). A strategy to validate protein function predictions in vitro. https://doi.org/10.57844/ARCADIA-CAE9-96C4

Sabini E, Hazra S, Ort S, Konrad M, Lavie A. (2008). Structural Basis for Substrate Promiscuity of dCK. https://doi.org/10.1016/j.jmb.2008.02.061

Shewach DS, Reynolds KK, Hertel L. (1992). Nucleotide specificity of human deoxycytidine kinase. https://pubmed.ncbi.nlm.nih.gov/1406603/

Slot Christiansen L, Munch-Petersen B, Knecht W. (2015). Non-Viral Deoxyribonucleoside Kinases – Diversity and Practical Use. https://doi.org/10.1016/j.jgg.2015.01.003

Thorndike RL. (1953). Who Belongs in the Family? https://doi.org/10.1007/bf02289263

Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. (2017). Protein–Sol: a web tool for predicting protein solubility from sequence. https://doi.org/10.1093/bioinformatics/btx345

Sabini E, Hazra S, Konrad M, Lavie A. (2007). Nonenantioselectivity Property of Human Deoxycytidine Kinase Explained by Structures of the Enzyme in Complex with <scp>l</scp>- and <scp>d</scp>-Nucleosides. https://doi.org/10.1021/jm0700215

Arcadia Science. (2024). arcadia-pycolor. https://github.com/Arcadia-Science/arcadia-pycolor

Chottiner EG, Shewach DS, Datta NS, Ashcraft E, Gribbin D, Ginsburg D, Fox IH, Mitchell BS. (1991). Cloning and expression of human deoxycytidine kinase cDNA. https://doi.org/10.1073/pnas.88.4.1531

van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, Steinegger M. (2023). Fast and accurate protein structure search with Foldseek. https://doi.org/10.1038/s41587-023-01773-0

Share your thoughts!

Provide feedback

Pub details

Content 6 contributors

12 references

Activity 47 discussions

0 social posts

This work is licensed under CC BY 4.0

Purpose
Background
What is ProteinCartography?
What is deoxycytidine kinase (dCK)?
Using ProteinCartography to investigate the dCK family
The approach
Obtaining data from the literature
Selecting proteins for in vitro analysis
Purifying selected dCK proteins
Assessing biochemical activity of dCK proteins
Additional methods
The results
Biochemically characterized dCK proteins from the literature demonstrate the utility of ProteinCartography clustering
In vitro analysis of dCK proteins further highlights ProteinCartography’s utility
Bringing it all together
Key takeaways

Audrey Bell

Visualization

Brae M. Bigge

Conceptualization, Supervision, Writing

Feridun Mert Celebi

Validation

Megan L. Hochstrasser

Editing

Atanas Radkov

Formal Analysis, Investigation, Visualization, Writing

Ryan York

Supervision

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). How can we biochemically validate protein function predictions with the deoxycytidine kinase family? https://doi.org/10.57844/ARCADIA-1E5D-E272

Avasthi P, Bigge BM, Radkov A, Wood H, York R. (2024). A strategy to validate protein function predictions in vitro. https://doi.org/10.57844/ARCADIA-CAE9-96C4

Sabini E, Hazra S, Ort S, Konrad M, Lavie A. (2008). Structural Basis for Substrate Promiscuity of dCK. https://doi.org/10.1016/j.jmb.2008.02.061

Shewach DS, Reynolds KK, Hertel L. (1992). Nucleotide specificity of human deoxycytidine kinase. https://pubmed.ncbi.nlm.nih.gov/1406603/

Slot Christiansen L, Munch-Petersen B, Knecht W. (2015). Non-Viral Deoxyribonucleoside Kinases – Diversity and Practical Use. https://doi.org/10.1016/j.jgg.2015.01.003

Thorndike RL. (1953). Who Belongs in the Family? https://doi.org/10.1007/bf02289263

Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. (2017). Protein–Sol: a web tool for predicting protein solubility from sequence. https://doi.org/10.1093/bioinformatics/btx345

Arcadia Science. (2024). arcadia-pycolor. https://github.com/Arcadia-Science/arcadia-pycolor

Chottiner EG, Shewach DS, Datta NS, Ashcraft E, Gribbin D, Ginsburg D, Fox IH, Mitchell BS. (1991). Cloning and expression of human deoxycytidine kinase cDNA. https://doi.org/10.1073/pnas.88.4.1531

Wijnand on Sep 16, 2025

Most protein language models are only trained on a fraction of available experimental 3D structures, and since most proteins contain intrinsically disordered regions - which often lack density and resolution in a PDB structure due to its inherent flexibility - relying solely on global structural information can limit the accuracy to predict protein function. Intrinsically disordered regions are crucial for protein folding, function etc..

Brae Bigge on Sep 16, 2025

Thanks for the comment! Intrinsically disordered regions are important for function and dynamics of the protein, and we agree that a caveat of ProteinCartography is that global structural comparisons mean you miss things like this and minor differences between proteins.

Gabriel Rosenfield on Jun 01, 2025

Bravo! This is a a really interesting tool integrating sequence and structure database searches, structural predictions and comparison techniques, and clustering and visualization methods to suggest functional groupings of proteins. I wish I had access to tools like this during my PhD! Building off of comments by Benjamin Groves and Gabriel Butterfield, I think a great extension to ProteinCartography would be a "sliding window" mode, in which an input protein is broken down into segments before each segment is run through the existing pipeline. If a structure for the input sequence is provided, or if the sequence has features that can be confidently linked to structural or functional features (e.g. a high confidence BLAST match to a protein with a known structure, or high confidence conserved domain match, etc), the locations of these domains within the sequence could be used to define the sliding window size. If this information isn't supplied by the user and can't be confidently predicted, then some default sliding window size can be used (e.g. 100 residues). By comparing the clustering results across windows, it should be possible to identify the approximate borders between structurally independent elements of the input, to identify database hits that share either parts or all of the structural elements of the input, and, potentially, to improve functional suggestions for inputs that consist of multiple independent elements. The process could also be run iteratively, updating the window size each time to improve the separation of structural elements and functional predictions for each element. I imagine this could be especially useful in cases where proteins have evolved via gene fusion, or to suggest potential functions for proteins that contain novel or unstructured elements that may result in few or noisy matches and/or poor clustering with the current pipeline. But it might be computationally intensive...

Brae M Bigge on Jun 05, 2025

Thanks, Gabriel! I love these ideas to dig into the structures a bit more granularly. We don't have plans to continue working on ProteinCartography right now, but we're totally open to contributions if you want to take a look -> https://github.com/Arcadia-Science/ProteinCartography. I think you could even test this by just triggering the pipeline multiple times for your different windows/domains of your input. It will definitely be more computationally intensive than a single ProteinCartography run, but I'm sure there are tradeoffs that could be made for window size and number of windows depending on the protein.

Tyler Lazar on May 26, 2025

This is really so neat! I work in the Jewett Lab over at Stanford where one of our projects has been exploring engineering promiscuity of dNKs towards both canonical and noncanonical nucleosides; thus, coming across this work was very helpful and illuminating. On our end, we have been screening dNK homologues broadly representing a sequence similarity search where we have also seen trends in promiscuity/activity but with sequence clustering/phylogenetic similarity. I think it would be interesting to compare the sequence-based clustering of our experimental data with ProteinCartography along with ESM2-based clustering to determine how much sequence/structure/protein feature space differ from each other and which approach is the most predictable for our measured dNKs!

Although the amount of experimental data is relatively small, have you attempted to train a supervised model (linear or nonlinear) on the collected experimental data and determined its predictive strength? I assume it would be low, but it would be neat to also check the d-optimality of structures included in the clustering to determine how many experimentally validated sequences would be necessary to achieve reasonable feature coverage across the entire dataset. Pairing structure-based clustering with supervised learning could really help us effectively traverse natural fitness landscapes effectively!

Also on more of a tangent, do you intuitively think it would be viable to apply ProteinCartography to populate search space for sparsely characterized protein classes (i.e. generating 1000 sequences with the highest sequence similarity to the starting protein via BLAST, structural clustering via ProteinCartography, and then selecting + testing variants that cluster with the starting sequence)? There have a been a few collaborations where we have identified proteins with unique activities that share little sequence similarity with other known proteins. In these instances, sequence-based clustering methods have really struggled to identify additional variants from nature with the desired function, and I wonder if a structure-based clustering workflow would be more optimal at identifying functional variants.

Tyler Lazar on Jun 06, 2025

I really appreciate the response Brae! I will definitely play around with ProteinCartography a bit and compare our experimental data to all these different clustering metrics; y'alls documentation is super extensive and well done, and I doubt it will be a burden on my end to boot it up!

As for your comments about the connection between structure and function along with utilizing ProteinCartography for effective selection of experimental variants, I completely agree! I think a lot about how we as protein scientists can effectively traverse diverse feature landscapes, either from nature or created through generative models, and I truly believe that structure-guided clustering, possibly integrated with sequence-based PLMs, could be a super effective tool at a) filtering and identifying candidates with a desired function and b) effectively sampling a library to maximize feature coverage and then train supervised models to effectively identify optimal candidates from the feature space. Really neat to see what y'all are developing!

Brae Bigge on Jun 03, 2025

Thanks for the comments and ideas, Tyler! It's really cool to hear that you all are working on exploring dNK homologs too. Comparing your sequence-based clustering to the structure-based ProteinCartography results would be really interesting, and adding in the experimental data would be a great way to evaluate the clustering. We did look a little at sequence similarity of the proteins we analyzed here, but a larger analysis to more broadly compare sequence and structure-based clustering could shed a lot of light on the results. If you'd like to give it a try yourself, all the data in this pub is available and you can run ProteinCartography here -> https://github.com/Arcadia-Science/ProteinCartography. We haven't tried training a model on this data, which was more of a test case or validation of ProteinCartography. We aren't actively working on following up with the dCK data. However, we are working on models using datasets with a bit more functional information. I do think that ProteinCartography could be useful for picking variants to analyze (we've even used it to evaluate variants we generated!). I always think adding the structural information is quite useful, especially if you're looking for things that might be more distant sequence-wise. Structure and function are closely related, and proteins that share the same fold, even if they're more distantly related based on sequence, can have shared functions. An example of this in our analysis is that the 3 proteins in cluster 2 were only had about 40% sequence identity to each other, but they still clustered together and shared some functions.

Pam on May 22, 2025

Have you considered using ProteinCartography to identify structural motifs or features associated with thermostability across a family? For example, could you overlay thermostability data onto a ProteinCartography map to spot clusters or substructures linked to enhanced stability? Similar to how you identify the formation of complexes as a "function," could thermostability be a different "function" you tried to predict? It would be amazing if this kind of mapping could guide enzyme engineering by pinpointing stabilizing features in closely related, more robust proteins.

Pam on May 30, 2025

A lot of industrial applications prefer to run at higher temperatures to avoid contamination and speed up catalysis, so anything used in chemical or food applications might be interesting. Proteases, amylases, and glucose oxidases are pretty commonly used and well-characterized, so there would likely be sufficient data for any of these families. It would also be really interesting to see if trends could be detected across families. Intramolecular interactions and structural features seem to play important roles, so I wonder if these features are conserved well enough to detect signature motifs, or if evolution has revealed a number of ways to solve this problem.

Brae M Bigge on May 23, 2025

Thanks for the comment, Pam! We haven't done that, but it sounds like an interesting use case. Do you have ideas of protein families that might be particularly useful to look at this way? If you wanted to try it out yourself, the code is available too!

Shruti on May 20, 2025

The observation that some proteins cluster based on tertiary structure (oligomerization) rather than enzymatic activity suggests that ProteinCartography captures higher-order structural features effectively. Given this finding, could ProteinCartography be specifically optimized to predict protein complex formation or oligomerization tendencies across broader protein families?

Brae M Bigge on May 23, 2025

Thanks for the comment! It's definitely possible that ProteinCartography could be used to make hypotheses about oligomerization and complex formation. One way to optimize for this or strengthen these hypotheses is to include some kind of experimental data or even additional bioinformatic data. For example, because we have experimental data, one could hypothesize that other proteins in the cluster where the dCKs seem to be organizing into some higher order structure are also likely to form a higher order structures. We also have an example where we actually looked at the conservation of residues involved in actin polymerization across the actin family, and that could be used to strengthen hypotheses about which proteins are likely to polymerize (https://research.arcadiascience.com/pub/result-actin-structural-clustering/release/5/). These would just be predictions/hypotheses, but it would be interesting to test them out in the lab!

Koko Mutai on Apr 07, 2025

This is a very interesting way of analyzing protein function using structural information. But are these clusters enriched in specific biological pathways? Are these clusters dynamic across different conditions such as proliferating versus quiescent cells? It's also my thinking that you might have utilized dCKs and protein cartopgarphy to display the power and limitations of protein cartography, but is the plan to use it in the future to identify other proteins across species that can be further clustered, functionally characterized, and eventually used in therapeutics.

Brae M. Bigge on May 19, 2025

Thanks for the comment! The clusters aren't specifically enriched for biological pathways or taking into account any conditions. We just used the human dCK to do sequence (BLAST) and structure (Foldseek) based searches and then clustered any proteins that were a match. One could overlay additional information about biological pathways or how things change in different conditions, but this is just the baseline clustering. Our hope is that ProteinCartography can be used for things like identifying novel proteins, functionally characterizing proteins, and other related things in the future!

Emilia Silverberg on Mar 26, 2025

Selection

midine all populate a single cluster. However, while proteins annotated as dCK all cluster together, they don’t share all functions. This is likely related to how ProteinCartography compares and clusters proteins, something that we’

This is expected as proteins often can function in multiple ways depending on their respective environments. I wonder how this program could incorporate assumptions of the environment to form multiple clusters, each based on different assumptions (i.e., pH, surrounding substrates). This would certainly complicate results, but may provide a more accurate, searchable database to predict enzymatic function based on protein sequences and their environments.

Brae M. Bigge on Mar 28, 2025

Thanks for the comment! You’re absolutely right, the environment can affect protein function, and considering that alongside ProteinCartography results could help make more accurate predictions about function. A relatively simple thing one could do to gain a little bit of insight is overlay some of the biophysical characteristics of the protein that might be able to give you an idea about where they function (isoelectric point, hydrophobicity, etc.) on the ProteinCartography map.

Gabriel Butterfield on Feb 18, 2025

Selection

% sequence identity. Despite this, they share structural similarity and some functional similarity. On the flip side, because ProteinCartography is based on rigid, global structural alignment, it might not pick up on small changes at the active site for example. Finally, it’s typical to focus primarily on enzymatic activity when comparing enzymes, but we us

Could structural alignment of only residues surrounding the active site increase the ability to discriminate enzymatic activities of structurally similar proteins?

Brae M. Bigge on Feb 18, 2025

Thanks for the comment, Gabriel! One of the biggest caveats with trying to compare potential activities of proteins with a tool like ProteinCartography is that we are aligning full structures so might not pick up on subtle differences. I think looking at just the structure of the active site and seeing how that changes the space would be interesting!

Gabriel Butterfield on Feb 18, 2025

Thanks for the reply, Brae! Your group got really nice results using full structures, so clearly your approach works well. I think focusing on the functional regions of protein families could be a fruitful area for future exploration though. I know that accurately assigning functional regions can be difficult but I think it could be worthwhile at least in some cases.

Stuart Adamson on Feb 18, 2025

Very thought-provoking publication! Brings up an interesting thought if the final form (structure) of a tool contains information that the blueprint (sequence) does not.

Gabriel has an interesting comment (but of course requires knowledge of the active site). In a similar vein, I wonder if integrating information of the evolutionary pressure of the protein sequence could help improve performance and filter out noise? The more the protein is evolutionary constrained the less likely it is to have “random” residue changes, especially around the active site.

Similarly, focusing analysis and follow up on highly constrained proteins could be a good way to focus the validation efforts.

Brae M. Bigge on Feb 18, 2025

Thanks for the comment, Stuart! Incorporating evolutionary info and sequence-based info is something that we’ve been thinking about for a while for this type of analysis, and I think, in general, it would improve the analysis. We don’t have plans to immediately follow up on this particular work, but we are incorporating evolutionary information as we move forward with different types of analyses and tools.

Raymondo Lopez on Feb 20, 2025

Great read and approach to leverage available data. This approach neatly clusters structurally related proteins and identifies differences within clusters. How do you think this workflow can inform rationale protein design for functional analysis? It could be interesting to follow up on the cluster of proteins with disordered regions for example and investigate whether these regions confer formation of higher order structures and/or functional properties.

Brae M. Bigge on Feb 21, 2025

Hi Raymondo, thanks for the comment! ProtienCartography could be useful for a few aspects of protein design. First, it’s a tool for gathering related proteins, so it could be used to identify homologs and build a training set if you’re interested in a particular protein. Second, it can be used to cluster designed proteins to see how much of the space you’re covering. And I agree, it would be interesting to follow up on the cluster with the disordered regions to see if those regions are contributing to functions by maybe testing their functions with and without the disordered regions.

Alex Greenlon on Feb 21, 2025

Really exciting work! Between this and the validation work on actin and polyphosphate kinases (and other examples), it seems like there is strong reason to trust ProteinCartography’s clusters carry useful biological information. Are you aware of any effort (at Arcadia or elsewhere) to use ProteinCartogrophy to hypothesize the function of protein families with unknown function? Since the clustering uses protein structural information, and given advances in protein design, it seems like the pipeline could be adapted to look for enzyme families for substrates without known degradation pathways, or to inform study of “hypothetical proteins” that are often conserved in biosynthetic and other operons.

Brae M. Bigge on Feb 21, 2025

Thanks for the question, Alex! We don’t have any specific efforts aimed at whole protein families of unknown function, but it would be totally reasonable and interesting to use ProteinCartography for this purpose. I like your ideas of looking for enzyme families for substrates without known degradation pathways or using ProteinCartography to learn more about hypothetical proteins. Let us know if you try it out on any of these use cases!

Benjamin Groves on Feb 25, 2025

Along the lines of what Gabriel suggested, in addition to full length structural clustering, do you think it’d be interesting to break the input proteins up into domains and search/cluster based on those?

It would also be neat to flip this around a bit. Starting with several protein phenotype datasets (with measurements along a variety of axes), see what types of publicly available data (e.g. structure, sequence, protein language model embeddings) best capture the various types of differences. Clustering based on structure is very cool; but also having clusters based on sequence and protein language model embeddings seems potentially useful too - and having an easy-to-use pipeline available for that would (I think?) make people happy. Users might be able to compare how clusters differ between the various search/clustering methods.

Brae M. Bigge on Feb 25, 2025

Thanks for the questions, Benjamin! Currently, ProteinCartography clusters only on full proteins by default. You could use a single domain as an input for “Search” mode, but it will retrieve full-length proteins. For “Cluster” mode, you provide a folder of structures, and it clusters them, so you can do clustering on single domains using “Cluster” mode.

I like the idea of aligning publicly available data with clustering based on different protein features! As you mentioned, ProteinCartography only does structural comparisons. It then clusters using the all-v-all matrix generated from that comparison. We’ve talked about adding in comparison of different protein representations but have paused the development of that for now. However, you can still use ProteinCartography functionality on all-v-all matrices generated outside the pipeline to obtain comparable clustering for the different representations.

Taylor Anglen on Feb 25, 2025

I would be interested in the big-picture thoughts and discussions around how dCKs might be used therapeutically. How would these be delivered? What would be your first go-to diseases? What work has been done already in this direction? This is exciting work.

Brae M. Bigge on Feb 27, 2025

Thanks for the questions Taylor! We didn’t select dCKs for this work based on their therapeutic potential, and we don’t have any plans to develop it for therapeutic purposes. For this work, we just wanted a protein that we were able to purify and assay easily and that had interesting ProteinCartography results. However, it’s worth noting that dCK does play an essential role in the nucleoside salvage pathway and has been used in therapeutics to activate nucleoside analog prodrugs (https://pubmed.ncbi.nlm.nih.gov/1406603/).

CJ San Felipe on Feb 26, 2025

Really interesting work! I had a couple questions:

For the cluster 2 proteins that purified as multimers, did you consider running an AlphaFold multimer prediction of what the complex could look like/if it correctly predicts multimeric assembly? Do you think there would be any clues to suggest why Rickettsiales and almond enzymatic activity towards all substrates appears to be much better than mustard if you align all the structures together?
For cluster 2, do these species encode multiple different nucleoside kinases or do they only encode 1? If there are multiple then do you see them represented within the same cluster?

Brae M. Bigge on Feb 27, 2025

Thanks for the questions CJ! I’ll answer them individually!

We haven’t tried running AlphaFold multimer on this, but that’s a great idea! I have compared the monomeric structures of the Rickettsiales, almond, and field mustard proteins but didn’t really find anything too obvious to explain the differences in substrate specificity. The core of the protein is pretty well conserved between the three, and the almond appears to have an added unstructured region that the other two don’t.
Great question! Within each individual species, all proteins in the analysis can be found in cluster 2. For example, there’s one Prunus dulcis protein in the space, and it’s in cluster 2. Other proteins from species in the genus Prunus are also in cluster 2. There are five proteins in cluster 2 for Brassica campestris, and those are the only ones in the analysis. For the Rickettsiales proteins, there’s only one protein in cluster 2 and only one protein in the analysis that’s specifically from “Rickettsiales bacterium”, but there are proteins from other Rickettsiales species that are outside of cluster 2.

Weijie Xu on Feb 27, 2025

Very interesting!! Are the structures used in this pub all from PDB? It would be great if predicted structures can also be sorted this way.

Brae M. Bigge on Feb 27, 2025

Thanks for the question! The structures in this pub are all predicted proteins from the AlphaFold database.

Matt Davis on Feb 28, 2025

Thanks for sharing team, and I appreciated the ability to dig into your raw data and files. A few unrelated thoughts, questions, and feedback:

My experience with protein clustering has mostly been sequence similarity-based (for example tools from the Enzyme Function Initiative) and focused on identifying enzymes that are ‘best’ at a desired function such as total activity or substrate specificity. It would be interesting to see how structure-based clustering compares to other clustering approaches across different use cases.
I found it very interesting that dCK2 enzymes from X. laevis and G. gallus reside in cluster 4 with their dCK paralogs despite having significantly different substrate preference. Do you think of these as ‘outliers’ due to arising from a genome duplication, or have thoughts on how your tools could be applied to studying evolution of function after duplication. An example that comes to mind would be asking whether some protein families rapidly evolve in function after duplication while other families remain relatively conserved and what factors may drive that difference.
Building on comments from others about active site residues, perhaps a useful cluster overlay could be whether or not canonical catalytic residues are present. This way no changes to global alignments or clustering are needed, but you could gain some high-level understanding of clusters/members that might have diverged in function.
A small request, but it would have been helpful to list the number of sequences within each cluster, or at least just the clusters you focused on. This was easy enough to find in the downloaded dataset, but useful context in my opinion.

Brae M. Bigge on Mar 03, 2025

Hi Matt! Thanks for the comments. I’ll respond to your thoughts and questions below.

We're definitely interested in comparing structure-based clustering to clustering based on other protein representations.
It is interesting that those proteins all ended up in the same cluster despite acting on different substrates. I don’t think of them as ‘outliers’ because the overall structural similarity, which is what ProteinCartography is based on, between all the proteins is above 0.9, meaning that they are structurally quite similar and the differences in substrate specificity are likely due to small changes not picked up by the global comparison. I like your idea for using ProteinCartography to look at functional evolution following gene duplication. It would be interesting to look across a bunch of different families with examples of gene duplication where the duplicates have varying levels of conservation.
Creating an overlay with the conservation of specific sites is something that we’ve done in previous pubs (https://research.arcadiascience.com/pub/result-actin-structural-clustering/release/3), so it’s totally something that’s doable, and it can provide lots of useful info. While I don’t have an overlay for this family, we did look at the catalytic residues, and the majority of them were well-conserved for our clusters of interest (2 and 4). In this case, it could also be useful to look at the substrate binding pocket since most of the difference that we’re seeing has to do with substrate specificity.
Thanks for the suggestion. It’s helpful to know this would be useful!

Kartik Lakshmi Rallapalli on Mar 04, 2025

Thank you for providing a thorough analysis of the ProteinCartography tool and emphasizing the necessity for the experimental validation of its structure-based clusters. Recognizing that structural similarity does not always correspond to functional similarity, I am interested in exploring additional computational parameters to distinguish the specific enzymatic substrates and functions of enzymes with highly similar folds.
Specifically, have you considered employing high-dimensional embedding-based clustering for these dCK proteins? Utilizing protein language models to generate sequence embeddings can perhaps capture intricate sequence patterns beyond traditional alignments. These embeddings can then be clustered to identify functional subgroups within structurally similar enzyme clusters. It would be interesting to see the “concordance scores“ between the embedding clusters and the structural clusters shown in this pub!
Let me know if you guys have tried this or any other alternative in silico characterization of enzymes to supplement ProteinCartography!

Brae M. Bigge on Mar 04, 2025

Thanks for the comment, Kartik! We have considered different protein representations, including sequence embeddings. I do recommend using additional in silico characterization to supplement ProteinCartography or using ProteinCartography alongside other tools, but we haven’t implemented anything like this within ProteinCartography.

Wen Xiong on Mar 07, 2025

Very interesting study! As has been discussed, one potential explanation for the discrepancies between structure-based clustering and actual substrate specificity could be local differences in active sites or slight structural rearrangements that aren’t fully captured by global alignments. For instance, a short, non-domain structural motif might serve as an autoregulatory element—turning an enzyme ‘on’ or ‘off’ and thereby influencing which substrates it accepts. Additionally, post-translational modifications (PTMs), which often differ by cell type or species, could further modulate enzyme function in ways not readily apparent from static protein structures alone.

Brae M. Bigge on Mar 07, 2025

Hi Wen! Thanks for the comments. You’re totally correct about local differences in structure and PTMs, which could also be contributing to the differences we see.

Mitsu Raval on Mar 11, 2025

This insightful article vividly demonstrates the capabilities and limitations of using ProteinCartography for inferring protein functions from structural data. The discussion around the need for integrating local structural dynamics and post-translational modifications is particularly enlightening, underscoring the complex nature of protein function.

Building on this, I suggest that future iterations of ProteinCartography incorporate real-time dynamic data from molecular dynamics simulations to enhance its predictive accuracy. This integration could allow for a better understanding of transient structural states that play critical roles in enzyme activity and regulation.

Additionally, the application of network theory could offer novel insights. By treating structural similarities and functional annotations as a network, we can apply community detection algorithms to discover new functional clusters that might not be apparent through traditional clustering methods. This approach could also help in predicting the effects of mutations on protein function, particularly for proteins without extensive biochemical characterization.

As someone keen to contribute to innovative biological research at Arcadia, I am excited about the prospect of integrating these computational techniques to enhance the predictive power and utility of tools like ProteinCartography in your Pilot teams. This could accelerate the discovery and validation of biological insights, aligning with Arcadia's mission to transform evolutionary innovations into real-world solutions.

Brae M. Bigge on Mar 11, 2025

Thanks for the comment Mitsu! Proteins aren’t static, and incorporating dynamic information could be useful when considering transient structural states important for function. We’ve paused the development of ProteinCartography for now, but I’d be interested in talking more about how we might integrate some dynamic information into our various analyses, as well as learning more about your second suggestion. Currently, ProteinCartography takes proteins within a family, structurally compares them all to each other to create an all-v-all similarity matrix, and then does Leiden clustering, a form of community detection, which we did find tended to work better than other clustering methods (like Foldseek clustering) for our needs.

Shantanu Khatri on Mar 18, 2025

Selection

Cartography, but we hope this data is also useful to scientists studying dCK and related proteins.We found that ProteinCartography, which uses global structural alignment, is able to sort proteins into clusters based on their enzymatic function, but it does not always do so. For example, proteins annotated as thymidine kinase that act on deoxythymidine all populate a single cluster. However, while proteins annotated as dCK all cluster together, they don’t share all functions. This is likely related to how ProteinCartography compares and clusters proteins, something that we’re interested in exploring more, and highlights the importance of combining analyses like these with other analyses to learn more about protein function.This pub is part of the platform effort, “Annotation: Mapping the functional landscape of protei

I find it fascinating that dCK homologs from different species cluster together structurally but exhibit distinct functional behaviors. Have you explored whether evolutionary constraints (such as selective pressures on active site residues) correlate with functional divergence within clusters? It might be interesting to overlay evolutionary conservation scores onto your ProteinCartography clusters to see if functional outliers correspond to relaxed selection at key positions.

Brae M. Bigge on Mar 18, 2025

We haven’t looked at this yet, but we are interested in how evolutionary constraints might be playing a role. I think this could be really interesting!

Shantanu Khatri on Mar 18, 2025

Selection

n are closely linked, an idea that we tested in this analysis. Our foundational hypotheses are that ProteinCartography will cluster functionally similar proteins together while sorting functionally distinct proteins into different clusters based on structural similarities. Here, we test these hypotheses using in vitro data to help give ProteinCartography users some idea

What is the core idea behind such a clustering?

Brae M. Bigge on Mar 18, 2025

Thanks for the comments and questions! ProteinCartography sorts proteins into clusters based on their structural similarity. It uses TM-score to compare all structures to all other structures and then uses the resulting all-v-all matrix to do Leiden clustering. The idea behind our hypothesis is that because structure is closely related to function, we’ll be able to use structure-based clustering to make hypotheses about function.

Shantanu Khatri on Mar 18, 2025

Selection

ty towards dA and dG. A function not conserved between the two proteins is the activity towards dT. This suggests that while ProteinCartography can separate proteins based on function, it doesn’t separate on every function. This is expected, as proteins are complex and perform many

Thats a important fact, as some proteins also undergo trimming and modification to achieve function.

Brae M. Bigge on Mar 18, 2025

Definitely! There are lots of factors that might be affecting function outside of just structure.

Contributors (A-Z)

Purpose

Share your thoughts!

Background

What is ProteinCartography?

What is deoxycytidine kinase (dCK)?

Using ProteinCartography to investigate the dCK family

The approach

Obtaining data from the literature

Selecting proteins for in vitro analysis

Purifying selected dCK proteins

Cloning

Induction

Lysis

Purification

Checking concentration and purity

Assessing biochemical activity of dCK proteins

Additional methods

The results

Biochemically characterized dCK proteins from the literature demonstrate the utility of ProteinCartography clustering

In vitro analysis of dCK proteins further highlights ProteinCartography’s utility

Selecting clusters and individual proteins for biochemical characterization

The human dCK shares some, but not all, functions with a protein from its cluster

Proteins in different clusters have some distinct biochemical features

Bringing it all together

Key takeaways

References

Share your thoughts!

Provide feedback

Pub details

Table of contents

Contributors (A-Z)

Purpose

Share your thoughts!

Background

What is ProteinCartography?

What is deoxycytidine kinase (dCK)?

Using ProteinCartography to investigate the dCK family

The approach

Obtaining data from the literature

Selecting proteins for in vitro analysis

Purifying selected dCK proteins

Cloning

Induction

Lysis

Purification

Checking concentration and purity

Assessing biochemical activity of dCK proteins

Additional methods

The results

Biochemically characterized dCK proteins from the literature demonstrate the utility of ProteinCartography clustering

In vitro analysis of dCK proteins further highlights ProteinCartography’s utility

Selecting clusters and individual proteins for biochemical characterization

The human dCK shares some, but not all, functions with a protein from its cluster

Three proteins in another cluster share some, but not all, functions

Proteins in different clusters have some distinct biochemical features

Bringing it all together

Key takeaways

References

Share your thoughts!

Provide feedback

Pub details

Table of contents