How can we biochemically validate protein function predictions with the deoxycytidine kinase family?
The human deoxycytidine kinase, a member of the nucleoside salvage pathway, has been studied extensively. We’ll use this family to assess our structure-based protein clustering tool, ProteinCartography. We’d love feedback on how we might work with this protein for validation.
We aim to validate ProteinCartography, a tool for structure-based protein clustering, by evaluating two foundational hypotheses: that proteins within a cluster have similar functions and proteins in different clusters have differing functions.
Purpose
We created ProteinCartography to computationally compare protein structures from a single family across many different species [1]. ProteinCartography identifies proteins similar to an input and compares the structures of each protein to every other protein to produce an interactive map with clustering information overlaid. In a previous pub, we began formulating a plan to validate ProteinCartography by testing two foundational hypotheses: proteins within clusters will have similar functions and proteins in different clusters will have different functions [2].
In this pub, we outline our ProteinCartography results for one of the protein families we’ve chosen to use for validation, deoxycytidine kinases, which we selected because it’s been previously biochemically studied and produced results with many clear options for how to test our hypotheses [2].
We’re seeking feedback regarding how we might approach in-lab validation in this family, especially from those who’ve previously worked with deoxycytidine kinase proteins.
The ProteinCartography pipeline used to run these analyses is available in this GitHub repo. To create the custom overlays, we used this notebook and added our custom color dictionaries, which can be found in the associated Zenodo repositories.
The data associated with this pub, including ProteinCartography results for the deoxycytidine kinase family, can be found in this Zenodo repository.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Background
Why use deoxycytidine kinases?
Our initial validation of ProteinCartography is intended to test the two foundational hypotheses that proteins in the same cluster have similar structures and functions and that proteins in different clusters have differing structures and functions. To do this rapidly and in a straightforward manner, we began with proteins that had been previously biochemically characterized. We started with the 200 most well-studied human proteins [4]. Other factors we considered in our protein selection decision were the length of proteins and the quality of the available AlphaFold structures. The pLDDT (predicted local distance difference test), computed by AlphaFold, is a per-residue measure of the confidence of a model structure [5]. This score ranges from 0 to 100, with higher scores indicating greater confidence. In our case, we focused on proteins shorter than 1,280 amino acids, a length limit set by AlphaFold, and proteins with a pLDDT score higher than 80. Model structures in this pLDDT score range are typically considered high-confidence.
Taking into account each of our selection criteria [2], we chose to focus on the human deoxycytidine kinase. As of this writing, there are 47 Protein Data Bank (PDB) entries for this protein, which places it among the 200 human proteins with the most solved structures. Additionally, this protein family has commercially available assay kits and it produced ProteinCartography results with clearly defined clusters that would allow us to test our foundational hypotheses (Figure 1).
What do deoxycytidine kinases do and why are they important?
Deoxycytidine kinase (dCK) has an essential role as a nucleoside kinase, critical in producing precursors for DNA synthesis [6]. The enzyme is crucial in the nucleoside salvage pathway, primarily phosphorylating deoxycytidine and converting it into deoxycytidine monophosphate [7]. The enzyme can also convert the nucleosides deoxyadenosine and deoxyguanosine to their monophosphate forms, albeit at a lower rate [6]. In addition to these native substrates, the dCK enzyme is essential for activating several nucleoside analog prodrugs via phosphorylation. These analogs include anticancer drugs (cytarabine, gemcitabine, cladribine, and fludarabine) as well as antiviral drugs (lamivudine and emtracitabine) [6].
Very little is known about non-human dCK homologs but they’re intriguing to investigate because they could have distinct properties that might improve cancer and antiviral therapies that rely on human dCK. There’s already evidence that novel human dCK homologs improve the efficacy of gene-directed enzyme prodrug therapies for cancer [9]. For example, a nucleoside kinase encoded by the fruit fly Drosophila melanogaster has broader substrate specificity, better catalytic efficiency, and improved stability [10] relative to its human counterpart. A truncated version of the fruit fly dCK successfully re-sensitized a drug-resistant breast cancer cell line to treatment with an anticancer nucleoside analog [10]. Another example is a tomato (Solanum lycopersicum) thymidine kinase that is highly active and less sensitive to negative feedback regulation by its reaction products [11]. Researchers used a combination of an anti-cancer prodrug and the tomato thymidine kinase to successfully treat malignant glioma (brain tumor) cells in vitro and brain tumors in mice [12].
Diving into the ProteinCartography results for the deoxycytidine kinase family
Running ProteinCartography on deoxycytidine kinases
To explore the biochemical function of non-human dCK homologs, we used the ProteinCartography pipeline to find proteins that are structurally similar to the human dCK protein and group them into clusters based on that similarity. ProteinCartography uses BLAST and Foldseek to identify proteins similar to the input [13][14]. It compares the structures of each protein to every other protein to produce TM-scores, or structural similarity scores where a “one” indicates identical proteins [15]. Using these scores, the pipeline performs Leiden clustering to separate similar proteins into clusters and reduces dimensionality to create interactive UMAP and t-SNE projections with overlays for further exploring the protein family [16][17][18].
In our analysis, we used “search mode” with standard parameters and with the human dCK structure as input (UniProt ID: P27707). We requested 3,000 Foldseek hits and 7,000 BLAST hits — a total of 10,000 structures. Our run generated 2,418 unique structure hits that grouped into 12 clusters (LC00–LC11) (Figure 1). Our input protein, human dCK, is in LC04 (Figure 1 and Figure 2, A).
A full list of all the proteins in this analysis, plus all the aggregated information from the pipeline can be found in the aggregated features file linked below:
Assessing compactness and overall quality
We started our analysis by exploring the Leiden cluster similarity matrix (Figure 2, B) to evaluate the quality of the protein space ProteinCartography generated. The similarity matrix displays scores calculated by comparing the mean TM-score of every structure in each cluster to every other structure in the analysis [1]. By looking at the similarity scores along the diagonal of the matrix, we get an idea of how tightly grouped the proteins are within each individual cluster. The average of the diagonal values is a measure we’ve previously described as “cluster compactness” [1]. The clusters in our analysis had a mean compactness score of 0.73 (average of the diagonal values in the similarity matrix). Most of the individual clusters also appear compact (a score above 0.6), in particular LC04 (score: 0.91; cluster with our input protein), LC09 (score: 0.92), and LC11 (score: 0.94) had some of the highest compactness scores (Figure 2, B). Cluster compactness represents a basic quality-control check of how well the proteins have grouped. However, given its nonlinear relationship with a number of other ProteinCartography outputs, we decided to include several clusters with low compactness in our downstream analyses to better understand the utility of cluster compactness.
As a preliminary check of the quality of the structures, we explored the distribution of mean pLDDT scores (structural confidence) and TM-scores (structural similarity) across all clusters. The pLDDT scores tell us how confident the AlphaFold structural prediction is and often low scores point to disordered regions. A score of 100 is a highly confident structure [5]. The majority of the structures in our dCK analysis had a pLDDT score greater than 80, except for the structures in LC02, which we discuss further below (Figure 2, A). These high scores suggest that we can be confident in the accuracy of the structural predictions. When we looked at TM-scores, which tell us how similar two structures are to each other, we saw that some structures are very similar to the input protein (TM-scores close to one), but some structures are only distantly related (TM-scores between 0.4 and 0.5) (Figure 1 and Figure 2, A). The broad spectrum of relatedness represented enables us to more thoroughly investigate the relationship between structural similarity and function.
Exploring the data
To better understand the composition of our clusters and guide our selection process, we explored ProteinCartography’s metadata overlays (Figure 1 and Figure 2, A). The metadata that we found particularly interesting for our analysis shows the distribution of taxa (broad taxonomy overlay) (Figure 2, D), length of proteins (length overlay), TM-scores (TM-score_v_input overlay) (Figure 2, E), pLDDT scores (pLDDT overlay), and UniProt annotation scores (annotation overlay), across all of the proteins in each Leiden cluster (Figure 1).
In the following subsections, we walk through the most interesting clusters.
We began by analyzing the metadata overlays for LC04, which contains our input protein, to see whether the results seem reliable and match what we would expect for the cluster containing the input protein. We started with the broad taxonomic group overlay. ProteinCartography assigns proteins into taxonomic groups that allow for the best readability, but the taxonomic depth is not uniform. Cluster LC04 contains two dominant taxonomic groups, mammals and other vertebrates. Because our input protein is a human protein, this is reasonable. The mean length of proteins in LC04 is ~270 amino acids, which is very close to our input protein (260 amino acids), and the mean TM-score is 0.9, indicating that the proteins in this cluster adopt a fold that’s highly similar to our input protein (Figure 1 and Figure 2, A). The mean pLDDT score for proteins in LC04 is 87, which confirms that the quality of the structural predictions is high and that the proteins are generally well-structured (Figure 1 and Figure 2, A). Last, the most common annotation score in this cluster is two (132 proteins out of 233 total in LC04) followed by one (78 proteins) (Figure 1 and Figure 2, A), which both suggest that existing UniProt protein annotations are of low confidence. We often observe these two annotation scores as the most common because the majority of the proteins in the UniProt database have not been biochemically characterized. Overall, these results are fairly typical for a ProteinCartography run and there were no surprises, so we’re reasonably confident that the pipeline worked as we’d hoped.
LC02: Plant homologs close in structure to human dCK
By exploring the taxon distribution across the other clusters in our analysis, we found that all proteins in LC02 are in the clade Viridiplantae (Figure 1; Figure 2, A; and Figure 2, D). The proteins in this cluster have a mean length that is much higher (512 amino acids) compared to our input protein (260 amino acids) (Figure 1 and Figure 2, A). Even though the proteins in LC02 have a slightly lower mean TM-score (0.8), they should still adopt the same fold as our input protein [19] (Figure 1 and Figure 2, A). The extra length of the proteins in this cluster may contribute to their lower TM-score and lower mean pLDDT score of 67. We explored the structures of a few of the individual proteins and noticed that they all have a core region with a high pLDDT score (90) that structurally aligns well with our input protein. However, that core region is flanked by unstructured portions on both the N- and C-termini, which may also contribute to the low pLDDT score for the entire protein. Similar to LC04, almost all proteins in this cluster have an annotation score of one (317 proteins out of 321 total in LC02), indicating an overall poor quality of the annotations in this cluster (Figure 1 and Figure 2, A).
LC08 and LC09: Taxonomically diverse homologs that diverge in structure from human dCK
When we explored the broad taxonomy overlay for LC08 and LC09, we found that there are highly diverse taxa represented in LC08, including Vertebrata, Bacteria, Archaea, Viridiplantae, and Arthropoda, while LC09 contains exclusively bacterial proteins (Figure 1 and Figure 2, D). The proteins in LC08 are on average longer compared to our input protein (319 amino acids vs. 260 amino acids), and this cluster also contains some very long proteins (> 1,000 amino acids) (Figure 1 and Figure 2, A). The mean length of proteins in LC09 is very uniform and most proteins are shorter than our input protein (220 amino acids vs. 260 amino acids). Finally, both LC08 and LC09 show low mean TM-scores of 0.5 and 0.4, respectively, suggesting that the proteins in these clusters have adopted a fold that is more distantly related to our input protein (Figure 1 and Figure 2, E). For both clusters, the structure quality is high, with mean pLDDT scores of 83 and 93 for LC08 and LC09, respectively, and the vast majority of the proteins (74%) have an annotation score of one or two (Figure 1 and Figure 2, A), so their annotations are lower confidence.
Overlaying annotation data
In addition to all of the overlays that the ProteinCartography pipeline outputs automatically, we can also create custom overlays to display any metadata. We manually noted which type of deoxynucleoside or deoxynucleoside derivative each protein was annotated to act on in UniProt since we noticed that not all the proteins in our maps are kinases that are annotated as proteins that act on deoxycytidine. We overlaid this annotation data onto our Leiden cluster map (Figure 2, C).
We were curious to see if proteins annotated as acting on the same substrate would cluster together, or if perhaps proteins with certain annotations would be distributed across multiple clusters. In the case of LC04, the vast majority of the proteins were annotated as dCK (deoxycytidine kinase), the same annotation as our input protein (Figure 2, C). For LC02, the most prevalent annotation was the general annotation, “deoxynucleoside kinase,” or dNK, which could mean these proteins act on several nucleosides or that this broad annotation was used because the substrate specificity was unknown (Figure 2, C). While LC08 contained very mixed annotations, all of the proteins in LC09 were annotated as acting primarily on cytosine or cytosine derivatives (Figure 2, C). In addition to overlaying the protein annotations across the Leiden cluster map, we used ProteinCartography to generate a semantic analysis of the annotations that provides a more granular view of their distribution throughout clusters (Figure 3). For example, we can see that while the input cluster LC04 is primarily annotated as “deoxycytidine kinase,” LC09 is primarily “cytidylate kinase” (Figure 3). Additionally, we can get more detail about the mixed annotations in LC08, and see that the primary annotations are “dephospho-CoA kinase,” “uridine kinase,” and “guanylate kinase” (Figure 3).
Summary
Aside from the cluster with our input protein, LC04, we find LC02, LC08, and LC09 most interesting because they contain proteins from diverse taxa and close, as well as distant, structural homologs of our input protein. We plan to use proteins from these clusters to test whether the two foundational hypotheses underlying ProteinCartography are accurate (that proteins with similar functions cluster together and those with dissimilar functions cluster separately), but we want to hear your thoughts!
What do you think?
Testing hypothesis 1: Do proteins within clusters function similarly?
Here are our ideas about how we might test this.
We could characterize uncharacterized proteins from the cluster containing our input protein to determine if they have the same function as the input protein (in LC04). Specifically, we plan to test the ability of proteins to phosphorylate deoxynucleoside substrates using ATP.
We could refine the current annotations of proteins that are annotated too broadly. In the cluster with our input protein, some proteins are annotated as the generic “deoxynucleoside kinase.” We could make this more specific by testing how these proteins interact with different substrates.
Do these seem like reasonable approaches to test this hypothesis?
Testing hypothesis 2: Do proteins in different clusters have different functions?
Here are the clusters we’re considering to test this question. Each seems distinct in a different way, so we suspect that we’ll find functional differences between proteins from these clusters and between these and our human input protein, which is in LC04.
LC02 contains exclusively plant proteins with an overall low quality of annotations. The proteins in this cluster are also longer than our input protein and contain a disordered region at each end. We could investigate whether there are functional differences between our input protein and proteins in LC02, which could be caused by the disordered region.
The proteins in LC08 span several distinct taxonomic clades and are only distantly related structural homologs of our input protein.
LC09 contains exclusively bacterial proteins that adopt a different fold from our input protein based on our structural comparisons.
How should we approach working with dCK proteins in vitro?
Once we select individual clusters and proteins, we’ll bring them into the lab for biochemical characterization. We plan to purify each protein we select and test its ability to act on its possible substrates.
Are there tips/tricks/challenges to biochemical analysis of dCK?
Do you have ideas for functions of dCK that we might want to test other than or in addition to its activity as a deoxynucleoside kinase?
Additional methods
We used ChatGPT to help critique, clarify, and streamline text that we wrote.
Next steps
Now that we’ve selected deoxycytidine kinases as a protein family to test, we hope readers will provide feedback on the interesting clusters we identified and how to choose individual proteins for further analysis. Once selected, we’ll bring these proteins into the lab for functional assays. We’re planning to purify our selected proteins and run basic activity assays on each one.
While our biochemical efforts are in progress, we have a few additional computational ideas to gain insights into what we can learn from ProteinCartography clustering. We discuss these potential next steps below.
Align functional data in the literature with ProteinCartography clustering
While we plan to directly compare the function of diverse proteins from each family in our own hands, we might also be able to check our ProteinCartography clustering against empirical functional data in the literature. Do proteins with similar functional profiles cluster together? Do those known to work differently cluster apart?
This analysis should be doable, as several homologs of the human dCK enzyme have biochemical data available, including proteins from chicken [20][21], frog [20][21][22], worm [23], arabidopsis [24], fruit fly [10][25], mosquito [26], moth [22], amoeba [27], and bacteria [28][29][30][31][32][33][34]. There’s also a review that summarizes the biochemical activity of enzymes from this family from multiple organisms [35].
Learn more about clusters and individual proteins by studying specific, conserved structural features
We’re broadly interested in leveraging comparative structural biology to annotate protein function. While ProteinCartography analyses rely on comparing the global protein structure, there are many other structure-based characteristics that we might consider in trying to predict function across protein families. Some of these features include secondary structural elements (like ɑ-helices or β-sheets), surface area, hydrophobicity, electrostatics, topology, inter-protein contact networks, active sites, and potentially predicted binding sites. We’re interested in comparing these features across proteins to provide more specific and accurate protein function predictions.
For example, if we start with the human dCK enzyme and determine the conservation of its structural features across many structural homologs, we may be able to predict with a higher accuracy which of these proteins have a similar function. We know that the human dCK enzyme acts not only on deoxycytidine (dC), but also on deoxyguanosine (dG) and deoxyadenosine (dA). Could we predict which other proteins act on these three nucleosides? Might we predict which proteins act on just one?
Summary
We hope that by combining our fold-based structural clustering, more specific information on structural features, and functional data from the literature, we can start to develop a more complete and predictive framework to understand protein function.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
For hypothesis two, it might be good to look at active site conservation between clusters. Structure dictates function, but with enzymes the active site is probably what counts the most in terms of function. For example, DNA polymerases always have two conserved Aspartate residues that coordinate at least divalent metal ions at the active site, even when the overall structure may be radically different.
?
Brae M. Bigge:
Thanks, Sean! I agree that the active site is important, and we have looked at that in some upcoming work. However, because ProteinCartography does full structure comparisons, it’s much more likely to pick up large-scale structural changes than small 1-2 amino acid changes in the active site. I think both are important comparisons, particularly when thinking about function broadly (not just enzyme activity but also stability, substrate specificity, etc.).
?
Soumendranath Bhakat:
The biggest open question is : Does proteins in different clusters have inherently similar dynamics?
Dynamics of kinases play a key role in cellular signaling. It is emerging that different classes of kinases have inherently similar dynamics.
Example: Serine-threonine kinases remains in dynamic equilibrium between DFG-in and DFG-inter states and transition between DFG-in and DFG-inter states allosterically sends a signal to a distant part of the protein which governs protein-protein interactions involved in inflammatory singling.
Can we find something like that for each clusters in case deoxycytidine kinase family members? I strongly believe using AI, molecular simulation and machine learning we will able to find “common structural features which governs protein dynamics unique for each clusters”.
Imagine the implications. We will have first ever dynamic dates for deoxycytidine kinase family members. A great starting point to train new AI models.
Thanks for the comment! This is a great question. We often think of proteins as stationary objects, but we know that proteins are very dynamic and that their dynamics are important for their broader functions. For our starting point, we’re thinking of functions as those broader things that we could easily measure in the lab, like activity and stability, but looking at dynamics as another “function” is something that we could consider. If the proteins between clusters are quite structurally different, it might be hard to discriminate between structural differences and dynamics differences. But if we could do that or control for it, it would be cool if we could find differences in dynamics between clusters!
?
Patrick Kelly:
I find it particularly interesting that the plant homologs in LC02 have an amino acid chain length close to double that of the input protein. Considering dCK is a homodimer, I initially thought the plant homologs might be dimers presenting as a monomer. Since it was mentioned that a few of the examined plant proteins were seen to have a core region that structurally aligns with the input protein, I’m curious why plant dCK seem to have such long C- and N-termini regions. I’m especially interested in learning if these elongated termini contribute to the enzymes function/activity.
?
Brae M. Bigge:
Hi Patrick, thanks for your comment! We found this interesting too, and we actually did decide to investigate a bit further. In an upcoming pub, we look at the function of a couple of these plant proteins compared to the human dCK, so stay tuned! We found some cool differences in both their activity and how they eluted from the sizing column, which might provide some insight. However, even with that insight, it is still a bit of a mystery exactly why these proteins have such long disordered regions at the C and N termini. That could be a really cool problem for someone to dig into!
?
Seemay Chou:
Under this definition, do you consider “differing functions” to be that they have entirely different enzymatic or biochemical activities? Or that they may have a different ensemble of activities? For example, how do you bucket protein clusters that may have shared activity with another, but another additional activity appended to that?
?
Seemay Chou:
Related, do you anticipate running into interpretation challenges for the above w deoxycytidine kinases?