The ProteinCartography pipeline is available here.
In the ProteinCartography pipeline, we use protein structural comparisons to generate interactive maps of protein families for exploration and discovery. This kind of analysis can be useful for provoking hypotheses about what properties could be driving functional differences within protein families and identifying outlier proteins where innovations might be found.
We’re presenting our initial version of the pipeline, which contains the core functionality, but we intend to continue improving the pipeline itself and adding features in future versions. For additional information about what’s coming, jump to the Next Steps section. Check back for new releases and updates!
This pub is part of the platform effort, “Functional annotation: mapping the functional landscape of protein families across biology.” Visit the platform narrative for more background and context.
The ProteinCartography pipeline is available in this GitHub repository. Try it yourself and let us know what you think!
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
All organisms, from single-celled bacteria to multicellular animals, share common types of basic building blocks, including proteins. Comparing proteins across the tree of life can help us understand how different organisms have evolved distinct traits and discover novel biology. Recent tools that enable searches based on structural similarity, including Foldseek, have made it possible to compare proteins from diverse organisms in new, and perhaps more informative, ways . We developed a pipeline that facilitates comparative protein biology by leveraging these emerging tools to enable users to interactively explore protein families.
Comparative protein biology is an important and rapidly progressing field. Amino acid sequences are widely used for these analyses due to the abundance and ease of working with sequence data, but there are disadvantages to such methods. For example, small protein sequence changes can result in dramatic structural changes that alter the function of the protein, and conversely, proteins with low sequence similarity can have similar folds and perform similar functions . Comparisons of protein structure could overcome these limitations, as structures are generally more conserved than protein sequences and are more closely tied to protein function . Historically, researchers have been limited by the availability of experimentally determined structures, but recent advances in protein folding prediction tools, such as AlphaFold and ESMFold, and protein search tools like Foldseek have brought us into a new era of protein analysis .
Three main methods are typically applied to represent protein space: classification, networks, and maps . These can be created based on sequence, structure, or other characteristics. Classification sorts proteins into hierarchical categories. For example, SCOP (structural classification of proteins) and CATH (class, architecture, topology, homologous superfamily) databases sort protein domains into categories based on folds or structure . Networks represent proteins as nodes that are connected to related proteins by edges . The most common type of network is a sequence similarity network (SSN), where protein nodes are connected by edges that represent some sequence similarity threshold defined by the user. Networks are useful because they can be used to cluster proteins into sub-groups. Finally, maps visualize a high-dimensional protein space representing complex information (like protein structural characteristics) as a collection of points in a low-dimensional space, often generated via classic dimensionality reduction tools like principal component analysis (PCA) and multidimensional scaling (MDS) .
Many of the analyses done with these three methods are aimed at understanding the whole protein universe, or all protein structures that have been experimentally solved or predicted . While these analyses are extremely useful for understanding large-scale protein evolution and for understanding how proteins as a whole relate to each other, they are computationally complex and can be difficult to interpret if you want to know more about individual proteins or protein families.
We developed a pipeline to rapidly and intuitively identify and visualize groups of proteins with similar structures across user-defined protein families (Figure 1). The ProteinCartography pipeline uses a combination of networks for clustering analysis and maps for visualization, and focuses these protein space representations at the protein family level to allow for rapid and intuitive analyses. The pipeline starts with a protein of interest provided by the user and searches available sequence and structure databases. After obtaining the AlphaFold-predicted structures of each match, the pipeline uses Foldseek to perform all-v-all structural comparison, which it uses to generate a similarity network for identifying groups of structurally related proteins. The pipeline then performs dimensionality reduction to create a visual “map” for exploratory analysis. Informative protein features can be overlaid on the map, such as cluster association, taxonomy, sequence conservation to the query, and annotation information. This allows you to generate hypotheses about what properties could be driving functional differences within protein families and identify outlier proteins where innovations might lie across taxa.
In this pub, we’ll take you through some general uses of the ProteinCartography pipeline, as well as an example of how the ProteinCartography pipeline works, what the results look like, and how to analyze them. This is all contained in the “ProteinCartography in action” section directly after this paragraph. For more in-depth information about the limitations of ProteinCartography and individual steps and parameters of the pipeline, see the “Comprehensive overview of the pipeline” section. To learn more about plans we have for improving the pipeline see the “Next steps” section and to provide feedback check out the “What do you think?” section.
You can use the ProteinCartography pipeline to generate hypotheses and make predictions about individual proteins. For example, it can identify proteins that are structurally similar to an input protein, or it can identify outlier proteins. Downstream analyses could tell you which regions of the protein are important for function, and further investigation could determine whether these protein regions differ across clusters. Additionally, you could use the ProteinCartography pipeline to annotate proteins of unknown function or to provide support for annotation predictions.
The pipeline also lets you explore subfamilies within larger protein families. For example, it could be used to make hypotheses about whether distantly related proteins in the same family are members of the same subfamily. You could use it to identify especially interesting subfamilies for further examination (like subfamilies composed of only proteins from a particular taxonomic group). This is perhaps the most common use we’ve encountered so far, and we’ll dive into this use case more below. Importantly, we use the pipeline as a starting point to generate hypotheses and make predictions, but encourage users to test their hypotheses and predictions with additional analyses.
Before we discuss the results of the pipeline and how to interpret them, we provide a brief walkthrough of how a run of the ProteinCartography pipeline typically works. To jump to a detailed description of each step in the “Comprehensive overview of the pipeline” section, click the link at the beginning of each step below.
The pipeline generally starts with a protein of interest, or input protein, but it can start with multiple proteins. A PDB (structure) file and/or FASTA (sequence) file are required for each input protein. The package provides utilities to fetch these from UniProt or AlphaFold based on accession number, or to fold short sequences (less than 400 amino acids) using ESMFold . For proteins longer than 400 amino acids, you can use outside tools like ColabFold to fold your proteins and import them into the pipeline .
Sequence-based search: The pipeline performs a protein BLAST search against the NCBI non-redundant (nr) database to identify proteins based on sequence similarity to the input .
Structure-based search: The pipeline performs a Foldseek search against the AlphaFold/UniProt50 v4, AlphaFold/Swiss-Prot v4, and AlphaFold/Proteome v4 databases to identify proteins based on structural similarity to the input .
Aggregate and filter hits: The pipeline combines BLAST and Foldseek results and filters hits based on a user-defined length, fragment status, and whether the protein is marked as inactive.
Download hits: Using the combined hits list, the pipeline downloads PDB files from the AlphaFold database and metadata from UniProt for all hits. Metadata include fields such as taxonomic information, annotation information, protein characteristics, and others .
Construct all-v-all similarity matrix: Using the downloaded structure files, the pipeline compares the structure of every protein to the structure of every other protein to identify the best matches using
foldseek search. For those matches, the pipeline uses
foldseek aln2tmscore to calculate a similarity score, or a TM-score, where generally, a value of 1 means two structures are identical and values closer to zero mean the structures are less similar . These scores are aggregated in an all-v-all similarity matrix. For more information about TM-scores, see the “Construct all-v-all similarity matrix” section.
Cluster proteins: The all-v-all similarity matrix is then used to cluster proteins into groups of similar proteins. The pipeline uses two different clustering algorithms, Foldseek’s TM-align greedy set clustering and the Leiden algorithm . While both are provided in the final results, we default to Leiden clustering for visualizations. The Leiden algorithm is a clustering method that iteratively groups proteins, in this case attempting to optimize the modularity of the network . More information on clustering algorithms and why we default to Leiden clustering can be found in the detailed “Cluster proteins” section.
Calculate cross-cluster similarity matrix: Once clusters have been created, the pipeline performs cluster-related analyses, including calculating the cross-cluster similarity matrix. The cross-cluster similarity matrix is a heatmap representing the mean TM-score of all structures in each cluster versus all other proteins in each other cluster. The diagonal of this matrix tells us how similar all proteins are within a cluster. We average the values of the diagonal to determine a “cluster compactness” score that we use as a heuristic for how well-clustered the proteins are overall.
Perform semantic analysis: Using clustering information and metadata obtained from UniProt, the pipeline performs semantic analysis to evaluate and visualize the most frequently occurring existing annotations for each cluster . The pipeline determines the most common annotations per cluster and shows them as individual bar charts, as well as the most commonly used annotation words per cluster and shows them as word clouds.
Create maps: The pipeline uses the all-v-all similarity matrix of TM-scores to perform dimensionality reduction and create a map of the protein family. It uses the all-v-all similarity matrix to calculate a principal component analysis (PCA), which it then uses to calculate a t-distributed stochastic neighbor embedding (t-SNE) and a uniform manifold approximation projection (UMAP) . For more information on these dimensionality reduction techniques, see the “Create map” section. Both t-SNE and UMAP result in 2D maps that are meant for visualization.
Generate interactive plots: To make t-SNE and UMAP plots maximally useful, the pipeline uses the Plotly Python package  to create an interactive and navigable map with color overlays corresponding to protein metadata, including length, broad taxon, TM-score to input from Foldseek local search, source (Foldseek vs. BLAST), average pLDDT (structure quality), and others.
As an example, we ran mitogen-activated protein kinase 10 (MAPK10), also called c-Jun N-terminal kinase 3 (JNK3), one of the top 200 most-studied human proteins, through the ProteinCartography pipeline . We refer to this protein as MAPK10 throughout this pub and abbreviate it as MK10 in figures. For the input, we used P53779 (Figure 2, A). The pipeline carried out the steps listed above and produced the following outputs, which we explore further in the next section:
A UniProt features TSV file containing a summary of the UniProt metadata as well as clustering information:
An HTML file containing cross-cluster similarity matrix (Figure 3)
An HTML file containing interactive t-SNE plot with color overlays (Figure 4)
An HTML file containing interacting UMAP plot with color overlays (Figure 5)
An HTML file containing semantic analysis (Figure 6)
A summary of the outputs relevant to our interpretation is in Figure 2, but we will go through each output throughout the section.
Mitogen-activated protein kinase 10 (MAPK10), or c-Jun N-terminal kinase 3 (JNK3), is a serine/threonine kinase that is a member of the MAP kinase family. MAP kinases are involved in a number of cellular functions, including everything from proliferation to apoptosis . These proteins form signaling cascades, or chains of interactions that result in a final signal being delivered. This particular kinase, MAPK10, is a neuronal kinase that is often involved in stress response, where its activation results in phosphorylation of several transcription factors that result in neuronal apoptosis .
MAP kinases are found in almost all eukaryotic organisms, but individual MAP kinases are not always well conserved. In particular, MAP kinases in the JNK pathway seem to have emerged more recently in evolutionary time, as they are usually only described in vertebrates , with a somewhat similar pathway described in yeast as the “HOG pathway” . Using the ProteinCartography pipeline, we can ask whether MAPK10-like proteins exist in earlier-diverging organisms, like fungi and plants. Additionally, we can ask if these MAPK10-like proteins are structurally similar or distinct based on how they cluster, and we can identify clusters and proteins for further computational or experimental analysis.
First, to determine how well the clustering performed, we looked at the cross-cluster similarity matrix (Figure 2, B and Figure 3). The diagonal of the matrix shows how structurally similar proteins are within each cluster, or how compact the clusters are. We refer to the average value of the diagonal as “cluster compactness.” The goal of clustering is to group similar proteins together, so when clusters contain structures that are not very similar, it can suggest issues with the clustering that may have to do with the proteins themselves or the clustering parameters. Cluster compactness doesn’t take into account all the ways that the pipeline might fail or succeed, but it does give us a general idea of whether a run produced interpretable results. We dig more into this in the “Testing the limits of the pipeline section.” We also consider cluster distinctness (how similar each cluster is to other clusters) when evaluating clustering. In this case, the clusters are not very distinct, but we expect this is because we’re evaluating a family of closely related proteins (Figure 2, B and Figure 3).
For this MAPK10 map, the cluster compactness value is 0.63 (Figure 2, B and Figure 3), which is around average for the analyses conducted for this pub. Among the clusters in this analysis, Leiden clusters 11 (LC11) and 12 (LC12) drew our attention, as they are compact, suggesting that they might hold proteins that are well clustered. Looking beyond the diagonal, the similarity matrix also tells us which clusters are similar to each other. In this example, we see that LC02 and LC06 are quite similar and may be more related to each other than they are to other proteins in the map (Figure 2, B and Figure 3).
We next looked at the clusters themselves. Our protein of interest appears in LC02 (Four-pointed star in Figure 2, C–D, Figure 4, and Figure 5). In addition to looking at which cluster our protein belongs to, overlaying additional information on the map provides more insight. For example, the pipeline categorizes proteins based on their broad taxonomic grouping. By examining the taxonomic groupings of proteins in the neighborhood of our input protein, we observe that our protein and the surrounding proteins originate from mammals and other vertebrates (“Taxon” panels in Figure 2, C–D; “Broad taxon overlay” panels in Figure 4 and Figure 5). It’s important to note that the taxonomic depth is not uniform and is instead chosen to be generally interpretable and useful to people while staying within the limitation presented by the available number of colors. Advanced users can also customize the taxonomic groups and colors based on their organisms of interest. In this view, we observe that LC12, one of the tight clusters we saw in Figure 2, B, contains primarily fungal proteins, whereas LC11 is composed of primarily plant proteins (“Taxon” panels in Figure 2, D–F; “Broad taxon overlay” panels in Figure 4 and Figure 5).
Intrigued that we identified clusters of fungal and plant proteins in our MAPK10 analysis, we explored features calculated by the pipeline to determine if these proteins really are MAPK10 or JNK proteins. We looked at the quality of predicted structures (mean pLDDT), and found that the proteins in these clusters were high-quality, or closer to 100 (“pLDDT” panel in Figure 2, C–F; “pLDDT” overlay in Figure 4 and Figure 5). We also looked at the structural similarity to our input (TM-score to P53779) and saw that the structures in LC11 and LC12 were generally structurally related to our input protein (“TM-score” panel in Figure 2, C–F; “TM-score to input” overlay in Figure 4 and Figure 5). Note that the TM-score to input values shown in the maps are calculated during the all-v-all comparison step of the pipeline. During this step, TM-scores are only calculated for pairs of proteins that meet the default threshold. The rest of the proteins are marked as zero. For more information see the “Construct all-v-all similarity matrix” section.
To provide additional context, the pipeline generates simple semantic analysis visualizations that summarize existing annotation information for proteins retrieved from UniProt (Figure 6). Comparing the three clusters of interest, we see the top annotation for the cluster containing our input protein is “Stress-activated protein kinase JNK (EC 220.127.116.11)”, while the top annotation in LC11 is “Mitogen-activated protein kinase (EC 18.104.22.168)” and in LC12 is “Mitogen-activated protein kinase HOG1 (MAP kinase HOG1) (EC 22.214.171.124)” (Figure 2, D′, E′, F′, Figure 6). “EC 126.96.36.199” refers to the enzyme class to which these MAPK proteins belong, suggesting that they do all fit in the same enzyme class. The HOG pathway from yeast is similar to the JNK pathway , so it makes sense that LC12, which is primarily composed of yeast proteins, would contain many proteins annotated as a HOG1 protein. To determine how useful these annotations are, we can overlay the UniProt annotation score on the map (“Annotation score” panels in Figure 2, C–F; “Annotation score overlay” drop-down in Figure 4 and Figure 5). The annotation score is assigned by UniProt and ranges from 1 to 5, where a score of 5 means that the annotation is backed by experimental evidence and a score of 1 generally means that annotations were predicted or inferred . For both of these clusters, there are several proteins supported by an annotation score of 4 or 5, suggesting that at least some of the annotations in each cluster are likely backed by experimental evidence (“Annotation score overlay” panels in Figure 2, C–F; “Annotation score overlay” drop-down Figure 4, and Figure 5).
Thus, using the ProteinCartography pipeline, we could now hypothesize that there are MAPK10 or MAPK10-like proteins in fungi and plants. However, we would want to test these hypotheses with additional experiments.
From a large list of candidates, the pipeline helped us identify specific groups of proteins of interest in diverse taxonomic groups and make predictions about their function in relation to our input. In this case, it’s especially interesting to note that the sequence identity of the proteins identified in plants and fungal species is quite low (~30%) and these were identified via Foldseek, suggesting that the pipeline was able to identify relatives that would have been missed using a BLAST search alone (Figure 3, Figure 4). However, because the pipeline provides information based on predicted protein structures, further analysis would be necessary to draw definitive conclusions about protein function. For this example, downstream analyses like assessing the presence or absence of interacting proteins upstream and downstream of MAPK10 in the signaling cascade, looking for the conservation of specific and known catalytic residues, determining evolutionary history, or performing biochemical assays to directly test protein function, could be used to help determine the function of proteins of interest.
For users interested in learning more about the inner workings of the ProteinCartography pipeline, the following section dives into the details and parameters that we used to build it. It also provides additional in-depth information for each step and a meta-analysis that tests the limits of the pipeline. To jump straight to the next section, “Next steps,” click here.
To run the pipeline, clone the GitHub repository and follow the instructions there for installation. The current version of the pipeline takes around 30 to 90 minutes to run for small- to average-sized (< 400 amino acids) proteins, assuming default search parameters.
TRY IT: The ProteinCartography pipeline is available in this GitHub repository, along with instructions to get started.
To run the pipeline, you will need a FASTA and/or PDB file for your input protein(s). The pipeline accepts a single or multiple input proteins. Each input protein will be used to perform independent BLAST and Foldseek searches. You can fetch FASTA and PDB files for most proteins in UniProt. The pipeline can fold proteins less than 400 amino acids in length using ESMFold . You can generate PDB files for larger structures using tools such as ColabFold .
After running this pipeline on a variety of proteins, we noticed that certain proteins resulted in more interpretable and useful maps than others. For example, shorter proteins with high structure quality tended to have the best performance. While we are working to improve the pipeline to be effective at analyzing a broad diversity of proteins, we wanted to provide some guidance to users for what proteins might work well in this preliminary release. We therefore performed a meta-analysis to identify the limits of the pipeline.
To identify the characteristics that make proteins appropriate for the current version of the pipeline, we analyzed a set of proteins sampled from the list of the 200 most-studied human proteins, as reported in . We selected 25 proteins from a distribution of two features we noticed are important: protein length and structure quality, represented by average pLDDT  (Figure 7, Table 1). The pLDDT is a value that represents how a predicted structure will align with an experimental structure at each residue based on the distance difference, so by averaging this value across the length of the protein, we can determine how much of the protein lacks a defined structure. We used these 25 proteins as individual input proteins for 25 separate runs of the ProteinCartography pipeline.
ALK tyrosine kinase receptor
Bile acid receptor
Complement factor H
E3 ubiquitin-protein ligase XIAP
Glycogen synthase kinase-3 beta
Hemoglobin subunit beta
HLA class I histocompatibility antigen, A alpha chain
Immunoglobulin kappa constant
Macrophage migration inhibitory factor
Mitogen-activated protein kinase 10
Nuclear receptor coactivator 2
Parkinson disease protein 7
Peptidyl-prolyl cis-trans isomerase A
Peptidyl-prolyl cis-trans isomerase FKBP5
Phosphatidylinositol 3-kinase regulatory subunit alpha
Polycomb protein EED
Proteasome subunit beta type-5
Retinoic acid receptor RXR-alpha
Retinol-binding protein 2
Superoxide dismutase [Cu-Zn]
Ubiquitin carboxyl-terminal hydrolase 7
von Hippel-Lindau tumor suppressor
Table 1. The 25 proteins we sampled from the 200 most-studied human proteins.
We evaluated the quality of the clusters generated using the cluster compactness metric discussed above (Figure 8, A). In addition to protein length and average pLDDT, we explored how metrics such as the number of domains, the fraction sequence identity to the input, and the TM-score to the input impacted compactness (Figure 8, B–F). For each of these metrics, we determined the value for every protein in each map and examined how their distributions varied in comparison to compactness (Figure 8). Among the 25-protein sample, shorter proteins with higher pLDDT and fewer domains tended to result in maps with more compact clusters than longer proteins with lower average pLDDT values and more domains (Figure 8, B–D). Proteins with lower pLDDT values could be lower-confidence predictions, but they could also be proteins with more intrinsic disorder. In either case, these proteins might not be appropriate for ProteinCartography analysis using structural comparisons. Additionally, AlphaFold structures are treated as rigid bodies and are aligned as such in the alignment step of our analysis. Slight differences due to protein dynamics could therefore be missed by alignment. This is something to consider especially when working with proteins that contain multiple domains connected by flexible regions (Figure 8, D).
The TM-scores shown here are calculated during the all-v-all structural comparison, which does not evaluate every possible protein-protein comparison comprehensively and therefore may contain missing values. For more information see the “Construct all-v-all similarity matrix” section. This means that zeros don’t necessarily represent non-matches. They just represent less good matches relative to the rest of the proteins. To visualize the median without the presence of these zero-values, we supply a median TM-value for each analysis with (white circle) and without (white diamond) zeros (Figure 8, F).
There are also certain proteins that are not well represented in the AlphaFold database. For example, viral proteins have been excluded from the current draft of the AlphaFold database, meaning searches involving these proteins will be limited . We wanted to know whether the taxonomic diversity or total number of hits in an analysis impacted pipeline performance. The number of hit proteins didn’t correlate well with cluster compactness for our limited analysis (Figure 9, A, B, F). However, there are clear cases when having very few proteins resulted in poor analyses. For example, NCOA2 only had 93 proteins identified and had the lowest cluster compactness (Figure 9, D, F). We also observed that more diverse taxonomic distributions may result in more compact clusters, but need additional analyses to determine if this trend holds (Figure 9, C–F).
In summary, we advise you to consider the characteristics of each input protein when analyzing it with the ProteinCartography pipeline, as not all proteins are equally appropriate for structural comparison and could instead be evaluated using sequence, shapemer (or short fragment of a protein structure), or protein language model embedding comparisons. We will continue to use these 25 proteins throughout the following sections to evaluate our methods.
Once you’ve designated an input protein or proteins, the first step of the ProteinCartography pipeline involves searching protein structure and sequence databases.
The pipeline performs a sequence-based search for each input protein using NCBI Protein BLAST to search against the full NCBI non-redundant (nr) database . It runs BLAST using a query to the web API. You can customize the number of hits returned with a default cutoff of 3,000. In general, there is no taxonomic constraint applied. Because we’re querying such a large database and asking for relatively few hits, we also don’t generally use an E-value cutoff for our BLAST search. In our runs so far, the median sequence identities for our BLAST hits have been consistently higher than for our Foldseek hits, and generally above 50% (Figure 8, E). However, including quality cutoffs in our BLAST search or including the ability for users to set a cutoff for this is something that we hope to include in future versions.
From the BLAST search, the pipeline retrieves a list of RefSeq or GenBank identifiers. To retrieve predicted structures from AlphaFold and retrieve protein metadata, it maps these identifiers to UniProt accessions . At this step, some proteins are usually lost because not all proteins present in the non-redundant NCBI database are present in UniProt.
For the structure search, the pipeline uses a Foldseek web API query to search against the AlphaFold/UniProt50 v4, AlphaFold/Swiss-Prot v4, and AlphaFold/Proteome v4 databases using the 3Di search mode with no taxonomic restraints . The AlphaFold/UniProt50 v4 database uses MMseqs2 clustering at 50% sequence identity and returns a representative from each cluster that has the highest structure quality (average pLDDT) instead of all of the most closely related proteins . The AlphaFold/Swiss-Prot v4 database contains UniProt proteins with high-quality annotations . The AlphaFold/Proteome v4 database contains proteomes from a set of 48 model organisms and global health-related organisms . The pipeline currently runs the Foldseek structure-based search step using a query to the web API, and each database search returns a maximum of 1,000 sequences. This is a constraint set by Foldseek.
Using the combined set of BLAST and Foldseek hits, the pipeline queries the UniProt database to retrieve metadata for each protein, including the protein name, gene name, organism, protein length, cross-references to annotation databases such as Pfam and InterPro, and other metadata (see example TSV) . It then uses this metadata to filter and remove hits based on user-defined size cutoffs (if applied), whether the protein is marked as a fragment, and whether the protein is marked as inactive.
Next, the pipeline downloads the structure files (PDB files) for the proteins that are in the AlphaFold database, which includes only protein structures predicted using AlphaFold . We again lose some proteins at this step that were identified via BLAST but don’t have AlphaFolded structures, but one could use AlphaFold or ESMFold to fold these unfolded proteins .
Once all of the structures have been obtained and compiled in a single folder, the pipeline uses Foldseek to compare every protein structure to every other protein structure to create an all-v-all comparison for network analysis . This works by first performing
foldseek search on the user’s machine, this time searching each hit against every other hit that was downloaded from AlphaFold . From these alignments, the pipeline obtains E-values for each comparison. The pipeline uses the default E-value threshold of 0.001 set by Foldseek to determine which pairs of proteins to compare using TM-align. The pipeline then aligns any pairs of structures with E-values that satisfy the threshold using
foldseek aln2tmscore to obtain a TM-score (template modeling score). A TM-score is a metric for the structural similarity of protein structures that ranges from 0–1 . A TM-score of 1 means the compared structures are identical, while two protein structures with scores above 0.5 are usually similar and proteins with scores of 0.17 or lower are likely unrelated . Any comparisons that did not satisfy the threshold E-value do not return a TM-score; we set these missing values to 0 for the purposes of clustering. In addition to the E-value threshold,
foldseek search defaults to returning a maximum of 1,000 TM-scores for each protein analyzed, including the input. This means that not all comparisons will have calculated TM-scores. We also treat these missing values as 0 for the purposes of clustering.
This type of thresholding is common in network analyses to help slice the space into groups of differing depths , but we have not yet determined the optimal thresholding parameters for our analyses. Consequently, in our visualizations, there is an inflated number of comparisons with a TM-score of 0 — users should treat these zeros as missing values, rather than a true absence of structural similarity. In future versions of the pipeline, we plan to explore these thresholds in a more principled way to determine what is appropriate for different types of analysis and will provide configuration parameters for users to tune the stringency of filtering. We will also provide complete TM-score calculations for input proteins for visualization purposes.
Next, the ProteinCartography pipeline arranges TM-scores into an all-v-all similarity matrix. Additionally, the pipeline adds the score of each hit protein compared to the input protein(s) to the output TSV file and uses it in the final visualization steps.
After generating an all-v-all similarity matrix, the pipeline groups proteins into clusters based on how similar they are to each other. Originally, we used Foldseek’s TM-align greedy set clustering algorithm to generate structural clusters . Foldseek’s clustering algorithm utilizes Linclust and MMseqs2 . Briefly, protein structures are represented as 3Di sequences . Linclust extracts short sections of these sequences and uses them to sort the sequences into groups. The longest sequence in each group is identified as the representative. Foldseek’s structural clustering uses the representatives in an initial structural alignment, the output of which feeds into MMseqs2 clustering. During this step, MMseqs2 clusters representatives based on TM-score. MMseqs clustering is a greedy set-cover algorithm, meaning that it chooses a single representative structure and adds it into a new or existing cluster and repeats this until all sequences belong to clusters .
In our 25-protein analysis, we observed that many analyses only contained a small number of structural clusters as defined by Foldseek. 15 out of 25 maps contained fewer than five clusters (Figure 10). Moreover, these clusters didn’t tend to be very compact. While Foldseek’s clustering method may be appropriate for some uses, we wondered whether other clustering algorithms may be better suited for our intended use case.
In particular, we were interested in Leiden clustering, a method that has become popular for identifying groupings within single-cell expression networks . Leiden clustering is another algorithm used to identify communities, or preliminary clusters, within a pre-existing network in a three-phase process. First, proteins are grouped to find the highest quality community separations. Next, the algorithm undergoes a refinement step where proteins can be switched to other communities. Finally, the network is aggregated . This process is generally done several times. We used the implementation of Leiden clustering found in the popular single-cell RNA-seq analysis package Scanpy . This implementation takes a matrix of counts and performs principal component analysis (PCA), generating a neighborhood graph using an implementation of uniform manifold approximation projection (UMAP). We used parameters n_pcs = 30 and n_neighbors = 10 for this implementation of Leiden clustering, and performed Leiden clustering until optimized (scanpy default, n_iterations = −1).
To compare the clustering methods, we applied each to the 25 proteins we sampled in Figures 7–9 (Figure 10). To determine which algorithm better sorted the proteins into clusters, we measured cluster compactness and number of clusters in the resulting map (Figure 10). In all 25 cases, the structural similarity scores within clusters were higher when using Leiden clustering (Figure 10, A, C, C′) and there were more clusters (Figure 10, B, D, D′), suggesting that Leiden clustering might be more appropriate for identifying sub-groups within groups of structurally similar proteins. We hypothesize that Foldseek’s structural clustering might be better for larger-scale analyses looking across families, but did not test this.
The current version of the pipeline provides both structural clustering using Foldseek’s algorithm and Leiden clustering results in the final “aggregated_features.tsv” file, but it defaults to Leiden clustering in plots and analyses. We have not fully optimized the standard parameters and this could contribute to the differences in clustering quality for different protein families. In future iterations of the pipeline, we hope to experiment more with these parameters to develop more generalizable clustering approaches.
To allow for better understanding of the quality and content of the clusters, the pipeline calculates a cross-cluster similarity matrix (Figure 2, Figure 11). For each cluster, it calculates the mean TM-score of all structures in that cluster versus all other proteins in each other cluster. Clusters with a greater mean cross-cluster TM-score are more structurally similar (with a maximum value of 1). Within this visualization, the y-axis represents the protein that is treated as the query protein, while the x-axis represents the target protein in each comparison. The diagonal of the matrix represents the similarity of all proteins within a cluster, which we can use to assess cluster compactness (cmx), or the average value of the diagonal. Clusters with a low within-cluster mean TM-score are likely to contain assortments of unrelated or dissimilar proteins. The pipeline visualizes the results of this comparison using a heatmap.
While the cluster compactness score provides a general idea of the quality of the clustering, the heatmaps produced with each run of the pipeline can help users more thoroughly evaluate clustering effectiveness for their analyses. We show heatmaps from analysis of four example proteins, hemoglobin subunit beta (HBB), peptidyl-prolyl cis-trans isomerase A (PPIA), retinol-binding protein 2 (RET2), and complement factor h (CFAH) from our sample of most-studied human proteins (Figure 11). These examples reflect a range of potential outcomes for clustering. The first three proteins all have high cluster compactness (HBB: 0.79, PARK7: 0.71, RET2: 0.85), but their heatmaps reflect variable levels of interpretability. Clusters in the HBB analysis show a high level of compactness, with clusters 00, 02, 03, 04, 06, 07, and 08 showing mean TM-scores > 0.5 along the diagonal (Figure 11, A–B). However, examining the other cells of the matrix reveals that several of these clusters may not be very distinct — for example, clusters 00, 03, 04, and 08 show high levels of mutual similarity, suggesting that these clusters might be combined into a single larger cluster (Figure 11, A–B).
The PPIA analysis shows a similar result, where several clusters could potentially be fractions of a larger cluster (Figure 11, C–D). Particularly interesting with the PPIA matrix, there are strong horizontal lines that suggest there are some clusters that are similar to all other clusters (Figure 11, C–D). The RET2 analysis shows a more extreme example: most clusters, with the exception of 01 and 12 (and, to a lesser extent, 04 and 10) show relatively strong similarity to all other clusters, suggesting that the clustering analysis was not able to identify distinguishable sub-groupings (Figure 11, E–F). In some cases, such as for large proteins like CFAH, clustering does not appear to produce compact or distinct clusters (Figure 11, G–H). For these types of proteins, the large number of domains and lower overall pLDDT might impede structural comparisons due to the limitations of rigid-body structural comparison, and other types of comparison networks — such as sequence, shapemer, or protein language model embedding — might be more amenable to clustering analysis.
The pipeline retrieves protein metadata from UniProt, which can include gene and protein names . While not always reliable, especially for understudied organisms, these annotations can provide a more human-readable method of understanding what kinds of proteins exist in each cluster . To summarize protein annotations from each cluster, we implemented a visualization that we refer to as “simple semantic analysis.” For each Leiden cluster, the pipeline aggregates the most frequently occurring annotations and individual annotation words, and represents these as a ranked bar chart and proportional word cloud, respectively (Figure 5).
For example, the semantic analysis plot for the MAPK10 analysis is shown in Figure 5. Our input protein is in LC02, where the most represented annotation is “Stress-activated protein kinase JNK (EC 188.8.131.52)”. In the word cloud, we can also see,“JNK,” “Mitogen-activated,” and “MAPK,” all suggesting that clustering analysis correctly aggregated these proteins together, since we know that MAPK10 is a member of the JNK family. LC11, a cluster composed of primarily plant proteins, has the top annotation “Mitogen-activated protein kinase (EC 184.108.40.206),” suggesting it comes from the same enzyme class, EC 220.127.116.11. For LC12, a primarily fungal cluster, the top annotation is “Mitogen-activated protein kinase HOG1 (MAP kinase HOG1) (EC 18.104.22.168),” which is consistent with the literature showing that the JNK pathway is similar to the fungal HOG pathway . This analysis can provide broader biological context for the contents of each cluster.
The pipeline uses standard dimensionality reduction approaches to create a visual representation of protein space. It starts by using the original all-v-all similarity matrix to calculate a principal component analysis (PCA) with 30 components . The PCA results are then passed to an analysis to calculate the t-distributed stochastic neighbor embedding (t-SNE) and the uniform manifold approximation projection (UMAP) . For the t-SNE, it returns two components, the perplexity is set to 50, and the number of iterations to run is set to 2,000. For the UMAP, it returns two components, the number of neighbors is set to 80, and the minimum distance between neighbors is set to 0.5. The parameters used here are defaults used in other analogous analyses, but in the future, we plan to optimize them for our particular use cases.
Both UMAP and t-SNE are non-linear, graph-based methods for dimensionality reduction . They are both meant for visualization — we do not treat the 2D maps generated by these techniques as fully representative of the higher-dimensional relationships between proteins. They each follow the same general principle: create a high-dimensional graph, then reconstruct it in a lower-dimensional space while retaining the structure. t-SNE moves the graph from high dimension to lower dimension point by point, while UMAP compresses the high-dimensional graph . We provide both in this pipeline so that the user can choose which visualization is easier to navigate for each protein family. Often, t-SNE creates more space between clusters, while the UMAP plot appears more connected. However, users should not interpret distances in the 2D axis of UMAP or t-SNE plots as quantitative.
Finally, the pipeline uses all the data collected above to create an interactive and navigable map that you can use to explore the protein family (Figure 2, Figure 3). The pipeline produces HTML file maps that allow dynamic visualization, built using the Plotly Python package  This allows you to interact with graphs and apply multiple overlays as shown in the above examples and detailed more thoroughly below. A toggle button allows you to see the input protein(s) in the map as black, four-pointed star markers, and metadata for each protein is displayed in a tooltip when the mouse cursor hovers over a point. In addition to the interactive plot, the pipeline produces a file that contains all the UniProt features along with the information calculated throughout the pipeline for each protein.
To empower researchers and make these plots maximally useful, the pipeline has a color drop-down that colors points according to protein metadata. The default view colors each point by its Leiden cluster. Clustering (separating the protein structures into similar groups based on the all-v-all similarity matrix) and mapping (visualization via dimensionality reduction) take place independently in the pipeline. However, at this final step, these two representations of protein space are combined when the results of the clustering analysis are overlaid onto the map.
You can color points by the following:
Annotation score, a metric to measure the annotation content of a UniProt protein 
Broad taxon, which can be either eukaryotic- or bacterial-focused with this current version of the pipeline; see the GitHub README for how to customize the taxonomic groups
Source (Foldseek vs. BLAST)
TM-score vs. the input protein (this TM-score is the value from the Foldseek local search and may not reflect the true TM-score due to E-value thresholding)
Fraction sequence identity vs. the input protein
Average pLDDT of the protein
[Experimental] Concordance vs. the input protein (see below)
Using this visualization and the accompanying file containing this information in a tabular format, we can begin to make predictions and hypotheses about the relationship of proteins to each other and even how these proteins might function.
We used ChatGPT to write, clean up, and comment code. We also used it to suggest wording ideas that we edited extensively.
This pipeline is a work in progress — we are actively building and adding features. Many of the features we hope to improve and add are recorded as GitHub issues in the ProteinCartography GitHub repository. As we move forward, we hope to build in four important areas: broad software improvements, validation, new analysis features, and linkages with other software packages. We lay out our plans below, but would love feedback on what you’d like to see us tackle next.
We built the ProteinCartography pipeline using Snakemake, which lets us develop flexible workflows that can run on most computers. However, we plan to provide a Nextflow version of the pipeline in the future. Additionally, we plan to decrease our reliance on APIs in general, but in particular, we hope to avoid using the Foldseek API. This will also allow us to support different databases for comparisons. It’s really important to us that this pipeline is not just useful, but also usable, so we plan to work on increasing usability and adding features that allow researchers to more easily interpret the space.
To provide users with more definitive and useful information, we must first provide more validation and principled statistical analysis. We plan to develop a clearer understanding of what characteristics make proteins amenable to analysis using the pipeline by performing additional large-scale analyses of diverse proteins and performing statistical tests to understand how well different protein features correlate with pipeline results. To perform such validation, we also need to expand our metrics for evaluating pipeline performance from focusing on cluster compactness to also include cluster distinctness, evaluation of over- and under-clustering, annotation distinctness (how well annotations line up with clusters), and other measures. By assessing these metrics, we will also be able to develop automated methods for parameter selection to identify sensible defaults that work well across protein families of diverse size and composition.
Additionally, we plan to explore the current parameters of the pipeline and how we might be able to optimize them for different use cases.
Finally, we are currently working on in-lab biochemical validation to show that the interesting predictions and hypotheses we have been able to make about protein function based on the results of the pipeline are actually indicative of true functional differences. We’re looking for proteins with established purification protocols and assays that come from diverse and interesting protein families. If you have any suggestions for proteins that fit these criteria and would be especially useful for this type of biochemical validation, please let us know in a comment!
While our pipeline is able to aggregate results from sequence and structural searches and provide maps for exploration, the pipeline does not yet perform detailed analysis of the features within proteins that make them distinct from each other. We’d like to add analyses that allow us to identify the specific regions of proteins that result in differences between clusters.
We want to make overall exploration of the maps easier and more intuitive by employing analyses to identify interesting proteins or outliers that might be of particular interest. We also want to find easy ways of pointing out which areas of the maps have high-quality clusters and which areas of the map users should consider avoiding. We’d like to find ways to identify the specific structural features between clusters that make them unique.
For example, we are interested in identifying proteins that are convergently evolved — composed of divergent sequences but which fold into similar structures — as well as proteins with high sequence identity but apparently divergent structures. The current version of the pipeline provides a rudimentary measure of the relationship between sequence identity and structural similarity, which we call “concordance.” This is a simple measure that subtracts structural similarity from sequence identity. More positive values mark proteins with greater structural similarity than sequence identity, while negative values mark the opposite. While the current measure is not statistically principled, as TM-score and sequence identity do not follow the same linear scale, we are exploring methods to compare sequence identity to structural similarity to identify proteins that meaningfully diverge or converge from expectations.
FEATURE REQUESTS: Are there features that would be useful to you? Let us know in a comment on this pub!
We’d like to incorporate phylogenetic information and sequence information to complement the structural information that the pipeline provides. Additionally, we’d like to integrate this pipeline with other resources from Arcadia, including NovelTree and PreHGT .
TRY IT: The ProteinCartography pipeline is available here, along with instructions to get started.
If you use it, let us know in a comment on this pub! We’d love to hear your use case and what you learned from your own protein mapping. Additionally, we welcome outside suggestions as GitHub issues and contributions through pull requests.
We’d particularly interested in getting your feedback on the following:
Would this kind of pipeline be useful for your own work?
How could we make it more useful for you?
Do you have any recommendations for types of analyses or validation?
Do you have any recommendations for protein families that you think would be particularly interesting to look at?
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Thank you to Milot Mirdita of the Foldseek team for his helpful comments and advice about the Foldseek Search Server and to the Martin Steinegger and Johannes Söding labs for developing this set of tools.
Brae M. Bigge
Conceptualization, Data Curation, Formal Analysis, Investigation, Supervision, Visualization, Writing
Adair L. Borges
Feridun Mert Celebi
Critical Feedback, Supervision, Validation
Rachel J. Dutton
Galo Garcia III
Megan L. Hochstrasser
Cameron Dale MacQuarrie
David Q. Matus
Elizabeth A. McDaniel
Critical Feedback, Validation
Critical Feedback, Writing
Critical Feedback, Validation
Michael E. Reitman
Dennis A. Sun
Conceptualization, Data Curation, Formal Analysis, Investigation, Software, Visualization, Writing
Emily C.P. Weiss
Critical Feedback, Validation