Skip to main content
SearchLoginLogin or Signup

Predicting bioactive peptides from transcriptome assemblies with the peptigate workflow

Peptigate predicts bioactive peptides from transcriptomes. It integrates existing tools to predict sORF-encoded peptides, cleavage peptides, and RiPPs, then annotates them for bioactivity and other properties. We welcome feedback on expanding its capabilities.
Published onAug 08, 2024
Predicting bioactive peptides from transcriptome assemblies with the peptigate workflow
·

Purpose

Peptides are small protein sequences (less than 100 amino acids in length) with significant therapeutic and biotechnological potential due to their small size and the wide variety of biological pathways they participate in. Despite these appealing traits, experimental discovery of peptides remains challenging, and computational tools suffer from false positives.

In this pub, we introduce peptigate (peptide + investigate), a workflow that predicts and annotates bioactive peptides from transcriptomes. Peptigate unites functions previously distributed across different tools. It predicts small open reading frame (sORF)-encoded proteins, cleavage peptides, and ribosomally synthesized and post-translationally modified peptides (RiPPs) from transcriptomes. Peptigate then annotates them for bioactivity, chemical properties, similarity to known sequences, and signal peptide presence.

We used peptigate to predict peptides in the human transcriptome, resulting in 2,949 distinct peptides. Comparing these predictions against experimental datasets, we validated an average of 23% of peptides (49% general cleavage, 20% RiPPs, 22% sORF-encoded peptides). A major challenge during this project was the lack of gold-standard data for validation, as peptide annotations are incomplete even for humans. We used noisy and incomplete proxies like mass spectrometry peptidomics databases and ribosomal profiling. With only a quarter of our predictions confirmed, it's unclear whether mismatches arise from gaps in these data sources or incorrect predictions. We welcome suggestions for more reliable ground truth data to improve our pipeline's assessment.

We anticipate that peptigate may be a jumping-off point for new peptide discovery. For example, if a researcher is interested in identifying peptides in a tumor microenvironment, they might interact with peptigate as follows. First, the researcher would identify a transcriptome or group of transcriptomes from their tumor and non-tumor samples. Next, they’d run peptigate on the transcriptomes. Using the peptigate output, they’d filter to peptides that are only present in the tumor samples or perform differential expression analysis and retain transcripts that encode peptides that are differentially expressed in the tumor microenvironment. 

What else could they do with this information? The researcher could use the metadata reported by peptigate to form a hypothesis about the cellular role of these peptides. For example, if the peptides contain a secretory pathway-targeting signal peptide, they're likely secreted and interact with other cells. Using these predictions, the researcher could design wet-lab experiments to follow up on their research interests.

What do you think?

If you think this example resonates with work you’re doing, we’d love to hear about it and possibly help. We are also open to learning about other peptigate use cases that others come up with.

Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion. 

The context

Peptides are a diverse class of biological molecules present in all three domains of life. They participate in activities like cellular signaling [1], chemical messaging [2], and defense/immunity [3][4]. Peptide synthesis occurs via many pathways, including ribosomal synthesis of small open reading frames (sORFs) [5], cleavage from precursor proteins, and synthesis by non-ribosomal enzymes [6]

Due to their high specificity and potency, peptides are increasingly recognized for their therapeutic and biotechnological potential. When compared to small molecules, peptides offer the advantages of lower toxicity and relative ease of synthesis. However, they often face challenges such as a short half-life and the requirement for non-oral delivery methods to bypass digestive degradation and effectively reach target tissues [7]. In contrast to other biologics like monoclonal antibodies, peptides theoretically benefit from simpler synthesis processes, shorter research and development phases, and faster regulatory approvals. Despite these advantages, peptides generally exhibit lower stability during storage and handling, similar to the stability issues observed with biologics, necessitating advanced formulation strategies to ensure efficacy.

Our working definition of “peptide”

While the definition of a peptide varies, for this pub, we’ll define peptides as small polypeptides comprised of 2–100 amino acids with standalone biological activity. We refer to these peptides as “bioactive” to denote their distinct physiological functions, unlike peptide fragments from protein degradation or those that don't function independently, such as intermediary or cleaved signal peptides [8][9]

The problem

Endogenous peptide discovery is difficult, especially when predicting many peptides from many species. Before the advent of DNA sequencing technologies, peptide discovery was primarily an experimental endeavor. Early discoveries in the first half of the 20th century focused on single peptides implicated in specific biological actions [10][11][12][13]. Advances in chromatography and mass spectrometry ushered in a high-throughput discovery era via peptidomics [14][15]. Discoveries facilitated by these technologies as well as genome sequencing highlighted the underappreciation of peptides as a biological class [16].

In the intervening decades, further refinement of these technologies and appreciation of different ways peptides are synthesized endogenously have led to more discoveries of peptides [17][18]. Even still, blind spots persist. Some peptides are only present under hyper-specific conditions [19], while peptidomics and ribosomal profiling require expensive infrastructure and expertise and may require sample-specific preparation techniques that limit usability for new sample types [20][21][22][23].

Computational tools address this experimental bottleneck by predicting peptides from genomes and transcriptomes [17]. Sequencing data in particular is amenable to peptide discovery because it can be analyzed in many different ways, which fits with the natural diversity of peptides themselves; multiple tools can detect different types of peptides. However, detecting peptides from sequencing data is still fairly challenging. Apart from the many different kinds of peptides, the short nature of peptide sequences makes them difficult to detect and makes detection susceptible to false positives [5].

Our solution

We introduce peptigate, a workflow that applies previously developed best-in-class tools to predict and annotate diverse bioactive peptides from transcriptomes (Figure 1). “Peptigate” is a portmanteau of peptide and investigate. Peptigate currently predicts sORF-encoded proteins, cleavage peptides, and ribosomally synthesized and post-translationally modified peptides (RiPPs). These peptides are then annotated to predict bioactivity, chemical properties, similarity to known peptide sequences, and the presence of a signal peptide. These functions were previously scattered in disparate tools; peptigate unites them to make diverse peptide prediction simpler.

For multiple reasons, we chose to use transcriptomes as the input. RNA-seq data, and thus transcriptome assemblies, are comparatively more available than genomes, especially for less developed research organisms. Transcriptomes are also smaller and have a higher ratio of gene content than genomes, which reduces false positives in peptide discovery. However, to make peptigate more flexible, we also provide a reduced pipeline that takes predicted protein sequences as input.

Figure 1

An overview of the peptigate workflow for predicting bioactive peptides.

Peptigate takes a transcriptome assembly and open reading frames (ORFs) predicted from the transcriptome contigs as input. It uses these files to predict sORF peptides with plm-utils, cleavage peptides with DeepPeptide, and RiPP peptides with NLPPrecursor. Predicted peptides are then annotated for bioactivity using AutoPeptideML, compared to known peptides in the metadatabase Peptipedia, annotated for signal peptides with DeepSig, and chemical properties calculated with the Python package peptides.py. The peptide prediction and annotation outputs are reported in a pair of TSV files. The predicted peptide sequences are also provided in nucleotide (FFN) and amino acid (FAA) format for convenience. We’ve omitted many intermediate steps in the workflow to focus on the parts of the workflow that perform predictive tasks.

The resource

The peptigate pipeline is available in this GitHub repository (DOI: 10.5281/zenodo.12775316).

Peptigate is a Snakemake pipeline that combines existing tools to predict bioactive peptides from transcriptomes. Below, we highlight how each part of peptide prediction works, covering sORF, cleavage, and RiPP peptide prediction and annotation.

Predicting small open reading frames

Background on sORFs

Small open reading frames (sORFs) encode peptides that are short upon synthesis [24] (rather than cleaved later). They’re also known as "short" open reading frames [25]. The functional peptide products are referred to as sORF-encoded polypeptides (SEPs or sPEPs), microproteins, or micropeptides [26]. DNA transcription and ribosomal translation from open reading frames 300 nucleotides or shorter produces these peptides. Most sORFs use non-traditional start codons like UUG, CUG, GUG, and ACG, each of which differs by one nucleotide from the start codon AUG [27].

While most genomes contain many sORFs, only a few are actively translated and transcribed. Most transcribed sORFs are within a transcript's 5′ or 3′ UTR of the primary coding domain sequence (uORF and dORF, respectively). They often play regulatory roles by influencing the translation of the mRNA [5]. However, some sORFs encode peptides that are translated into functional small proteins. The majority of these sORFs have been identified in what were presumed to be long non-coding RNAs [5][28].

How peptigate predicts sORFs

The peptigate pipeline targets sORFs on contigs without longer ORFs to identify sORFs that encode functional peptides (as opposed to translation regulators). The pipeline begins sORF prediction by removing contigs in the transcriptome assembly that have predicted open reading frames (supplied by the user). Next, the pipeline tries to remove fragmented contigs that likely contain longer open reading frames by comparing each remaining transcript against the UniRef50 database using DIAMOND blastp [29]. If a contig has a match to a protein in UniRef50 that's longer than 300 nucleotides, we remove these transcripts. Peptigate then scans the remaining contigs for open reading frames using common sORF start codons (AUG, UUG, CUT, GUG, and ACG [27]) and retains all predicted ORFs 300 nucleotides or shorter. It then predicts whether each sequence is coding or non-coding using the Python package plm-utils [30]. Plm-utils uses latent information in the large protein language model Evolutionary Scale Modeling (ESM) [31][32] to determine whether a short open reading frame is coding or non-coding [30].

Predicting cleavage peptides

Background on cleavage peptides

Cleavage peptides are generated by enzymatic cleavage (proteolysis) of precursor proteins. These peptides are initially ribosomally translated while embedded in the precursor protein and then cleaved to become biologically active. Peptides can be proteolytically released from proteins by specific [33] or general proteases [34] or receive additional modifications after cleavage [35]. Cleavage peptides participate in a variety of biological tasks, including stress response (corticotropin-releasing hormone), blood sugar regulation (insulin and glucagon), blood clotting (thrombin), and inflammation (C3a), and phagocytosis (C3b).

Cleavage peptides are different from propeptides and proteolytic degradation products. Propeptides are parts of proteins that are cleaved during protein maturation and don’t have a biological function once cleaved. Similarly, proteolytic degradation products are generated by the ubiquitin or lysosomal pathways and mostly don't generate functional products, although individual amino acids are often recycled for new protein synthesis [36][37].

How peptigate predicts cleavage peptides

The peptigate pipeline predicts two classes of cleavage peptides: cleavage peptides with protease cut sites as well as ribosomally synthesized and post-translationally modified peptides (RiPPs). Peptigate uses the DeepPeptide tool to identify cleavage peptides with protease cut sites [38]. DeepPeptide is built atop the ESM2 large protein language model [31][32] and predicts peptides and propeptides from protein sequences. The peptides range in length from 5–50 amino acids.

Peptigate uses NLPPrecursor for RiPP prediction [39]. NLPPrecursor was trained using only bacterial RiPP sequences and thus may work best when run on bacterial protein sequences [39]. However, many cyclic eukaryotic peptides are RiPPs [40]. When run against eukaryotic protein sequences, we think it's possible that the RiPP peptides detected were once horizontally transferred from bacteria to eukaryotes; however, we haven't followed up on this hypothesis.

Annotating predicted peptide sequences

Functional annotation of peptide sequences is difficult for many reasons. Most protein functional annotation tools use sequence similarity or orthology to compare new protein sequences to proteins of known function. These methods often generate statistically unreliable results when applied to short sequences; for short sequences, sequence similarity comparisons typically only work to find matches that are very similar to sequences that have already been discovered [41]. In some species, peptides encoded by sORFs are under lower purifying selection [42], or they’re evolutionarily young so they’re not present in other closely related species [25], decreasing sequence similarity. 

Moreover, peptides can exhibit varied functions in different biological contexts due to their ability to adopt multiple conformations [43], which complicates functional annotation based solely on sequence similarity. Because some peptide functions, such as antimicrobial activity, are easier to assay, these functions may improperly propagate even though they don't reflect in vivo functions [44].

Peptigate attempts to overcome these challenges by annotating predicted peptide sequences using multiple approaches. First, peptigate compares against known peptide sequences by BLASTing each predicted peptide sequence against the Peptipedia database using DIAMOND blastp [29][45][46]. Peptipedia is a metadatabase with peptide sequences from 76 databases encompassing 213 bioactivities (as of March 23, 2024). Peptigate reports the top match for each peptide.

Next, peptigate annotates signal peptides in predicted peptide sequences using DeepSig [47]. Signal peptides are short peptide sequences (16–30 amino acids long) that mark proteins for secretion [48]. Signal peptides can provide clues as to the function of a protein depending on the presence and the class of the signal peptide [49]

Peptigate predicts the function of predicted peptide sequences using AutoPeptideML [50]. AutoPeptideML is a tool that allows users to build and use models for peptide bioactivity prediction through machine learning best practices. It uses the ESM large protein language model (ESM2-8M) internally to improve prediction accuracy [31][32]. Currently, peptigate uses 16 models built in the AutoPeptideML preprint (antibiotic, anticancer, ACE inhibitor, antifungal, anti-MRSA, antimalarial, antimicrobial, antioxidant, antiparasitic, antiviral, blood-brain barrier crossing, neuropeptide, quorum sensing, toxic, and tumor t-cell antigen) [50]. However, the Matthews correlation coefficient of these models ranges from approximately 0.02 to 0.73, indicating a wide performance range and a general inability to predict peptide bioactivity. Nevertheless, this approach is state-of-the-art, so we’ve included it in the peptigate pipeline.

Peptigate also calculates peptide chemical characteristics using the Python package peptides.py. Peptigate calculates metrics like molecular weight, charge, and hydrophobicity. These attributes can be used to compare peptides or to assess whether a given peptide is suitable for a downstream task (e.g., removing hydrophobic peptides because they’ll be difficult to synthesize).

Last, peptigate determines the nucleotide sequences that encode the predicted peptide protein sequences. The nucleotide sequences are three times as long as the amino acid sequences, which can improve sequence searches against large databases and other comparisons. Peptigate doesn't use these sequences directly for annotation, but they're provided to the user as an output so they can be further analyzed (e.g., via sequence similarity clustering with MMseqs2 [51]). 

We think there's still room for improvement in our approach to peptide annotation, especially for bioactivity prediction. We welcome feedback or suggestions on how to improve our approach.

Limitations of the peptigate pipeline

While we tried to generate a comprehensive tool, peptigate is still limited. Below, we outline specific tasks that peptigate doesn't yet perform and highlight why including these approaches is difficult.

Peptigate doesn't predict non-bioactive peptides. It's focused on predicting bioactive peptides, so it doesn't predict degradation products from the ubiquitin or lysosomal degradation pathways or digestion (e.g., tryptic cleavage). It also doesn't predict sORFs that occur in the 5′ or 3′ UTR of longer ORFs, as most of these sORFs regulate the translation of the transcripts they occur in and don't have bioactivity beyond this niche role [5].

There are also some classes of bioactive peptides that peptigate doesn't yet predict. In particular, peptigate doesn't predict nonribosomal peptides synthesized by nonribosomal peptide synthetase(s) (NRPSs). NRPSs synthesize peptides independent of messenger RNA and ribosomes. Each enzyme typically contains multiple catalytic domains that help accomplish a specific peptide synthesis step. Multiple NRPS enzymes are usually required to synthesize a peptide, and these enzymes are usually co-located together in the genome (and co-expressed on polycistronic transcripts in the case of bacteria). We didn't include NRPS prediction in peptigate because we were unsure how to identify which NRPS enzymes belong to a single NRPS peptide. We were also unsure if we'd be able to predict the peptide sequence generated through this mechanism.

There are also several annotation tasks that peptigate doesn't currently perform. In general, we omitted tools that are only accessible through a browser, don't have commercial-compatible licenses, or aren’t easily installable through a package manager or a container. We considered including the tools DeepLoc to predict the sub-cellular localization of a peptide [52], PeptideRanker to assess the likelihood that a peptide is bioactive [53], and PepScore to assess whether a peptide is stable in humans [54], but ultimately didn’t include them. We're also interested in predicting the immunogenicity of peptide predictions but didn't find an accurate tool for this.

Peptigate pipeline inputs and outputs

The peptigate pipeline takes three user-provided input files: a transcriptome assembly and annotated ORFs from that assembly in both amino acid and nucleotide format. These files are then used to predict sORF and cleavage peptides.

Peptigate also relies on databases and models. These are either packaged in the peptigate repository or the pipeline downloads them. The sORF prediction tool plm-utils, the cleavage peptide prediction tool NLPPrecursor, and the bioactivity annotation tool AutoPeptideML all require model weights. The plm-utils model is packaged in the peptigate GitHub repository, while the pipeline downloads the AutoPeptideML and NLPPrecursor models. Peptigate also downloads the two databases on which it depends, UniRef50 and Peptipedia. Once downloaded and prepared, the peptigate pipeline will use these same files repeatedly unless they're moved or changed.

Peptigate outputs six files, two FASTA files, and four TSV files. The two main outputs are a pair of TSV files, “peptide_predictions.tsv” and “peptide_annotations.tsv.” The predictions file provides the peptide identifiers, sequences, and the tools that predicted each peptide. The second annotation file provides information from each annotation approach discussed above. The FASTA files and the partner TSV files provide the predicted peptides’ amino acid and nucleotide sequences. 

We also adapted peptigate to run when the user only has protein sequences as input. In this scenario, peptigate predicts sORF proteins by length-filtering to proteins less than 100 amino acids. Cleavage peptide prediction and annotation proceed as in the main pipeline, although without nucleotide reporting.

Evaluating the peptigate pipeline

The code and associated data we used to evaluate the peptigate pipeline are available in this GitHub repository (DOI: 10.5281/zenodo.13239486), including the results and evaluation of running peptigate on the human RefSeq transcriptome.

We used peptigate to predict peptides in the human transcriptome to understand the tool’s accuracy. Starting from the human RefSeq transcriptome (click here to download the transcriptome), we predicted open reading frames using TransDecoder. We recognize that this approach doesn't fully take advantage of existing annotations for the human transcriptome, but it matches our recommended preprocessing for peptigate. Peptigate predicted 4,235 distinct peptides in the human transcriptome (Table 1). After removing DeepPeptide-predicted propeptides — a part of a protein cleaved during activation or maturation that lacks independent function — 2,949 peptide sequences remained.

We next wanted to evaluate the accuracy of these predictions. Because not all human peptides have been annotated, we lacked a ground truth against which to compare our peptide predictions. We decided to compare the predicted peptide sequences against orthogonal data sources such as databases of previously observed peptides, public annotations, and ribosomal profiling data. We reasoned that if we observed matches between these data sources and our predictions, this would provide evidence that the peptide is likely real. However, this approach is flawed because any disagreement could mean that databases are incomplete, our predictions are at least partially wrong, or some combination of the two. Even still, we moved forward with this approach because we were unable to identify a better gold standard dataset for evaluation.

Prediction tool within peptigate pipeline

Number of predicted peptides

Peptipedia

NCBI metadata

RibORF

Total (distinct)

DeepPeptide
(predicts cleavage peptides)

263

130

NA

NA

130 (49%)

NLPPrecursor
(predicts RiPPs)

431

87

NA

NA

87 (20%)

plm-utils
(predicts sORFs)

2,255

291

287

288

486 (22%) 

Total

2,949

508

287

288

703 (24%)

Table 1. Summary of peptides predicted by peptigate and orthogonal validation information.
“NA” indicates that orthogonal information wasn't available. “Distinct” refers to distinct amino acid sequences; each sequence is counted once even if it’s validated by multiple datasets. It represents the fraction of predicted peptides validated by orthogonal datasets.

We started by comparing peptigate’s predictions to peptides in the Peptipedia database [46]. Peptipedia is a metadatabase comprised of peptides from 76 databases, including human peptide-containing databases like Peptide Atlas [55]. Using the annotation results generated by the peptigate pipeline, we checked whether the predicted peptides had a hit against any peptide in Peptipedia. More cleavage peptides had hits to peptides in Peptipedia than sORF peptides: 49% of peptides predicted by DeepPeptide, 20% of peptides predicted by NLPPrecursor, and 13% of peptides predicted by plm-utils had hits against at least one peptide in the database (Table 1). Our findings suggest that at least one-quarter of peptigate-predicted peptides are likely real.

View the analysis code we used to investigate peptide matches against the Peptipedia peptide database.

For cleavage peptides, we expected to predict more peptides than are present in databases because the DeepPeptide paper predicted 1.3× the known number of peptides in humans (352 in UniProt, 458 predicted) [38]. To determine whether predicted cleavage peptides that didn't have matches in the Peptipedia database might still be real, we looked for signals associated with cleavage peptides. For example, most (but not all) annotated cleavage peptides are cleaved from precursor proteins that contain an N-terminal signal peptide [56]. Signal peptides target a protein to the secretory pathway and allow cleaved peptides to reach their final destination [57]. Many cleavage peptides function as hormones or other signaling molecules, making export from the cell a key step in their biogenesis [57]. Of the 133 predicted peptides with no BLAST hit, 28 are predicted from precursor proteins with signal peptides. We also investigated whether the precursor proteins contained propeptides, as many precursor proteins contain these constructs that help with protein folding, stability, or targeting [58]. A further eight precursor proteins contained propeptides. These results suggest that some cleavage peptide predictions that didn't match known peptides are biologically plausible.

View the analysis code we used to identify signal peptides and propeptides in the precursor proteins of cleavage peptides.

We anticipated that sORF-encoded peptides would have a lower hit rate than cleavage peptides when compared against peptides in databases. While Peptipedia contains 76 databases, it doesn’t include dedicated sORF catalogs like sORFs.org [59]. Further, cleavage peptides were discovered many decades before sORF-encoded peptides [60][61][62][63], and so we expect more cleavage peptides to be annotated than sORFs. In addition, many sORFs are thought to be evolutionarily young [25], meaning we wouldn’t expect homology to peptides from other species. Even still, because so few sORF-encoded peptides had matches against the Peptipedia database, we next focused on validating this class of peptide predictions.

We first looked at the annotations for each transcript. Since we started with a RefSeq transcriptome, all transcripts are labeled as curated coding, curated non-coding, predicted coding, or predicted non-coding by their accession number. Of the 2,255 predicted sORF-encoded peptides, 13% are labeled as curated coding (Table 1). We anticipate that many more transcripts are actually coding; recent research has shown that many transcripts labeled as non-coding actually contain sORFs that encode peptides [64][65][66][67][68][69][70][71][72]. However, the observed overlap validates a subset of our sORF predictions and demonstrates that the Peptipedia database is partially incomplete with regard to sORF-encoded peptides with known coding potential. 

Given that Peptipedia is incomplete with regards to sORFs, we tested how many predicted sORF-encoded peptides are supported by ribosomal profiling data. Ribosome profiling data is generated by sequencing fragments of mRNA that are protected by ribosomes, offering a snapshot of translation in action [73] — if one of our predicted sORF-encoded peptides appears in a ribosome profiling dataset, it would lend credence to the idea that this is a real, translated peptide. A recent set of papers developed a tool called RibORF that predicts open reading frames from ribosomal profiling data and uses this tool to re-analyze over 600 ribosomal profiling datasets from humans [54][74]. 13% of sORF-predicted peptides overlapped with RibORF predictions (Table 1), 265 (189 canonical, 61 non-coding, nine extension, and six truncation). This overlap supports the idea that these sORFs are translated into proteins.

View the analysis code we used to compare sORF-encoded peptides against ribosomal profiling data.

The fraction of sORF-predicted peptides that appeared in ribosomal profiling data underwhelmed us, so we tried to validate these sequences using other orthogonal datasets. First, we checked whether peptigate predicted true non-coding RNAs as coding. Of the three we tested (XIST, HOTAIR, NEAT1), peptigate predicted none to be coding. These findings confirm that peptigate effectively discriminates between coding and non-coding RNAs.

View the analysis code we used to search for non-coding RNAs in sORF-encoded peptides

We next wanted to measure the relative translation potential of the predicted sORFs. If an sORF is able to recruit a ribosome for translation, it's potentially more likely to be translated into a protein. To estimate translation potential, we measured the Kozak sequence similarity score for each predicted sORF and compared the distribution against ORFs > 300 nucleotides in the human transcriptome. The Kozak consensus sequence functions as a translation initiation start site and enhances translation efficiency by directing ribosomes to the correct start codon [75]. Six base pairs occur upstream and one base pair downstream of the start codon in a transcript [75]. The exact sequence varies, so each Kozak sequence can be scored in comparison to the most common sequence motif [76]. We scored each Kozak sequence as performed in [76]: using the sequence motif GccA/GccAUGG, we designated upper-case letters as highly conserved (scored +3) and lower-case letters as common (scored +1). We didn't score the start codon (bolded letters). The maximum score is 13. On average and across transcript types (inherited from RefSeq labels), sORFs have lower Kozak sequence scores than other transcripts (Welch’s two-sample t-test, estimate = 1.4, p < 0.001, 95% CI [0.8, 1.07]). However, the sORF Kozak sequence scores occurred within the same range as those of other transcripts, with both coding and non-coding sequences achieving the maximum Kozak sequence score of 13. Given the range of Kozak scores observed, these results suggest that some predicted sORFs are likely to recruit ribosomes and be translated into proteins.

View the analysis code for calculating and comparing Kozak sequence scores in sORF-encoded peptides versus normal open reading frames.

Overall, we struggled to identify a gold-standard, ground-truth dataset to use when evaluating peptigate. It's unclear to us what a "good" expected hit rate is against different orthogonal datasets. We expect some hits, as we'd expect some fraction of our predicted peptides to have been previously discovered. However, it's unclear how many bioactive peptides exist or how many have been discovered. A peptidomics mass spectrometry and machine learning paper published in 2022 suggested that, to date, only 300 peptides in humans have confirmed bioactivity [56], so our predictions aren't many orders of magnitude away from what we might expect, and there may be room for new human peptide discovery. We welcome suggestions for different validation datasets that can be used to validate computational peptide predictions.

Additional methods

We used ChatGPT to help refactor some Python scripts executed by the Snakefile, write first drafts of doc strings, and clean up character lines to reduce them to under 100 characters. We also used ChatGPT and Notion AI to suggest wording ideas, and then we chose which small phrases or sentence structure ideas to use.

Key takeaways

  1. Peptigate is a workflow for predicting and annotating bioactive peptides from transcriptomes. It combines existing state-of-the-art tools to predict peptides encoded by small open reading frames and cleavage peptides. It annotates predicted peptides to provide insights into their potential function.

  2. Peptigate is designed to better inform researchers as they make decisions about follow-up functional studies. This may require multiple peptigate prediction runs across diverse transcriptomes or additional prediction tasks. For example, if a researcher is interested in a specific bioactivity that isn't tested in peptigate, it may be useful to build additional bioactivity prediction models with AutoPeptideML.

  3. Only about a quarter of peptigate predictions match peptides predicted in orthogonal datasets, highlighting a need for more comprehensive and reliable validation methods and datasets.

Next steps

  1. Identifying ground-truth data. One of the things we struggled with during this project was a lack of gold-standard data for prediction. Given that peptide annotations are incomplete, even for the human genome and proteome, it wasn't clear what to use as ground truth, true positive, and true negative data. We used orthogonal datasets like mass spectrometry peptidomics databases and ribosomal profiling as proxies, but these datasets are noisy and incomplete. We'd love new ideas for ground truth data we can use to assess our pipeline.

  2. Improving bioactivity annotations. Bioactive peptides participate in almost all aspects of metabolism, making them interesting for both basic and translational research. Even if we can produce confident peptide sequence predictions, it’s difficult to computationally predict the bioactivity of those sequences because of their short length. We're interested in identifying new tools or orthogonal tests that we can incorporate into peptigate to improve bioactivity annotations.

  3. Including more tools for peptide prediction and annotation. As described in the “Limitations…” section above, peptigate doesn't predict all types of peptides or incorporate all possible annotation tools. We'd like to expand the types of peptides and annotations included if we can overcome the challenges outlined in the limitations section.

  4. Making the pipeline easier to use. We wrote peptigate as an experimental pipeline. While we tried to assemble a reasonable pipeline, we identified many areas where we could improve the quality of our software engineering. If peptigate proves useful, we plan to improve the quality of the software by adding things like installation from a package manager and automated tests.


Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion. 

Contributors
(A–Z)
Visualization
Supervision
Supervision
Methodology, Software
Conceptualization, Supervision
Formal Analysis, Investigation, Methodology, Software, Visualization, Writing
Conceptualization, Critical Feedback
Connections
1 of 2
Comments
4
?
Bruno Cuevas:

Hi! I heard about “encrypted peptides” (De La Fuente group, UPen). Proteins seem to encode functional antimicrobial peptides in case they get proteolyzed. Have you considered using a similar protein-derived dataset to pre-train your peptide detection tool?

?
Taylor Reiter:

This is really cool! I hadn’t heard of encrypted peptides before. I like how their model relies on physicochemical characteristics. The scale of their discovery is also interesting—they found 2,603 peptides in just the secreted human proteins. I think this would be a super interesting dataset to compare against. Thank you for pointing us toward it!


Encrypted peptides publication: https://doi.org/10.1038/s41551-021-00801-1

?
Bridget Hansen:

This is an interesting point. AntiSmash does a decent job of addressing bioactive molecule prediction using DNA and has pipelines for Bacteria, Fungi and Plants. It would be interesting to see if you reverse transcribed the transcriptome in silico and then ran the fragments through the NRPS prediction algorithm they use, what may come from that.
The enzymatic site calling has become fairly decent (stachelhaus code) and if your molecule has no modifications, you could in theory predict partial peptide sequences. But I agree, if the peptide requires multiple NRPS, 2-3, or has additional modifying enzymes, which it likely will, it will be difficult to predict bioactivity (because the modifications often tailor the compound, changing its structure, which changes its function).
I think for NRPS, DNA may ultimately be the way to go in regards to prediction.

However, it would be interesting to use this to understand which NRPS transcripts are being translated. One could combinatorially predict all possible outputs (ex: you have 2 NRPS transcripts which means you have 4 possible options for peptide sequence pre-modification) and then you could pair that analysis with a targeted LC/MS or MALDI run to see if you have those peptides in your samples - although there are plenty of reasons why you might not choose this route.

I think starting with a transcriptome in general is a really interesting starting point, especially when thinking about bioactive synthetic potential for a system or community. This could help with drug discovery by providing some leads without a genome sequenced - a common problem in drug discovery from diverse environments or microbes.

This is overall a really interesting approach to bioactive peptide prediction and I look forward to seeing how it grows.

?
Taylor Reiter:

Thank you Bridget! These are really useful points about NRPSs. I think it could be interesting to play around with the tools/methods you list above in fungi or plants to see how far a transcriptome can get. If a transcriptome is useful, I think this could be added into the peptigate pipeline, but if its lossy/presents an unpredictably biased picture, it might be worth generating a separate pipeline that also takes genome sequences as input.

+ 1 more...
?
William Connell:

This might be more appropriate for the plm-utils Pub, as it concerns model evaluation and tuning.

Currently, plm-utils is evaluated using accuracy, F1, precision, and recall. While these metrics are useful for benchmarking against alternatives, I believe considering task-specific error tradeoffs could help refine the model’s decision threshold when applied prospectively for discovery.

For discovery applications, maximizing recall may be best to make sure all true peptides are recovered. However, this may also depend on the throughput of the experimental validation method.

For instance, in a low-throughput validation scenario, prioritizing high precision to minimize false positives is helpful. Conversely, given a high-throughput method, prioritizing high recall will reduce false negatives and capture more true functional peptides.

?
Taylor Reiter:

Thanks, William. Is there a specific metric that would be useful to report or a user-supplied filtering parameter that would help you as a user to be able to implement the type of filtering you would be interested in?

+ 1 more...
?
William Connell:

While I'm relatively unfamiliar with peptide science, I recognize the strategic value in developing discovery tools, given the wealth of available mining data and the progress in computational protein characterization and design. Many of the tools coming online for protein structure/sequence design are likely applicable, or can be readily adapted, to peptide design. For example, I worked on biophysical characterization of a proinflammatory antimicrobial peptide (AMP) LL37. Curating a larger dataset of AMPs via peptigate could help rationally assemble a fine-tuning dataset for design purposes.