Description
The plm-utils Python package is available in this GitHub repository.
We’re interested in detecting small open reading frame (sORF)-encoded, bioactive peptides in transcriptomes. sORFs are open reading frames that contain fewer than 300 nucleotides and often use alternate start codons. Computationally detecting real sORFs is challenging, and we wanted to more accurately detect sORFs that encode functional peptides.
We hypothesized that latent information in the embeddings of large protein language models might contain information about the coding propensity of amino acid sequences, even though this wasn't the original use case for such models. We were encouraged toward this line of thinking because other peptide classification tools that identify cleavage peptides [1] and predict peptide bioactivity [2] have successfully used large protein language models to improve classification accuracy.
We conducted multiple tests across different datasets and observed increased accuracy over leading tools when we applied these innovations to predicting sORF coding potential. On all transcripts from a set of 16 diverse research organisms, our tool performed comparably to the leading tool, RNAsamba. However, our method significantly outperformed that tool for short sequences. Additionally, on the RNAChallenge dataset, where most tools struggle, we achieved an accuracy of 33% compared to the average tool accuracy of 11%. While our approach improves accuracy on this challenging prediction task, the overall accuracy indicates that there's still work to be done.
We packaged our approach as a small Python package called “plm-utils.” Using the Python package infrastructure improved the usability and portability of our tool and will allow us to expand the package in the future if it proves useful.
The plm-utils Python package is available in this GitHub repository.
The code to train and evaluate the sORF plm-utils model is available here.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Advances in ribosome profiling and mass spectrometry have experimentally demonstrated that some small open reading frames (sORFs) are not random sequences in genomes but lead to functional products [3]. This has spurred a greater appreciation for and interest in the coding potential of sORFs. Most genomes encode many sORFs, only a fraction of which are transcribed and even fewer translated. Most transcribed sORFs occur in the 5′ or 3′ UTR of the main coding domain sequence in a transcript and perform a regulatory role by impacting the translation of the mRNA [4]. However, some sORFs encode translated peptides with functional roles as small proteins. The majority of sORFs that encode peptides have been identified in presumed long non-coding RNAs [4][5], although some have also been found upstream of, downstream of, and overlapping with, transcripts with longer coding domain sequences [4].
sORFs are alternately referred to as “short” open reading frames [6], while functional peptide products are referred to as sORF-encoded polypeptides (SEPs or sPEPs), microproteins, or micropeptides [7]. These peptides are genomically encoded by open reading frames of fewer than 300 nucleotides (100 codons) and are synthesized via DNA transcription and ribosomal translation. The majority of sORFs use codons that differ from the traditional start codon (AUG) by one nucleotide (UUG, CUG, GUG, and ACG) [8].
Historically, sORFs have been underrepresented in protein annotations [9]. sORFs occur frequently throughout genomes, so several heuristic filters are used in tools that predict protein-coding regions to reduce false-positive annotations [7]. These filters include length cutoffs of 300 nucleotides [10] and the use of the AUG (ATG) start codon [11]. Both of these filters preclude the computational annotation of sORFs because sORFs are shorter than 300 bases and the majority start with non-AUG start codons [8][9].
Many computational tools have been developed to identify sORFs in sequencing data. The majority perform sORF discovery on either genome or transcriptome assemblies. These tools generally use evolutionary signatures or sequence heuristics to classify coding versus non-coding sequences.
Tools like phyloCSF [12], PhastCons [13], and micPDP [14] use genome alignments and codon substitution patterns to identify sORFs. The three main limitations to these tools are the requirement for a genome assembly, the lack of built-in evidence that the predicted sORF gets transcribed, and dependence on cross-species conservation (in some species, sORFs encode evolutionarily young proteins that aren’t conserved in other species [6]).
The other common strategy for computational sORF identification is to predict whether a transcript (or an ORF predicted on a transcript) is coding or non-coding. These tools either use heuristics like codon substitution and nucleotide composition or train machine learning algorithms to predict whether a transcript or an ORF contains a coding sequence. Some tools, like MiPepid or sORFfinder [15][16], are trained specifically on short sequences, while others, like RNAsamba and DeepCPP [17][18], are trained on all transcripts but perform well on sORFs. In both cases, these models often struggle with small training datasets or heuristics that do not generalize to other species or sequence types [19].
Protein language models like AlphaFold2 [20] and Evolutionary Scale Modeling (ESM) [21] have revolutionized computational approaches to protein research [22]. Protein language models are trained on large numbers of protein sequences and other information like multiple sequence alignments or protein structures. After “learning” patterns in this original data, a model can ingest new protein sequences and relate them to the existing information in the model in a process called embedding. In the case of ESM, an embedded protein sequence is represented as a numerical vector, typically a high-dimensional array of floating-point numbers. While embeddings are not directly interpretable by humans, they capture information about the structure of a protein, including orthogonal attributes that correlate with structure, like function [21][23][24].
We wrote a Python package called plm-utils (short for protein language model utilities) that provides a basic set of tools for generating and analyzing embeddings of protein sequences using pre-trained protein language models. It currently only works with ESM2 models, but we may expand it in the future if doing so opens new use cases. If other protein language models would be helpful in your research, please let us know by commenting here or posting an issue on the plm-utils GitHub repository.
While we set this package up as a general tool for using the information in protein language models to improve protein prediction tasks, we developed it specifically to predict whether a transcript is coding or non-coding. We posited that there may be latent information in protein sequences that is not currently used by other tools. This information could be extracted by protein language models, which might help in the classification of coding transcripts. We thought this might particularly be true for sORFs because embeddings of protein language models have helped improve other difficult tasks with peptides like predicting cleavage peptides [1] and annotating peptide bioactivity [2].
The plm-utils Python package is available in this GitHub repository (DOI: 10.5281/zenodo.12775178).
Plm-utils is a small Python package that includes functions for working with protein language models. Currently, it contains code for building binary classifiers from labeled data using ESM2 embeddings. It also contains helper functions for our first use case, predicting whether a transcript is coding or non-coding.
For the task of classifying coding vs. non-coding transcripts, plm-utils first uses orfipy to find and translate the longest open reading frame on each contig [25]. Plm-utils considers multiple potential start codons (AUG, UUG, CUG, GUG, and ACG) as ORFs in general and sORFs in particular can use any of these [8][26]. After translating all possible ORFs, we retain only the longest putative ORF from each transcript, assuming that this ORF is the one most likely to be genuine and encode a bioactive protein. Next, plm-utils embeds the putative translated ORF sequences in the ESM2 model embedding space (esm2_t6_8M_UR50D). Plm-utils then uses these embeddings and ground-truth labels to train a random forest classifier. In this case, a classifier is trained to predict whether the ORFs were translated from a coding or non-coding transcript. This results in a model that predicts whether a given amino acid sequence represents a genuine ORF. In more precise terms, the model classifies a given amino acid sequence as “coding” or “non-coding” based on its similarity to the longest ORFs derived from the coding and non-coding transcripts in the training dataset. The primary output of plm-utils is a TSV indicating whether a transcript is coding (positive) or non-coding (negative).
To apply this approach to sORFs specifically, we added a length filtering step after ORF prediction and before embedding and coding vs. non-coding classification.
Throughout this pub, we evaluate the capacity of plm-utils to differentiate between coding and non-coding transcripts, serving as a preliminary test of its effectiveness. To conduct a thorough assessment, we developed several models. The performance of each model varies depending on the specific attributes of the training data. This means that a model trained on all coding ORFs in a transcriptome will perform differently than a model trained on only sORFs. However, every model created using plm-utils is compatible with any sequencing data that ESM can process (the maximum sequence length ESM2 can handle is 1,024 amino acids).
We set up the plm-utils Python package so that it can be easily adapted to future use cases. At the moment, we have building blocks in place for embedding sequences, training, and making predictions from a binary classifier. We think this setup could be well suited for predicting whether a protein has a specific function or for predicting traits of a protein such as whether it is membrane-bound. Users would first need to build a new model using labeled training data for new use cases. This model could then be used to predict the traits of new, unseen data.
If you have a use case that would require additional functions in the plm-utils package, we would love to hear your needs either as a comment on this pub or as an issue on the GitHub repository.
To assess whether protein language models improve the classification of coding sequences over existing tools, we first compared the performance of plm-utils models against RNAsamba models (version 0.2.5). We chose to compare against RNAsamba because it performed well across various prediction tasks when benchmarked against other tools [19].
To assess model performance with diverse sequencing data, we selected a set of 16 species (Table 1) for which high-quality annotated reference transcriptomes were available on Ensembl. We then trained models separately on transcriptomes from each of the 16 species using the above-mentioned procedure. We used each of the resulting 16 models to make predictions for each of the other 15 species. Note that we did not split the transcriptomes into training and test sets; we trained models on all transcripts from one species and then made predictions for all transcripts from each of the other species.
Species | Abbreviation | Common name | Kingdom | Class |
---|---|---|---|---|
Apis mellifera | Amel | Honey bee | Animals | Insects |
Arabidopsis thaliana | Atha | Thale cress | Plants | Eudicots |
Caenorhabditis elegans | Cele | Roundworm | Animals | Chromadorea |
Dictyostelium discoideum | Ddis | Slime mold | Protozoa | Mycetozoa |
Drosophila melanogaster | Dmel | Fruit fly | Animals | Insects |
Danio rerio | Drer | Zebrafish | Animals | Ray-finned fishes |
Gallus gallus | Ggal | Chicken | Animals | Birds |
Homo sapiens | Hsap | Human | Animals | Mammals |
Mus musculus | Mmus | Mouse | Animals | Mammals |
Oryza indica | Oind | Rice | Plants | Monocots |
Rattus norvegicus | Rnor | Rat | Animals | Mammals |
Saccharomyces cerevisiae | Scer | Baker’s yeast | Fungi | Saccharomycetes |
Schizosaccharomyces pombe | Spom | Fission yeast | Fungi | Schizosaccharomycetes |
Tetrahymena thermophila | Tthe | Ciliate protozoan | Protozoa | Ciliates |
Xenopus tropicalis | Xtro | Western clawed frog | Animals | Amphibians |
Zea mays | Zmay | Corn | Plants | Monocots |
Table 1. Species used to train and evaluate plm-utils and RNAsamba models in the task of predicting coding versus non-coding transcripts.
We performed this procedure for both plm-utils models and RNAsamba models. We calculated the performance of each model using Matthew’s correlation coefficient, a measure that quantifies the quality of binary classifications, ranging from −1 (perfectly wrong; worse than random) through 0 (no better than random) to +1 (perfect prediction). The RNAsamba models (average MCC 0.51) slightly outperformed the plm-utils models (average MCC 0.43) when trained and evaluated on all transcripts (Figure 1, A and C). However, the plm-utils models (average MCC 0.52) significantly outperformed the RNAsamba models (average MCC 0.15) when the models were trained and evaluated only on transcripts whose longest putative ORF was an sORF (< 100 amino acids) (Figure 1, B and D).
Our experimental setup comparing plm-utils and RNAsamba has two differences that arise because the tools work differently. First, the models don’t use the same sequence data to make predictions. Although both models predict whether a transcript is coding or non-coding, the plm-utils models do so based on the amino-acid sequence of each transcript’s longest putative ORF. In contrast, the RNAsamba models do so based on the full nucleotide sequence of the transcript itself. Second, the plm-utils models incorporate a correction for class imbalance (unequal numbers of coding and non-coding transcripts in the training data) by using a balanced class weight in the random forest classifier. This ensures that both classes are treated equally despite their unequal proportions. The RNAsamba models don't include this correction. Because many species contain relatively few coding transcripts whose longest ORF is an sORF, this difference likely partially explains the difference in performance we observed between plm-utils and RNAsamba models trained only on transcripts whose longest putative ORF was an sORF. In addition, embedding sORFs using ESM allows plm-utils to take advantage of information in a larger corpus of protein sequences, even when there are very few input sequences.
Next, we assessed how well plm-utils performed on a challenging prediction task compared to other tools. A recent large-scale benchmarking study identified 27,283 transcripts (16,243 coding; 11,040 non-coding) that were challenging for many tools to classify as coding [19]. The authors named the dataset “RNAChallenge” and observed an average accuracy of 10.8% (Figure 2). The protein-coding transcripts in this dataset are shorter than the average transcript: approximately 80% of the ORFs on the protein-coding transcripts were less than 300 nucleotides long, highlighting that most tools struggle with classifying sORFs as coding or non-coding.
We first built a model to predict coding versus non-coding transcripts using diverse species input. Using the same species listed in Table 1, we separated coding from non-coding transcripts. We reduced homology between our input sequences by clustering at 80% sequence identity using MMseqs2 (version 15.6f452) [27]. We then used plm-utils to translate sequences, limiting to sORFs by filtering to transcripts with a maximum predicted ORF of < 100 amino acids. We then embedded these sequences and trained a model. We ran the plm-utils model on the RNAChallenge dataset and calculated the performance (Figure 2). The F1 score, a metric that balances precision (the accuracy of positive predictions) and recall (the ability to identify all actual positive cases), is the highest for plm-utils. However, the RNAChallenge dataset contains some sequences that are highly similar to some sequences that we used to train the plm-utils model. While this was also true for models and tools evaluated by the benchmark, we wanted to control for this in our evaluation. We therefore removed sequences from RNAChallenge that were at least 80% similar to sequences used during training. This reduced the RNAChallenge dataset to 16,180 sequences (8,847 coding; 7,333 non-coding). Evaluating the performance on this dataset, the F1 Score decreased by ~6%. Plm-utils still outperformed all but two tools covered in the benchmarking paper, longdist and NCResNet (Figure 2) [28][29]. Both of these tools performed poorly on other benchmarks that assessed their ability to predict coding vs. non-coding transcripts in non-human species (regardless of transcript or ORF length) [19]. This likely indicates over- or under-fitting to the RNAChallenge dataset and an inability to generalize well across diverse biological datasets [19]. While we haven’t compared directly, we expect plm-utils to perform better across species and sequencing contexts.
Taken together, we find that a simple classifier built atop the ESM2 large protein language model improves the classification of sORFs as coding or non-coding. However, the overall performance of plm-utils on the RNAChallenge dataset is still low (~33% accurate). This dataset is enriched in sORFs, highlighting that there’s still room for improvement on this classification task. We think using ESM embeddings captures information about the secondary structure of proteins, as large protein language models have previously been shown to have high accuracy on structural predictions for some peptides [30]. This may mean this method struggles with very short peptides or peptides with certain structural features. Indeed, evidence of functional sORFs in the 3–15 amino acid length range suggests we may be missing this class of sORFs [6].
View the workflow code we used to build the plm-utils classification model and evaluate its performance on RNAChallenge.
We built a pipeline, peptigate, that predicts and annotates peptides from transcriptomes [31]. Peptigate uses plm-utils to predict sORF-encoded peptides. While evaluating this pipeline, we ran plm-utils on the human transcriptome to identify sORF-encoded peptides. We compared these results against orthogonal datasets like databases of known peptides, ribosomal profiling, and strength of translation initiation site sequences. We found orthogonal support for 22% of plm-utils predictions and didn’t detect any false positives (true non-coding sequences predicted to be coding). For more insights on plm-utils outputs and predictions, view those results here.
We used ChatGPT and GitHub Copilot to help write, clean up, and add comments to our code. We also used ChatGPT to suggest wording ideas and then chose which small phrases or sentence structure ideas to use.
The plm-utils Python package is available in this GitHub repository.
The plm-utils Python package encodes a set of helper functions for working with protein language models. It currently only works with Evolutionary Scale Model (ESM), a protein language model trained on millions of protein sequences. It has functions to embed sequences in ESM2, train a binary classifier using labeled protein sequences, and predict the classification of new proteins using that model.
Our first use case for plm-utils is predicting whether a transcript is coding or non-coding. We used plm-utils to predict coding versus non-coding transcripts in general, as well as when the transcript encodes an sORF. sORFs are small open reading frames of less than 300 bases that sometimes encode peptides.
In a set of diverse research organisms, plm-utils is outperformed by the state-of-the-art tool RNAsamba for the general task of predicting coding versus non-coding transcripts [plm-utils Matthew’s correlation coefficient (MCC) = 0.43; RNAsamba MCC = 0.51]. However, plm-utils significantly improves prediction when the transcript encodes an sORF (plm-utils MCC = 0.52; RNAsamba MCC = 0.15). This is likely due, in part, to latent information captured by protein language models.
Plm-utils also improves the prediction accuracy (33%) on a challenging dataset, RNAChallenge, over most tools (average 11%).
While plm-utils improves prediction accuracy over most tools, predicting the coding potential of short sequences (< 100 amino acids) remains challenging. We would love feedback or ideas on how to improve accuracy in this task, with or without using protein language models.
We have several ideas to potentially improve accuracy:
Using larger models: We currently use the smallest model (esm2_t6_8M_UR50D), but there are larger models available (esm2_t48_15B_UR50D, esm2_t36_3B_UR50D, esm2_t33_650M_UR50D, esm2_t30_150M_UR50D, esm2_t12_35M_UR50D). Other prediction tasks on peptides haven’t seen improved accuracy with larger ESM models [2]. In preliminary testing, we didn’t see an improvement in accuracy for sORF coding prediction, but this should be more extensively tested and validated.
Exploring ESM3: The newly released ESM3 model [32] may offer potential improvements. ESM2 was trained on 49.9 million protein sequences from UniRef50 [33][34], while ESM3 was trained on 2.78 billion protein sequences [32]. These new sequences may improve ESM’s ability to encode information about short sequences.
Refining training data sources: We used Ensembl for labeled training data (coding vs. non-coding transcripts). Some transcripts initially labeled as non-coding are later found to encode sORFs [35][36][37][38][39][40][41][42][43]. Building models that only include validated coding and non-coding transcripts from diverse sources could improve model accuracy. However, this type of curation task would likely take a substantial amount of time.
In the meantime, we plan to use plm-utils to identify sORF-encoded peptides in transcriptome assemblies using the peptigate pipeline [31].
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion.