Robust long-read saliva transcriptome and proteome from the lone star tick, Amblyomma americanum

Joan Wong; Juliana Gil; Elizabeth Tseng; MaryClare Rollins; Behnom Farboud; Peter S. Thuy-Boun; Kira E. Poskanzer; Tori Doran; Greg Huber; William Hatleberg; Megan L. Hochstrasser; Seemay Chou

doi:10.57844/arcadia-3hyh-3h83

Dataset Not actively updating Ticks as treasure troves: Molecular discovery in new organisms

Published on May 31, 2022 by Arcadia Science

Robust long-read saliva transcriptome and proteome from the lone star tick, Amblyomma americanum

The way you generate a reference database has a real impact on the completeness and results of proteomics experiments.

Robust long-read saliva transcriptome and proteome from the lone star tick, Amblyomma americanum

Purpose

At Arcadia, we’re studying diverse organisms and sharing both our discoveries and the tools we develop along the way. In one of our first efforts, we’re trying to understand how ticks manipulate their hosts. In this pub, we describe how we established a proteome reference for A. americanum ticks by collecting a long-read transcriptome from their salivary glands. We also show you where you can access all the data. We hope it will be useful for other tick researchers or anyone interested in doing omics in the absence of a complete genome.

This pub is part of the project, “Ticks as treasure troves: Molecular discovery in new organisms.” Visit the project page for more background and context.
Data from this pub can be found in the SRA (transcriptome) and in the PRIDE repository (proteome).
The method used to generate these data is more fully described in this pub.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Background and goals

These experiments are part of our effort to build omics tools to study ticks. Ticks feed on us and other animals for prolonged periods, which suggests that ticks have powerful means for suppressing host surveillance systems. We are identifying the components of the tick molecular toolkit and developing them into new therapies for patients living with otherwise intractable skin conditions.

To begin, we decided to take a peek at all the proteins we could find in the saliva of Amblyomma americanum (a.k.a. the lone star tick).

SHOW ME THE DATA: Access our transcriptomics data and proteomics data.

The approach

To begin unraveling the intricacies of tick saliva, we've chosen to examine the salivary proteome using tandem mass spectrometry-based proteomics as a key technology. For now, we’re taking the more straightforward bottom-up approach. Proteomics experiments come in many flavors, but they can be categorized into one of three bins according to the size of the peptide analytes being examined. 1) Top-down proteomics is generally concerned with the analysis of intact proteins and/or their complexes. 2) Bottom-up proteomics is generally concerned with the analysis of peptides generated by chemical or enzymatic digestion of parent proteins. 3) Middle-down proteomics takes an intermediate approach wherein parent proteins are minimally digested, creating peptides larger than those considered for bottom-up work but still smaller than intact proteins. Tools and techniques for bottom-up proteomics have been in development for much longer than the other two styles are thus more reliable and accessible.

We hope that mass spectrometry will be advantageous in this context because it will enable the analysis of cell-free secretions. Importantly, it is suited for the detection of non-encoded molecules/modifications, including protein post-translational modifications (e.g. phosphorylation, sulfation, lipidation, glycosylation, etc.), non-ribosomal peptides, and small molecules (metabolomics).

In order to interpret proteomic data from tandem mass spectrometry, we need a reference proteome, which can be inferred from genome and/or transcriptome sequencing efforts. Unfortunately, ticks aren't model organisms (yet) and apart from Ixodes scapularis (a.k.a. the deer tick), there are few previously deposited data sets for the other ~900 known tick species, including the tick species we’re studying, A. americanum (Figure 1). One notable exception is the combined short-read transcriptome and matched time-resolved salivary proteome deposited in the PRIDE repository by the Mulenga lab [1]. This rich data set serves as a great scientific resource.

**In contrast to** **Ixodes scapularis**, **A. americanum** **reference data sets are incompletely represented in public repositories**.
*A. americanum* genome and transcriptome assembly will enable the creation of a comprehensive proteome database for LC-MS/MS-based proteomics analysis. In this work, we focused on assembling a new transcriptome to inform our proteomic analysis.

Before we performed our mass spectrometry experiments, we decided to develop our own proteome database, adding to the Mulenga lab’s work and enriching the reference data available to the tick research community. We considered sequencing the A. americanum genome, but it would require more time, money, and expertise than RNA sequencing. We therefore decided to do long-read RNA sequencing (specifically PacBio’s HiFi Iso-seq methodology) because it can provide insights into full transcript structures. We figured it would provide a great complement to the Mulenga lab's short-read data set collected on the same tick species.

Our overall method is summarized in the text below and in Figure 2. For more information on why we took this approach, see our companion method piece. For a detailed, step-by-step protocol, see our protocols.io entry.

**Overview of the parallel transcriptomic (top) and proteomic (bottom) work streams**.

Sample collection and RNA preparation

We collected our tissue of interest by excising salivary glands [2] (which comprise a major mass fraction of the tick anatomy) from unfed female A. americanum ticks.

RNA extraction, processing, and sequencing

We pooled salivary gland tissue from about 10 ticks, homogenized by bead beating, and obtained total RNA using a standard extraction kit. We collected electropherograms to calculate RNA integrity number (RIN), which is a ratio of the 28S:18S ribosomal RNA (rRNA) subunit peak areas and a proxy for RNA quality. We enriched mRNA via oligo-(dT) primers, which target mRNA containing poly-A tails. Finally, we submitted our RNA samples to the UC Berkeley QB3 genomics core for size-selection (>3 kb), PacBio's library preparation, Sequel II HiFi sequencing, and Iso-seq analysis.

A note on electropherograms from arthropod RNA:
We were surprised to find only one peak corresponding to the 18S subunit where we would normally see two peaks: one corresponding to the 18S subunit and one to the 28S subunit.
Some quick literature searches suggested that this is a commonly observed phenomenon with arthropod RNA. It's thought that arthropods’ 28S subunit can fragment (due to structural instability) during sample preparation, yielding two peaks that overlap with the 18S subunit’s peak [3][4].
We took a chance and proceeded with transcriptomic library preparation without a RIN readout for RNA quality. To ensure that future extraction are adequate before library preparation, we’d like to identify fast and easy alternative assays for RNA quality. Suggestions are highly appreciated.

Mass spectrometry

In parallel to the RNA processing and sequencing steps, we prepared tryptic peptides from homogenized A. americanum salivary gland tissue and analyzed them by data-dependent LC-MS/MS using a high resolution-high resolution strategy on an Orbitrap mass spectrometer.

Transcriptome and proteome processing and analysis

Our analysis process is summarized in Figure 3. We identified coding sequences in our transcriptome data using TransDecoder [5], CPAT [6], and ANGEL [7]. We collapsed sequences down by CD-HIT clustering [8] for subsequent proteomics mapping. Clusters were submitted for Interproscan analysis [9] and BUSCO analysis [10] to identify protein families and assess completeness of our transcriptome data set, respectively. We assigned fragmentation spectra with a basic proteomic search.

**Overview of data analysis workflow and tools.**

The results

SHOW ME THE DATA: Access our transcriptomics data and proteomics data.

Transcriptomic data

Once our transcriptome data arrived, we identified protein-coding sequences using TransDecoder, CPAT, and ANGEL. We combined our resultant protein output and collapsed sequences down by CD-HIT clustering with a similarity setting of 100% (c=1.0) to group redundant sequences, yielding 222,632 predicted proteins (down from a total of 307,541). We used these CD-HIT-collapsed non-redundant protein sequences for subsequent proteomics mapping. For functional analysis, we reasoned that proteins with closely related sequences would likely have the same function. Thus in order to reduce compute time, we grouped closely related protein sequences using CD-HIT, except this time with a similarity setting of 95% (c=0.95). This yielded 121,223 protein clusters (Figure 4, A). Each cluster contained one or more members and one representative sequence; for single-member clusters, one sequence is both a member and a representative. Representative sequences for each of these 95% cut-off clusters were submitted for Interproscan analysis to classify proteins into families and identify domains, resulting in annotation for 68,705 clusters (57%) but no annotation for 52,518 clusters (43%) (Figure 4, B).

**Processing and annotating protein clusters within our** **A. americanum** **transcriptome**.
(A) Overview of *A. americanum* long-read transcriptome data. Protein-coding sequences were predicted from poly-A-enriched and 5-kb-size-selected transcripts.
(B) Protein-coding sequences were clustered using CD-HIT and functional annotation by Interproscan reveals a large subset (43%) of unannotated protein clusters.

In addition, BUSCO analysis, which assesses completeness of a transcriptome, revealed a slight gain in completeness compared to the previous short-read transcriptome (Figure 5, A). Finally, we compared our new long-read transcriptome database with the short-read Mulenga database and a database forged from NCBI sequences (Figure 5, B). It’s striking how divergent our data set and the Mulenga data set appear to be, but the real test of usefulness for our data set will be determined by proteomics mapping results.

**Comparing the transcriptome generated through this method to previous resources for the same organism,** **A. americanum**.
(A) BUSCO analysis reveals our long-read transcriptome (“Arcadia”) is slightly more complete than the short-read transcriptome from the Mulenga lab.
(B) CD-HIT clustering reveals only small overlap between protein cluster membership between Arcadia, Mulenga, and NCBI proteomes.

Proteomic data from mass spectrometry

With a basic proteomics database search, we were able to assign approximately 40% of all collected fragmentation spectra between all databases. 37% were assigned by our new database and 36% by the Mulenga database, with a fairly large overlap. We observe approximately 8% more peptide-spectrum matches (PSMs) and 9% more peptides than are represented in the Mulenga transcriptome and NCBI databases alone (Figure 6).

**Comparison of LC-MS/MS-based proteomics mapping results when we used the Arcadia, Mulenga, and NCBI transcriptome-based proteomes as mapping databases**.
Venn diagrams depicting overlap at the peptide-spectrum match (PSM)-, peptide-, and protein cluster-level.

Background on peptide-spectrum matches vs. peptides:
During a proteomics database search, theoretical mass spectra are generated from peptides in our search database. We compare these theoretical mass spectra to experimental mass spectra, and when there is reasonable agreement during a comparison, we assign an experimental mass spectrum with the matching peptide identity. This assignment is called a peptide-spectrum match (PSM). During a tandem mass spectrometry run, many mass spectra are collected and often, several of the same spectra are collected, especially if a peptide is abundantly represented in a mixture. Thus, it is possible for a single distinct peptide to be represented by many spectra.

At the protein cluster level (CD-HIT clustering at 65% similarity cut-off; c=0.65), we observe a 38% increase in cluster detection. Interestingly, when we compare all database protein sequences against all protein sequences detected by proteomics, an unexpected distribution emerges revealing that proteins detected by our database tend to skew toward longer sequences (Figure 7). We hope that for further studies, having longer protein sequences will enable a more complete understanding of function.

**Histograms of protein sequence length distribution for all proteins (left) and only proteins with LC-MS/MS evidence (right)**.
Note that y-axes are different.

Key takeaways

To sum this up, it looks like our long-read transcriptome-based proteome database compares reasonably well with the Mulenga lab’s short-read transcriptome-based proteome database.

While our database enables the detection of approximately 8% more PSMs and 9% more peptides than the previous Mulenga database, it is in no way a replacement, as 5% of all PSMs are only detectable thanks to the Mulenga database. The short-protein skew of the Mulenga database appears to be complementary to the long-protein skew of our own database.

Finally, the assignment of 40% of all fragmentation spectra is reasonable but there are likely many more assignable spectra awaiting deconvolution. >80% assignment is highly unlikely based on many factors (and personal experience), but leaping to a value between 40% and 80% may be achievable.

What’s next?

Building a more complete protein database will allow us to assign a greater percentage of fragmentation spectra. For this, we’d need a fully assembled A. americanum genome, which, to the best of our knowledge, is not yet available in a public repository. As such, assembling an A. americanum genome will probably be the next item on our checklist. We’re also still analyzing this data and specifically exploring post-translational modifications. Ultimately, we hope to identify active salivary molecules.

Acknowledgements
- Thank you to the QB3 Genomics Facility at UC Berkeley (RRID:SCR_022170) for RNA library prep and sequencing.

Share your thoughts!

Provide feedback

Pub details

Content 12 contributors

10 references

Activity 15 discussions

0 social posts

This work is licensed under CC BY 4.0

Purpose
Background and goals
The approach
Sample collection and RNA preparation
RNA extraction, processing, and sequencing
Mass spectrometry
Transcriptome and proteome processing and analysis
The results
Transcriptomic data
Proteomic data from mass spectrometry
Key takeaways
What’s next?

Seemay Chou

Conceptualization, Editing, Supervision

Resources

Critical Feedback

Critical Feedback

Visualization

Megan L. Hochstrasser

Editing, Visualization, Writing

Greg Huber

Conceptualization, Critical Feedback

Kira E. Poskanzer

Conceptualization, Supervision

MaryClare Rollins

Project Administration, Resources

Peter S. Thuy-Boun

Editing, Formal Analysis, Investigation, Methodology, Writing

Elizabeth Tseng

Conceptualization, Critical Feedback

Joan Wong

Critical Feedback

Kat katherinebaney@gmail.com on May 30, 2022

Selection

t is in no way a replacement, as 5% of all PSMs are only detectable thanks to the Mulenga database. The short-protein skew of the Mulenga database appears to be complementary to the long-protein skew of our own database.Finally, the assignment of 40% of all fragmentation spectra is reasonable but there are likely man

Does this mean ideal proteome database construction would include both short and long read databases?

Peter S. Thuy-Boun on Jun 01, 2022

Hey Kat, thanks checking us out! Generally speaking, the best proteome database would contain as much information as possible about all protein sequences expressed by a biological system and this would include all protein isoforms. For bacteria with small genomes (or organisms of similar configuration), the choice of long- vs short-read might not be so important; in many cases, we can predict coding sequences directly from the genome and use those sequences directly for peptide mapping. For more complex organisms with large genomes (in this case ticks), identifying coding sequences from the genome may not be straightforward and transcriptomes make life a little easier. Given no constraints, having both short- and long-read transcriptomes would probably be ideal like you’re suggesting, especially considering that there are trade-offs between technologies.

Kat katherinebaney@gmail.com on May 31, 2022

Selection

mericanum genome, which, to the best of our knowledge, is not yet available in a public repository. As such, assembling an A. americanum genome will probably be the next item on our checklist. We’re also still analyzing this data and specifically exploring post-translational modifications. U

It seems like there are four datasets a scientist might consider building:

Proteome
Long Read Transcriptome
Short Read Transcriptome
Genome

how would you advise a scientist on which dataset they should put effort into if they wanted to begin characterizing a non model organism’s metabolites? This pub seems to recommend a transcriptome as satisfying replacement for a full genome, but here your next step is to get a genome.

Seemay Chou on May 31, 2022

Part of the reason we are doing the genome next is to ask: how much more could we get from adding this in, and how much more work would that require? I think we are going to learn from this comparison for Amblyomma how we might approach the other tick species based on the cost/benefit analysis.

Metabolites is a whole different ballgame since you can’t directly map small molecules against genes 1:1. We will probably be tackling this next. Out of curiosity Kat - what kind of metabolomic analyses are you interested in?

Austin H. Patton on Jun 03, 2022

Selection

tsSHOW ME THE DATA: Access our transcriptomics data and proteomics data. Transcriptomic dataOnce our transcriptome data arrived, we identified protein-coding sequences using TransDecoder, CPAT, and ANGEL. We combined our resultant protein output and collapsed sequences down by CD-HIT clustering with a similarity setting of 100% (c=1.0) to group redundant sequences, yielding 222,632 predicted proteins (down from a total of 307,541). We used these CD-HIT-collapsed non-redundant protein sequences for subsequent proteomics mapping. F

I noticed there seems to be a large number of duplicated BUSCOs (i.e. Fig. 5a). I wonder if this could be related to the use of these three independent methods (TransDecoder, CPAT, ANGEL) for CDS prediction?

Could there be some redundant predictions that persist even following the removal of duplicates using CD-HIT? I would be curious whether a less stringent cutoff (e.g. 99%) could achieve a similar level of BUSCO completeness, but with a greater proportion being single-copy?

Austin H. Patton on Jun 03, 2022

I included this in my comment on the method pub, but it looks like Cogent (https://github.com/Magdoll/Cogent) could be the perfect tool for this (designed for this exact use case - identifying non-redundant transcripts from Iso-seq data in the absence of a reference genome)

Peter S. Thuy-Boun on Jun 08, 2022

Austin, thanks for checking in and commenting!

I agree with you, there are a lot of duplicated BUSCOs likely because I cobbled together output from multiple CDS prediction tools. Sometimes the tools all converge on the same CDS and sometimes they produce different CDSs from a single transcript. Transdecoder was especially good at predicting divergent protein sequences as the protein length minimum cut-off was lowered (100->50->25 amino acid residues). The Transdecoder github warns that as prediction length minimums decrease below 100 amino acid residues, false positive predictions may increase dramatically. I’m certain that false positive protein sequences are generously incorporated in our protein database, but because we’re interested in identifying as many peptides (by mass spectrometry) as possible it’s a danger that we’ve chosen to live with for now. This is definitely a problem we’re looking to solve and I’m hoping to address it in the future by taking a reverse approach wherein proteomics mass spectrometry data might be used to identify novel protein coding sequences in transcripts (and possibly even in genome data).

To your comment about Cogent: yes that’s a great tool! I ran our data through it early on and it managed to condense our original 307,541 transcripts into 21,726 families with family membership ranging from 1-481 transcripts. I’m not sure I’ve made the best use of this output yet, let me know if you have any thoughts!

Austin H. Patton on Jun 12, 2022

Hi Peter, of course! Apologies for not getting back sooner - was caught up at a conference!

I’m glad to see we share the same intuition about this! I’m sure each of the three methods have their own respective complimentary strengths, so I definitely agree that using a combination of approaches is the way to go. This is similar in a way to how de novo transcriptome assemblies can be improved (particularly with respect to completeness) through the merging of assemblies produced by different assemblers.

As for detecting whether shorter protein sequences with Transdecoder are enriched for false positives… I wonder if this could be detected by comparing the length distributions of protein sequences for which you’ve got LC-MS/MS evidence to the length distributions for those lacking evidence? A shift in the distribution towards shorter lengths in predicted protein sequences for which evidence is lacking might suggest this is happening - the similar length distributions between studies for sequences that do have LC-MS/MS evidence (right panel of figure 7) is promising though.

As for cogent - that sounds like it seems to have worked super well! I know for vertebrates (which ticks obviously are not!) the number of gene families is fairly conserved, around 20, 000 - I’m not sure what the expectation would be for ticks, but the 21,726 identified by Cogent certainly seems coherent. This paper (http://www.genome.org/cgi/doi/10.1101/gr.7046608) would suggest it’s at least within a similar ballpark? I think with respect to BUSCO, I think using a single representative sequence (e.g. the longest) for each gene family would directly deal with the inference of many duplicated orthologs, all while retaining the original completeness of the assembly.

Maybe just my own curiosity, but I’d be interested in what the “gene-family membership frequency spectrum” (or more simply just a histogram of family membership counts) looks like!

Peter S. Thuy-Boun on Jun 23, 2022

It might also be useful to visualize the percentage of predicted proteins possessing LC-MS/MS evidence binned by protein length (with more granularity in the 0-500 amino acid length range) to get a sense for false positive protein prediction as a function of length. I'm not sure that the single proteomics dataset we have right now surveys the proteome deeply enough for this work but it could be an interesting project for the future!

Also, thank you for the reference! It definitely makes me feel better about the dataset we have.

Austin H. Patton on Jun 03, 2022

Selection

ome database with the short-read Mulenga database and a database forged from NCBI sequences (, B). It’s striking how divergent our data set and the Mulenga data set appear to be, but the real test of usefulness for our data set will be determined by proteomics mapping results.Proteomic data from mass spectrometryWith a basic proteomics database search, we were able to a

I agree that the substantial differences between your dataset and the Mulenga database (Fig. 5b) is striking, particularly given the greater degree of overlap seen in Fig. 6. It makes me wonder whether there’s a high degree of sequence divergence of homologous genes between these two species, which may be expected given that it seems like they diverged quite some time ago (seems like ~ 60mya? Estimate from Gou et al., 2013: https://doi.org/10.1002/ece3.685)?

Edit: I mistakenly was thinking the Mulenga transcriptomic dataset was from Ixodes - makes the lack of overlap even more surprising!

Peter S. Thuy-Boun on Jun 08, 2022

I wonder if the overlap could be increased by running the same CDS prediction tools on the Mulenga assembled transcripts?

Austin H. Patton on Jun 12, 2022

I was thinking the same thing! This way it would be more like comparing apples to apples, and the overlap could potentially even be compared at two scales - membership of all protein clusters from predicted CDS, or using only single representative sequences from each inferred protein cluster. I imagine the latter might point more towards broad-scale overlap across identified protein sequences, irrespective of differences in gene-family composition or identification of isoforms?

Peter S. Thuy-Boun on Jun 23, 2022

For sure, maybe this, the Cogent output (including the gene-family membership frequency plot), and BUSCO assement of CD-HIT cluster representatives could be useful bits for a follow-up pub!

Taylor Reiter on Aug 03, 2023

Selection

erest by excising salivary glands (which comprise a major mass fraction of the tick anatomy) from unfed female A. americanum ticks.RNA extraction, processing, and sequencingWe pooled salivary gland tissue from about 10 ticks,

Would you be willing to provide information on where the ticks were sourced from?

MaryClare Rollins on Aug 25, 2023

Yes! We purchase ticks from the National Tick Research and Education Resource at Oklahoma State University.

Contributors (A-Z)

Purpose

Share your thoughts!

Background and goals

The approach

Sample collection and RNA preparation

RNA extraction, processing, and sequencing

Mass spectrometry

Transcriptome and proteome processing and analysis

The results

Transcriptomic data

Proteomic data from mass spectrometry

Key takeaways

What’s next?

References

Share your thoughts!

Provide feedback

Pub details

Table of contents