Skip to main content
SearchLoginLogin or Signup

De novo assembly of a long-read Amblyomma americanum tick genome

We generated a whole-genome assembly for the lone star tick to serve as a reference for downstream efforts where whole-genome maps are required. We created our assembly using pooled DNA from salivary glands of 50 adult female ticks that we sequenced using PacBio HiFi reads.
Published onMar 30, 2023
De novo assembly of a long-read Amblyomma americanum tick genome
·

Purpose

This pub introduces our first public draft assembly of an Amblyomma americanum genome. We’ve continued to refine and explore this assembly in a variety of projects, including for genome annotation, gene expression, and comparative genomics [1]. We’re sharing the work that led to our initial assembly in the hope that it will serve as a valuable reference dataset for the tick research community. We think there’s more work to be done with this genome. For example, we found sequences from the endosymbiont Coxiella in the salivary gland, consistent with previous reports [2], but haven’t followed up on this at all.

Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

Background and goals

We want to understand how ticks manipulate humans, so we’ve been studying the lone star tick, Amblyomma americanum. Given that there are over 900 species of ticks identified so far, it’s easy to overlook A. americanum in particular. However, to many in the eastern and southern United States, A. americanum is a pest with a rapidly growing presence and predilection for humans as hosts [3][4].

One of our long-time pain points with respect to the study of A. americanum has been operating without a reference genome. Despite their relevance to humans, we suspect the deficit in publicly available genomic references for A. americanum might stem from their surprising genetic complexity. Based on flow cytometry estimates, the tick’s haploid genome is expected to be approximately 3 gigabases (Gb) [5]. For comparison, Escherichia coli (bacterium) have 5 megabase (Mb) genomes, Drosophila melanogaster (fruit fly) have 140 Mb genomes, Mus musculus (mouse) have 2.7 Gb genomes, and Homo sapiens (human) have 2.9 Gb genomes [6][7]. Based on these figures, one might not expect that tiny ticks could hold so much genetic information, but believe it or not, the tree of life is peppered with examples more extreme than this. Our ability to accurately and comprehensively map these edge cases with reasonably ordinary resources is a recent phenomenon.

Previously, we and others generated transcriptome assemblies using tissue extracted from A. americanum ticks [8][9]. These datasets provide snapshots of the genes being transcribed in the cells we collected. This information was instrumental for generating the protein databases we needed to do mass spectrometry-based proteomics on the same tick species. While these transcriptome datasets have proven useful, they can fall short when mass spectrometry detects peptide features that don’t correspond to reference transcripts. In these cases, the mass spectrometry features will likely go unidentified in analyses until more transcriptomic data becomes available.

Rather than iteratively sequence tick transcriptomes under various conditions (to capture broader transcript landscapes), we decided that a whole-genome assembly could give us more bang for our buck by capturing all genes encoded by A. americanum

SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome in GenBank, and our full assembly on Zenodo.

The approach

We felt that the time might be perfect to generate an A. americanum genome because long-read DNA sequencing technologies have become more accessible in recent years, meeting the challenge of mapping large and complex genomes. These long reads should, in theory, make the assembly of large genomes less computationally expensive (compared to short-reads) while reducing assembly errors. Many groups have rallied around long reads as a key technology for tick genome assembly in particular, with a notably high-quality Ixodes scapularis (deer tick) example unveiled very recently [10][11]. To the best of our knowledge, no such assembly exists for A. americanum.

Using long-read HiFi DNA sequencing from Pacific Biosciences (PacBio), we assembled an unphased diploid whole-genome map for female A. americanum ticks using the pooled DNA of approximately 50 individuals. We hope that our efforts will provide the research community with a useful resource for advancing work in this important tick species.

Detailed methods

Tick salivary gland extraction

We dissected 50 female ticks for DNA extraction. In an effort to reduce potential contamination from microbes that might inhabit the tick gut, we chose to isolate salivary glands, which incidentally comprise a major mass fraction of the internal tick organ system. We pooled these salivary glands in distilled water, chilled on ice, and extracted DNA immediately after dissection.

DNA extraction

We attempted high-molecular-weight DNA extraction using several different commercially available kits and found that in our hands, the Circulomics high-molecular-weight DNA tissue kit was most consistent for isolating well-distributed, ≥ 30 kb genomic DNA fragments (as judged by femto-pulse analysis).

Sequencing

We submitted raw genomic DNA to UC Berkeley’s QB3 genomics core for fragment analysis, shearing, and 12–17 kb size selection. Subsequently, we prepared HiFi libraries using PacBio’s SMRTbell prep kit 3.0. We sequenced these libraries using two SMRT Cells (8M) and a Sequel II instrument. UC Berkeley’s Vincent J. Coates genomics sequencing lab processed raw sequencing data into circular consensus (CCS) HiFi reads and sent us data in HiFi FASTQ format.

Data analysis

Our long-read assembly and assessment workflow is summarized in Figure 1. We tried assembling the genome with Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM, and ultimately moved forward with Flye.

The following code blocks depict the command line scripts we used for the assembly and assessment processes:

Concatenate FASTQ files (optional):

cat *.fastq > concatenated_file_name.fastq

Run Flye assembler (v2.8.3; default settings):

flye –pacbio-hifi concatenated_file_name.fastq -g 3g -o output_directory -t 19 –min-ovlp 5000

BUSCO assessment (v5.4.4):

busco -i assembly.fasta -l arachnida_odb10 -o output_directory -m genome

Purge_dups (v1.2.6):

Instructions for execution are available here.

All other code, including the Jupyter notebook we used for genome clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).


Data deposition

We deposited raw HiFi reads (FASTQ files) into NCBI (bioproject PRJNA932813) and our pseudo-haploid genome assembly (FASTA file) into NCBI/GenBank (accession GCA_030143305.1). We also uploaded our full, unphased assembly (FASTA file) to Zenodo (DOI: 10.5281/zenodo.7747102).

The data

SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome on GenBank, and our full assembly on Zenodo.

Figure 1

Bioinformatics workflow.

We received a sufficient amount of data from each SMRT Cell 8M. Our combined yield totaled 3.3 million HiFi CCS reads, composed of approximately 44.5 billion HiFi CCS bases, for an average HiFi insert length of 13.6 kb. We expected this amount of data to provide approximately 15-fold coverage of the A. americanum genome. We subjected the data to a long-read assembly and assessment workflow (Figure 1) starting with some cursory test assemblies using Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM [12][13][14]. We found that Flye and Hifiasm provided the most BUSCO complete assemblies using data from just one SMRT Cell 8M (Figure 2).

Figure 2

BUSCO results for each assembly.

Note that we deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Hap)” in Zenodo and deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Purged)” at NCBI/GenBank.

In our experience, Flye was the fastest assembler that produced reasonably (> 75%) complete assemblies. However, Hifiasm produced several assemblies and the largest (unitig) assembly contained the most duplicated BUSCO genes. Flye consumed approximately 1–2 days of processing time for one SMRT Cell worth of data and 3–4 days of processing time using the combined data from both SMRT Cells. Hifiasm consumed 1–2 days for one SMRT Cell and didn’t complete processing after three weeks for two SMRT Cells. We suspect that Hifiasm might have had trouble with our dataset because the genomic DNA we sequenced came from 50 ticks rather than one individual, which would have been the ideal scenario.

For our initial draft assembly, we chose to move forward with Flye due to its speed, convenience, and simplicity of output. However, the unitig assembly that Hifiasm produced is a bit larger and potentially more information-rich than the contig Hifiasm assembly and the default Flye assembly. This could lead to higher read-mapping for RNA-seq mapping and more complete protein database assembly for proteomics.

During genome deposition at NCBI/GenBank, the raw assembly that Flye produced triggered a few automated error messages, indicating that our assembly needed some light cleanup. Specifically, we had several contigs of less than 200 nucleotides and several duplicate contigs that we needed to remove. We also had a contig containing an adaptor sequence requiring adaptor excision. We generated a Python-based Jupyter notebook to take care of these issues.

All code, including the Jupyter notebook we used for clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).

The final issue, which a simple Python script could not resolve, was the fact that our assembly was too large compared to NCBI/GenBank estimates. To solve this issue, we used Purge_Dups to split our unphased assembly [15]. This generated a pseudo-haploid assembly which we then cleaned up using our aforementioned Python-based Jupyter notebook. We deposited the resultant assembly at NCBI/GenBank, available under accession GCA_030143305.1. We also deposited our unphased diploid genome into Zenodo for anyone interested in accessing our full dataset.

Key takeaways

We’ve assembled an 88% BUSCO-complete long-read pseudo-haploid Amblyomma americanum genome of approximately three gigabases from 50 individual female ticks. It’s available for download and use in GenBank. The unphased diploid genome is also available in a Zenodo repository.

Next steps

We’ve continued to make refinements to the Amblyomma americanum assembly. We identified sequences from a Coxiella-like endosymbiont in the salivary glands (as seen previously [2]), but haven’t done any work to analyze them beyond that. In a follow-up pub, we identified and removed those and other contaminant contigs [1] from our deposited assembly. We also performed gene-finding operations for this assembly, which are now available on GenBank (GCA_030143305.2) and UniProt (UP001321473). In the future, we plan to use this genome in research projects that involve expression profiling and comparative genomics.


Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.


  • Acknowledgements

    • Thank you to the QB3 Genomics Facility (RRID:SCR_022170) at UC Berkeley for raw DNA quality control, library preparation, and sequencing.


Contributors
(A–Z)
Conceptualization, Supervision
Resources
Critical Feedback
Editing, Visualization
Supervision
Critical Feedback, Data Curation, Validation
Project Administration, Resources, Supervision
Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualization, Writing
Connections
1 of 5
Comments
4
Yoonseong Park:

Current NCBI version (Date of 11/17/2023) GCA_030143305.1 is found to have BUSCO (arthropoda) 84.2% complete, indicating the final clean up may have kicked out a number of genes including Arthropod BUSCO set from the “Cell 1+2:Flye, Purge_Dups (Purged)”.

?
Taylor Reiter:

Thank you for your comment Yoonseong! Because we had to extract DNA from so many ticks, our assemblies ended up being pretty heterogeneous. The genome assembly we initially tried to upload to NCBI was 6GB and was flagged by NCBI’s quality control as being too large. We inferred from this that it was too large because it was diploid and used purge_dups to generate a pseudo-haploid genome which then passed NCBIs quality control.

It sounds like you have more systematically assessed the gene content of these assemblies than we did here, and have determined some genes that are missing in our pseudo haploid genome? We think there is definitely room for improvement on the completeness of the genome sequence and appreciate your efforts and would love to hear more.

Yoonseong Park:

I am looking forward to see this output. We are willing to collaborate with you by providing PacBio isosequencing and manual annotations of a number of genes.

?
Taylor Reiter:

We’re glad to hear these outputs might be useful for your research! We are close to finishing our ORF annotation efforts for this genome and are looking forward to posting these outputs soon. While we’ll provide a more complete account shortly, we used EVM to produce gene models for this genome. We think our annotations are a similar completeness to the genome itself. Once we have this data available, we would love to hear if these gene model predictions are inclusive of those you found using your isoseq data.

+ 1 more...
?
Gloria I. Giraldo-Calderón:

Thanks for generating a tick genome, we are much in need of representation from this group of (medical important) arthropods! Would be great if you could also assemble the scaffolds into chromosomes. You wrote as part of Next steps “in silico functional annotations“. Based on experience with other genomes I strongly encourage you to request NCBI to annotate your genome with their pipeline: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/. I have seen several people just give up on uploading their annotations to NCBI due to issues from the NCBI side providing appropriate indications of file errors and/or format during the submission process. Actually, that could be a great benefit to the NCBI genome database users if “someone” could help them streamline the submission process of genome annotation files.

Elizabeth A. McDaniel:

Hi Gloria, thank you for your comment!

  1. We are in the process of annotating the draft genome and will have an updated version of the pub soon will links to our set of annotated genes/proteins!

  2. At this time we don’t have plans to scaffold the genome into chromosomes as the draft assembly is quite fragmented with just the PacBio HiFi data we had alone. If we decide to curate this genome further we will update the pub/data as we go!

+ 1 more...
Jonathan A. Eisen:

Did you look for microbial DNA data in the sequences? Did this eliminate all microbial contamination or was there anything that showed up even in these salivary samples?

Jonathan A. Eisen:

Actually - following up on this. I went to NCBI and did a blastn search against one of the SRA entries here SRX19315096. I searches with the E. coli 16s rRNA gene and found many hits. I took the sequence of one of these hits and then reblasted it against NCBI.

The sequence I searched with is pasted below. It’s top hit is to Coxiella like endosymbionts of ticks so I guess at least some of these are in the samples. Do you possibly have a whole genome of this symbiont in your assemblies?

    GGATTAGTTAGTTGGTGGGGTAATAGCCCACCAAGACGATGATCCGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTATGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGAAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCAGGAAAGAAAGTCTTAAAGTTAATACCTTTAATGAGTTGACGTTACCTGCAGAAGAAGCACTGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGAGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGATATTAAAGTCGGGTGTGAAAGCCCCGAGCTTAACTCGGGAATTGCGTTCGATACTGAGTATCTAGAGTATAGTAGAGGGAAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGGCTTCCTGGACTAATACTGACACTGAGGTGCGAAAGCGTAGGGAGCAAACAGGATTAGAGACCCTGGTAGTCCACGCTGTCAACGATGAGAACTAGCTGTTAGAAAACTTGTTTCTTGGTAGCGAAGCTAACGCGTTAAGTTCTCCGCCTGGGGAGTACGACCGCAAGGTTAAAACTCAAAGAAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACACGAAAAACCTTACCTACCCTTGACATCCTCGGAACTTGTCAGAGATGACTTGGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTGCCAGCGGTTCGGCCGGGAACTCTAAGGAGACTGCCGGTGATAAACCGGAGGAAGGTGAGGATGATGTCAAGTCATCATGGCCCTTATGGGTAGGGCTACACACGTGCTACAATGGGCAGTACAGAGGGTTGCCAAATCGTGAGGTGGAGCTAATCCCAGAAAGCTGCTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCTATGAAGCTGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCTCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGTACCAGAAGCGGGTAGGCTAACCTTATAGGAGGCCGCTCACCACGGTATGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATCACCTCCTTA

+ 3 more...