De novo assembly of a long-read Amblyomma americanum tick genome

Seemay Chou; Tori Doran; Behnom Farboud; Megan L. Hochstrasser; Kira E. Poskanzer; Taylor Reiter; MaryClare Rollins; Peter S. Thuy-Boun

doi:10.57844/arcadia-9b6j-q683

Purpose

This pub introduces our first public draft assembly of an Amblyomma americanum genome. We’ve continued to refine and explore this assembly in a variety of projects, including for genome annotation, gene expression, and comparative genomics [1]. We’re sharing the work that led to our initial assembly in the hope that it will serve as a valuable reference dataset for the tick research community. We think there’s more work to be done with this genome. For example, we found sequences from the endosymbiont Coxiella in the salivary gland, consistent with previous reports [2], but haven’t followed up on this at all.

This pub is part of the project, “Ticks as treasure troves: Molecular discovery in new organisms.” Visit the project narrative for more background and context.
We followed up on this work in a subsequent pub, “Predicted genes from the Amblyomma americanum draft genome assembly.” We predicted proteins from our A. americanum assembly, which are now available on GenBank (GCA_030143305.2) and UniProt (UP001321473).
Raw genomic data is available via NCBI at BioProject PRJNA932813.
Our pseudo-haploid genome is in GenBank under accession GCA_030143305. We cover assembly version GCA_030143305.1 in this pub.
Our full assembly is on Zenodo.
Our code is available in this GitHub repository.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Background and goals

We want to understand how ticks manipulate humans, so we’ve been studying the lone star tick, Amblyomma americanum. Given that there are over 900 species of ticks identified so far, it’s easy to overlook A. americanum in particular. However, to many in the eastern and southern United States, A. americanum is a pest with a rapidly growing presence and predilection for humans as hosts [3][4].

One of our long-time pain points with respect to the study of A. americanum has been operating without a reference genome. Despite their relevance to humans, we suspect the deficit in publicly available genomic references for A. americanum might stem from their surprising genetic complexity. Based on flow cytometry estimates, the tick’s haploid genome is expected to be approximately 3 gigabases (Gb) [5]. For comparison, Escherichia coli (bacterium) have 5 megabase (Mb) genomes, Drosophila melanogaster (fruit fly) have 140 Mb genomes, Mus musculus (mouse) have 2.7 Gb genomes, and Homo sapiens (human) have 2.9 Gb genomes [6][7]. Based on these figures, one might not expect that tiny ticks could hold so much genetic information, but believe it or not, the tree of life is peppered with examples more extreme than this. Our ability to accurately and comprehensively map these edge cases with reasonably ordinary resources is a recent phenomenon.

Previously, we and others generated transcriptome assemblies using tissue extracted from A. americanum ticks [8][9]. These datasets provide snapshots of the genes being transcribed in the cells we collected. This information was instrumental for generating the protein databases we needed to do mass spectrometry-based proteomics on the same tick species. While these transcriptome datasets have proven useful, they can fall short when mass spectrometry detects peptide features that don’t correspond to reference transcripts. In these cases, the mass spectrometry features will likely go unidentified in analyses until more transcriptomic data becomes available.

Rather than iteratively sequence tick transcriptomes under various conditions (to capture broader transcript landscapes), we decided that a whole-genome assembly could give us more bang for our buck by capturing all genes encoded by A. americanum.

SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome in GenBank, and our full assembly on Zenodo.

The approach

We felt that the time might be perfect to generate an A. americanum genome because long-read DNA sequencing technologies have become more accessible in recent years, meeting the challenge of mapping large and complex genomes. These long reads should, in theory, make the assembly of large genomes less computationally expensive (compared to short-reads) while reducing assembly errors. Many groups have rallied around long reads as a key technology for tick genome assembly in particular, with a notably high-quality Ixodes scapularis (deer tick) example unveiled very recently [10][11]. To the best of our knowledge, no such assembly exists for A. americanum.

Using long-read HiFi DNA sequencing from Pacific Biosciences (PacBio), we assembled an unphased diploid whole-genome map for female A. americanum ticks using the pooled DNA of approximately 50 individuals. We hope that our efforts will provide the research community with a useful resource for advancing work in this important tick species.

Detailed methods

Tick salivary gland extraction

We dissected 50 female ticks for DNA extraction. In an effort to reduce potential contamination from microbes that might inhabit the tick gut, we chose to isolate salivary glands, which incidentally comprise a major mass fraction of the internal tick organ system. We pooled these salivary glands in distilled water, chilled on ice, and extracted DNA immediately after dissection.

DNA extraction

We attempted high-molecular-weight DNA extraction using several different commercially available kits and found that in our hands, the Circulomics high-molecular-weight DNA tissue kit was most consistent for isolating well-distributed, ≥ 30 kb genomic DNA fragments (as judged by femto-pulse analysis).

Sequencing

We submitted raw genomic DNA to UC Berkeley’s QB3 genomics core for fragment analysis, shearing, and 12–17 kb size selection. Subsequently, we prepared HiFi libraries using PacBio’s SMRTbell prep kit 3.0. We sequenced these libraries using two SMRT Cells (8M) and a Sequel II instrument. UC Berkeley’s Vincent J. Coates genomics sequencing lab processed raw sequencing data into circular consensus (CCS) HiFi reads and sent us data in HiFi FASTQ format.

Data analysis

Our long-read assembly and assessment workflow is summarized in Figure 1. We tried assembling the genome with Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM, and ultimately moved forward with Flye.

The following code blocks depict the command line scripts we used for the assembly and assessment processes:

Concatenate FASTQ files (optional):

cat *.fastq > concatenated_file_name.fastq

Run Flye assembler (v2.8.3; default settings):

flye –pacbio-hifi concatenated_file_name.fastq -g 3g -o output_directory -t 19 –min-ovlp 5000

BUSCO assessment (v5.4.4):

busco -i assembly.fasta -l arachnida_odb10 -o output_directory -m genome

Purge_dups (v1.2.6):

Instructions for execution are available here.

All other code, including the Jupyter notebook we used for genome clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).

Data deposition

We deposited raw HiFi reads (FASTQ files) into NCBI (bioproject PRJNA932813) and our pseudo-haploid genome assembly (FASTA file) into NCBI/GenBank (accession GCA_030143305.1). We also uploaded our full, unphased assembly (FASTA file) to Zenodo (DOI: 10.5281/zenodo.7747102).

The data

SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome on GenBank, and our full assembly on Zenodo.

We received a sufficient amount of data from each SMRT Cell 8M. Our combined yield totaled 3.3 million HiFi CCS reads, composed of approximately 44.5 billion HiFi CCS bases, for an average HiFi insert length of 13.6 kb. We expected this amount of data to provide approximately 15-fold coverage of the A. americanum genome. We subjected the data to a long-read assembly and assessment workflow (Figure 1) starting with some cursory test assemblies using Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM [12][13][14]. We found that Flye and Hifiasm provided the most BUSCO complete assemblies using data from just one SMRT Cell 8M (Figure 2).

Note that we deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Hap)” in Zenodo and deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Purged)” at NCBI/GenBank.

In our experience, Flye was the fastest assembler that produced reasonably (> 75%) complete assemblies. However, Hifiasm produced several assemblies and the largest (unitig) assembly contained the most duplicated BUSCO genes. Flye consumed approximately 1–2 days of processing time for one SMRT Cell worth of data and 3–4 days of processing time using the combined data from both SMRT Cells. Hifiasm consumed 1–2 days for one SMRT Cell and didn’t complete processing after three weeks for two SMRT Cells. We suspect that Hifiasm might have had trouble with our dataset because the genomic DNA we sequenced came from 50 ticks rather than one individual, which would have been the ideal scenario.

For our initial draft assembly, we chose to move forward with Flye due to its speed, convenience, and simplicity of output. However, the unitig assembly that Hifiasm produced is a bit larger and potentially more information-rich than the contig Hifiasm assembly and the default Flye assembly. This could lead to higher read-mapping for RNA-seq mapping and more complete protein database assembly for proteomics.

During genome deposition at NCBI/GenBank, the raw assembly that Flye produced triggered a few automated error messages, indicating that our assembly needed some light cleanup. Specifically, we had several contigs of less than 200 nucleotides and several duplicate contigs that we needed to remove. We also had a contig containing an adaptor sequence requiring adaptor excision. We generated a Python-based Jupyter notebook to take care of these issues.

All code, including the Jupyter notebook we used for clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).

The final issue, which a simple Python script could not resolve, was the fact that our assembly was too large compared to NCBI/GenBank estimates. To solve this issue, we used Purge_Dups to split our unphased assembly [15]. This generated a pseudo-haploid assembly which we then cleaned up using our aforementioned Python-based Jupyter notebook. We deposited the resultant assembly at NCBI/GenBank, available under accession GCA_030143305.1. We also deposited our unphased diploid genome into Zenodo for anyone interested in accessing our full dataset.

Key takeaways

We’ve assembled an 88% BUSCO-complete long-read pseudo-haploid Amblyomma americanum genome of approximately three gigabases from 50 individual female ticks. It’s available for download and use in GenBank. The unphased diploid genome is also available in a Zenodo repository.

Next steps

We’ve continued to make refinements to the Amblyomma americanum assembly. We identified sequences from a Coxiella-like endosymbiont in the salivary glands (as seen previously [2]), but haven’t done any work to analyze them beyond that. In a follow-up pub, we identified and removed those and other contaminant contigs [1] from our deposited assembly. We also performed gene-finding operations for this assembly, which are now available on GenBank (GCA_030143305.2) and UniProt (UP001321473). In the future, we plan to use this genome in research projects that involve expression profiling and comparative genomics.

Acknowledgements
- Thank you to the QB3 Genomics Facility (RRID:SCR_022170) at UC Berkeley for raw DNA quality control, library preparation, and sequencing.

References

Celebi FM, Chou S, McDaniel EA, Reiter T, Weiss ECP. (2024). Predicted genes from the Amblyomma americanum draft genome assembly. https://doi.org/10.57844/ARCADIA-9602-3351

Klyachko O, Stein BD, Grindle N, Clay K, Fuqua C. (2007). Localization and Visualization of aCoxiella-Type Symbiont within the Lone Star Tick,Amblyomma americanum. https://doi.org/10.1128/aem.00537-07

McClung KL, Little SE. (2023). Amblyomma americanum (Lone star tick). https://doi.org/10.1016/j.pt.2022.10.005

Centers for Disease Control and Prevention. (2020). Guide to the Surveillance of Metastriate Ticks (Acari: Ixodidae) and their Pathogens in the United States. www.cdc.gov/ticks/surveillance

Geraci NS, Spencer Johnston J, Paul Robinson J, Wikel SK, Hill CA. (2007). Variation in genome size of argasid and ixodid ticks. https://doi.org/10.1016/j.ibmb.2006.12.007

Hotaling S, Kelley JL, Frandsen PB. (2021). Toward a genome sequence for every animal: Where are we now? https://doi.org/10.1073/pnas.2109019118

Archer CT, Kim JF, Jeong H, Park JH, Vickers CE, Lee SY, Nielsen LK. (2011). The genome sequence of E. coli W (ATCC 9637): comparative genome analysis and an improved genome-scale reconstruction of E. coli. https://doi.org/10.1186/1471-2164-12-9

Kim TK, Tirloni L, Pinto AFM, Diedrich JK, Moresco JJ, Yates JR, da Silva Vaz I, Mulenga A. (2020). Time-resolved proteomic profile of Amblyomma americanum tick saliva during feeding. https://doi.org/10.1371/journal.pntd.0007758

Chou S, Poskanzer KE, Thuy-Boun PS. (2024). Robust long-read saliva transcriptome and proteome from the lone star tick, Amblyomma americanum. https://doi.org/10.57844/ARCADIA-3HYH-3H83

De S, Kingan SB, Kitsou C, Portik DM, Foor SD, Frederick JC, Rana VS, Paulat NS, Ray DA, Wang Y, Glenn TC, Pal U. (2023). A high-quality Ixodes scapularis genome advances tick science. https://doi.org/10.1038/s41588-022-01275-w

Gulia-Nuss M, Nuss AB, Meyer JM, Sonenshine DE, Roe RM, Waterhouse RM, Sattelle DB, de la Fuente J, Ribeiro JM, Megy K, Thimmapuram J, Miller JR, Walenz BP, Koren S, Hostetler JB, Thiagarajan M, Joardar VS, Hannick LI, Bidwell S, Hammond MP, Young S, Zeng Q, Abrudan JL, Almeida FC, Ayllón N, Bhide K, Bissinger BW, Bonzon-Kulichenko E, Buckingham SD, Caffrey DR, Caimano MJ, Croset V, Driscoll T, Gilbert D, Gillespie JJ, Giraldo-Calderón GI, Grabowski JM, Jiang D, Khalil SMS, Kim D, Kocan KM, Koči J, Kuhn RJ, Kurtti TJ, Lees K, Lang EG, Kennedy RC, Kwon H, Perera R, Qi Y, Radolf JD, Sakamoto JM, Sánchez-Gracia A, Severo MS, Silverman N, Šimo L, Tojo M, Tornador C, Van Zee JP, Vázquez J, Vieira FG, Villar M, Wespiser AR, Yang Y, Zhu J, Arensburger P, Pietrantonio PV, Barker SC, Shao R, Zdobnov EM, Hauser F, Grimmelikhuijzen CJP, Park Y, Rozas J, Benton R, Pedra JHF, Nelson DR, Unger MF, Tubio JMC, Tu Z, Robertson HM, Shumway M, Sutton G, Wortman JR, Lawson D, Wikel SK, Nene VM, Fraser CM, Collins FH, Birren B, Nelson KE, Caler E, Hill CA. (2016). Genomic insights into the Ixodes scapularis tick vector of Lyme disease. https://doi.org/10.1038/ncomms10507

Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, Armstrong J, Tigyi K, Maurer N, Koren S, Sedlazeck FJ, Marschall T, Mayes S, Costa V, Zook JM, Liu KJ, Kilburn D, Sorensen M, Munson KM, Vollger MR, Monlong J, Garrison E, Eichler EE, Salama S, Haussler D, Green RE, Akeson M, Phillippy A, Miga KH, Carnevali P, Jain M, Paten B. (2020). Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. https://doi.org/10.1038/s41587-020-0503-6

Kolmogorov M, Yuan J, Lin Y, Pevzner PA. (2019). Assembly of long, error-prone reads using repeat graphs. https://doi.org/10.1038/s41587-019-0072-8

Cheng H, Concepcion GT, Feng X, Zhang H, Li H. (2021). Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. https://doi.org/10.1038/s41592-020-01056-5

Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R. (2020). Identifying and removing haplotypic duplication in primary genome assemblies. https://doi.org/10.1093/bioinformatics/btaa025

Contributors (A-Z)

Purpose

Share your thoughts!

Background and goals

The approach

Detailed methods

Tick salivary gland extraction

DNA extraction

Sequencing

Data analysis

Data deposition

The data

Key takeaways

Next steps

References

Share your thoughts!

Provide feedback

Pub details

Table of contents