Description
Our raw whole-genome sequencing data is available here.
This pub introduces our first public draft assembly of an Amblyomma americanum genome. We’ve continued to refine and explore this assembly in a variety of projects, including for genome annotation, gene expression, and comparative genomics [1]. We’re sharing the work that led to our initial assembly in the hope that it will serve as a valuable reference dataset for the tick research community. We think there’s more work to be done with this genome. For example, we found sequences from the endosymbiont Coxiella in the salivary gland, consistent with previous reports [2], but haven’t followed up on this at all.
This pub is part of the project, “Ticks as treasure troves: Molecular discovery in new organisms.” Visit the project narrative for more background and context.
We followed up on this work in a subsequent pub, “Predicted genes from the Amblyomma americanum draft genome assembly.” We predicted proteins from our A. americanum assembly, which are now available on GenBank (GCA_030143305.2) and UniProt (UP001321473).
Raw genomic data is available via NCBI at BioProject PRJNA932813.
Our pseudo-haploid genome is in GenBank under accession GCA_030143305. We cover assembly version GCA_030143305.1 in this pub.
Our full assembly is on Zenodo.
Our code is available in this GitHub repository.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
We want to understand how ticks manipulate humans, so we’ve been studying the lone star tick, Amblyomma americanum. Given that there are over 900 species of ticks identified so far, it’s easy to overlook A. americanum in particular. However, to many in the eastern and southern United States, A. americanum is a pest with a rapidly growing presence and predilection for humans as hosts [3][4].
One of our long-time pain points with respect to the study of A. americanum has been operating without a reference genome. Despite their relevance to humans, we suspect the deficit in publicly available genomic references for A. americanum might stem from their surprising genetic complexity. Based on flow cytometry estimates, the tick’s haploid genome is expected to be approximately 3 gigabases (Gb) [5]. For comparison, Escherichia coli (bacterium) have 5 megabase (Mb) genomes, Drosophila melanogaster (fruit fly) have 140 Mb genomes, Mus musculus (mouse) have 2.7 Gb genomes, and Homo sapiens (human) have 2.9 Gb genomes [6][7]. Based on these figures, one might not expect that tiny ticks could hold so much genetic information, but believe it or not, the tree of life is peppered with examples more extreme than this. Our ability to accurately and comprehensively map these edge cases with reasonably ordinary resources is a recent phenomenon.
Previously, we and others generated transcriptome assemblies using tissue extracted from A. americanum ticks [8][9]. These datasets provide snapshots of the genes being transcribed in the cells we collected. This information was instrumental for generating the protein databases we needed to do mass spectrometry-based proteomics on the same tick species. While these transcriptome datasets have proven useful, they can fall short when mass spectrometry detects peptide features that don’t correspond to reference transcripts. In these cases, the mass spectrometry features will likely go unidentified in analyses until more transcriptomic data becomes available.
Rather than iteratively sequence tick transcriptomes under various conditions (to capture broader transcript landscapes), we decided that a whole-genome assembly could give us more bang for our buck by capturing all genes encoded by A. americanum.
SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome in GenBank, and our full assembly on Zenodo.
We felt that the time might be perfect to generate an A. americanum genome because long-read DNA sequencing technologies have become more accessible in recent years, meeting the challenge of mapping large and complex genomes. These long reads should, in theory, make the assembly of large genomes less computationally expensive (compared to short-reads) while reducing assembly errors. Many groups have rallied around long reads as a key technology for tick genome assembly in particular, with a notably high-quality Ixodes scapularis (deer tick) example unveiled very recently [10][11]. To the best of our knowledge, no such assembly exists for A. americanum.
Using long-read HiFi DNA sequencing from Pacific Biosciences (PacBio), we assembled an unphased diploid whole-genome map for female A. americanum ticks using the pooled DNA of approximately 50 individuals. We hope that our efforts will provide the research community with a useful resource for advancing work in this important tick species.
We dissected 50 female ticks for DNA extraction. In an effort to reduce potential contamination from microbes that might inhabit the tick gut, we chose to isolate salivary glands, which incidentally comprise a major mass fraction of the internal tick organ system. We pooled these salivary glands in distilled water, chilled on ice, and extracted DNA immediately after dissection.
We attempted high-molecular-weight DNA extraction using several different commercially available kits and found that in our hands, the Circulomics high-molecular-weight DNA tissue kit was most consistent for isolating well-distributed, ≥ 30 kb genomic DNA fragments (as judged by femto-pulse analysis).
We submitted raw genomic DNA to UC Berkeley’s QB3 genomics core for fragment analysis, shearing, and 12–17 kb size selection. Subsequently, we prepared HiFi libraries using PacBio’s SMRTbell prep kit 3.0. We sequenced these libraries using two SMRT Cells (8M) and a Sequel II instrument. UC Berkeley’s Vincent J. Coates genomics sequencing lab processed raw sequencing data into circular consensus (CCS) HiFi reads and sent us data in HiFi FASTQ format.
Our long-read assembly and assessment workflow is summarized in Figure 1. We tried assembling the genome with Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM, and ultimately moved forward with Flye.
The following code blocks depict the command line scripts we used for the assembly and assessment processes:
Concatenate FASTQ files (optional):
cat *.fastq > concatenated_file_name.fastq
Run Flye assembler (v2.8.3; default settings):
flye –pacbio-hifi concatenated_file_name.fastq -g 3g -o output_directory -t 19 –min-ovlp 5000
BUSCO assessment (v5.4.4):
busco -i assembly.fasta -l arachnida_odb10 -o output_directory -m genome
Purge_dups (v1.2.6):
Instructions for execution are available here.
All other code, including the Jupyter notebook we used for genome clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).
We deposited raw HiFi reads (FASTQ files) into NCBI (bioproject PRJNA932813) and our pseudo-haploid genome assembly (FASTA file) into NCBI/GenBank (accession GCA_030143305.1). We also uploaded our full, unphased assembly (FASTA file) to Zenodo (DOI: 10.5281/zenodo.7747102).
SHOW ME THE DATA: Access our raw genomic data on NCBI, our assembled pseudo-haploid genome on GenBank, and our full assembly on Zenodo.
We received a sufficient amount of data from each SMRT Cell 8M. Our combined yield totaled 3.3 million HiFi CCS reads, composed of approximately 44.5 billion HiFi CCS bases, for an average HiFi insert length of 13.6 kb. We expected this amount of data to provide approximately 15-fold coverage of the A. americanum genome. We subjected the data to a long-read assembly and assessment workflow (Figure 1) starting with some cursory test assemblies using Shasta, Flye, and Hifiasm with default settings on a 10-core 3.7 GHz Xeon workstation containing 224 GB of RAM [12][13][14]. We found that Flye and Hifiasm provided the most BUSCO complete assemblies using data from just one SMRT Cell 8M (Figure 2).
Note that we deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Hap)” in Zenodo and deposited the entry labeled “Cell 1+2:Flye, Purge_Dups (Purged)” at NCBI/GenBank.
In our experience, Flye was the fastest assembler that produced reasonably (> 75%) complete assemblies. However, Hifiasm produced several assemblies and the largest (unitig) assembly contained the most duplicated BUSCO genes. Flye consumed approximately 1–2 days of processing time for one SMRT Cell worth of data and 3–4 days of processing time using the combined data from both SMRT Cells. Hifiasm consumed 1–2 days for one SMRT Cell and didn’t complete processing after three weeks for two SMRT Cells. We suspect that Hifiasm might have had trouble with our dataset because the genomic DNA we sequenced came from 50 ticks rather than one individual, which would have been the ideal scenario.
For our initial draft assembly, we chose to move forward with Flye due to its speed, convenience, and simplicity of output. However, the unitig assembly that Hifiasm produced is a bit larger and potentially more information-rich than the contig Hifiasm assembly and the default Flye assembly. This could lead to higher read-mapping for RNA-seq mapping and more complete protein database assembly for proteomics.
During genome deposition at NCBI/GenBank, the raw assembly that Flye produced triggered a few automated error messages, indicating that our assembly needed some light cleanup. Specifically, we had several contigs of less than 200 nucleotides and several duplicate contigs that we needed to remove. We also had a contig containing an adaptor sequence requiring adaptor excision. We generated a Python-based Jupyter notebook to take care of these issues.
All code, including the Jupyter notebook we used for clean-up, is available at this GitHub repository (DOI: 10.5281/zenodo.7787240).
The final issue, which a simple Python script could not resolve, was the fact that our assembly was too large compared to NCBI/GenBank estimates. To solve this issue, we used Purge_Dups to split our unphased assembly [15]. This generated a pseudo-haploid assembly which we then cleaned up using our aforementioned Python-based Jupyter notebook. We deposited the resultant assembly at NCBI/GenBank, available under accession GCA_030143305.1. We also deposited our unphased diploid genome into Zenodo for anyone interested in accessing our full dataset.
We’ve assembled an 88% BUSCO-complete long-read pseudo-haploid Amblyomma americanum genome of approximately three gigabases from 50 individual female ticks. It’s available for download and use in GenBank. The unphased diploid genome is also available in a Zenodo repository.
We’ve continued to make refinements to the Amblyomma americanum assembly. We identified sequences from a Coxiella-like endosymbiont in the salivary glands (as seen previously [2]), but haven’t done any work to analyze them beyond that. In a follow-up pub, we identified and removed those and other contaminant contigs [1] from our deposited assembly. We also performed gene-finding operations for this assembly, which are now available on GenBank (GCA_030143305.2) and UniProt (UP001321473). In the future, we plan to use this genome in research projects that involve expression profiling and comparative genomics.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Acknowledgements
Thank you to the QB3 Genomics Facility (RRID:SCR_022170) at UC Berkeley for raw DNA quality control, library preparation, and sequencing.