Skip to main content
SearchLoginLogin or Signup

Streamlining genome assembly and QC with the reads2genome workflow

We want to swiftly generate genome assemblies and produce quality control statistics to gauge the need for more curation. We built a Nextflow pipeline that assembles Illumina, Nanopore, or PacBio sequencing reads for a single organism and runs QC checks on the resulting assembly.
Published onApr 11, 2023
Streamlining genome assembly and QC with the reads2genome workflow
·

Purpose

We want to ensure that we assemble high-quality genomes in a reproducible manner. We built a Nextflow workflow, reads2genome, to assemble sequencing reads from Illumina, Nanopore, or PacBio HiFi technologies for a single organism and produce quality control statistics for the resulting assembly. The product of this pipeline is an assembly, mapped reads, and interactive visualizations reported with MultiQC. The final HTML report addresses assembly quality, lineage-specific checks, and mapping statistics that will help us make more informed decisions about downstream curation and functional annotation efforts.

We built this pipeline using open-source software and tools, and we hope others will shape and extend this resource to fit their needs.

Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

The resource

The problem

Running the commands for assembly and quality control (QC) checks from sequencing efforts of single organisms can be fairly straightforward but repetitive, depending on the desired outcomes. We want to quickly generate assemblies and resulting statistics to decide if further curation is needed before moving forward with downstream steps.

Our solution

We previously developed the “hifi2genome” workflow (see the earlier version of this pub) for automating genome assembly and QC for PacBio HiFi sequencing efforts. Since releasing that workflow, we’ve expanded our genome sequencing efforts to include Illumina and Nanopore technologies. Therefore, we developed a computational resource that automates genome assembly and quality control checks from Illumina, Nanopore, or PacBio Hifi technologies, called reads2genome.

The reads2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.8240239).

An overview of the reads2genome workflow

The reads2genome pipeline injects a sample sheet that includes the sample name and the local path, URL, or URI of the reads in FASTQ format (Figure 1).

Figure 1

Computational steps in the reads2genome workflow.

We designed the pipeline to separately process Illumina, Nanopore, or PacBio HiFi reads from a single organism, and therefore researchers cannot currently use reads2genome for hybrid assembly or scaffolding approaches. We made this decision based on our most common internal use cases, which have evolved from solely using PacBio HiFi for sequencing genomes from single organisms (the previous version of this pub was limited to this use case). The user must therefore select the corresponding technology with the --platform flag with either --platform illumina, --platform nanopore or --platform pacbio when launching the workflow.

Key functions

Most of the tools downstream of read QC and assembly are the same for all technologies. We describe tools specific to Illumina, Nanopore, or PacBio HiFi technologies below. After inputting reads in FASTQ format, the pipeline performs basic read QC, adapter removal, and assembly, and then maps the reads back to the assembly with minimap2 [1]. Subsequent assembly checks run in parallel and generate QC statistics. Currently, the checks include 1) lineage-specific QC marker checks with BUSCO [2], 2) assembly quality statistics with QUAST [3], and 3) mapping rate stats with samtools stats [4][5].

In addition to the sample sheet containing the path to the reads and the corresponding --platform selection, the only other input the user must provide is the closest BUSCO lineage of the target organism for calculating lineage-specific completeness and redundancy statistics.

The final step of the pipeline aggregates the results from QUAST, BUSCO, samtools stats, and the information about the pipeline run and software versions into an HTML report with MultiQC [5] (Figure 1). MultiQC can generate an HTML report from the log files of numerous bioinformatics programs, and you can use it with or without running a Nextflow pipeline. The MultiQC report currently outputs general information about the assemblies and mapping statistics, as well as more detailed information about each assembly from QUAST, including the distribution of sizes of contigs that were assembled, BUSCO lineage assessment results, and outputs from samtools stats, including percentages of the reads that mapped to each corresponding assembly and alignment metrics. The resulting MultiQC HTML report is emailed to the end user if SMTP credentials for the pipeline are configured.

View an example of the MultiQC HTML report from the pipeline below from a run on a publicly available dataset of PacBio HiFi sequencing of 24 microorganisms from the “Food safety and infectious microbes” dataset with nextflow run main.nf –input samplesheet.csv –outdir microbial_hifi_assemblies -profile docker –lineage bacteria:

Download the sample report:


Illumina-specific tools

When the user launches the workflow using --platform illumina, reads2genome filters each set of paired-end reads with fastp [6] and assembles them with SPAdes [7]

Nanopore- and PacBio HiFi-specific tools

When the user launches the workflow using --platform pacbio, reads2genome summarizes the quality stats of the reads with NanoPlot [8] and assembles them with Flye [9] using the --pacbio-hifi flag. When the user launches the workflow using --platform nanopore, it similarly summarizes reads with NanoPlot, but includes an adapter-trimming step with Porechop_ABI [10] before assembling with Flye using the --nano-hq flag. After assembly, the workflow polishes contigs using Medaka with default parameters [11].

Deployment

We deploy the pipeline with continuous integration testing using subsampled reads for each sequencing technology,  ensuring proper execution of the workflow as we add new features.

We are currently deploying all of our Nextflow workflows, including reads2genome, through Nextflow Tower using our AWS Batch setup [12]. The pipeline is still fully executable locally via the command line and works on diverse compute infrastructure setups.

We found that for organisms with small genomes, such as bacteria and archaea, reads2genome assembles the reads fairly quickly, and can run these jobs on interruptible AWS EC2 spot instances and complete successfully. However, for higher-order eukaryotes with larger genomes, like humans and ticks [13], which might take multiple days to assemble, we needed to reconfigure the Nextflow Tower queue directive settings so that assemblies are run via on-demand instances and are not interrupted.

The reads2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.8240239).

Next steps

This version of the reads2genome pipeline is a simple way to assemble reads obtained from a single organism using either Illumina, Nanopore, or PacBio HiFi technologies and to provide QC stats for the resulting assembly. In the future, we would like to:

  • Provide the user with the option to use other assembly algorithms (such as Hifiasm) in place of or concurrently with Flye to compare assembly outputs for technologies such as PacBio HiFi.

  • Add an optional endosymbiont detection sub-workflow for pulling out contigs that do not belong to the host genome and are likely symbiont(s) sequences.

  • Configuring Medaka to run on GPU instances.

For these efforts, we have created GitHub issues in the reads2genome GitHub repository and welcome outside suggestions and contributions through pull requests!


Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.


Contributors
(A–Z)
Software, Supervision, Validation
Editing, Visualization
Conceptualization, Software, Visualization, Writing
Critical Feedback, Validation
Critical Feedback
Comments
1
?
Arudhir Singh:

I think a really helpful addition, particularly as you expand this workflow with more assemblers, read preprocessing strategies, etc. would be some relatively objective measure of assembly accuracy.

I have found great utility in using QV scores to measure assembly accuracy: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9

QV scores are a Phred quality score for assemblies calculated by comparing k-mers from the short reads (accurate) to the assembly itself. It’s reference-free so it’ll work well for non-model organisms.

I’ve found it incredibly helpful as a way to quickly judge and compare different assembly techniques. I like it because it’s a single, highly interpretable number. You can quickly see its utility by comparing the Miniasm results with and without the long read correction. You can also use it to judge whether techniques that should increase accuracy (i.e., short read polishing) are worth the computational time they add on.

Also check out Unicycler for the bacterial assemblies, it’s my favorite hybrid read assembler. At its core it’s a short read SPAdes assembly, long read Miniasm, then a semi-global alignment to use the long read assembly to bridge the scaffolds from the SPAdes. It’s my favorite because it consistently gives the highest QV ;)

?
Taylor Reiter:

Thank you so much for these ideas! I’ve added the information here as an issue on the reads2genome GitHub repository so that if/when we expand and update this workflow, we include these tools. The QV scores are especially compelling, I hadn’t seen those before but I love the comparison against kmers in short reads, very clever.

https://github.com/Arcadia-Science/reads2genome/issues/16