Published on Apr 11, 2023 by Arcadia Science

Streamlining genome assembly and QC with the reads2genome workflow

We want to swiftly generate genome assemblies and produce quality control statistics to gauge the need for more curation. We built a Nextflow pipeline that assembles Illumina, Nanopore, or PacBio sequencing reads for a single organism and runs QC checks on the resulting assembly.

Streamlining genome assembly and QC with the reads2genome workflow

Purpose

We want to ensure that we assemble high-quality genomes in a reproducible manner. We built a Nextflow workflow, hifi2genome, to assemble PacBio HiFi reads from a single organism and produce quality control statistics for the resulting assembly. The product of this pipeline is an assembly, mapped reads, and interactive visualizations reported with MultiQC. The final HTML report addresses assembly quality, lineage-specific checks, and mapping statistics that will help us make more informed decisions about downstream curation and functional annotation efforts.

We built this pipeline using open-source software and tools, and we hope others will shape and extend this resource to fit their needs.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

The resource

The problem

Running the commands for assembly and quality control (QC) checks from sequencing efforts of single organisms can be fairly straightforward but repetitive, depending on the desired outcomes. We want to quickly generate assemblies and resulting statistics to decide if further curation is needed before moving forward with downstream steps.

Our solution

We’ve developed a computational resource that automates genome assembly and quality control checks from HiFi reads, called hifi2genome.

The hifi2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.7829602).

An overview of the hifi2genome workflow

The hifi2genome pipeline injects a sample sheet that includes the sample name and the local path, URL, or URI of the HiFi reads in FASTQ format.

Computational steps in the hifi2genome workflow.

Each box represents a process step and the other text describes the input and outputs of the pipeline. File types are indicated in parentheses and software in brackets.

The first step in the pipeline runs Flye [1] to assemble the PacBio HiFi reads into contigs (Figure 1). Subsequent steps then run in parallel on the assembly for generating QC statistics. Currently, the checks we include are 1) assembly QC statistics with QUAST [2], 2) lineage-specific QC statistics with BUSCO [3], and 3) mapping stats generated with SAMtools [4] from mapping the reads back to the assembly with minimap2 [5].

In addition to the sample sheet containing the path to the reads, the only other input the user must provide is the closest BUSCO lineage of the target organism for calculating lineage-specific completeness and redundancy statistics.

The final step of the pipeline aggregates the results from QUAST, BUSCO, samtools stats, and the information about the pipeline run and software versions into an HTML report with MultiQC [6] (Figure 1). MultiQC can generate an HTML report from the log files of numerous bioinformatics programs, and you can use it with or without running a Nextflow pipeline. The MultiQC report currently outputs general information about the assemblies and mapping statistics, as well as more detailed information about each assembly from QUAST, including the distribution of sizes of contigs that were assembled, BUSCO lineage assessment results, and outputs from samtools stats, including percentages of the reads that mapped to each corresponding assembly and alignment metrics.

View an example of the MultiQC HTML report from the pipeline below from a run on a publicly available dataset of PacBio HiFi sequencing of 24 microorganisms from the “Food safety and infectious microbes” dataset with nextflow run main.nf –input samplesheet.csv –outdir microbial_hifi_assemblies -profile docker –lineage bacteria:

Download the sample report:

hifi2genome_multiqc_report.htmlDownload

Deployment

We deploy the pipeline with continuous integration testing using subsampled PacBio HiFi reads of two strains of E. coli from the PacBio HiFi “Food safety and infectious microbes” dataset for ensuring proper execution of the workflow as new features are added.

We are currently deploying all of our Nextflow workflows, including hifi2genome, through Nextflow Tower using our AWS Batch setup [7]. The pipeline is still fully executable locally via the command line and works on diverse compute infrastructure setups.

We found that for organisms with small genomes, such as bacteria and archaea, hifi2genome assembles the reads fairly quickly, and can run these jobs on interruptible AWS EC2 spot instances and complete successfully. However, for higher-order eukaryotes with larger genomes, like humans and ticks [8], which might take multiple days to assemble, we needed to reconfigure the Nextflow Tower queue directive settings so that assemblies running via on-demand instances would not be interrupted.

The hifi2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.7829602).

Next steps

This first version of the hifi2genome pipeline is a simple way to assemble PacBio HiFi reads and QC the resulting assembly. In the future, we would like to:

  • Provide the user with the option to use other assembly algorithms (such as Hifiasm) in place of Flye or concurrently to compare assembly outputs.
  • Add an optional endosymbiont detection subworkflow for pulling out contigs that do not belong to the host genome and are likely symbiont(s) sequences.
  • Extend the workflow or apply its methods to Nanopore- and Illumina-based single-organism assembly workflows

For these efforts, we have created GitHub issues in the hifi2genome GitHub repository and welcome outside suggestions and contributions through pull requests!


Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

Provide feedback

F
Feridun Mert Celebi
Software, Supervision, Validation
M
Megan L. Hochstrasser
Editing, Visualization
E
Elizabeth A. McDaniel
Conceptualization, Software, Visualization, Writing
T
Taylor Reiter
Critical Feedback, Validation
P
Peter S. Thuy-Boun
Critical Feedback