Streamlining genome assembly and QC with the reads2genome workflow
Streamlining genome assembly and QC with the reads2genome workflow
We want to ensure that we assemble high-quality genomes in a reproducible manner. We built a Nextflow workflow, hifi2genome, to assemble PacBio HiFi reads from a single organism and produce quality control statistics for the resulting assembly. The product of this pipeline is an assembly, mapped reads, and interactive visualizations reported with MultiQC. The final HTML report addresses assembly quality, lineage-specific checks, and mapping statistics that will help us make more informed decisions about downstream curation and functional annotation efforts.
We built this pipeline using open-source software and tools, and we hope others will shape and extend this resource to fit their needs.
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Running the commands for assembly and quality control (QC) checks from sequencing efforts of single organisms can be fairly straightforward but repetitive, depending on the desired outcomes. We want to quickly generate assemblies and resulting statistics to decide if further curation is needed before moving forward with downstream steps.
We’ve developed a computational resource that automates genome assembly and quality control checks from HiFi reads, called hifi2genome.
The hifi2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.7829602).
The hifi2genome pipeline injects a sample sheet that includes the sample name and the local path, URL, or URI of the HiFi reads in FASTQ format.
Computational steps in the hifi2genome workflow.
Each box represents a process step and the other text describes the input and outputs of the pipeline. File types are indicated in parentheses and software in brackets.
The first step in the pipeline runs Flye [1] to assemble the PacBio HiFi reads into contigs (Figure 1). Subsequent steps then run in parallel on the assembly for generating QC statistics. Currently, the checks we include are 1) assembly QC statistics with QUAST [2], 2) lineage-specific QC statistics with BUSCO [3], and 3) mapping stats generated with SAMtools [4] from mapping the reads back to the assembly with minimap2 [5].
In addition to the sample sheet containing the path to the reads, the only other input the user must provide is the closest BUSCO lineage of the target organism for calculating lineage-specific completeness and redundancy statistics.
The final step of the pipeline aggregates the results from QUAST, BUSCO, samtools stats
, and the information about the pipeline run and software versions into an HTML report with MultiQC [6] (Figure 1). MultiQC can generate an HTML report from the log files of numerous bioinformatics programs, and you can use it with or without running a Nextflow pipeline. The MultiQC report currently outputs general information about the assemblies and mapping statistics, as well as more detailed information about each assembly from QUAST, including the distribution of sizes of contigs that were assembled, BUSCO lineage assessment results, and outputs from samtools stats
, including percentages of the reads that mapped to each corresponding assembly and alignment metrics.
View an example of the MultiQC HTML report from the pipeline below from a run on a publicly available dataset of PacBio HiFi sequencing of 24 microorganisms from the “Food safety and infectious microbes” dataset with nextflow run main.nf –input samplesheet.csv –outdir microbial_hifi_assemblies -profile docker –lineage bacteria
:
Download the sample report:
We deploy the pipeline with continuous integration testing using subsampled PacBio HiFi reads of two strains of E. coli from the PacBio HiFi “Food safety and infectious microbes” dataset for ensuring proper execution of the workflow as new features are added.
We are currently deploying all of our Nextflow workflows, including hifi2genome, through Nextflow Tower using our AWS Batch setup [7]. The pipeline is still fully executable locally via the command line and works on diverse compute infrastructure setups.
We found that for organisms with small genomes, such as bacteria and archaea, hifi2genome assembles the reads fairly quickly, and can run these jobs on interruptible AWS EC2 spot instances and complete successfully. However, for higher-order eukaryotes with larger genomes, like humans and ticks [8], which might take multiple days to assemble, we needed to reconfigure the Nextflow Tower queue directive settings so that assemblies running via on-demand instances would not be interrupted.
The hifi2genome Nextflow workflow is available at this GitHub repository (DOI: 10.5281/zenodo.7829602).
This first version of the hifi2genome pipeline is a simple way to assemble PacBio HiFi reads and QC the resulting assembly. In the future, we would like to:
For these efforts, we have created GitHub issues in the hifi2genome GitHub repository and welcome outside suggestions and contributions through pull requests!
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.