Skip to main content
SearchLoginLogin or Signup

Quickly preprocessing and profiling microbial community sequencing data with a Nextflow workflow for metagenomics

We want to seamlessly process and summarize metagenomics data from Illumina or Nanopore technologies. We built a Nextflow workflow that handles common metagenomics tasks and produces useful outputs and intuitive visualizations.
Published onMay 26, 2023
Quickly preprocessing and profiling microbial community sequencing data with a Nextflow workflow for metagenomics
·

Purpose

Metagenomic sequencing of microbial communities can provide evolutionary and ecological insights into uncultivated microbial lineages and their interactions. However, processing metagenomic sequencing data involves several preprocessing steps that can be repetitive and cumbersome. We built a Nextflow workflow, Arcadia-Science/metagenomics, to automate common metagenomics tasks and produce output visualizations and files necessary for downstream decision-making. The products of this pipeline are interactive visualizations reported in an HTML with MultiQC, assemblies, mapped reads, and several output files used to assess taxonomic and functional composition of samples.

We built this pipeline using open-source software and tools, and we hope others will use and add to the tool to suit their own needs.

Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

The strategy

The problem

Extracting valuable insights from metagenomic sequencing data first requires several preprocessing steps that are often repetitive and time-consuming. We want to quickly preprocess metagenomic sequences from either Illumina or Nanopore technologies by performing quality control (QC), assembly, and taxonomic profiling. This information will help us decide whether or not to move forward with particular microbial community samples for more involved downstream analyses and exploration. Although there are numerous existing solutions for processing metagenomics data, we sought a different approach that encourages the user to pause at critical decision points before moving forward with the analysis. 

Our solution

We developed a computational resource that automates QC, assembly, mapping, taxonomic profiling, and functional prediction from raw metagenomic reads obtained through either Illumina or Nanopore technologies. This resource is a Nextflow [2] workflow, named Arcadia-Science/metagenomics. We built this Nextflow workflow with a custom template based on the nf-core template [3].

The Arcadia-Science/metagenomics pipeline is available in this GitHub repository (DOI: 10.5281/zenodo.7972166).

The resource

An overview of the metagenomics workflow

The metagenomics pipeline ingests a sample sheet that includes the sample name and the local path, URL, or URI of either paired-end Illumina reads or Nanopore reads in FASTQ format (Figure 1).

Figure 1

An overview of the steps in the metagenomics workflow.

Users provide FASTQ reads from either Illumina or Nanopore technologies in a CSV sample sheet as the input to the workflow. Note that tools that apply only to either Illumina or Nanopore data are highlighted in different colors. The main parts of the workflow encompass common preprocessing steps, QC checks, and taxonomic/functional profiling.

We designed the pipeline to separately process Illumina or Nanopore metagenomic samples, and therefore this pipeline cannot be used to scaffold Illumina assemblies with Nanopore reads or polish Nanopore assemblies with Illumina reads. We made this decision based on our most common internal use case, where we need to separately process Illumina or Nanopore metagenomic experiments. Additionally, recent updates to Nanopore sequencing chemistries have improved the accuracy of reads and no longer necessarily require polishing with corresponding Illumina reads [4]. Therefore, the user has to select --platform illumina or --platform nanopore when launching the workflow. 

Key functions

After inputting either Illumina or Nanopore reads, the pipeline performs basic read QC and adapter removal, assembly, mapping the reads back to the assembly, and then reports statistics and info about the workflow run in an HTML file produced by MultiQC [5]. Tools specific to either the Illumina or Nanopore workflows are listed below in their respective sections.

Both workflows report QC stats of assemblies with QUAST [6], summarize mapping rates and alignments statistics with samtools -stat [7], predict open reading frames and proteins from the assemblies with Prodigal [8], generate coverage statistics with the program jgisummarizecontigs from MetaBat2 [9], and compare these proteins to the Uniprot Uniref90 database using DIAMOND [10].

The pipeline also launches a series of sourmash commands to produce signatures, compare the signatures against each other, compare the signatures to reference databases, and produce taxonomy summaries based on hits to reference databases [11]. For each sample, we apply these commands to both the reads and assemblies to produce files amenable to comparing samples against each other and exploring taxonomic compositions. We chose to implement sourmash in particular for generating taxonomic summaries due to the ability to rapidly search any sequence input against reference databases [12]. Additionally, we recently created an R package, sourmashconsumr [13], for working with the output files from sourmash and generating intuitive figures (see Figure 2 and Figure 3 below for examples).

Illumina-specific processing 

When a user launches the Illumina workflow using --platform illumina, it filters each set of paired-end reads with fastp [14] and individually assembles them using SPAdes with the metagenomic option [15]. It then maps the corresponding reads back to each assembly using Bowtie2 [16].

See an example MultiQC report that this workflow generated from short-read Illumina data:

Nanopore-specific processing

When the user launches the Nanopore workflow using --platform nanopore, it summarizes each set of Nanopore reads in FASTQ format using Nanoplot [17], removes adapters using Porechop_ABI [18], and individually assembles them using Flye [19] using the --nano-hq option. The workflow then polishes assemblies using Medaka with default parameters [20] and maps reads back to each assembly using minimap2 [21].

Here’s a sample MultiQC report that the workflow output from long-read Nanopore data:

Deployment

We deploy the pipeline with continuous integration testing using subsampled metagenomic reads from Illumina and Nanopore sequencing efforts of cheese rind microbiomes from our “Paired long- and short-read metagenomics of cheese rind microbial communities at multiple time points” dataset [1]. This ensures that the workflow executes properly as we add new features over time. The pipeline can be run with conda, Docker, or singularity, although we highly recommend using Docker when possible. 

We are currently deploying all of our Nextflow workflows, including metagenomics, through Nextflow Tower using our AWS Batch setup [22]. The pipeline is still fully executable locally via the command line and works on diverse compute infrastructure setups.

For most steps in the workflow, we can take advantage of AWS EC2 spot instances to save cost. However, we found that for long-running jobs such as metagenomic assembly and Nanopore polishing, we needed to modify the workflow to run these processes via on-demand instances so they wouldn’t be interrupted. We configured this through setting up queue directives in Tower so that all processes except assembly and polishing will run on AWS EC2 spot instances. 

Example taxonomic insights from outputs of the workflow

Figure 2

Proportion of classified sequences in Illumina and Nanopore assemblies.

We compared our assembled contigs to several “cover databases” (simplified databases that contain each k-mer only once) and used the R package sourmashconsumr to depict the proportion of sequences that sourmash could classify in each database or that remain unclassified for (A) Illumina or (B) Nanopore assemblies. X-axis labels are abbreviated cheese sample names and aging durations (W = week; M = month).

In addition to insights from the MultiQC HTML report, we can use files generated from different sourmash subcommands to quickly inspect metagenomic reads and assemblies. The sourmashconsumr R package provides parsing, visualization, and analysis functions for working with the output files of sourmash. Below, we give two examples of how the outputs of sourmash gather and sourmash taxonomy summarize the proportion of sequences in either unassembled reads or assembled contigs that are assigned taxonomy based on comparison to a database (Figure 2) and the breakdown of those taxonomic classifications (Figure 3) using data from a prior cheese metagenomics study [1].

The code we used for this taxonomic analysis and the resulting figures is available in this GitHub repository (DOI: 10.5281/zenodo.7972177), and the associated data is on Zenodo (DOI: 10.5281/zenodo.7968234).

Figure 3

Breakdown of classified sequences in all Nanopore assemblies.

We ran sourmash gather on all input sample metagenomic reads and assembled contigs against “cover databases” of archaea, bacteria, viruses, fungi, invertebrates, plants, protozoa, and vertebrates available in GenBank using a k-mer size of 31 (Figure 2). Cover databases “cover,” or contain, each k-mer in the full database only once [23]. To build a cover database, sourmash sequentially examines each sketch and retains only the hashes that have not been previously observed. This reduces the total database size, which in turn reduces search times and search RAM. In practice, cover databases decreased RAM by an order of magnitude (~124 GB RAM to ~12 GB RAM) and halved runtimes. While less computationally intensive, strain-level assignments are likely inaccurate with cover databases, so it might be necessary to summarize one level up in taxonomy (i.e. to species).

We compared our assembled contigs to several “cover databases” (simplified databases that contain each k-mer only once) and used the R package sourmashconsumr to depict the proportion of contigs that sourmash could classify in each database or that remain unclassified for (A) Illumina or (B) Nanopore assemblies.

Additional methods

We used ChatGPT to suggest wording ideas and then edited the AI-generated text.

Next steps

The first version of the metagenomics workflow performs common preprocessing tasks that are necessary for downstream steps and analyses of Illumina and Nanopore metagenomic samples. In the future, we would like to: 

  • Add support for reciprocal mapping of all reads to all assemblies for time-series metagenomics experiments.

  • Add support for preprocessing PacBio HiFi metagenomic reads.

  • Detect mobile elements such as plasmids and diverse phages beyond those contained in GenBank databases.

  • Automate sourmashconsumr reports for comparing samples and taxonomy summaries.

  • Build subsequent workflows for binning metagenomic contigs, multi-omics layering, etc.

For some of these efforts, we have created GitHub issues in the metagenomics workflow GitHub repository and welcome outside suggestions and contributions through pull requests!


Share your thoughts!

Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.

  • Contributors
    (A–Z)

    • Adair L. Borges

      • Critical Feedback

    • Feridun Mert Celebi

      • Critical Feedback, Validation

    • Rachel J. Dutton

      • Supervision

    • Megan L. Hochstrasser

      • Editing, Visualization

    • Elizabeth A. McDaniel

      • Conceptualization, Formal Analysis, Software, Visualization, Writing

    • Manon Morin

      • Critical Feedback

    • Taylor Reiter

      • Critical Feedback, Validation

    • Emily C.P. Weiss

      • Critical Feedback

Contributors
(A–Z)
Critical Feedback
Critical Feedback, Validation
Supervision
Editing, Visualization
Conceptualization, Formal Analysis, Software, Visualization, Writing
Critical Feedback
Critical Feedback, Validation
Critical Feedback
Connections
1 of 3
Comments
0
comment
No comments here
Why not start the discussion?