This workflow lets you find potential circular DNA in your organism of interest using short-read, whole-genome sequencing data and a reference genome. We applied it to parasitoid wasps and some other parasites and found putative circular DNA.
We developed a computational method to identify circular DNA using short-read DNA sequencing data and reference genomes. We previously identified capsid-like proteins in some venomous and parasitic organisms [1]. Inspired by this work, we wanted to search across a broad range of parasitic organisms for circularized (and thus potentially packaged) DNA cargo that parasites might deliver to their hosts.
We figured that by mapping paired short reads to reference genomes and searching for unusually large apparent distances between them, we could find putative circular DNA. To test this approach, we used our workflow to find putative packaged circular DNA in parasitoid wasps and then applied it to a set of species that includes human-associated parasites. We identified clear patterns of large mapped distances and high coverage in parasitoid wasps. We also found putative circular DNA regions of interest in many human-associated parasite species in our example dataset, showcasing a use case for the workflow.
This method should be broadly applicable for circular DNA searches across organisms using standard short-read sequencing libraries, providing a fully computational, simple way to work with these data. It can also be a supplementary approach to current wet-lab sample processing methods, which often require time-consuming sequencing library enrichment steps. We hope that researchers looking for circular DNA in any organism will be able to apply this workflow as an early screening step.
The Nextflow pipeline, Python tools, and example use cases are in this GitHub repository.
Data for our two example results are available here.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
We’ve put this effort on ice! 🧊
#TranslationalMismatch #DeadEnd
We found interesting patterns suggesting that several parasite species circularize double-stranded DNA cargo for packaging into viral-like particles, but multiple barriers prevent us from pursuing this further. Disentangling putative circular DNA from false positives requires time-intensive manual checks, and validating if and how organisms of interest deliver DNA to their hosts would require significant additional research. In combination with the smaller market opportunity for dsDNA delivery modalities as compared to established areas like RNA delivery, these hurdles led us to discontinue this project.
Learn more about the Icebox and the different reasons we ice projects.
Background and goals
Parasitoid wasps deliver dsDNA-encoded virulence factors genes to their insect and arachnid hosts using endogenized viruses [2]. In past work, we looked across venomous species to see if they have endogenized viral capsids that may let them deliver cargo to the organisms they’re biting [1], finding putative capsids across several parasitic species. However, we didn’t have a clear way to identify the cargo these species deliver, if any. Because we were most interested in finding novel nucleic acid delivery systems for gene therapy applications and the parasitoid wasps that inspired this work circularize DNA cargo to package into capsids, we developed a method to identify circular DNA in sequencing data. We imagine that other organisms might use a similar approach to deliver genes to their hosts, making circularized DNA a potential hallmark of this host manipulation strategy.
We realized that our method might be broadly useful for researchers interested in exploring circular DNA in their organisms of interest. In this pub, we’re sharing the workflow we used to search for circular DNA in short-read DNA sequencing data and provide two example datasets where we’ve applied it.
The problem
We needed a way to search for circular DNA cargo that parasitic organisms might deliver to their hosts. In earlier work identifying DNA cargo in parasitoid wasps, researchers filtered, purified, and sequenced virus-like particles containing the cargo, a time-consuming and tedious process [3]. We wanted to take a computational approach that would let us search for potential DNA cargo across all sorts of parasitic organisms.
Several computational tools and workflows to detect circular DNA — in particular, extrachromosomal circular DNA (eccDNA) — already exist, like Circle-Map [4], ecc_finder [5], circlehunter [6], circdna [7], and eccDNA-pipe [8]. However, these tools primarily focus on human circular DNA and aren’t as well-suited for finding circular DNA across organisms, partially due to their sample preparation preferences. Some of these tools rely on samples pre-enriched for circular DNA (e.g., Circle-seq [9], Circulome-seq [10]), or are built specifically for long-read [11] or ATAC-seq [12] data. Generating these kinds of data is time-consuming and far less common for non-model organisms, so we wanted to create a method that works with short-read sequencing data and allows for rapid data exploration.
Our strategy
We developed a workflow that lets scientists explore short-read sequencing data for putative circular DNA without needing special library preparation or sequencing methods.
We split this approach into two distinct steps:
Finding circular DNA: A Nextflow pipeline downloads short-read DNA sequencing data and reference genomes, maps the reads, and extracts mapped reads with insert sizes > 1 kb. Additional files marking regions of high coverage depth are produced.
Learning about each circular DNA sequence: Python code directly takes the workflow's output and parses it into filtered mapped reads, coverage depth data, and gene annotation data that you can use to investigate the putative circular DNA segments.
DNA can be circularized at specific junctions, and some forward and reverse reads from a read pair might span the junction. When those reads are mapped back to the linear genome, the distance between the paired reads will be much larger than the expected distance for paired short reads. We chose our approach to rapidly scan for a signal of larger-than-expected insert sizes, with Python functions for further downstream exploration of coverage and gene annotations.
The method
We developed an approach to systematically identify positions in eukaryote genomes where paired short reads have a consistently larger mapped distance than expected, a hallmark of circularized DNA. This approach is usable as a Nextflow pipeline, and we’ve also created some Python tools to explore the putative circularized DNA.
Overarching strategy
We thought we might be able to identify circularized DNA by examining mapped reads for unusually large mapped distances (i.e., the insert sizes between the forward and reverse reads when mapped to the reference genome) between the forward and reverse reads of standard short-read DNA sequencing libraries (Figure 1). Typically, when preparing short-read DNA libraries, at least one size selection step ensures that sequenced segments are mostly between ~200–600 bp, centered around ~300 bp (Figure 1, step 1). However, if any reads straddle a recombination site in circular DNA, mapping those reads back to the linear genome will result in an apparent mapped distance of > 600 bp (Figure 1, step 2). In these cases, the mapped length will correspond to approximately the size of the dsDNA circle. Additionally, circularized DNA is generally present at a higher copy number than linear chromosomal DNA, which could appear as high coverage depth in read-mapping results [9]. We hypothesized that circles would be detectable through irregular distance distributions from reads mapped to the genome, and those segments would also likely have higher coverage depth than surrounding DNA (Figure 1, step 3).
To summarize, in scaffolds that don’t produce circular DNA, we'd expect a power law distribution (many read pairs with small mapped distances and relatively few with large mapped distances); in scaffolds that do, we'd expect peaks in the distribution at mapped distances that correspond to the length of the circular segment. This turns out to work fairly well, especially for organisms known to produce circular DNA (jump to “Example results…” to see the method in action). We’ve formalized the read-mapping approach and downstream filtering steps for coverage depth and annotation filtering into a Nextflow pipeline, which we deployed on Nextflow Tower using AWS Batch spot EC2 instances.
Mapping short reads to find circularized dsDNA
We structured the workflow to take in a sample sheet of reference genomes and corresponding short-read sequencing experiment accessions, structured as a three-column CSV file with a genome accession, the NCBI FTP path, and SRA run accession per line (Figure 2). It handles downloading all genomes and short-read sequencing files with wget or fasterq-dump from the SRA toolkit (version 3.07 [13]) and filtering reads with fastp (version 0.23.4 [14]). Next, the workflow maps short reads against the corresponding genome using minimap2 (version 2.28-r1209 [15]) and converts from SAM files into sorted BAM files with SAMtools (version 1.20 [16]). An awk command filters mapped read pairs to only those with mapped distances ≥ 1 kb. Coverage depth is calculated across every position using samtools depth, average coverage depth is calculated per scaffold, and positions on each scaffold with coverage depth ≥ 100× the average scaffold coverage depth are identified using awk. Then, using BEDTools (version 2.31.1 [17]), we merged positions within 100 bases of each other to pinpoint regions of extremely high coverage depth.
The workflow outputs several files usable for downstream analysis:
*.sorted.bam: A sorted BAM file of all mapped reads
*.large_inserts.bam: A filtered BAM file of only reads with mapped distances ≥ 1 kb
*.coverage.txt: A tab-delimited file of coverage depth for each position in the genome
*.average_coverage.txt: A tab-delimited file of per-scaffold average coverage depth
*.high_coverage_regions.txt: A tab-delimited file of positions with coverage depth ≥ 100× average scaffold coverage. Also available in BED format
*.high_coverage_region_sizes.txt: A tab-delimited file of high-coverage depth regions (≥ 100× average scaffold coverage depth, with positions within 100 bp merged into a region), with columns corresponding to scaffold name, start position, end position, and total region size (bp)
*.filtered_coverage.txt: A tab-delimited file of coverage depth only at positions where we identified mapped distances ≥ 1 kb
Considerations for applying the workflow
You may want to consider a more stringent short-read mapping algorithm like BWA [18], especially if you’re working with human data, since minimap2 [15] isn’t as sensitive to small variants and deals with repetitive sequence alignment differently. We used minimap2 primarily for its low-memory overhead and ease of use. Our Nextflow workflow is modular, so it’s pretty easy to swap in different programs based on your specific use case.
Importantly, if you suspect your circular DNA comes from multiple different segments or chromosomes of a genome (like in cancer-associated extrachromosomal circular DNAs), this tool won't accurately detect those sequences. We wouldn’t recommend using our method in those situations!
Parsing the mapped read outputs
When organisms don’t have high-quality genomes, the number of scaffolds and inserts this approach identifies is large. We decided to use the Pareto principle to filter scaffolds and inserts for a given genome of interest. The Pareto principle suggests that only a few contributors cause many outputs. In our case, we filter scaffolds to only those contributing to ≥ 80% of the mapped distance data. We summarize all mapped distances (rounded to the nearest kilobase and filtered below a maximum size threshold) and their positions (also rounded to the nearest kilobase) from the filtered scaffolds. We count the number of inserts per position and calculate z-scores to identify statistical outliers. In this case, statistical outliers are regions with many inserts at a given position. Additionally, we extract positions within 10 kb of the inserts. We use the outlier data and extended range to filter each genome's associated coverage depth and annotation data (if present) for easier downstream analysis.
This approach is flexible and usable as the GenomeInfo Python class, which uses the Polars package to efficiently parse large datasets. Users must provide the large insert BAM file and filtered coverage file as input to GenomeInfo, and gene annotations can optionally be provided in GFF format. Methods in the function will automatically load and parse the mapped distance data (.load_bam()), the coverage depth data (.load_coverage()), and the annotation data (.load_gff()) if provided. Users can provide a list of scaffold names to examine instead of relying on the Pareto principle for filtering or can adjust the Pareto cutoff from its default value of 0.8. Users can then:
Generate the mapped distance summary with a provided maximum mapped distance threshold to filter with (.generate_insert_summary(maximum_size_threshold = 100000))
Generate the extended range surrounding inserts with a user-provided base pair width to extend (.generate_insert_range(bp_width = 20000))
Deduplicate the insert data (.deduplicate_insert_summary(z_score_threshold = 3))
Finally, users can generate filtered coverage depth and annotation files using .filter_coverage() and .filter_gff(). We hope this framework is usable for genomes and short-read datasets from various organisms.
To identify proviral and non-proviral chromosomes in Microplitis demolitor, we downloaded and searched the latest genome assembly’s (iyMicDemo2.1a) annotations with the proviral genes annotated in Burke et al., 2018 [19]. For Hyposoter didymator, we filtered scaffolds for visualization to those with > 10,000 inserts since the genome was more fragmented. To visualize the example results, we used the R data.table (version 1.15.4 [20]) package for some preprocessing and the ggplot2 package (version 3.5.0 [21]) for visualization. We used the arcadiathemeR package (version 0.1.1) [22] to style our visualizations.
We used ChatGPT to help write and clean up code. We also used GitHub Copilot to help write code. Additionally, we used Grammarly Premium to reformat text according to a style guide, streamline and clarify text that we wrote, and suggest wording ideas from which we chose small phrases or sentence structures to use.
Example results from the method
We validated our read-mapping method in parasitoid wasps, organisms that we know deliver circular DNA to their hosts. We then applied the method to some human-associated parasites to demonstrate a use case for species that users of the workflow might be interested in.
Validating detection of circularized DNA in parasitoid wasps
We wanted to test our read-mapping method with organisms known to make circularized DNA. Parasitoid wasps in the Braconidae and Ichneumonidae families have co-opted viral machinery to manipulate their insect hosts [2]. They use integrated viral machinery from polydnaviruses to package circular double-stranded DNA inside of virus-like particles, which they then inject into hosts alongside their eggs. The genetic material inside the virus-like particles compromises host immune responses and significantly increases juvenile survival.
To make and circularize DNA, specific regions of the wasp genome known as “proviral regions” are massively amplified from the wasp genome. Within proviral regions, distinct replication units are individually amplified. Segments in the replication units are then excised, processed by integrase-mediated recombination to produce circularized segments, and packaged [23]. In parasitoid wasps, it’s relatively easy to identify replication units by examining mapped reads for regions of extraordinarily high coverage depth (> 100–20,000× the average coverage depth of the wasp genome) [24][25], which are signals of amplification. We developed a method to identify the segments within those replication units that are excised, circularized, and packaged, as described in the prior section. By looking at mapped distance distributions, we hypothesized that we could find unusually large distances in segments within regions of high coverage depth, a pattern indicative of circular DNA.
We analyzed four parasitoid wasp species to check that a distance pattern was only present in wasps that produce circular DNA. Only female wasps of post-reproductive maturity create circular DNA and virus-like particles, so we ran multiple samples per species (when available) to account for samples prepared from males, females, or mixtures.
We first focused on Microplitis demolitor, a well-studied braconid wasp with a high-quality genome (iyMicDemo2.1a) and male- and female-only short-read libraries (SRR1565751, SRR2011474), to validate our approach. Not every M. demolitor chromosome has proviral segments, so we expected to find peaks in the mapped distance distributions only in the chromosomes with those segments. Moreover, we wanted to make sure that the read mapping approach was identifying circularized DNA, not just proviral segments. Male wasps have the proviral segments in their genome but don’t circularize and package it; consequently, only mapped reads from female samples should show distance distribution peaks, while distributions from males should look similar to the non-proviral chromosome distance distributions.
Because we were able to use known M. demolitor bracovirus genes to identify proviral chromosomes, we didn't filter scaffolds using a Pareto cutoff since this would filter out the non-proviral chromosomes (see “Parsing the mapped read outputs” for more on this cutoff). In female wasps, we observed that chromosomes with known proviral segments had noticeable peaks in mapped distance distributions (Figure 3, A). The peaks are dissimilar across proviral chromosomes, matching the size of circularized DNA from different proviral segments. Except for one peak (corresponding to an unannotated region), we don’t see similar mapped distance distributions in mapped reads from male wasps. Overall, these results suggest we can indeed identify dsDNA circles in parasitoid wasps using this read-mapping approach.
Next, we wanted to verify that the peaks in mapped distance occurred within regions of high coverage. We merged M. demolitor’s read mapping data with the coverage depth data by chromosome position and examined if large distance peaks occurred at high coverage (Figure 3, B). We observe that high coverage corresponds with peaks in mapped distances, further supporting our ability to identify circular DNA from short-read sequencing libraries. Additionally, coverage depth differs across segments of large mapped distance, supporting the pattern of non-equimolar abundance of unique dsDNA circles found in parasitoid wasp virus-like particles [26].
To test our approach with a set of controls, we examined libraries from three other parasitoid wasps. As a negative control (no expectation of large mapped distances), we used a sample of Cotesia congregata male wasps, which has a similar bracovirus to that of Microplitis demolitor and well-defined replication units [27]. As a second negative control, we used a library from Venturia canescens, an ichneumonid wasp that has more recently acquired an ichnovirus that incorporates virulence proteins in its virus-like particles rather than DNA [28]. Finally, to test if we could find proviral sequences in an ichneumonid wasp, we looked at two female libraries from Hyposoter didymator, which has a dsDNA-encoding ichnovirus [29] and should exhibit the large mapped distance distribution pattern.
Our results were confirmatory: we found a clear mapped distance and coverage pattern in two short-read libraries from H. didymator (Figure 4) but no clear signature of circular DNA in either of the other species we'd included as negative controls. Since the latest version of H. didymator’s genome didn’t have publicly available annotations, we couldn’t confirm if the inserts we detected were proviral by leveraging existing annotation data as we’d done with M. demolitor. Instead, we first extracted the sequence from the most common mapped distance and position from each scaffold (12 total), plus 2 kb in each direction to capture flanking segments. For each, we then manually performed a BLASTx search against the NR database (June 2024). The top hits for 10 of the 12 segments were Hyposoter ichnovirus virulence-associated proteins (nine H. didymator, one H. fugitivus) and the other two homologs of another parasitoid wasp’s proteins (Table 1). These results are promising for detecting genes within inserts — even though H. didymator’s ichnovirus has been described by sequencing its virus-like particles [30], we were able to identify ichnovirus genes just by looking for locations with large inserts. In total, these results indicate that we can reliably identify circular DNA from parasitoid wasps using this read-mapping strategy.
Hyposoter didymator scaffold
BLASTx hit description
Hit scientific name
E-value
Percent Identity
Accession length
JBAJMU01000001.1
Vinx1
Hyposoter didymator ichnovirus
0.0
96.02
377
JBAJMU01000002.1
Repeat element protein-d7.3
Ichnoviriform fugitivi
6e−110
76.57
240
JBAJMU01000003.1
Vinx1
Hyposoter didymator ichnovirus
0.0
99.73
376
JBAJMU01000004.1
Vacuolar protein sorting-associated protein 18 homolog isoform X2
Table 1. Top BLASTx hits from the most common insert size and position of each Hyposoter didymator scaffold.
Searching for circularized DNA in human-associated parasites and related species
We wanted our method to be broadly helpful in finding and exploring circular DNA in diverse organisms, so we tested this by applying the Nextflow pipeline to 58 samples from 29 human-associated parasites and related species (12 Trichinella species, 6 tick species, and 11 other species, including kissing bugs, parasitic flies, non-parasitic flies, and tapeworms). The sample sheet we used for this example is provided on GitHub. We used multiple samples per species when possible to account for any differences in sex or reproductive maturity, even though we weren’t sure how relevant these traits were beyond parasitoid wasps for determining the presence of circular DNA.
Across the 29 species in our dataset, our workflow identified 24 with putative circular DNA (Figure 5, A). We briefly examined the available annotations of scaffolds with large inserts to find genes that might be present within the putative circular DNA segments using the methods provided in GenomeInfo. We found some genes implicated in host-parasite interactions [31] within the most common large insert in Trichinella spiralis (Figure 5, B). We checked that our workflow wasn’t just flagging this insert due to multiple gene copies by examining the count of other mapped distances > 1 kb in the same region. If reads were randomly mapping to different copies of the same gene within the segment, we'd expect relatively similar counts of mapped distances between other copies of the genes in the region; instead, we only found evidence of this insert. Additionally, the nucleotide sequences of this multi-copy gene are relatively different within the insert, and as such, we wouldn't expect to find reads mapping indiscriminately to different copies.
However, it's important to note that we also found many false positives within these annotations, including retrotransposons, ribosomal RNAs, and mitochondrial genes. Because this workflow looks for large mapped distances between forward and reverse reads, areas of the target organism’s genome with multiple copies of genes with near-identical sequences or repetitive elements could be detected. Users should carefully, manually inspect the flagged inserts from this workflow and validate with orthogonal methods, or consider implementing a supplementary mapping filter in the read-mapping step. Additionally, fragmented genome assemblies will be more difficult to analyze since small scaffolds with few inserts may still be considered outliers.
All output files, including the large insert BAM files, filtered coverage and annotation files (where available), and mapped distance and position summaries for the parasite dataset, are available on Zenodo.
Next steps
We’ve decided not to use this approach to pursue identifying dsDNA cargo in virus-like particles any further. Still, we believe our workflow is generally helpful for finding circular DNA across organisms. For researchers studying parasitoid wasps, whether for use as pest control or for more general ecological and evolutionary biology research, our method appears to reliably identify circularized and packaged DNA without needing to sequence virus-like particles. For scientists studying circular DNA more broadly or in specific target organisms, this workflow could be implemented to look for similar patterns of larger-than-expected mapped distances between paired reads. If you decide to use this workflow, we'd enjoy hearing how it goes here or on GitHub.
Watch a video tutorial on making a PubPub account and commenting. Feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or post about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Thank you for the insightful perspective on discovering circular DNA using short-read sequencing data. Inspired by genetic anomalies in parasitic organisms, you developed a novel tool to identify circular DNA structures by detecting unusual gaps in sequencing data.
While the approach represents a notable step forward, it raises intriguing questions about its broader implications and potential limitations. The reliance on short-read sequencing data and computational analysis for identifying circular DNA marks a shift from traditional wet-lab methods. Still, it also introduces challenges related to false positives and the need for manual validation.
Given that the method is designed to work with standard short-read data, it is crucial to understand how its accuracy holds up across various types of genomes, especially those with complex or poorly annotated structures. This understanding is key to the method's successful application in diverse genetic research.
Another dimension worth exploring is the application of this method to study the functional implications of circular DNA in parasitic organisms. Circular DNA could play roles beyond gene delivery, potentially influencing pathogenicity, host interactions, or evolutionary strategies. Understanding these functions could open new avenues for targeted therapies or interventions in parasitic diseases.
Adair L. Borges:
Thanks for your comment and your interest in this work. While we are not following up on this work ourselves, we hope it gives other researchers the tools to explore this phenomenon more. If you are interested in using it in your own research, please let us know if you have any questions. We would love to hear about how it goes!