Spideog: a new tool to convert and combine Kraken reports
Introducing a new tool to convert and combine Kraken reports!
Quick links
- Repository: https://github.com/jeanmanguy/spideog
- README: https://github.com/jeanmanguy/spideog/blob/main/README.md
- Binaries: https://github.com/jeanmanguy/spideog/releases
Earlier this year I started working on a metagenomic project and I also started to learn Rust. To classify and assign sequencing reads to taxons I use Kraken 21 combined with Bracken2.
Goals
I have some problems with the Kraken reports format. I can't easily make plots with R with them. For example I wanted to use the taxonomy data to draw trees with ggtree3, you can't do that easily: the taxonomic tree is encoded using indentation mixed with the abundance data. The format used by Metaphlan4 is slightly better but has the same problems.
So, as I was learning Rust I set up myself with the goal of making a simple command line software in Rust to parse Kraken reports and to transform them into standard and tidy text formats.
One of my goals was to combine data from multiple reports in order to ease the analysis of multiple samples. It was also important for me to not waste time working on deployment and installation procedures. At the same I was working on setting up a Nextflow pipeline to launch jobs on the university's HPC cluster "SONIC".
Implementation
I developed Spideog to read one or multiple Kraken report and write one tree file or one CSV file. I use the simple Newick format for the tree, and a tidy format5 for the abundance data. Newick trees are easily readable in R (and other analysis language) with the {ape} package6, and tidy data is the standard for the Tidyverse7, the easiest way to format data if you make plots with {ggplot}8. These file formats can be combined to merge the results of multiple analysis.
Spideog is implemented in Rust. I set up a continuous integration to build binaries for Linux, OSX, and Windows. No dependencies needed. No Docker container or Conda environment needed. No need to have Rust installed on the machine. I added the binary to my Nextflow pipeline (in the bin/
folder, you could make a container wrapper if you want to have everything in containers and not have binaries in your git repository), it works like a charm on the cluster, no extra hassle.
Example
For development and test purposes I manually crafted 2 kKraken reports with few differences in read counts and also with different species found.
The Spideog's subcommand combine-trees
will take all the reports you give it (thanks for glob) and generate a single tree in Newick format, that you can read with ape and ggtree. The subcommand combine-abundances
will produce 1 CSV file that can be read with R. Then using fortify
on the tree object to get a data frame, you can join the 2 objects to plot heatmap and other data visualisation with ggplot2 (or base R plot functions).