Platform Benchmarks Workflow
Recently I’ve been running a bunch of sequencing datasets though BEST to generate some basic run statistics. The plots I’ve been most interested in generating are:
Q versus emQ: Which shows how well calibrated Q scores are.
Q yield plots: How big is the population of each “Q”.
Overall error statistics.
Which Samples And References?
The first issue was figuring out which datasets to use. Turns out HG002 which is part of the Genome-in-a-bottle sample set is the best studied human genome we have. It fast seems to becoming the a standard sample to run when validating a new sequencing platform.
New human genome references have been developed for this sample. In fact, two references one for each of paternal and maternal chromosomes. The best reference to use seems to be just concatenating both of these individual reference genomes (as evaluated here).
With that process established we actually have to do the alignments. I used Dorado aligner (which uses minimap2) do align everything. This isn’t ideal. I suspect just using stock minimap2 would be fairer and will likely work on short read data as well as long.
Which Datasets?
Next thing is to look for datasets. You might think this is as easy as grabbing them from whatever source and then aligning them… it’s not unfortunately.
Many of the datasets I found had issues… I found that in general it was best to convert whatever you’re given to a fastq and then use that for the alignment input.
Dorado aligner (and I assume Minimap2) will work from BAMs (probably crams too?). Theoretically you could use these… However I found that the resulting BAMs would contain references to any original alignment1, this would cause BEST to fall over (correctly) when it could find the referenced location in the reference.
Going back to fastq naturally strips out any alignment information and gives a clean input to work from2.
With a clean input you can align and run the output through BEST.
I’ve so far used BEST to perform this for the following datasets:
An Oxford Nanopore dataset called with the “fast” base caller model.
An Oxford Nanopore dataset called with the “sup” model and duplex model.
These runs have all be fairly instructive! I hope that vendors start running their benchmark datasets though BEST and providing benchmarks based on this.
While these statistics don’t provide a complete picture of the performance of a particular sequencing platform, they do provide a valuable insight into the broad performance characteristics.
Perhaps a larger project where we look at how accuracy has evolved over time would be fun? Thoughts on future projects welcome on the Discord!
I guess perhaps reads were alignment failed retain their original alignment information? Anyway… this causes issues for BEST.
It would be quite fun to build an automated pipeline to throw all available HG002 datasets at. And I think this is the strategy I’d use here. Grab the data. If it’s not a fastq already, attempt to use samtools fastq to get back to a fastq, then run subsequent steps from this.