Illumina (Hiseq!) reads are 99.88% Accurate? (0.12% error rate)
I’ve been aligning various datasets and throwing them through BEST. GenomeMiner has given me free credit to do this on their platform, and setup the tools for me there. So I recommend checking it out.
I wanted to compare previous datasets to some Illumina data. So grab a random HG002 fastq from the PanGenome project.
The results are largely unsurprising. But it was interesting to look at some HiSeq data and see all those lovely (reasonably well calibrated) Q scores:
Here are the yield plots. BEST assigns a emQ of 75 to perfect reads, which here is the bulk of the data:
Overall the error rate seems very low, slightly lower than the previous PacBio dataset:
The identity here is actually 0.9988. So 99.88% accurate. 0.12% error rate, somewhere around Q29.
More or less what I would except to see!
Also… Century of Biology has like… 100x the subscribers I have here… so please subscribe to bitsof.bio!
Datasets here:
HG002_HiSeq30x_subsampled_R1.fastq.gz
https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/
The notes here describes this as:
Whole-genome data, downsampled to ~30x PCR-free Illumina 150bp (HG002, HG003, and HG004) to match the expected production data available for the 1000 genome samples (produced at the NYGC: ~30x 150 bp PE Illumina reads from the 2504 1KG samples (using ~450bp fragment size))
I aligned only the first read and used Dorado aligner (Minimap) which is consistent with previous runs.