The effects of bioinformatics preprocessing on cell-free DNA fragment analysis

DOI: 10.1093/gigascience/giaf139

Nucleic Acids Biomarkers

Abstract

Background:
While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this article, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.

Results:
Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings, such as read trimming, filtering of secondary alignments, and choice of genome build, as well as practices such as downsampling or selecting for a short fragment on derived cfDNA features, including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to them. Our findings confirm that selecting for shorter fragments enhances cancer-specific signals but, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.

Conclusions:
Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, but we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.