Tired of looking at SV calls and thinking "WHY was this called, it's TERRIBLE?" Us too So we trained a model. Gets rid of 47% of FPs. Keeps 97% of TP. Quick writeup here https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1 Full paper w/ major samplot upgrades soon https://github.com/ryanlayer/samplot 1/4
Lessons learned: ML architectures are easy thanks to amazing libraries. Training is hard in a surprising way. When picking a training set think about how the model will be used. In SV detection MOST SVs are FP so picking good negative examples is very important and kinda hard 2/4
A BAD way of picking negative SVs is to randomly select genomic regions that are not in your truth set or to select a variable site but pick a sample that is hom ref. These sites are VERY clean. An SV detector will never call these Your model will never need to filter them 3/4
Negatives need to be edge cases that confuse the detection algorithm. We could pick FPs for some caller/truth set, but truth sets have many false negatives. Our idea was to sample calls made in regions that are enriched for FPs. Any other ideas? 4/4