Thread by @ryanlayer, Tired of looking at SV calls and thinking "WHY was this called, [...]

ryanlayer

Tired of looking at SV calls and thinking "WHY was this called, it& #39;s TERRIBLE?" Us too So we trained a model. Gets rid of 47% of FPs. Keeps 97% of TP. Quick writeup here https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1">https://www.biorxiv.org/content/1... Full paper w/ major samplot upgrades soon https://github.com/ryanlayer/samplot">https://github.com/ryanlayer... 1/4

ryanlayer/samplot

Plot structural variant signals from many BAMs and CRAMs - ryanlayer/samplot

https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1

Lessons learned: ML architectures are easy thanks to amazing libraries. Training is hard in a surprising way. When picking a training set think about how the model will be used. In SV detection MOST SVs are FP so picking good negative examples is very important and kinda hard 2/4

A BAD way of picking negative SVs is to randomly select genomic regions that are not in your truth set or to select a variable site but pick a sample that is hom ref. These sites are VERY clean. An SV detector will never call these Your model will never need to filter them 3/4

Negatives need to be edge cases that confuse the detection algorithm. We could pick FPs for some caller/truth set, but truth sets have many false negatives. Our idea was to sample calls made in regions that are enriched for FPs. Any other ideas? 4/4

You can follow @ryanlayer.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: