I'm genuinely excited to see the extension of methods to discern clonal structure from mtDNA variants-- something that I've thought about over the years. However, I'm going to comment on a couple of points in this pre-print to hopefully show why mtDNA tracing isn't easy 1/n https://twitter.com/YuanhuaHuang/status/1376778067205443594
Disclosure: I am the author of mgatk, a method that is compared against in this new approach (Mquad), so I must admit that I was a bit peeved by the statement "there is a lack of effective computational methods to identify informative mtDNA variants" (to qualify this thread) 2/n
The authors utilize a binomial mixture model as a means of identifying high-confidence mitochondrial DNA variants. This at first glance makes a ton of sense. The use of the BIC and the knee-calling is in my opinion a very reasonable way to approach this problem. 3/n
The issue is that many of the variants in the real data are most likely artifacts of one or more modes of data noise that aren't accounted for... Here is my rationale for this claim: 4/n
In Figure 2a, we see the top 20 variants from Mquad. In red, I'm highlighting 3 'variants' that look identical in terms of single-cell heteroplasmy that occur within 4 bp of one another. I would argue that this is not 3 independent mutations that are perfectly concerted 5/n
These are most likely due to non-mito originating SNPs (e.g. regions of the nuclear genome that have high homology to mitochondrial DNA and thus map onto chrM). These are problematic variants that I see more commonly in scRNA than scDNA/ATAC data 6/n
Another variant (2619A>T) is in the same position as an extremely reproducible RNA editing event that we've previously reported. (There are a few possibilities for why the alternate allele here is different than what we've previously reported as the editing event) 7/n
These are reasons why @LeifLudwig @julirsch and I strongly advocate for complementary (bulk or sc) DNA data to remove these false positives. Ultimately, I don't think that the proposed mixture model will catch/filter these that are most certainly not informative variants 8/n
Since these RNA-specific variants are highly reproducible from dataset to dataset, let's say that we've found all them just remove them automatically ( @julirsch did this in fact and it works extremely well), are we in the clear? Unfortunately not. 9/n
This brings me to the most important point of understanding noise in mtDNA variants-- the noise is NOT RANDOM. It is highly predictable based on the local sequence structure of the variant as well as the ref/alt nucleotide patterns. 10/n
Here's an example of real region in mtDNA that is indicative of what happens. In the mitochondrial genome, one can observe many stretches of poly-Cs and/or poly-Gs. When our @illumina sequencers read these molecules in different directions, we observe huuuuge differences 11/n
Specifically, reads that are sequenced in the direction of the blue arrow show much higher "heteroplasmy" than those in the red direction due to a photobleaching effect. This is much worse on the Next-seq and Nova-seq platforms (which is basically what everyone uses today) 12/n
If you sequence the same cell line (here I'm showing two different cell lines on left and right), and look at the top strand sequence, one can compute a 'bias' of fw-rev. What you see are systematic differences a function of sequencer and nucleotide composition 13/n
Seeing this, we designed mgatk to account for these massive strand differences. The x-axis shows the concordance (correlation) of the heteroplasmy when estimated using either strand. I believe this is the single most informative metric for finding real variants 14/n
NB: GATK uses a Fisher's test to examine strand imbalance in the alternate allele, but since mtDNA coverage is so high, the Fisher's test winds up being significant for evvvvverything, so we use this correlation metric 15/n
Now, this brings me to supplemental figure S1 in this paper, where the red dots show the Mquad variant calls. See all of the variants that have low and even negative correlation? These variants are almost certainly artifacts. Only one strand is contributing to heteroplasmy! 16/n
In the supplement of the mtscATAC-seq paper, we show what some of these variants look like. All of them have either a reference or an alternate C. The other give away is that these aren't transitions (C>T,G>A,A>G,T>C) which are far more common in mtDNA (at least in blood) 17/n
So how do we interpret this figure where MQuad is reportedly massively out performing mgatk? The reason is that the simulation isn't taking into account the true nature of the errors, which as far as I can tell from the methods, are just random mismatches in the simulations 18/n
However, the authors claim that "A high false positive rate was observed for mgatk... consistent with the comments from the original publication"

No. This was not the comment. 19/n
(For reasons mentioned above, scRNA-seq has a lot of errors that makes this hard, so we emphasize proceeding with caution. This isn't mgatk's fault. However, I'd note @vangalenlab & @TylerEMiller's work that made huuuge strides in doing this better for 3' scRNA) 20/n
In fact, I'd say mgatk does pretty well. I agonized over this. For literally years. On real smart-seq2 data, mgatk does pretty well when we know the true clonal variants (from @LeifLudwig's painstaking experiment) 21/n
Part of the reason that this works so well is because of the variance mean ratio statistic (Y axis) that helps identify known RNA editing loci (including 2619), which we show in the supplement. Not perfect, but for smart-seq2 data, I believe mgatk is pretty good. 22/n
Finally, I want to clarify that I really would love for a method to come along and outperform mgatk. I'm not invested in the method being the best-- I'm interested in *us* using the best tools to understand human tissue physiology via mtDNA tracing 23/n
Overall, @YuanhuaHuang, I think Mquad genuinely holds major promise. The integration with Vireo is an important step forward in combining mtDNA and nuclear mutation clonal tracing, which is exactly the direction that we need to head as a community 24/n
However, I felt like many points that were made in this preprint mischaracterized our prior work, so I felt compelled to write on a public forum. I'm more than happy to discuss mtDNA tracing/variant calling/etc. anytime. 25/25
You can follow @CalebLareau.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: