Excellent paper.

“we describe the first native RNA sequence of SARS-CoV-2 [using direct nanopore sequencing], detailing the coronaviral transcriptome [all the RNA, mRNA, sgRNA, made by the virus] and epitranscriptome [all marked up RNA used by the virus]” https://www.biorxiv.org/content/10.1101/2020.03.05.976167v2
“Many features of SARS-CoV-2 biology are captured in these direct RNA sequence data, 100 including the transcriptome, as well as RNA base modifications or ‘epitranscriptome’.”

So there is a lot of data but now you have to find the ones the CoV might use so ...
Imagine a reel to reel magnetic tape.

On the reel the tape starts with a white header, then magnetic tape and finishes in a black trailer.

But as you spool the tape you notice that the magnetic tape is broken up into sections each with a header, more tape and a trailer.
Coronaviruses are large virus in the family nidovirus.

Nido is latin for nest because their genome is “nested”: inside the whole genome which has a header and a polyadenolated (long set of AAAAAAAAA) trailer are long chunks which has parts that also have headers and trailers.
This is (very roughly) what they’re talking about in lines 58-73 in the paper and why conventional RT-PCR doesn’t collect all the data you’d need to understand the full transcriptome (all the expressed RNA with all it’s various uses).
“cellular-derived material generated 680,347 reads, comprising 860Mb of sequence information. Aligning to the genome of the cultured SARS-CoV-2 isolate, a subset of reads were attributed to coronavirus sequences (28.9%), comprising 367Mb of sequences from the 30kbase genome.”
“Of these, a number had lengths >20,000 bases, capturing the majority of the 91 SARS-CoV-2 genome on a single molecule. This direct RNA sequencing approach 92 generated an average 12,230 fold coverage of the coronaviral genome”

It doesn’t get all of the genome all of the time!
“To define the transcriptome, the shared 5’ leader sequence was used as a marker to identify intact transcripts, these corresponding to subgenomic mRNAs and having a low abundance in the virion-derived data”

Grep (look, for you non-Unix people) for the headers!
“In SARS, ORF7a and ORF7b are encoded on a shared subgenomic mRNA, with translation of ORF7b being achieved through ribosome leaky scanning, explaining the absence of a dedicated ORF7b-encoding subgenomic mRNA”

It’s flaky but it works but ...
“ORF10 is the last predicted coding sequence upstream of the poly-A sequence, and the shortest of the predicted coding sequences at 117 bases in length. ORF10 also has no annotated function, and the encoded peptide does not appear in SARS-CoV-2 proteomes”

So ORF10 not used?
You can follow @kevinpurcell.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: