The SARS-CoV-2 furin cleavage site is yet again in the news - this time because of a quote by Nobel laureate David Baltimore.
The site is not a "smoking gun", nor does it "make a powerful challenge to the idea of a natural origin".
Quite the opposite, so a little science
The site is not a "smoking gun", nor does it "make a powerful challenge to the idea of a natural origin".
Quite the opposite, so a little science


The furin cleavage site (FCS) / polybasic cleavage site is present in SARS-CoV-2 at the S1/S2 junction of the spike protein where it mediates the cutting (by the host protease furin, among others) of the spike, which is required for infection of cells.
The FCS was created by an out-of-frame insertion of "CTCCTCGGCGGG" creating the "(P)RRAR" amino acid sequence, which constitutes a suboptimal polybasic cleavage site that is important for expanding SARS-CoV-2 host range, it's transmission and pathogenesis, etc.
References for:
Possible host range expansion: https://jvi.asm.org/content/94/5/e01774-19
Transmission: https://www.nature.com/articles/s41564-021-00908-w
Pathogenesis: https://www.biorxiv.org/content/10.1101/2020.08.26.268854v1
Possible host range expansion: https://jvi.asm.org/content/94/5/e01774-19
Transmission: https://www.nature.com/articles/s41564-021-00908-w
Pathogenesis: https://www.biorxiv.org/content/10.1101/2020.08.26.268854v1
FCSs are abundant, including being highly prevalent in coronaviruses. While SARS-CoV-2 is the first example of a SARSr virus with an FCS, other betacoronaviruses (the genus for SARS-CoV-2) have FCSs, including MERS and HKU1. https://www.sciencedirect.com/science/article/pii/S1873506120304165?via%3Dihub
There is nothing mysterious about having a "first example" of a virus with an FCS. Viruses sampled to date only give us a teeny-tiny fraction of all the viruses circulating in the wild. Fragments - such as the CTCCTCGGCGGG - come and go all the time. https://www.biorxiv.org/content/10.1101/2021.02.03.429646v1
How did SARS-CoV-2 acquire the FCS? We don't know, however, we know four main mechanisms often lead to insertions:
(1) mutation
(2) polymerase slippage
(3) template switching
(4) recombination
All of which play key roles in coronavirus (incl. SARS-CoV-2) evolution.
(1) mutation
(2) polymerase slippage
(3) template switching
(4) recombination
All of which play key roles in coronavirus (incl. SARS-CoV-2) evolution.
While we don't know for sure how SARS-CoV-2 acquired the FCS, template switching is a very likely explanation with a plausible mechanism: https://link.springer.com/article/10.1007%2Fs00705-020-04750-z
We also find insertions - albeit not FCSs (yet) - in highly related viruses, e.g., RmYN02: https://www.cell.com/current-biology/fulltext/S0960-9822(20)30662-X
We also find insertions - albeit not FCSs (yet) - in highly related viruses, e.g., RmYN02: https://www.cell.com/current-biology/fulltext/S0960-9822(20)30662-X
Template switching likely also play an important role during the ongoing evolution of SARS-CoV-2: https://www.biorxiv.org/content/10.1101/2021.04.23.441209v1.
We need to see this in the context of the decades of evolution of the SARS-CoV-2 ancestor and related viruses in bats. It's safe to say indels come and go.
We need to see this in the context of the decades of evolution of the SARS-CoV-2 ancestor and related viruses in bats. It's safe to say indels come and go.
The FCS itself, (P)RRAR, is not an optimal site (for cleavage) and has never previously been used in CoV experiments to the best of my knowledge - unlike more optimal sites, which have been inserted into SARSr CoVs for basic research: https://www.sciencedirect.com/science/article/pii/S0042682206000900

Note, site not present in all closely related viruses and plenty of indels around the site - like SARS-CoV-2 vs SARSr CoVs.
If we zoom in on the (P)RRAR site in SARS-CoV-2 and compare it to the one found in (some) FCoV sequences, we can see there's a fair bit of homology outside the FCS too - including likely O-linked glycans being conserved.
The (P)RRAR FCS isn't optimal and while it's 'sufficient' for SARS-CoV-2s 'success' as a pandemic virus, it's not an ideal site as defined by the canonical RâXâK/RâR FCS seen in many proteins (viral and otherwise). https://onlinelibrary.wiley.com/doi/full/10.1002/cti2.1073
The "P" from the (P)RRAR insert isn't directly part of the cleavage site itself, but, intriguingly, may regulate it via the nearby O-linked glycans.
This is seen in host proteins: https://www.jbc.org/article/S0021-9258(20)32890-8/fulltext,
but also in SARS-CoV-2: https://www.biorxiv.org/content/10.1101/2021.02.05.429982v1
This is seen in host proteins: https://www.jbc.org/article/S0021-9258(20)32890-8/fulltext,
but also in SARS-CoV-2: https://www.biorxiv.org/content/10.1101/2021.02.05.429982v1
Importantly, however, in recent month we have started seeing the "P" mutating towards residues creating more optimal furin sites - P681H and, especially, P681R, which can be found in B.1.1.7 and B.1.617.x, suggesting the virus may evolve towards more efficient usage of the site.

Now, the codons. Here, Baltimore is talking about the two codons coding for the first two arginines (R) following the P - CGG. The CGG codon is rare in viruses because it's an example of an unmethylated "CpG" site that can be bound by TLR9, leading to immune cell activation.

SARS: 5%
SARS2: 3%
SARSr: 2%
ccCoVs: 4%
HKU9: 7%
FCoV: 2%
Nothing unusual here.

We see CGG multiple times in different ways - here's an example comparing another "PR" stretch between SARS-CoV-2, RaTG13, and SARS-CoV in the N gene. Note how SARS-CoV-2 and RaTG13 both use CGG, while SARS-CoV-2 uses CGC for the first R, while later R's are coded by CGT or AGA.
One final point about the CGG codons in the FCS - if they were somehow "unnatural", we'd see SARS-CoV-2 evolve away from "CGG" during the ongoing pandemic. We have more than a million genomes to analyze, so what do we find if we look at synonymous mutations at the "CGG_CGG" site?

This is *very* strong evidence that SARS-CoV-2 'prefers' CGG in these positions.
R is coded by six different codons, yet the simple single transition "CGA" is only observed in ~0.02% of sequences. The second most 'popular' codon at these sites is "CGT" (a transversion) at 0.11% frequency.
In other words - there is nothing unusual about the codons either.
In other words - there is nothing unusual about the codons either.
So Baltimore's second point is also false, invalidating his hypothesis that the "FCS [...] with its arginine codons [...] was the smoking gun for the origin of the virus".
Baltimore does not provide any evidence to support his hypothesis and the data support a natural origin.
Baltimore does not provide any evidence to support his hypothesis and the data support a natural origin.
Does this disprove a lab leak? No. However, it disproves there being a "smoking gun" in the FCS and lends further evidence to natural emergence - but it also does not *prove* that scenario.
To this day, we have yet to see any scientific evidence supporting a lab leak.
To this day, we have yet to see any scientific evidence supporting a lab leak.
A couple of other *key* references I did not get a chance to discuss:
https://virological.org/t/the-sarbecovirus-origin-of-sars-cov-2-s-furin-cleavage-site/536
https://virological.org/t/naturally-occurring-indels-in-multiple-coronavirus-spikes/560
https://virological.org/t/spike-protein-sequences-of-cambodian-thai-and-japanese-bat-sarbecoviruses-provide-insights-into-the-natural-evolution-of-the-receptor-binding-domain-and-s1-s2-cleavage-site/622
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001115
What others did I miss?
https://virological.org/t/the-sarbecovirus-origin-of-sars-cov-2-s-furin-cleavage-site/536
https://virological.org/t/naturally-occurring-indels-in-multiple-coronavirus-spikes/560
https://virological.org/t/spike-protein-sequences-of-cambodian-thai-and-japanese-bat-sarbecoviruses-provide-insights-into-the-natural-evolution-of-the-receptor-binding-domain-and-s1-s2-cleavage-site/622
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001115
What others did I miss?
Variants of
have come up - it's false. Specifically:
1. The events are not independent, hence the calculation is incorrect.
2. It's the same argument used by creationists about "irreducible complexity" - also false:
https://en.wikipedia.org/wiki/Irreducible_complexity
https://www.americanprogress.org/issues/religion/news/2006/04/10/1934/the-flaws-in-intelligent-design/

1. The events are not independent, hence the calculation is incorrect.
2. It's the same argument used by creationists about "irreducible complexity" - also false:
https://en.wikipedia.org/wiki/Irreducible_complexity
https://www.americanprogress.org/issues/religion/news/2006/04/10/1934/the-flaws-in-intelligent-design/
As to Richard's final point - well... #introspection