Researchers at the temple have identified the first genome that transmits coronavirus.
In the field of molecular epidemiology, the scientific community around the world has worked hard to solve the riddle of the early history of SARS-CoV-2.
Since the first case of the SARS-CoV-2 virus was discovered in December 2019, tens of thousands of its genomes have been sequenced worldwide, indicating that the coronavirus is mutating, albeit slowly. , at a rate of 25 mutations per genome per year.
But despite great efforts, so far no one has identified the first human transmission, or “no patient” in COVID-19 pandemic. Finding such a case is necessary to better understand how the virus may have jumped from its original animal host to infect humans as well as the history of how the SARS-CoV virus genome could be. -2 has changed over time and spread globally.
“The SARS-CoV-2 virus is carrying one RNA The genome has infected more than 35 million people worldwide, ”said Sudhir Kumar, director of the Institute of Evolutionary Medicine and Genetics, Temple University. “We need to find this common ancestor, which we call the precursor genome.”
This precursor genome is the mother of all of the SARS-CoV-2 coronaviruses that infect humans today.
In the absence of a patient, Kumar and his team at Temple University may have found the next best thing to help with molecular epidemiological detective work around the world. “We started recreating the ancestor’s genome using a large set of coronavirus genome data obtained from infected individuals,” said Sayaka Miura, a senior author on the study. .
They found the “mother” of all of the SARS-CoV-2 genomes and its offspring then mutated and spread to dominate the pandemic. “We have now re-engineered the precursor genome and mapped where and when the earliest mutations occurred,” said Kumar, corresponding author of a pre-printed study.
In doing so, their work provided new insights into the early history of the SARS-CoV-2 mutation. For example, their study reported that one mutation of the mutant protein SARS-CoV-2 (D416G), often associated with an increased likelihood of infection and spread, occurred after several other mutations, weeks. after COVID-19 started. “It is almost always found alongside many other protein mutations, so its role in increasing infectivity is still difficult to determine,” said Sergei Pond, senior co-author of the study.
In addition to their findings on the early history of SARS-CoV-2, Kumar’s team developed mutant fingerprints to quickly recognize strains and sub-strains infecting an individual or region. bridge.
Ordering for a pandemic
To identify the progenitor genome, they used a technique of mutation sequence analysis, which is based on cloning analysis of the mutant lines and the frequency at which mutant pairs appear together in the SARS- CoV-2.
First, Kumar’s team screened data on nearly 30,000 complete genomes of SARS-CoV-2, the virus that causes COVID-19. In total, they analyzed 29,681 SARS-CoV-2 genomes, each containing at least 28,000 sequence databases. These genomes were sampled from December 24, 2019 to July 7, 2020, representing 97 countries and regions around the world.
Many previous attempts at analyzing such large datasets, Kumar said, have not been successful because of the “focus on building the evolutionary tree of SARS-CoV-2”. “This coronavirus grows too slowly, the number of genomes to be analyzed is too great, and the data quality of the genome is very variable. I immediately saw a similarity between the properties of this genetic data from coronavirus and genetic data from the asexual spread of another nefarious disease, cancer. “
Kumar’s team has developed and studied many techniques to analyze genetic data from tumors in cancer patients. They have tuned and refined those techniques and built a mutant trace that automatically traces back to ancestors. “Basically, the genome before the first mutation is the ancestor’s genome,” says Kumar. “The method of monitoring mutation very well and predict the phylogeny of the” major strains “of SARS-CoV-2. It’s a great example of how big data combined with data mining is biologically informed to important patterns ”.
The progenitor genome
Kumar’s group uncovered a predictive sequence of the ancestral (parent) genome of all SARS-CoV-2 (proCoV2) genomes. In the proCoV2 genome, they identified 170 types of non-synonyms (mutations that induce an amino type acid changes in a protein) and 958 synonymic substitution of a closely related coronavirus, RaTG13, found in bat Rhinolophus affinis. Although bat-to-human vectors are unknown, this figure is similar to 96.12% in sequence between proCoV2 and RaTG13 sequences.
Next, they identified 49 single nucleotide (SNV) variants occurring with variation frequency greater than 1% from their data set. They have been further tested to look at their mutagenic patterns and global spread.
“The mutant tree predicts a tree of many species,” says Kumar. “You can also plant the previous strains, and predict the order of the mutations. However, this approach is greatly influenced by the quality of the sequences. When the spike rate is low, it is difficult to distinguish between low quality defect and real spike. The approach we have implemented is much more robust to counter sequence errors because analyzing site pairs on genomes yields more information.
The earlier timeline appeared
When comparing the inferred proCoV2 sequences to the genomes in their collection showed no complete matches at the nucleotide level, the Kumar team knew that the baseline of the pandemic was off. .
“This precursor genome has a different sequence than what some are calling the reference sequence, which was first observed in China and sent into the GISAID database,” Kumar said. SARS-CoV-2 ”.
The closest comparison was with the genome sampled 12 days after the earliest sampled virus was available on December 24, 2019. Multiple matches were found on all of the continents sampled and detected by April 2020 at the latest in Europe. Overall, the 120 genomes that Kumar’s team analyzed contained only differences that were synonymous with proCoV2. That is, all of their proteins are identical to the corresponding proCoV2 proteins in the amino acid sequence. Most (80 genomes) of these protein level matches are from coronaviruses sampled in China and other Asian countries.
These spatial patterns show that proCoV2 possesses the complete protein chains needed to infect, spread and survive in the global population.
They found the proCoV2 virus and its early descendants arose in China, based on the earliest mutations of proCoV2 and their location. Furthermore, they also demonstrated that a population of strains with up to six mutant differences from proCoV2 existed at the time of the discovery of the first COVID-19 case in China. With the estimated SARS-CoV-2 mutating 25 times a year, this means that the virus must have infected people weeks before the cases in December 2019.
Since there is strong evidence of multiple mutations prior to those found in the reference genome, Kumar’s team had to come up with a new nomenclature of mutant symbols to classify SARS-CoV-2 and solve it. Like them by introducing a variety of Greek alphabet symbols to represent each letter.
For example, they found that the occurrence of the α and α genomic variants SARS-CoV-2 preceded the first reports of COVID-19. This clearly implied the existence of some sequence diversity in ancestral SARS-CoV-2 populations. All 17 genomes sampled from China in December 2019, including the designated SARS-CoV-2 reference genome, carry all three variants μ and three α. Interestingly, six genomes containing the μ variants rather than the α variants were sampled in China and the United States in January 2020. Consequently, the genomes were sampled the earliest (including the specified reference) is not ancestral lines.
It also predicts that the progenitor genome has offspring that spread around the world during the earliest stages of COVID-19. It was ready to infect in the first place.
“Ancestors have all the abilities needed to spread,” said Sergei Pond. “There is very little evidence of bat-human lineage selection, despite strong bat coronavirus selectivity.”
Furthermore, they found confusing evidence that there is always another mutation associated with the mutant protein mutation D416G.
“Many people are interested in mutations in the mutant protein because of its functional properties,” says Kumar. “But what we are observing is that in addition to the mutant protein, there are some additional changes in the genome that are always found with changes in the mutant protein (D416G). We call this a group of beta mutants, and the spike is one of them. Whatever we think the spike is doing, it’s best not to forget that other spikes could be involved as well. Also, these mutations could simply be hitchhiking together, we still don’t know yet. ”
“What’s also interesting is that the genome containing the mutant protein mutation has undergone many other mutations. And what we call epsilon mutation (there are 3 of these) occurs against the background of the spike, and they alter arginine residues in a very important protein, the nucleocapsid protein (N). Epsilon mutations are common in Europe, and they are always found with mutant protein mutations. So the epsilon mutation started a dominant trend in both Europe and Asia. “
Overall, they identified seven major evolutionary lineages that arose after the pandemic began, some in Europe and North America following the introduction of ancestral lineages in China.
“The Asian races created the whole pandemic,” Kumar said. “But over time, it is the sub-strain containing the epsilon mutation, which may emerge outside of China (first observed in the Middle East and Europe), which is infecting even more Asia.”
Their mutation-based analyzes also showed that the coronaviruses in North America contained genomic markers that were very different from the markers common in Europe and Asia.
“This is a dynamic process,” says Kumar. “Apparently, there are very different pictures of spread drawn by the emergence of new mutants, three epsilon, gamma and delta, which we found that occurred after the protein mutation. . “We need to find out if there are any functional features of these mutations that have accelerated the pandemic.”
Going forward, they will continue to refine their results as new data becomes available.
“There are currently more than 100,000 SARS-CoV-2 genomes that have been sequenced,” says Pond. “The power of this approach is that the more data you have, the easier it is for you to know the exact frequencies of the individual mutations and pairs of mutations,” says Kumar. The resulting variations, single nucleotide variants, or SNV, their frequency and history can be very clearly told with more data. Therefore, our analysis inferred a reliable source for the SARS-CoV-2 phylogenetic. “
Their results are being automatically updated online when new genomes are reported (currently in excess of 50,000 samples and can be found at http://igem.temple.edu/COVID-19).
“These findings and our visual mutation trail of SARS-CoV-2 strains overcame the tough challenges of developing a retrospective study on how, when, and why COVID- 19 emergence and spread, which is the prerequisite to create remedies for this pandemic through Kumar said efforts of science, technology, public policy and medicine.
See: “Evolutionary Portrait of SARS-CoV-2 ancestors and its dominant branches during the COVID-19 pandemic” by Sudhir Kumar, Qiqing Tao, Steven Weaver, Maxwell Sanderford, Marcos A. Caraballo-Ortiz, Sudip Sharma, Sergei LK Pond and Sayaka Miura, September 29, 2020, BioRxiv.
DOI: 10.1101 / 2020.09.24.311845