img not found
Communications

Errors found in coronavirus genetic sequences included in the world's largest database

Event

Investigation

Errors found in coronavirus genetic sequences included in the world's largest database

A study led by the Institute for Integrative Systems Biology (I2SysBio) discovers “artifacts” in sequences with repair of deletions in the virus that causes COVID-19, affecting infection and vaccine response. Many of the sequences with repair mutations in the spike protein of the virus, the key to infecting human cells, were due to errors in data processing.

A multidisciplinary team led by the Institute for Integrative Systems Biology (I2SysBio), a joint center of the Spanish National Research Council (CSIC) and the University of Valencia (UV), has just published a study that uncovers a new perspective on the ability of the SARS-CoV-2 virus to mutate and infect humans. Through a review of the virus's genetic database most commonly used during the pandemic, the research team found 'false positives' in its ability to repair deletions, a process that restores sections of the viral genome that affects the virus's ability to replicate or evade the host's immune system. The work, published in the journal Virus Evolution, involves researchers from the Instituto de Biomedicina de Valencia (IBV) of the CSIC and the Instituto de Investigación Sanitaria La Fe (IIS-La Fe).

The work led by the Pathogenomics group of I2SysBio in collaboration with the Viral Biology group of the same institute offers an innovative perspective on certain rare genetic changes in the spike protein of SARS-CoV-2, the 'key' used by the coronavirus to infect our cells. The research focused on so-called deletion repair events in this protein, in which the virus appears to correct its genome.

After performing massive data mining on the most widely used SARS-CoV-2 virus genome database in the pandemic, called GISAID, they discovered that several of the initial findings were likely due to errors introduced by data processing in large genetic databases. The computer methods used to analyze millions of viral sequences can induce errors, creating the impression that the virus repairs its deletions more regularly. By comparing this already processed data with information obtained directly from genome sequencing (sequencing reads), the team has been able to obtain a more realistic view of the genetic changes undergone by the virus.

Less than 60% of confirmed repair events

“Using the GISAID gene sequence repository we estimated a very high frequency of these deletion repair events that are expected to be rare,” explains Mireia Coscollá Devís, CSIC researcher leading the study. “We realized that the sequences in the GISAID database are processed by each laboratory differently and contained many false positives for these types of markers. So, although in certain cases we were able to confirm that this was a real phenomenon, in most cases it was a consequence of the processing of the sequences,” he reveals.

Thus, “we saw that less than 60 percent of the deletion repair events could be confirmed. Although we have not been able to quantify it exactly for everyone, we can compare the proportions of the marker in various databases, and we see that the difference is 5 to 51 times less frequent than what appeared in the processed databases”, calculates the CSIC researcher.

Although these repair events are rare, the study shows that, when they occur, they can subtly affect the behavior of the virus. “For example, certain repairs can modify the way in which the virus enters cells or influence the response to antibodies generated by vaccination,” says Coscollá, something that the research team demonstrated through in vitro experiments.

Sharing pathogen genomic data

Thus, “our research highlights the importance of carefully examining genetic data to avoid erroneous conclusions,” says the CSIC researcher. The World Health Organization (WHO) recommends a policy of pathogen genomic data exchange to protect public health. However, in Spain there is no central collection of human, animal and environmental pathogen sequence data, nor is there a policy for the exchange of anonymized data between health and scientific institutions. This makes it difficult to monitor and respond to infectious diseases, including the monitoring of antimicrobial resistance, the researchers point out.

The work has been funded by the Ministry of Science, Innovation and Universities and by the European Union with NextGenerationEU/PRTR funds through CSIC's PTI+ Global Health. It is also supported by the Generalitat Valenciana and the European Social Fund through grant CIACIF/2022/333. The computational work was performed at Garnatxa, the high-performance computing (HPC) cluster of the I2SysBio.

 

Reference:

Miguel Álvarez-Herrera, Paula Ruiz-Rodriguez, Beatriz Navarro-Domínguez, Joao Zulaica, Brayan Grau, María Alma Bracho, Manuel Guerreiro, Cristóbal Aguilar-Gallardo, Fernando González-Candelas, Iñaki Comas, Ron Geller, Mireia Coscollá, Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein, Virus Evolution, 2025; https://doi.org/10.1093/ve/veaf015

Image: After mining the SARS-CoV-2 database most used in the pandemic, errors were discovered./ Pixabay

Source: Delegación Institucional del CSIC

https://delegacion.comunitatvalenciana.csic.es/ca/troben-errors-en-les-sequencies-genetiques-del-coronavirus-incloses-en-la-major-base-de-dades-mundial/

Share on social networks