From masses of data previously uploaded to databases, scientists have identified nearly 162,000 new species of RNA virus
An artificial intelligence tool has helped scientists discover unknown virus species at an unprecedented speed from data previously uploaded to databases, according to a joint study by researchers in mainland China, Hong Kong and Australia.
The team said the discovery of nearly 162,000 new species of RNA virus in different environments – including in the atmosphere, hot springs and hydrothermal vents – highlighted their diversity and resilience in harsh conditions, while potentially offering clues to how viruses and other elemental life forms came to be.
By analysing previously unrecognised genetic sequence data in public databases, the machine learning tool identified viruses based on their sequences and hidden protein structure information that RNA viruses use for replication, identifying whether a sequence represents an RNA virus species in one second or less.
Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP Knowledge, our new platform of curated content with explainers, FAQs, analyses and infographics brought to you by our award-winning team.
The tool uses an algorithm developed by the Alibaba Cloud Intelligence team and was developed in partnership with a team of virologists. Alibaba is the owner of the South China Morning Post.
“We developed a data-driven deep learning model that outperforms conventional methods in accuracy, efficiency and, most importantly, the breadth of virus diversity detected,” the team wrote in an article published in the peer-reviewed journal Cell on Wednesday.
They said the study was the largest virus species discovery ever published in terms of the number of species reported in a paper.
Co-lead author Shi Mang, a virologist and professor at the Sun Yat-sen University School of Medicine, said the AI tool not only accelerated virus discovery, which would be tedious and time-consuming using traditional methods, but also enabled scientists to explore the realm of previously unknown viruses.
“All the viruses discovered in this study exist in the environment and were sequenced. Our previous methods were not able to identify them, leaving them as ‘dark matter’ to scientists,” he said, referring to sequences that cannot be isolated for study or found to be related to known viruses.
“The AI tool fills this gap for us with high accuracy comparable to conventional methods in bioinformatics. It can uncover ‘dark matter’ sequences, along with those more closely related to established viral groups,” he said.
Shi said the discovery advanced future research by laying the basis for virus diversity.
“For example, the study informs us of the existence of viruses in extreme conditions, such as hot springs. This knowledge will enable scientists to develop more comprehensive descriptions of the ecology in various ecosystems,” he said.
“As for pathogenicity of viruses, we will be able to study further how viruses interact with their hosts, as well as identify the groups of viruses that can infect a particular host.”
Co-lead author Li Zhaorong, who researches computational biology in Alibaba Cloud Intelligence’s Apsara Lab, said the study showed that the deep learning algorithm could effectively perform tasks in biological exploration.
He said the team continued to update the tool with advanced AI technologies such as new pre-trained models for nucleotide and protein analysis.
“We are also re-examining classic problems in virology with a new AI and data-driven mindset, such as the structure and functions of viruses and the relationship between viruses and humans and animals,” he said.
The team said it would continue to train the model to uncover more virus diversity, and the same approach could be applied to identifying bacteria and parasites.