Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples.[1] [2] Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured.[3] Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited.[4] Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses.[5] [6] For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.[7]
The traditional methods for discovering, characterizing, and assigning viral taxonomy to viruses were based on isolating the virus particle or its nucleic acid from samples.[8] The virus morphology could be visualized using electron microscopy but only if the virus could be isolated in high enough titer to be detected. The virus could be cultured in eukaryotic cell lines or bacteria but only if the appropriate host cell type was known and the nucleic acid of the virus would be detected using PCR but only if a consensus primer was known.
Metagenomics requires no prior knowledge of the viral genome as it does not require a universal marker gene, a primer or probe design. Because this method uses prediction tools to detect viral content of a sample, it can be used to identify new virus species or divergent members of known species.
The earliest metagenomic studies of viruses were carried out on ocean samples in 2002. The sequences that were matched to referenced sequences were predominantly double-stranded DNA bacteriophages and double-stranded algal viruses.[9]
In 2016 the International Committee on Taxonomy of Viruses (ICTV) officially recognized that viral genomes assembled from metagenomic data can be classified using the same procedures for viruses isolated via classical virology approaches.[10]
In the 2002 metagenomics study the researchers found that 65% of the sequences of DNA and RNA viruses had no matches in the reference databases. This phenomenon of unmatched viral sequences in sequence reference databases is prevalent in viral metagenomics studies and is referred to as “viral dark matter". It is predominantly caused by the lack of complete viral genome sequences of diverse samples in reference databases and the rapid rate of viral evolution.
Adding to these challenges, there are seven classes of viruses based on the Baltimore classification system which groups viruses based on their genomic structure and their manner of transcription: there are double-stranded DNA viruses, single-stranded DNA viruses, double-stranded RNA viruses, and single-stranded RNA virus.[11] Single-stranded RNA can be positive or negative sense. These different nucleic acids types need different sequencing approaches and there is no universal gene marker that is conserved for all virus types. Gene-based approaches can only target specific groups of viruses (such as RNA viruses that share a conserved RNA polymerase sequence).
There is still a bias towards DNA viruses in reference databases. Common reasons for this bias is because RNA viruses mutate more rapidly than DNA viruses, DNA is easier to handle from samples while RNA is unstable, and more steps are needed for RNA metagenomics analysis (reverse transcription).
Sequences can be contaminated with the host organism's' sequences which is particularly troublesome if the host organism of the virus is unknown. There could also be contamination from nucleic acid extraction and PCR.
Metagenomic analysis uses whole genome shotgun sequencing to characterize microbial diversity in clinical and environmental samples. Total DNA and/or RNA are extracted from the samples and are prepared on a DNA or RNA library for sequencing.[12] These methods have been used to sequence the whole genome of Epstein–Barr virus (EBV) and HCV, however, contaminating host nucleic acids can affect the sensitivity to the target viral genome with the proportion of reads related to the target sequence often being low.[13] [14]
The IMG/VR system and the IMG/VR v.2.0 are the largest interactive public virus databases with over 760,000 metagenomic viral sequences and isolate viruses and serves as a starting point for the sequence analysis of viral fragments derived from metagenomic samples.[15] [16]
While untargeted metagenomics and metatranscriptomics does not need a genetic marker, amplicon sequencing does. It uses a gene that is highly conserved as a genetic marker, but because of the varied nucleic acid types, the marker used has to be for specific groups of viruses. This is done via PCR amplification of primers that are complementary to a known, highly conserved nucleotide sequence. PCR is then followed by whole genome sequencing methods and has been used to track the Ebola virus,[17] Zika Virus, and COVID-19[18] epidemics. PCR amplicon sequencing is more successful for whole genome sequencing of samples with low concentrations. However, with larger viral genomes and the heterogeneity of RNA viruses multiple overlapping primers may be required to cover the amplification of all genotypes. PCR amplicon sequencing requires knowledge of the viral genome prior to sequencing, appropriate primers, and is highly dependent on viral titers, however, PCR amplicon sequencing is a cheaper evaluation method than metagenomic sequencing when studying known viruses with relatively small genomes.
Target enrichment is a culture independent method that sequences viral genomes directly from a sample using small RNA or DNA probes complementary to the pathogens reference sequence. The probes, which can be bound to a solid phase and capture and pull down complementary DNA sequences in the sample. The presence of overlapping probes increases the tolerance for primer mismatches but their design requires high cost and time so a rapid response is limited. DNA capture is followed by brief PCR cycling and shotgun sequencing. Success of this method is dependent available reference sequences to create the probes and is not suitable for characterization of novel viruses. This method has been used to characterize large and small viruses such as HCV, HSV-1,[19] and HCMV.[20]
Viral metagenomics methods can produce erroneous chimerical sequences.[21] [22] These can include in vitro artifacts from amplification and in silico artifacts from assembly.[22] Chimeras can form between unrelated viruses, as well as between viral and eukaryotic sequences.[22] The likelihood of errors is partially mitigated by greater sequencing depth, but chimeras can still form in areas of high coverage if the reads are highly fragmented.[21]
Plant viruses pose a global threat to crop production but through metagenomic sequencing and viral database creation, modified plant viruses can be used to aid in plant immunity as well as alter physical appearance.[23] Data obtained on plant virus genomes from metagenomic sequencing can be used to create clone viruses to inoculate the plant with to study viral components and biological characterization of viral agents with increased reproducibility. Engineered mutant virus strains have been used to alter the coloration and size of various ornamental plants and promote the health of crops.[24]
Viral metagenomics contributes to viral classification without the need of culture based methodologies and has provided vast insights on viral diversity in any system. Metagenomics can be used to study viruses effects on a given ecosystem and how they effect the microbiome as well as monitoring viruses in an ecosystem for possible spillover into human populations. Within the ecosystems, viruses can be studied to determine how they compete with each other as well as viral effects on functions of the host. Viral metagenomics has been used to study unculturable viral communities in marine and soil ecosystems.[25]
Viral metagenomics is readily used to discover novel viruses, with a major focus on those zoonotic or pathogenic to humans. Viral databases obtained from metagenomics provides quick response methods to determine viral infections as well as determine drug resistant variants in clinical samples. The contributions of viral metagenomics to viral classification have aided pandemic surveillance efforts as well as made infectious disease surveillance and testing more affordable.[26] Since the majority of human pandemics are zoonotic in origin, metagenomic surveillance can provide faster identification of novel viruses and their reservoirs.
One such surveillance program is the Global Virome Project (GVP) an international collaborative research initiative based at the One Health Institute at the University of California, Davis.[27] [28] The GVP aims to boost infectious disease surveillance around the globe by using low cost sequencing methods in high risk countries to prevent disease outbreaks and to prevent future virus outbreaks.
Viral metagenomics has been used to test for virus related cancers and difficult to diagnose cases in clinical diagnostics.[29] This method is most often used when conventional and advanced molecular testing cannot find a causative agent for disease. Metagenomic sequencing can also be used to detect pathogenic viruses in clinical samples and provide real time data for a pathogens presence in a population.
The methods used for clinical viral metagenomics are not standardized, but guidelines have been published by the European Society for Clinical Virology. A mixture of different sequencing platforms are used for clinical viral metagenomics, the most common being instruments from Illumina and Oxford Nanopore Technologies. There are also several different protocols, both for wet lab work and for bioinformatic analysis, that are in use.