Whole Genome Sequence Analysis of Viruses; Moving Beyond Single/Partial Gene Based Phylogenies in Context of Epidemiology and Genetic Evolution

| The enormous amount of viral species in nature arouses curiosity about not only their origin, but also forces their naming and organizing them into hierarchically arranged systematic units. The rapid evolution of viruses, in particular RNA viruses, has led to emergence of many new genotypes. The sequencing of whole genomes, genes or gene fragments is more and more commonly used for understanding epidemiology. Most accepted modern phylogenies are derived using sequences from individual homologous genes which may fail sometimes to construct a true phylogenetic tree. The first problem is that the evolutionary history of a particular gene is not necessarily the same as the evolutionary history of the virus in which it can be observed. This might be due to duplication and deletion, or even horizontal gene transfer between different viruses. Secondly, it is not always possible to find genes that are sufficiently conserved across all our viruses of interest to be successfully identified, and yet sufficiently diverged to be of use for phylogenetic analysis. One of the most pervasive challenges in molecular phylogenetics is the incongruence and discordance between phylogenies obtained using different data sets, such as individual genes. Whole genome based phylogenetic analysis has helped to characterize the novel viruses, uncover the population history of the disease, elucidate virus-host interactions, understanding of the evolutionary rates, monitoring gene reassortment, interspecies transmission between different viral strains and produce estimates of epidemiological parameters. In this review, we discussed efforts that have been made to infer phylogenies by consideration of the viruses at the genome level, rather than just individual genes.


Advances in
ferred from a gene or protein sequence only describes the evolution of that particular gene or encoded protein.This sequence may evolve more or less rapidly than other genes in the genome or may have a different evolutionary history from the rest of the genome owing to horizontal gene transfer events.Single gene phylogenies were used because of incomplete genome sequence information and the inherent limitations of available computer programs; however, many studies have shown that the evolutionary histories of some individual genes or genomic regions may not be identical to each other within many viruses, which may be due to the recombination, reassortment and selection pressure (Herniou et al., 2001;Magiorkinis et al., 2004;Olvera et al., 2007;Olvera et al., 2010;Anderson et al., 2010;Tatte et al., 2010).Earlier due to high cost of sequencing and less efficient platforms the complete genome sequencing was not possible for every virus.However the decrease in cost and advent of high throughput sequencing led blast of whole genomes in the public data bases.It is full of ambiguity for phylogenetic analysis based on single gene when using conserved or similar genes since horizontal gene transfer (HGT) between viruses, along with gene duplication, gene capture from host appears to have been frequent in large DNA viruses (Herniou et al., 2001;Filee et al., 2003;Shackelton et al., 2004).It is important to consider the possibility of genetic recombination while evaluating apparent phylogenetic relationships between viral strains.Complete genome sequences contain phylogenetic information at several levels.In addition to the nucleotide sequence and amino acid sequences of the encoded proteins, the gene content and the order of genes on a genome may be phylogenetically informative (Koonin et al., 2000;Rokas et al., 2000).Gene content or gene order data sets are independent of these sequences of individual genes and should complement phylogenies based on nucleotide or amino acid sequences.Several attempts have been made to infer viral phylogeny from their whole genomes to avoid the problem of gene rearrangement, gene loss, gene duplication and lateral gene transfer.However, some of them infer the majority consensus tree of the many trees of individual genes or use the combined sequences of many shared genes.Some of them employ gene content and gene order method, but the former has to correct for the genome size effect and the latter can be hindered by a lack of synteny conservation or the variation of the evolving rate of synteny between taxa (Montague et al., 2000;Gao et al., 2003;Harrison et al., 2003;Snel et al., 2005).Therefore, several methods such as phylogenetic networks have been developed to infer these evolutionary processes including the recombination events.However, all these analyses could only provide accurate estimates only when all the whole-genome sequences are available (Olvera et al., 2010).Molecular epidemiology using whole genome sequences of pathogens will reveal more precise phylogenetic relationships as compared to gene or partial sequences, thus giving an exact picture of geographical and evolutionary origin of the viral isolates.Phylogenetic analysis is a prerequisite for virus tracing and thus allows implementing more effective control measures.

NEXT GENERATION SEQUENCING
With the advent of next generation sequencing technologies and falling costs of sequencing, a paradigm shift has taken place from traditional Sanger's method to whole genome sequencing.Compared to the traditional Sanger capillary sequencer, next-generation sequencers are capable of massively parallel sequencing of millions of amplified DNA molecules in a single run and they do not require the conventional cloning and amplification.Next generation sequencing is currently driven by 454 GS FLX titanium (Roche), Genome Analyzer-II (Illumina/Solexa), ABI-SOLiD (Life technologies-Applied Biosystems), Polonator G 007 (Danaher motions), Heliscope (Helicose Bioscience) and Pac Bio RS(Pacific Biosciences, SMRT-single molecule Real time sequencing technology) (Pacific Biosciences) and Ion Torrent (Life Technologies) (Table 1).Despite their different configurations, the next-generation sequencers share many common features: (1) relatively small amount of starting DNA (a few micrograms) is needed, (2) fragmented DNA templates are ligated to specific adaptors at both ends, (3) multiple PCR amplification cycles are performed, (4) amplified DNA templates are attached to a solid support in a reaction chamber or a flow cell, (5) during the extension cycles, sequencing reagents are repetitively applied and washed away and (6) the number of the extension cycles is often limited, thus producing shorter read lengths of 35-250 bases as compared to the read length of 650-800 bases in the Sanger capillary sequencing.Because of their high-throughput capacities, these next-generation sequencers are better suited for the studies dealing with the whole genomes, replacing the Sanger sequencing in many situations.

NGS PLATFORMS
Most commonly sequencing platforms nowadays used are as: Roche 454 FLX Pyrosequencer, Illumina sequence, ABI SOLiD System.The Roche 454 sequencing technology combines the principles of emulsion PCR and pyrosequencing.The steps involve fragmentation of the template DNA, ligation to adaptors and clonal amplification of DNA using emulsion PCR.The emulsion beads are then deposited in picotiter plate wells containing smaller beads with sequencing enzyme and buffers required to perform iterative pyrosequencing -translating each nucleotide incorporation event into a well-specific pyrophosphate-tagged luminescence.The newer and robust 'titanium' chemistry can generate 1 × 10 6 sequence reads of longer read length (≥400 bp), yielding 500 million bp of sequence per run.The Illumina/Solexa Genome Analyzer was the first 'short read' sequencing platform commercially available that involves sequencing-by-synthesis using reversible terminators.Fragmented ssDNA is hybridized to oligonucleotide anchors on a solid surface referred to as a 'flow cell'.Solid-phase bridge amplification of DNA templates is conducted to generate amplified clusters.Massively parallel sequencing of cleaved products from amplified clusters is carried out using DNA polymerase and a set of four base-specific color-coded reversible terminators that result in growing oligonucleotide chains.This platform originally produced 35-bp reads to yield 1 Gb of sequence output per 2-3-day run.Subsequent upgrades on this platform have increased both the density of clusters and read lengths so that this machine can currently yield 4 Gb of sequence output in a 2-3-day run.

Advances in Animal and
The third commercially available platform is the ABI SOL-iD platform, which uses hybridization-ligation methodologies for massively parallel sequencing.The initial emulsion PCR step is the same as that in the Roche 454 platform, except that the beads are only 1 μm in size.The amplified product on the beads is then covalently linked to a glass surface, and sequencing is carried out using hybridizationligation with an octamer interrogation probe consisting of two probe-specific, three degenerate and three promiscuous bases.Each nucleoide position is ascertained using a four-dye encoding schema and each position is interrogated twice to distinguish sequencing errors from single nucleotide polymorphisms (SNPs).When first available (early 2007), this platform had an output in short reads of 35 bp and produced 1-3 Gb of sequence data per 8-day run.The current upgrade on this system (now the SOLiD 3) is capable of a much higher density of beads and has an output of 20-40 Gb per 8-10-day sequence run.

NGS DATA ANALYSIS
NGS experiments generate unprecedented volumes of data, which present challenges and opportunities for data management, storage, and most importantly, analysis.Data volumes generated during single runs of the 454 GS FLX, Illumina, and SOLiD instruments are approximately 15 GB, 1 TB, and 15 TB, respectively a large variety of software programs for alignment and assembly have been developed and made available to the research community.
Most use the Linux operating system, and a few are available for Windows.Many require a 64-bit operating system and can use 16GB of RAM and multiple central-processing unit cores.The range of data volumes, hardware, software packages, and settings leads to processing times from a few minutes to multiple hours, emphasizing the need for sufficient computational power.Although a growing set of variations in alignment and assembly algorithms are available, there remains the trade-off between speed and accuracy in which many but not all possible alignments are evaluated, with a balance having to be struck between ideal alignment and computational efficiency.

SoftWaReS and BioinfoRmaticS toolS foR data analySiS
A

technical PRoBlemS and limitationS of nGS
There are some common technical problems associated with various NGS platforms.Short reads in many NGS systems result in difficulties with assembling and mapping to the reference sequences, particularly at repetitive regions.Not all sequences are equally processed and sequenced, and DNA regions enriched with GC content are particularly prone to low coverage.For NGS platforms with target amplification or enrichment, amplification bias may be introduced.Last but not least, sequencing errors are present essentially in all NGS platforms.Longer reads are prone to have error readings, particularly towards the ends.Repetitive sequences and homopolymers are also of concern for some third generation sequencers; however, rapid improvement has been made to overcome these problems.Increase of coverage and deep sequencing are important to correct some of these problems.

WHOLE GENOME SEQUENCING OF VIRUSES OF VETERINARY IMPORTANCE
In recent years several pathogens of veterinary importance have been sequenced world over.Whole-genome sequencing of microbes has revolutionized the methods by which these organisms are studied and has heightened expectations regarding the ability to predict potential targets for antimicrobial agents and vaccines.

METHODS OF PHYLOGENETIC TREE CONSTRUCTION
Neighbour-joining, Maximum likelihood and Bayesian approach are most commonly used for constructing phylogenetic trees.The accuracy of the tree-building methods used for phylogenetic analysis depends on the assumption on which each the method is based.Understanding these assumptions is the first step toward efficient use of these methods.The second step is understanding, how the methods actually work and what intrinsic limitations these methods have.The third step is choosing suitable phylogenetic method(s) that can give a reasonably correct picture of a phylogenetic tree.Neighbour-joining is one of the distance-based methods.It is extremely fast and has been advocated for analysis of large datasets.However, recovery of the true tree is guaranteed only if the distance matrix is correct, and calculation of genetic distances is complicated by biological processes such as rate heterogeneity.
It is thus not recommended for use in finding final tree.
Maximum likelihood (ML), one of the character-based methods, has also been used for phylogenetic analysis of pestiviruses.Under an evolutionary model, the most probable tree is found by an optimality criterion based on the character (nucleotide) at each position of a set of sequences.Disadvantages using ML are that it is computationally intensive when dealing with many taxa, and may yield unreliable results with regard to complex parameter-rich model robustness of the so-called best tree can be estimated statistically by bootstrapping (e.g.1000 replicates) the original dataset and a value of more than 70% is thought to indicate support for a group on the tree.The Bayesian approach has been recently developed for inferring phylogeny.It is rapidly accepted in phylogenetics.In contrast to the traditional ML method that only gives the topology of a tree, the Bayesian analysis produces both a tree estimate and a measurement of uncertainty for the groups on the tree, thus providing a measure of support faster than ML bootstrapping.By using a Markov chain Monte Carlo (MCMC) algorithm, the Bayesian phylogenetic inference allows implementation of complex parameter-rich evolution models.It is important to realize that phylogenetic tree reconstruction is not a trivial matter, but a complicated process that often requires careful thought.Accuracy, reliability, and computational speed are all major factors for consideration when choosing a particular phylogenetic method.It is also important to realize that none of the three phylogenetic reconstruction methods are guaranteed to find the correct tree.All three methods have the potential to produce erroneous trees.To minimize phylogenetic errors, it is recommended that at least two methods be used for any phylogenetic analysis to check the consistency of tree building results obtained.

WHOLE GENOME BASED APPROACH FOR PHYLOGENETIC ANALYSIS
Complete genome approaches have recently been employed to infer the phylogeny of many viruses.Wang et  2013) identified discordance between full-length genome tree and individual gene trees upon phylogenetic analysis of hepatitis E virus (HEV), Japanese encephalitis virus ( JEV), measles virus (MV) and porcine circovirus 2 (PCV2).For all the four viruses the individual gene trees differed not only from the corresponding genome tree, but from the trees constructed from the other genes in the same genome, in both their topologies and branch lengths in a way.In HEV, the trees of region GO and KLY-B differed dramatically from the genome trees in topologies, which resulted in misleading inferences on genetic relationships of some strains.However, it was hard to estimate which of the regions SGG-A, MJ-C and MXJ produced a tree that was most similar to the full genome tree.In JEV, the trees of gene E, NS1 and NS5 could agree well with the genome tree.However, it was hard to access which one from the genes NS2a, NS2b, NS3, NS4b and PreM could yield a tree more concordant with the genome tree due to different discordance involving different virus strains.The cap gene tree displayed topology obviously disagreeing with the genome tree and other gene trees.The MV trees based on the P gene, M gene, V gene and C gene could not match the genome trees very well.However, the L gene, N gene, H gene and F gene trees could reproduce the topology of the full-length genome tree more similar than others with reliable bootstrap support values.For PCV2, the cap gene shared a more similar tree with the genome than the rep gene both in topology and branch length obviously.However, the tree based on the rep gene presented much incongruence with the cap and the complete genome trees, leading to drastically disordered groupings for viral strains.

Advances in
Historically A comprehensive phylogenetic analysis of 22 complete JCV genomes was accomplished first time by Jobes et al.
(1998) using neighbour-joining, UPGMA and maximum parsimony methods.European Type 1 strains was found to be diverged from other subtypes during evolution .Previously phylogeny was carried out by most variable small V±T intergenic region (610 bp) nevertheless showed little variability between closely related JCV strains and may not provide enough informative sites (Sugimoto et al., 1997).
Utilizing the whole JCV genome, minus the regulatory region (4854 bp), substantially increases the number of phylogenetically informative sites and more adequately resolves relationships between the JCV genotypes.Parsimony analysis showed that of the 611 total characters (a single gap was required in the alignment), 534 sites were invariant, 36 were phylogenetically uninformative and only 41 sites were informative.In contrast, of the 4856 characters in the whole genome data set, 4523 were invariant between the strains, 161 were uninformative and 172 sites were phylogenetically informative.The whole genome approach, therefore, provides a fourfold increase of informative sites over the V±T region alone.This increase in informative sites translates into a much better resolved phylogeny for JCV.V±T region sequences placed strain Tai-3 with the Type 3 group and assigned Type 2 strain g224A an ambiguous and unresolved position in the UPGMA and neighbour-joining trees.
Most analysis of baculovirus phylogeny has been based on the polyhedrin/granulin gene but other genes have been used (Bulach et al., 1999;Bideshi et al., 2000).Comparison of these analyses reveals that conflicts are often observed between phylogenies based on different genes.These conflicts could be due to erroneous phylogenetic inferences caused by unequal rates of evolution, lack of an unambiguous phylogenetic signal in the sequences or due to recombination.Exchange of genetic material is known to occur between coinfecting baculo viruses or between baculo viruses and their hosts (Fraser et al., 1995).The complete genome sequencing of the reference strain of bluetongue virus (BTV) serotype 16 (strain RSArrrr/16) was carried out by Maan et al. (2012).Previous phylogenetic comparisons show that BTV RNA sequences cluster according to the geographic origins of the virus isolate/lineage, identifying distinct BTV topotypes.Sequence comparisons of segments Seg-1 to Seg-10 show that RSArrrr/16 belongs to the major eastern topotype of BTV (BTV-16e) and can be regarded as a reference strain of BTV-16e for phylogenetic and molecular epidemiology studies.All 10 genome segments of RSArrrr/16 group closely with the vaccine strain of BTV-16 (RSAvvvv/16) that was derived from it, as well as those recently published for a Chinese isolate (Yang et al., 2011) of BTV-16 (>99% nucleotide identity), suggesting a very recent common ancestry for all three viruses.
The evolutionary dynamics of influenza A virus are shaped by a complex interplay between rapid mutation, frequent reassortment, widespread gene flow, natural selection (occasionally generating genome-wide selective sweeps), functional interactions among segments, and global epidemiological dynamics.Large scale phylogenetic analysis based on whole genome play a pivotal role in understanding the reassortment and evolution of influenza A virus.The co-existence and circulation of different lineages is a big hurdle in understanding the epidemiology of the virus.Whole genomes of H3N2 influenza A viruses sampled during 1999-2004 has identified two key evolutionary patterns (Holmes et al., 2005).Whole-genome analysis of human influenza A virus revealed multiple persistent lineages and reassortment among recent H3N2 Viruses.First, although the majority of viruses isolated after 2002 fall into a single phylogenetic group (clade A), multiple, co-circulating viral lineages are present at particular time points.The genetic diversity of influenza A virus is therefore not as restricted as previously suggested, particularly when genes other than that encoding HA are analysed.This co-circulation of lineages is most apparent with the identification of three clades of H3N2 viruses that appear to infect the same populations until 2002, after which they acquired a common HA gene through reassortment.Second, and more dramatically, these multiple, co-circulating lineages may have complex genealogical histories and interact through reassortment.Two reassortment events involving the HA gene of clade B: one in which it was acquired by the clade A viruses and another in which it was independently acquired by those isolates assigned to clade C. The utility of whole-genome analyses of influenza A viruses, and further makes clear that additional whole-genome analyses are required to understand fully the evolutionary mechanisms and epidemiological dynamics of this virus.While antigenic variance of HA is still the dominant selective pressure on human influenza A virus evolution, the finding that antigenically novel clades emerge by reassortment among persistent viral lineages rather than via antigenic drift is of major significance for vaccine strain selection.
Studies on genetic diversity of rotaviruses have been primarily based on the genes encoding the antigenically significant VP7 and VP4 proteins.Since the rotavirus genome has 11 segments of RNA that are vulnerable to reassortment events, analyses of the VP7 and VP4 genes may not be sufficient to obtain conclusive data on the overall genetic diversity, or true origin of strains.In the last few years following the advent of the whole-genome-based genotype classification system, the whole genomes of at least 167 human group A rotavirus strains have been analysed, providing a plethora of new and important information on the complex origin of strains, inter-and intra-genogroup reassortment events, animal-human reassortment events, zoonosis, and genetic linkages involving different group A rotavirus gene segments (Ghosh et al., 2011;Wang et al., 2014;Thongprachum et al., 2013;Matthijnssens et al., 2008).Recently Wang et al. (2014) carried out first largescale whole genome-based study to assess the long-term evolution of common human rotaviruses (G3P [8]) in an Asian country from 2000 through 2013 and concluded Chinese G3P[8] rotavirus strains have evolved since 2000 by intra-genogroup reassortment with co-circulating strains, accumulating more reassorted genes over the years.
The genetic information in this study is expected to contribute as a baseline data to understand long-term evolution of rotavirus genome and to formulate policies for the use of rotavirus vaccines.Studies on genetic diversity of rotaviruses have been primarily based on the genes encoding the antigenically significant VP7 and VP4 proteins.Since the rotavirus genome has 11 segments of RNA that are vulnerable to reassortment events, analyses of the VP7 and VP4 genes may not be sufficient to obtain conclusive data on the overall genetic diversity, or true origin of strains.In the last few years following the advent of the whole-genome-based genotype classification system, the whole genomes of at least 167 human group A rotavirus strains have been analysed, providing a plethora of new and important information on the complex origin of strains, inter-and intra-genogroup reassortment events, animal-human reassortment events, zoonosis, and genetic linkages involving different group A rotavirus gene segments (Ghosh et al., 2011;Wang et al., 2014;Thongprachum et al., 2013;Matthijnssens et al., 2008).

CONCLUSION, FUTURE ISSUES AND CHALLENGES AHEAD
The dynamic development of the whole genome phylogenetic analysis triggered a breakthrough in the perception of the world around us.Phylogenetic analysis based on single gene is full of ambiguity and may fail to fully reflect the current taxonomical classification of viruses.Comparison of these analyses reveals that conflicts are often observed between phylogenies based on different genes.These conflicts could be due to erroneous phylogenetic inferences caused by unequal rates of evolution or to lack of an unambiguous phylogenetic signal in the sequences.Single gene phylogenies reveal extensive incongruence and conflicting topologies.The availability of complete genome sequence data due to cheaper sequencing technologies for several viruses has led to an interest in the use of such data for phylogenetic reconstruction.Full-genome based analysis can provide the relatively more reliable information about genetic relationships between different virus isolates and determination of the directions of virus migrations from one country or continent to another.Whole-genome analysis is beneficial for molecular characterization and understanding of the evolution of the pathogen.It is also useful for monitoring gene reassortment and interspecies transmission between different viral strains.The genetic information in whole genome based phlogenetic analysis is expected to contribute as a baseline data to understand long-term evolution of viral genome, and to formulate policies for the use of vaccines.Phylogenetic inference using whole genome data poses tremendous statistical and computational challenges.There is a profound need to develop new models for the analysis of multigene or multipartition datasets that can accommodate factors such as the heterogeneity of the evolutionary process among genes, or partitions, in whole genome phylogenetic analyses.Improved statistical methods are needed that account for genomic variation in evolutionary rates, transition/transversion rate ratios, and local gene trees.Moreover, there is an urgent need to develop efficient computer programs for combined analysis of multipartition datasets, particularly those suitable for parallel computer systems.Hence whole genome based phylogenetic analysis is the need of hour to study the molecular epidemiology and genetic evolution of the animal viruses.
A key question is to what extent gene exchanges have shaped the phylogeny of virus and is it possible to construct a single phylogenetic tree representing their evolutionary history.Despite the fluidity of bacula virus genome, the whole genome based methods are sufficiently powerful to unravel the underlying phlogeny of the viruses.Herniou et al. (2001) highlighted the fluid nature of baculovirus genomes, with evidence of frequent genome rearrangements and multiple gene content Advances in Animal and Veterinary Sciences August 2015 | Volume 3 | Issue 8 | Page 440 changes during their evolution by whole genome based phylogeny.

Table 1 :
Comparison of the Next-Generation DNA sequencing platforms longer adequately capture the genetic diversity of BKV.VP2 gene region sequences could resolve the subtypes but not the subgroups.The VP3 gene is located within VP2 gene and contains most of its informative sites.Phylogenetic trees based on small-T-antigen sequences can divide BKV into major subtypes and all subgroups of subtype I but cannot resolve subgroups of subtype IV.

in Animal and Veterinary Sciences August 2015 | Volume 3 | Issue 8 | Page 441
Most recent common ancestry can be traced back only within the last 50 years for Reston ebolavirus and Zaire ebolavirus species and suggests that viruses within these species may have undergone recent genetic bottlenecks.Examination of the whole family suggests that members of the Filoviridae, including the recently described Lloviu virus, shared a most recent common ancestor approximately 10,000 years ago.These data will be valuable for understanding the evolution of filoviruses in the context of natural history as new reservoir hosts are identified and, further, for determining mechanisms of emergence, pathogenicity, and the ongoing threat to public health.
elucidated the genetic diversity Japanese encephalitis virus by whole genome based phylogenetic analysis using Bayesian Markov chain Monte Carlo simulations.The results showed that the most recent common ancestor (TMRCA) for JEV was estimated to Advances