Annotation & Prediction

Variation Data Sources

We display data from two sources. There will be some overlap between the sample sets used in the two analyses.


SARS-CoV-2 mutation data is made available as a phylogenic tree by Nextstrain [Hadfield 2018], which is an open-source project to harness the scientific and public health potential of pathogen genome data. The Nextstrain project uses subsamples of the available genome sequence data to provide a real-time snapshot of evolving SARS-CoV-2 populations and creates visualisations to interrogate these over time and by region. A version of data available on 2020-04-08 is used.

The SARS-CoV-2 genome sequence data used in the Nextstrain analysis were collected and made available by GISAID. Many groups contributed pre-publication sequence data to GISAID; please see the GISAID site for a full list of attributions.

See Nextstrain for the latest data and analysis.


The European Nucleotide Archive team have developed a LoFreq-based pipeline to call variants from SARS-CoV-2 data submitted to the archive. This estimates the quantity of each variant allele within the sequenced sample. Here, we represent only the alleles seen in the virus in each indivdual.

Samples were called individually, and for display here it is assumed that sites which are not called as variant in a sample match the reference. This may not be totally accurate, but will simply give a more accurate idea of the frequency of each allele across the entire set sequenced.

This is a early version of variant data and some loci are flagged as a guide to quality.

  • Suspect reference location: site flagged by De Maio et al
  • High strand bias: No sample has a strand bias of less than 10 at this location. This flags the most extreme 10% of the set.