SARS-CoV-2 variation data

We display 17,1796 sequence variants from four different data sources. There is overlap between the sample sets used in variant identification, but different analysis methods were employed.

European Nucleotide Archive (ENA)

The ENA team have developed a LoFreq-based pipeline to call variants from SARS-CoV-2 data submitted to the archive. This estimates the proportion of each variant allele within the sequenced sample. 4,852 variants from the preliminary analysis of the first available 5,115 sequencing runs ( 2020/08/17) are displayed here

Data filtering

As this is an early version of the ENA variant data strict, basic filters have been applied to reduce the proportion of lower confidence sites. We have removed:

Sequencing runs with more than 40 variant calls
Variants where no sample has a frequency of 20% or more for the non-reference allele
Variants where all samples show strand bias

Data representation

We represent only the alleles seen in each sample and not the frequencies observed within each sample. Variants were called for each sequencing run individually and for display it is assumed that sites at which a variant was not called in a sample match the reference. This will provide a more accurate estimation of the frequency of each allele across the entire sample set.

European Variation Archive (EVA)

14,806 variant loci from the first SARS-CoV-2 data release (20/07/2021) from the EVA are available. Variant records from different submissions have been clustered by location and type and stable RefSNP accessions have been assigned. These records include variant location, alleles and RefSNP identifier.

Nextstrain

SARS-CoV-2 mutation data is made available as a phylogenetic tree by Nextstrain[Hadfield 2018], an open-source project to harness the scientific and public health potential of pathogen genome data. The Nextstrain project uses subsamples of the available genome sequence data to provide a real-time snapshot of evolving SARS-CoV-2 populations and creates visualisations to interrogate these over time and by region. 2,332 variants from the data version available on 2020/04/08 is displayed here.

The SARS-CoV-2 genome sequence data used in the Nextstrain analysis were collected and made available by GISAID. Many groups contributed pre-publication sequence data to GISAID; please see the GISAID site for a full list of attributions.

See Nextstrain for the latest data and analysis

COG UK

The COG-UK Mutation Explorer provides information on sequence variants discovered in data generated by the COVID-19 Genomics (COG-UK) Consortium. Reports focus on mutations in the spike gene and other variants of known or potential importance and are described by gene name and protein change. To link these variants to genomic information we map the descriptions to genome locations and display where novel (291 variants) or include the gene-protein based names as synonyms in records from other sources (1097 variants). HGVS-style names such as "S:p.N501Y" can be entered into the search function on the SARS-CoV-2 species page to retrieve variant records. Data from 2021/04/14 is displayed here.

Variant reliability annotation

Some loci are flagged as a further guide to quality:

Variants seen in more than one sample in either set have an evidence status of 'Multiple observations'
Variants at sites recommended for masking by De Maio et al (version 2020-07-29) have a flag of ‘Suspect reference location’

Favourite species

All species