SARS-CoV-2 assembly and gene annotation

Coronaviruses are a large family of enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of vertebrates. This site represents the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome, responsible for the COVID-19 disease, embedded within the Ensembl suite of tools.

Assembly

The reference assembly for the Wuhan-Hu-1 isolate has been imported from ENA (ASM985889v3, GCA_009858895.3, MN908947.3).
Taxonomy ID: 2697049

Gene annotation

The protein coding genes on this site have been generated using a modified Ensembl genebuild supported by protein evidence.

The first ORF representing approximately 67% of the entire genome that encodes 16 non-structural proteins (nsps) has been split into two genes: ORF1a and ORF1ab. The remaining ORFs encode accessory proteins and four major structural proteins: spike surface glycoprotein (S), small envelope protein (E), matrix protein (M) and nucleocapsid protein (N) [Wu A et al. 2020].

There are also annotated protein features using InterProScan (version 5.45-80.0), alignments to Rfam covariance models using cmscan (Rfam 14.2) and a GO import using dedicated SARS-CoV-2 GPAD files from GOA (from September 2020).

The gene annotation imported from ENA for this reference (submitted by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China) can also be viewed as an additional track.

Variation data

We display 6,123 sequence variants derived by two different groups using two different analysis methods. There is overlap between the sample sets used in the two analyses.

Nextstrain

SARS-CoV-2 mutation data is made available as a phylogenetic tree by Nextstrain[Hadfield 2018], an open-source project to harness the scientific and public health potential of pathogen genome data. The Nextstrain project uses subsamples of the available genome sequence data to provide a real-time snapshot of evolving SARS-CoV-2 populations and creates visualisations to interrogate these over time and by region. A version of data available on 2020-04-08 is used.

The SARS-CoV-2 genome sequence data used in the Nextstrain analysis were collected and made available by GISAID. Many groups contributed pre-publication sequence data to GISAID; please see the GISAID site for a full list of attributions.

See Nextstrain for the latest data and analysis

European Nucleotide Archive (ENA)

The ENA team have developed a LoFreq-based pipeline to call variants from SARS-CoV-2 data submitted to the archive. This estimates the proportion of each variant allele within the sequenced sample. The results of preliminary analysis of the first available 5115 sequencing runs, created on 2020-08-17, is displayed here

Data filtering

As this is an early version of the ENA variant data strict, basic filters have been applied to reduce the proportion of lower confidence sites. We have removed:

  • Sequencing runs with more than 40 variant calls
  • Variants where no sample has a frequency of 20% or more for the non-reference allele
  • Variants where all samples show strand bias

Data representation

We represent only the alleles seen in each sample and not the frequencies observed within each sample. Variants were called for each sequencing run individually and for display it is assumed that sites at which a variant was not called in a sample match the reference. This will provide a more accurate estimation of the frequency of each allele across the entire sample set.

Some loci are flagged as a further guide to quality:

  • Variants seen in more than one sample in either set have an evidence status of 'Multiple observations'
  • Variants at sites recommended for masking by De Maio et al (version 2020-07-29) have a flag of ‘Suspect reference location’

References

Wu A, Peng Y, Huang B, Ding X, Wang X, Niu P et al (2020) Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host and Microbe. Vol 27, Issue 3, P325-328 doi:10.1016/j.chom.2020.02.001

Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG et al (2020) A new coronavirus associated with human respiratory disease in China. Nature. 2020 Mar;579(7798) 265-269. doi:10.1038/s41586-020-2008-3

Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C et al. (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 2018 Dec 1;34(23):4121-4123 doi:10.1093/bioinformatics/bty407

Sagulenko P, Puller V and Neher R. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 2018 Jan; 4(1). doi:10.1093/ve/vex042

Wilm, A., Aw, P. P., Bertrand, D., Yeo, G. H., Ong, S. H., Wong, et al (2012). LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic acids research, 40(22), 11189–11201. https://doi.org/10.1093/nar/gks918

Statistics

Summary

AssemblyASM985889v3, INSDC Assembly GCA_009858895.3, Jan 2020
Base Pairs29,903
Golden Path Length29,903
Annotation providerEnsembl
Annotation methodFull genebuild
Genebuild startedApr 2020
Genebuild releasedApr 2020
Genebuild last updated/patched
Database version101.1

Gene counts

Coding genes12
Gene transcripts12

Other

Short Variants6,123

About this species