SARS-CoV-2 assembly and gene annotation

Coronaviruses are a large family of enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of vertebrates. This site represents the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome, responsible for the COVID-19 disease, embedded within the Ensembl suite of tools.


The reference assembly for the Wuhan-Hu-1 isolate has been imported from ENA (ASM985889v3, GCA_009858895.3, MN908947.3).
Taxonomy ID: 2697049

Gene annotation

The protein coding genes on this site have been generated using a modified Ensembl genebuild supported by protein evidence.

The first ORF representing approximately 67% of the entire genome that encodes 16 non-structural proteins (nsps) has been split into two genes: ORF1a and ORF1ab. The remaining ORFs encode accessory proteins and four major structural proteins: spike surface glycoprotein (S), small envelope protein (E), matrix protein (M) and nucleocapsid protein (N) [Wu A et al. 2020].

There are also annotated protein features using InterProScan (version 5.43-78.1), alignments to Rfam covariance models using cmscan (Rfam 14.2) and a GO import using dedicated SARS-CoV-2 GPAD files from GOA.

The gene annotation imported from ENA for this reference (submitted by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China) can also be viewed as an additional track.


SARS-CoV-2 mutation data is made available as a phylogenic tree by Nextstrain[Hadfield 2018], which is an open-source project to harness the scientific and public health potential of pathogen genome data. The Nextstrain project uses subsamples of the available genome sequence data to provide a real-time snapshot of evolving SARS-CoV-2 populations and creates visualisations to interrogate these over time and by region. A version of data available on 2020-04-08 is used.

The SARS-CoV-2 genome sequence data used in the Nextstrain analysis were collected and made available by GISAID. Many groups contributed pre-publication sequence data to GISAID; please see the GISAID site for a full list of attributions.

See Nextstrain for the latest data and analysis.


Wu A, Peng Y, Huang B, Ding X, Wang X, Niu P et al (2020) Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host and Microbe. Vol 27, Issue 3, P325-328 doi:10.1016/j.chom.2020.02.001

Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG et al (2020) A new coronavirus associated with human respiratory disease in China. Nature. 2020 Mar;579(7798) 265-269. doi:10.1038/s41586-020-2008-3

Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C et al. (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 2018 Dec 1;34(23):4121-4123 doi:10.1093/bioinformatics/bty407

Sagulenko P, Puller V and Neher R. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 2018 Jan; 4(1). doi:10.1093/ve/vex042



AssemblyASM985889v3, INSDC Assembly GCA_009858895.3, Jan 2020
Base Pairs29,903
Golden Path Length29,903
Annotation methodFull genebuild
Genebuild startedApr 2020
Genebuild releasedApr 2020
Genebuild last updated/patched
Database version100.1

Gene counts

Coding genes12
Gene transcripts12


Short Variants2,332

