SARS-CoV-2 (ASM985889v3)

SARS-CoV-2 assembly and gene annotation

Coronaviruses are a large family of enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of vertebrates. This site represents the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome, responsible for the COVID-19 disease, embedded within the Ensembl suite of tools.


The reference assembly for the Wuhan-Hu-1 isolate has been imported from ENA (ASM985889v3, GCA_009858895.3, MN908947.3).
Taxonomy ID: 2697049

Gene annotation

The protein coding genes on this site have been generated using a modified Ensembl genebuild supported by protein evidence.

The first ORF representing approximately 67% of the entire genome that encodes 16 non-structural proteins (nsps) has been split into two genes: ORF1a and ORF1ab. The remaining ORFs encode accessory proteins and four major structural proteins: spike surface glycoprotein (S), small envelope protein (E), matrix protein (M) and nucleocapsid protein (N) [Wu A et al. 2020].

There are also annotated protein features using InterProScan (version 5.45-80.0), alignments to Rfam covariance models using cmscan (Rfam 14.2) and a GO import using dedicated SARS-CoV-2 GPAD files from GOA (from September 2020).

The gene annotation imported from ENA for this reference (submitted by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China) can also be viewed as an additional track.

Variation annotation

We have imported variation data from a variety of sources including ENA, EVA, NextStrain and COG UK. See the main SARS-CoV-2 variation page for information on how data from each source is selected and displayed.

Comparative annotation

We used Cactus to align SARS-CoV-2 and 60 publicly available virus genomes from the Orthocoronavirinae subfamily resulting in 78% of the SARS-CoV-2 genome aligned with at least one other genome and 35% of the genome aligned with the complete set of Orthocoronavirinae genomes. We have also applied our gene tree methods to group the protein coding genes into families and to predict orthologous and paralogous relationships between genes.


Wu A, Peng Y, Huang B, Ding X, Wang X, Niu P et al (2020) Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host and Microbe. Vol 27, Issue 3, P325-328 doi:10.1016/j.chom.2020.02.001

Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG et al (2020) A new coronavirus associated with human respiratory disease in China. Nature. 2020 Mar;579(7798) 265-269. doi:10.1038/s41586-020-2008-3

Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C et al. (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 2018 Dec 1;34(23):4121-4123 doi:10.1093/bioinformatics/bty407

Sagulenko P, Puller V and Neher R. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 2018 Jan; 4(1). doi:10.1093/ve/vex042

Wilm, A., Aw, P. P., Bertrand, D., Yeo, G. H., Ong, S. H., Wong, et al (2012). LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic acids research, 40(22), 11189–11201. https://doi.org/10.1093/nar/gks918



AssemblyASM985889v3, INSDC Assembly GCA_009858895.3, Jan 2020
Base Pairs29,903
Golden Path Length29,903
Assembly providerENA
Annotation providerEnsembl
Annotation methodFull genebuild
Genebuild startedApr 2020
Genebuild releasedApr 2020
Genebuild last updated/patched
Database version104.1

Gene counts

Coding genes12
Gene transcripts12


Short Variants6,123

About this species