SARS-CoV-2 assembly and gene annotation

Coronaviruses are a large family of enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of vertebrates. This site represents the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome, responsible for the COVID-19 disease, embedded within the Ensembl suite of tools.

Assembly

The reference assembly for the Wuhan-Hu-1 isolate has been imported from ENA (ASM985889v3, GCA_009858895.3, MN908947.3).
Taxonomy ID: 2697049

Gene annotation

The protein coding genes on this site have been generated using a modified Ensembl genebuild supported by protein evidence.

The first ORF representing approximately 67% of the entire genome that encodes 16 non-structural proteins (nsps) has been split into two genes: ORF1a and ORF1ab. The remaining ORFs encode accessory proteins and four major structural proteins: spike surface glycoprotein (S), small envelope protein (E), matrix protein (M) and nucleocapsid protein (N) [Wu A et al. 2020].

There are also annotated protein features using InterProScan (version 5.45-80.0), alignments to Rfam covariance models using cmscan (Rfam 14.2) and a GO import using dedicated SARS-CoV-2 GPAD files from GOA (from September 2020).

The gene annotation imported from ENA for this reference (submitted by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China) can also be viewed as an additional track.

Variation annotation

We have imported variation data from a variety of sources including ENA, EVA, NextStrain and COG UK. See the main SARS-CoV-2 variation page for information on how data from each source is selected and displayed.

Comparative annotation

We used Cactus to align SARS-CoV-2 and 60 publicly available virus genomes from the Orthocoronavirinae subfamily resulting in 78% of the SARS-CoV-2 genome aligned with at least one other genome and 35% of the genome aligned with the complete set of Orthocoronavirinae genomes. We have also applied our gene tree methods to group the protein coding genes into families and to predict orthologous and paralogous relationships between genes.

References

Wu A, Peng Y, Huang B, Ding X, Wang X, Niu P et al (2020) Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host and Microbe. Vol 27, Issue 3, P325-328 doi:10.1016/j.chom.2020.02.001

Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG et al (2020) A new coronavirus associated with human respiratory disease in China. Nature. 2020 Mar;579(7798) 265-269. doi:10.1038/s41586-020-2008-3

Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C et al. (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 2018 Dec 1;34(23):4121-4123 doi:10.1093/bioinformatics/bty407

Sagulenko P, Puller V and Neher R. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 2018 Jan; 4(1). doi:10.1093/ve/vex042

Wilm, A., Aw, P. P., Bertrand, D., Yeo, G. H., Ong, S. H., Wong, et al (2012). LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic acids research, 40(22), 11189–11201. https://doi.org/10.1093/nar/gks918

Statistics

Summary

Assembly	ASM985889v3, INSDC Assembly GCA_009858895.3, Jan 2020
Base Pairs	29,903
Golden Path Length	29,903
Assembly provider	ENA
Annotation provider	Ensembl
Annotation method	Full genebuild
Genebuild started	Apr 2020
Genebuild released	Apr 2020
Genebuild last updated/patched
Database version	104.1

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes	12
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	12

Other

Short Variants

6,123

Favourite species

All species