Coronaviruses are a large family of enveloped, positive-sense, single-stranded RNA viruses that infect a broad range of vertebrates. This site represents the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome, responsible for the COVID-19 disease, embedded within the Ensembl suite of tools.
Assembly
The reference assembly for the Wuhan-Hu-1 isolate has been imported from ENA (ASM985889v3, GCA_009858895.3, MN908947.3).
Taxonomy ID: 2697049
Gene annotation
The protein coding genes on this site have been generated using a modified Ensembl genebuild supported by protein evidence.
The first ORF representing approximately 67% of the entire genome that encodes 16 non-structural proteins (nsps) has been split into two genes: ORF1a and ORF1ab. The remaining ORFs encode accessory proteins and four major structural proteins: spike surface glycoprotein (S), small envelope protein (E), matrix protein (M) and nucleocapsid protein (N) [Wu A et al. 2020].
There are also annotated protein features using InterProScan (version 5.45-80.0), alignments to Rfam covariance models using cmscan (Rfam 14.2) and a GO import using dedicated SARS-CoV-2 GPAD files from GOA (from September 2020).
The gene annotation imported from ENA for this reference (submitted by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China) can also be viewed as an additional track.
Variation annotation
We have imported variation data from a variety of sources including ENA, EVA, NextStrain and COG UK. See the main SARS-CoV-2 variation page for information on how data from each source is selected and displayed.
Comparative annotation
We used Cactus to align SARS-CoV-2 and 60 publicly available virus genomes from the Orthocoronavirinae subfamily resulting in 78% of the SARS-CoV-2 genome aligned with at least one other genome and 35% of the genome aligned with the complete set of Orthocoronavirinae genomes. We have also applied our gene tree methods to group the protein coding genes into families and to predict orthologous and paralogous relationships between genes.
References
Wu A, Peng Y, Huang B, Ding X, Wang X, Niu P et al (2020) Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host and Microbe. Vol 27, Issue 3, P325-328 doi:10.1016/j.chom.2020.02.001
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG et al (2020) A new coronavirus associated with human respiratory disease in China. Nature. 2020 Mar;579(7798) 265-269. doi:10.1038/s41586-020-2008-3
Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C et al. (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 2018 Dec 1;34(23):4121-4123 doi:10.1093/bioinformatics/bty407
Sagulenko P, Puller V and Neher R. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evolution 2018 Jan; 4(1). doi:10.1093/ve/vex042
Wilm, A., Aw, P. P., Bertrand, D., Yeo, G. H., Ong, S. H., Wong, et al (2012). LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic acids research, 40(22), 11189–11201. https://doi.org/10.1093/nar/gks918
Statistics
Summary
Assembly | ASM985889v3, INSDC Assembly GCA_009858895.3, Jan 2020 |
Base Pairs | 29,903 |
Golden Path Length | 29,903 |
Assembly provider | ENA |
Annotation provider | Ensembl |
Annotation method | Full genebuild |
Genebuild started | Apr 2020 |
Genebuild released | Apr 2020 |
Genebuild last updated/patched | |
Database version | 104.1 |
Gene counts
Coding genes | 12 |
Gene transcripts | 12 |
Other
Short Variants | 6,123 |