Entire databases can be downloaded from our FTP site in a variety of formats. Please be aware that some of these files can run to many gigabytes of data.
To facilitate storage and download all databases are GNU Zip (gzip, *.gz) compressed.
About the data
The following types of data dumps are available on the FTP site.
- FASTA sequence databases of Ensembl gene, transcript and protein
model predictions. Since the
FASTA format does not permit sequence annotation,
these database files are mainly intended for use with local sequence
similarity search algorithms. Each directory has a README file with a
detailed description of the header line format and the file naming
- Masked and unmasked genome sequences associated with the assembly (contigs, chromosomes etc.).
- The header line in an FASTA dump files containing DNA sequence consists of the following attributes : coord_system:version:name:start:end:strand This coordinate-system string is used in the Ensembl API to retrieve slices with the SliceAdaptor.
- Coding sequences for Ensembl or ab initio predicted genes.
- cDNA sequences for Ensembl or ab initio predicted genes.
- Protein sequences for Ensembl or ab initio predicted genes.
- Non-coding RNA gene predictions.
- Annotated sequence
- Flat files allow more extensive sequence annotation by means of feature tables and contain thus the genome sequence as annotated by the automated Ensembl genome annotation pipeline. Each nucleotide sequence record in a flat file represents a 1Mb slice of the genome sequence. Flat files are broken into chunks of 1000 sequence records for easier downloading.
- All Ensembl MySQL databases are available in text format as are the SQL table definition files. These can be imported into any SQL database for a local installation of a mirror site. Generally, the FTP directory tree contains one directory per database. For more information about these databases and their Application Programming Interfaces (or APIs) see the API section.
- Gene sets for each species. These files include annotations of both coding and non-coding genes. This file format is described here.
- GFF3 provides access to all annotated transcripts which make up an Ensembl gene set. This file format is described here.