Variant Effect Predictor Annotation sources


VEP can use a variety of annotation sources to retrieve the transcript models used to predict consequence types.

  • Cache - a downloadable file containing all transcript models, regulatory features and variant data for a species
  • GFF or GTF - use transcript models defined in a tabix-indexed GFF or GTF file
  • Database - connect to a MySQL database server hosting Ensembl databases

Data from VCF, BED and bigWig files can also be incorporated by VEP's Custom annotation feature.

Using a cache is the most efficient way to use VEP; we would encourage you to use a cache wherever possible. Caches are easy to download and set up using the installer. Follow the tutorial for a simple guide.


Caches

Using a cache (--cache) is the fastest and most efficient way to use VEP, as in most cases only a single initial network connection is made and most data is read from local disk. Use offline mode to eliminate all network connections for speed and/or privacy.

Cache version

We strongly recommend that you download/use the VEP Cache version which corresponds to your Ensembl VEP installation,
i.e. the VEP Cache version 104 should be used with the Ensembl VEP tool version 104.

This is mainly due to the fact that the VEP Cache (data content and structure) is generated every Ensembl release, regarding the data and API updates for this release, therefore the cache data format might differ between versions (and be incompatible with a newer version of the Ensembl VEP tool).


Downloading caches

Ensembl creates cache files for every species for each Ensembl release. They can be automatically downloaded and configured using INSTALL.pl.

If interested in RefSeq transcripts you may download an alternate cache file (e.g. homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify --refseq or --merged when running VEP to use the relevant cache. See documentation for full details.


Manually downloading caches

It is also simple to download and set up caches without using the installer. By default, VEP searches for caches in $HOME/.vep; to use a different directory when running VEP, use --dir_cache.

FTP directories by species grouping:

Ensembl: Vertebrates (indexed) | Vertebrates
Ensembl Genomes: Bacteria | Fungi | Metazoa | Plants | Protists

NB: When using Ensembl Genomes caches, you should use the --cache_version option to specify the relevant Ensembl Genomes version number as these differ from the concurrent Ensembl/VEP version numbers.


Data in the cache

The data content of VEP caches vary by species. This table shows the contents of the default human cache files in release 104.

SourceVersion (GRCh38)Version (GRCh37)
Ensembl database version 104 104
Genome assembly GRCh38.p13 GRCh37.p13
GENCODE 38 19
RefSeq 2020-12-10
(GCF_000001405.39_GRCh38.p13_genomic.gff)
2020-10-26
(GCF_000001405.25_GRCh37.p13_genomic.gff)
Regulatory build 1.0 1.0
PolyPhen 2.2.2 2.2.2
SIFT 5.2.2 5.2.2
dbSNP 154 154
COSMIC 92 92
HGMD-PUBLIC 2020.4 2020.4
ClinVar 2021-01-02 2020-12
1000 Genomes Phase 3 (remapped) Phase 3
NHLBI-ESP V2-SSA137 (remapped) V2-SSA137
gnomAD r2.1.1, exomes only r2.1, exomes only

Limitations of the cache

The cache stores the following information:

  • Transcript location, sequence, exons and other attributes
  • Gene, protein, HGNC and other identifiers for each transcript (where applicable, limitations apply to RefSeq caches)
  • Locations, alleles and frequencies of existing variants
  • Regulatory regions
  • Predictions and scores for SIFT, PolyPhen

The cache does not store any information pertaining to, and therefore cannot be used for, the following:

  • HGVS names (--hgvs, --hgvsg) - to retrieve these you must additionally point to a FASTA file containing the reference sequence for your species (--fasta)
  • Using HGVS notation as input (--format hgvs)
  • Using variant identifiers as input (--format id)
  • Finding overlapping structural variants (--check_sv)

Enabling one of these options with --cache will cause VEP to warn you in its status output with something like the following:

 2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs 

Convert with tabix

If you have Bio::DB::HTS (as set up by INSTALL.pl) or tabix installed on your system, the speed of retrieving existing co-located variants can be greatly improved by converting the cache files using the supplied script, convert_cache.pl. This replaces the plain-text, chunked variant dumps with a single tabix-indexed file per chromosome. The script is simple to run:

perl convert_cache.pl -species [species] -version [vep_version]

To convert all species and all versions, use "all":

perl convert_cache.pl -species all -version all

A full description of the options can be seen using --help. When complete, VEP will automatically detect the converted cache and use this in place.

Note that tabix and bgzip must be installed on your system to convert a cache. INSTALL.pl downloads these when setting up Bio::DB::HTS; to enable convert_cache.pl to find them, run:

export PATH=${PATH}:${PWD}/htslib


Data privacy and offline mode

When using the public database servers, VEP requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be appropriate for the analysis of sensitive or private data.

Note

Only the coordinates are transmitted to the server; no other information is sent.

To run VEP in an offline mode that does not use any network connections, use the flag --offline.

The limitations described above apply absolutely when using offline mode. For example, if you specify --offline and --format id, VEP will report an error and refuse to run:

ERROR: Cannot use ID format in offline mode

All other features, including the ability to use custom annotations and plugins, are accessible in offline mode.



GFF/GTF files

VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix and a FASTA file containing the genomic sequence is required in order to generate transcript models.

Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.

grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz

You may use any number of GFF/GTF files in this way, providing they refer to the same genome. You may also use them in concert with annotations from a cache or database source; annotations are distinguished by the SOURCE field in the VEP output.

  • GFF file
    Example of command line with GFF, using of flag --gff :

    ./vep -i input.vcf --cache --gff data.gff.gz --fasta genome.fa.gz

    This functionality uses VEP's custom annotation feature, and the --gff flag is a shortcut to:

    --custom data.gff.gz,,gff

    NOTE: You should use the longer custom annotation form if you wish to customise the name of the GFF as it appears in the SOURCE field and VEP output header.

  • GTF file
    Example of command line with GTF, using of flag --gtf :

    ./vep -i input.vcf --cache --gtf data.gtf.gz --fasta genome.fa.gz

    This functionality uses VEP's custom annotation feature, and the --gtf flag is a shortcut to:

    --custom data.gtf.gz,,gtf

    NOTE: You should use the longer custom annotation form if you wish to customise the name of the GTF as it appears in the SOURCE field and VEP output header.


GFF format expectations

VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, VEP may encounter problems parsing some GFF files. For the same reason, not all transcript biotypes defined in your GFF may be supported by VEP. VEP does not support GFF files with embedded FASTA sequence.


Column "type" (3rd column):

The following entity/feature types are supported by VEP. Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded.

Show supported types


Expected parameters in the 9th column:

  • ID

    Only required for the genes and transcripts entities.

  • parent/Parent

    - Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF.
    - Unlinked entities (i.e. those with no parents or children) are discarded.
    - Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities.

  • biotype

    Transcripts require a Sequence Ontology biotype to be defined in order to be parsed by VEP.
    The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for VEP to be able to parse GFF files from NCBI and other sources.

Here is an example:

##gff-version 3.2.1
##sequence-region 1 1 10000
1 Ensembl gene        1000  5000  . + . ID=gene1;Name=GENE1
1 Ensembl transcript  1100  4900  . + . ID=transcript1;Name=GENE1-001;Parent=gene1;biotype=protein_coding
1 Ensembl exon        1200  1300  . + . ID=exon1;Name=GENE1-001_1;Parent=transcript1
1 Ensembl exon        1500  3000  . + . ID=exon2;Name=GENE1-001_2;Parent=transcript1
1 Ensembl exon        3500  4000  . + . ID=exon3;Name=GENE1-001_2;Parent=transcript1
1 Ensembl CDS         1300  3800  . + . ID=cds1;Name=CDS0001;Parent=transcript1

GTF format expectations

The following GTF entity types will be extracted:

  • cds (or CDS)
  • stop_codon
  • exon
  • gene
  • transcript

Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.

Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype.

Here is an example:

1 Ensembl gene        1000  5000  . + . gene_id "gene1"; gene_name "GENE1";
1 Ensembl transcript  1100  4900  . + . gene_id "gene1"; transcript_id "transcript1"; gene_name "GENE1"; transcript_name "GENE1-001"; transcript_biotype "protein_coding";
1 Ensembl exon        1200  1300  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon1"; exon_id "GENE1-001_1";
1 Ensembl exon        1500  3000  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; exon_id "GENE1-001_2";
1 Ensembl exon        3500  4000  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon3"; exon_id "GENE1-001_2";
1 Ensembl CDS         1300  3800  . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; ccds_id "CDS0001";

Chromosome synonyms

If the chromosome names used in your GFF/GTF differ from those used in the FASTA or your input VCF, you may see warnings like this when running VEP:

WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160

To circumvent this you may provide VEP with a synonyms file. A synonym file is included in VEP's cache files, so if you have one of these for your species you can use it as follows:

./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz -synonyms ~/.vep/homo_sapiens/104_GRCh38/chr_synonyms.txt

Limitations of the cache

Using a GFF or GTF file as VEP's annotation source limits access to some auxiliary information available when using a cache. Currently most external reference data such as gene symbols, transcript identifiers and protein domains are inaccessible when using only a GFF/GTF file.

VEP's flexibility does allow some annotation types to be replaced. The following table illustrates some examples and alternative means to retrieve equivalent data.

Data typeAlternative
SIFT and PolyPhen predictions (--sift, --polyphen) Use the PolyPhen_SIFT VEP plugin
Co-located variants (--check_existing, --af* flags) A couple of options are available:
  1. Use a VCF with --custom to retrieve variant IDs, frequency and other data
  2. Add --cache to use variants in the cache. *
Regulatory consequences (--regulatory) Add --cache to use regulatory features in the cache. *

* Note this will also instruct VEP to annotate input variants against transcript models retrieved from the cache as well as those from the GFF/GTF file. It is possible to use --transcript_filter to include only the transcripts from your GFF/GTF file:

./vep -i input.vcf -cache -custom data.gff.gz,myGFF,gff -fasta genome.fa.gz -transcript_filter "_source_cache is myGFF"


FASTA files

By pointing VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using --cache or --offline. This enables VEP to retrieve HGVS notations (--hgvs), check the reference sequence given in input data (--check_ref), and construct transcript models from a GFF or GTF file without accessing a database.

FASTA files can be set up using the installer; files set up using the installer are automatically detected by VEP when using --cache or --offline; you should not need to use --fasta to manually specify them.

To enable this VEP uses one of two modules:

  • The Bio::DB::HTS Perl XS module with HTSlib. This module uses compiled C code and can access compressed (bgzipped) or uncompressed FASTA files. It is set up by the VEP installer.
  • The Bio::DB::Fasta module. This may be used on systems where installation of the Bio::DB::HTS module has not been possible. It can access only uncompressed FASTA files. It is also set up by the VEP installer and comes as part of the BioPerl package.

The first time you run VEP with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, VEP will force a rebuild of the index).


FASTA FTP directories

Suitable reference FASTA files are available to download from the Ensembl FTP server. See the Downloads page for details.
You should preferably use the installer as described above to fetch these files; manual instructions are provided for reference. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without "_rm" or "_sm" in the name) sequences.
Note that VEP requires that the file be either unzipped (Bio::DB::Fasta) or unzipped and then recompressed with bgzip (Bio::DB::HTS::Faidx) to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:

curl -O http://ftp.ensemblgenomes.org/pub/viruses/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa
./vep -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz


Databases

VEP can use remote or local database servers to retrieve annotations.

  • Using --cache (without --offline) uses the local cache on disk to fetch most annotations, but allows database connections for some features (see cache limitations)
  • Using --database tells VEP to retrieve all annotations from the database. Please only use this for small input files or when using a local database server!

Public database servers

By default, VEP is configured to connect to the public Ensembl MySQL instance at ensembldb.ensembl.org. If you are in the USA (or geographically closer to the east coast of the USA than to the Ensembl data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org

Data for Ensembl Genomes species (e.g. plants, fungi, microbes) is available through a different public MySQL server. The appropriate connection parameters can be automatically loaded by using the flag --genomes

If you have a very small data set (100s of variants), using the public database servers should provide adequate performance. If you have larger data sets, or wish to use VEP in a batch manner, consider one of the alternatives below.


Using a local database

It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run VEP (this can be the same machine). For most of the functionality of VEP, you will only need the Core database (e.g. homo_sapiens_core_104_38) installed. In order to find co-located variants or to use SIFT or PolyPhen, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_104_38).

Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.

To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "core",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_core_104_38'
);

Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "variation",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_variation_104_38'
);

Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");

For more information on the registry and registry files, see here.



Cache - technical information

ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.

The following hash keys are deleted from each transcript object:

  • analysis
  • created_date
  • dbentries : this contains the external references retrieved when calling $transcript->get_all_DBEntries(); hence this call on a cached object will return no entries
  • description
  • display_xref
  • edits_enabled
  • external_db
  • external_display_name
  • external_name
  • external_status
  • is_current
  • modified_date
  • status
  • transcript_mapper : used to convert between genomic, cdna, cds and protein coordinates. A copy of this is cached separately by VEP as

    $transcript->{_variation_effect_feature_cache}->{mapper}

As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:

  • introns : listref of intron objects for the transcript. The adaptor, analysis, dbID, next, prev and seqname keys are stripped from each intron object
  • translateable_seq : as returned by

    $transcript->translateable_seq

  • mapper : transcript mapper as described above
  • peptide : the translated sequence as a string, as returned by

    $transcript->translate->seq

  • protein_features : protein domains for the transcript's translation as returned by

    $transcript->translation->get_all_ProteinFeatures

    Each protein feature is stripped of all keys but: start, end, analysis, hseqname
  • codon_table : the codon table ID used to translate the transcript, as returned by

    $transcript->slice->get_all_Attributes('codon_table')->[0]

  • protein_function_predictions : a hashref containing the keys "sift" and "polyphen"; each one contains a protein function prediction matrix as returned by e.g.

    $protein_function_prediction_matrix_adaptor->fetch_by_analysis_translation_md5('sift', md5_hex($transcript-{_variation_effect_feature_cache}->{peptide}))

Similarly, some further data is cached directly on the transcript object under the following keys:

  • _gene : gene object. This object has all keys but the following deleted: start, end, strand, stable_id
  • _gene_symbol : the gene symbol
  • _ccds : the CCDS identifier for the transcript
  • _refseq : the "NM" RefSeq mRNA identifier for the transcript
  • _protein : the Ensembl stable identifier of the translation
  • _source_cache : the source of the transcript object. Only defined in the merged cache (values: Ensembl, RefSeq) or when using a GFF/GTF file (value: short name or filename)