Compara API Tutorial

Introduction

This tutorial is an introduction to the Ensembl Compara API. Knowledge of the Ensembl Core API and of the coding conventions used in the Ensembl APIs is assumed.

Documentation about the Compara database schema is available here, and while is not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.

Species and databases in Ensembl Compara

The compara API can be used to access the main Ensembl database as well as the non-vertebrate Ensembl Genomes databases. Details of how to access these databases with your registry are found in the general instructions for the Ensembl APIs. There is no comparative genomics data in the Ensembl GRCh37 database.

When using Compara adaptors, we specify the species as 'Multi' for vertebrates; for non vertebrates, we name the division, such as 'Plants' or 'Fungi'. The pan-taxonomic comparative analysis can be accessed by specifying as 'pan_homology'.

Main Ensembl Compara objects

The Ensembl Compara databases uses a number of objects that share some similarities to objects in other Ensembl APIs, but are not exactly the same. Here are some examples of object types and how they differ from their counterparts in other APIs:

  • GenomeDB objects represent a genome in the database, usually a single species but it may also be a strain of a species. It contains information about the taxonomy and is usually fetched with the GenomeDBAdaptor and the species name. Unlike the Core, Variation and Regulation databases, where each genome has its own database and this is specified in the adaptor, the compara databases are multi-genome, which means each genome must be represented as an object within the database.
  • DnaFrag objects are similar to the seq_regions attached to Features in other databases, and represent chromosomes, contigs or scaffolds.
  • SpeciesSets are groups of species.
  • A Method is a type of analysis that can be done on a SpeciesSet.
  • A MethodLinkSpeciesSet combines a Method with a SpeciesSet to specify a particular analysis.

MethodLinkSpeciesSets

The compara API works by specifying which type of analysis and set of species you're working with using the MethodLinkSpeciesSet object. This is fetched using the MethodLinkSpeciesSetAdaptor.

A MethodLinkSpeciesSet is obtained using its species and its Method. The Method represents the type of analysis, and includes gene homology methods and whole genome alignment methods. Below is a non-exhaustive list of methods available:

Method nameDescription
LASTZ_NETPairwise whole genome alignment using LastZ
EPOMultiple whole genome alignment (WGA), with ancestral inference, using Enredo, Pecan and Ortheus
EPO_EXTENDEDMultiple WGA for lower quality genomes, extending EPO using LASTZ
PECANMultiple WGA using Mercator and Pecan
CACTUS_HALMultiple WGA using progressiveCactus (* additional setup required - see here for details)
ENSEMBL_ORTHOLOGUESOrthologue calls between a pair of species
ENSEMBL_PARALOGUESParalogue calls for a species
FAMILYProtein families for a set of species
PROTEIN_TREESGene trees for protein coding genes
NC_TREESGene trees for non-coding RNAs

For pairwise MethodLinkSpeciesSets, such as pairwise orthologues or LASTZ_NET pairwise whole genome alignments, you can get the MethodLinkSpeciesSet using the species name (known as the registry_alias), which can be the common name or the latin name, or by the GenomeDB.

For multiple alignment MethodLinkSpeciesSets, such as EPO, PECAN or CACTUS_HAL, these can be fetched using the taxonomy-based name of the group, such as "mammals", "amniotes" or "murinae". The documentation for the alignments includes the terms you need to fetch these alignments with the MethodLinkSpeciesSetAdaptor.

It is possible to find the MethodLinkSpeciesSet IDs, and fetch using these, but we do not recommend this, particularly for the multiple alignments, as when new species are added to these alignments, new IDs are assigned, so your scripts will not work in new releases.

The following script fetches the LastZ-net alignment between human and mouse, using the MethodLinkSpeciesSet and GenomeDBs:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get the GenomeDB adaptor
my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' );

# Fetch GenomeDB objects for human and mouse:
my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens');
my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus');

# Get the MethodLinkSpeciesSet adaptor
my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet');

# Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences
my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] );
Used objects:

Registry

What is this object $registry? Make sure you have defined it in all your scripts. Learn more in the general instructions page.

Ensembl Compara Adaptors

Below is a non-exhaustive list of Ensembl Compara adaptors that are most often used:

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.

Whole Genome Alignments

The Compara database contains a number of different types of whole genome alignments, including multiple alignments and pairwise alignments.

GenomicAlignBlock objects

GenomicAlignBlocks are the preferred way to store and fetch genomic alignments. A GenomicAlignBlock contains several GenomicAlign objects. Every GenomicAlign object corresponds to a piece of genomic sequence from one genome aligned with another GenomicAlign from another genome in the same GenomicAlignBlock. A GenomicAlign object is always in relation with other GenomicAlign objects and this relation is defined through the GenomicAlignBlock object. Therefore the usual way to fetch genomic alignments is by fetching GenomicAlignBlock objects.

You can fetch GenomicAlignBlocks using Slice objects. When you fetch GenomicAlignBlocks this way, the blocks will not just represent your Slice, but the whole block of alignment. This means that you will need to restrict the block once you have fetched it. You will always get an array of blocks.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Define the query species and the coordinates of the Slice
my $query_species = 'human';
my $seq_region = '14';
my $seq_region_start = 75000000;
my $seq_region_end   = 75010000;

# Get the SliceAdaptor and fetch a slice
my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' );
my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end );

# Get the GenomeDB adaptor
my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' );

# Fetch GenomeDB objects for human and mouse:
my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens');
my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus');

# Get the MethodLinkSpeciesSetAdaptor
my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet');

# Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences
my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] );

# Get the GenomicAlignBlockAdaptor
my $genomic_align_block_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomicAlignBlock' );

# Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between human and mouse
my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_lastz_net_mlss, $query_slice ) };

# We will then (usually) need to restrict the blocks to the required positions in the reference sequence

foreach my $genomic_align_block( @genomic_align_blocks ) {
    my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
}

Print alignment

Once you've retrieved you alignment, you can use modules in BioPerl to print your alignment, such as Bio::SimpleAlign, from the GenomicAlignBlocks. The following code can be added to the end of the previous script to get SimpleAligns and print them to a file:

my $all_aligns;

# Get a Bio::SimpleAlign object from every GenomicAlignBlock
foreach my $this_genomic_align_block (@genomic_align_blocks) {
	my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
    my $simple_align = $restricted_gab->get_SimpleAlign;
    push(@$all_aligns, $simple_align);
}

# print all the genomic alignments using a Bio::AlignIO object
my $output_format = "clustalw";
my $alignIO = Bio::AlignIO->newFh(
    -interleaved => 0,
    -fh => \*STDOUT,
    -format => $output_format,
    -idlength => 10
);

foreach my $this_align (@$all_aligns) {
    print $alignIO $this_align;
}
Used objects:
Adaptor objects
Main objects

Print coordinates

From a GenomicAlign object, you can create a Slice for that species, which you can then use to fetch coordinates or sequence, or perform any other Slice functions, like find overlapping features. Taking again the script where we fetched the GenomicAlignBlocks, we can add the following code to get coordinates:

foreach my $genomic_align_block( @genomic_align_blocks ) {
    my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
    
    # fetch the GenomicAligns and move through
    my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns };
    foreach my $genomic_align (@genomic_aligns) {
    	my $species = $genomic_align->genome_db->get_scientific_name;
    	my $slice = $genomic_align->get_Slice;
    	print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t";
    }
    print "\n";
}
Used objects:
Adaptor objects
Main objects

Gene trees, Homologies and Protein clusters

All the gene trees, homologies and families refer to GeneMembers and SeqMembers. Homology objects store orthologous and paralogous relationships between members, GeneTree objects represent the evolutionary history of a set of members, and Family objects are clusters of members.

*Member objects

A member represent either a gene (GeneMember) or a sequence-bearing locus, e.g. a protein or a transcript (SeqMember). Most of them are defined in the corresponding Ensembl core database. For instance, the sequence for the human gene ENSG00000004059 is stored in the human core database.

The fetch_by_stable_id method of the corresponding *MemberAdaptor returns Members by their stable_id. Here is a simple example:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the MemberAdaptor
my $genemember_adaptor = $registry->get_adaptor('Multi','compara','GeneMember');

# fetch a Member
my $member = $genemember_adaptor->fetch_by_stable_id('ENSG00000004059');

# print out some information about the Member
print $member->source_name, " (", $member->dnafrag->name, ":", $member->dnafrag_start, "-", $member->dnafrag_end, "): ", $member->description, "\n";
Used objects:
Adaptor objects
Main objects

You can fetch the corresponding Ensembl Core API objects from all of the *Member objects, as well as coordinates. The taxon method returns an NCBITaxon object, which contains information about the species and taxonomy.

*Member objects have a source_name, which describes where the member comes from:

for GeneMember
  • ENSEMBLGENE, derived from an Ensembl gene
  • EXTERNALGENE, loaded from an external source (currently unused in the live databases)
for SeqMember
  • ENSEMBLPEP, derived from an Ensembl translation
  • ENSEMBLTRANS, derived from an Ensembl transcript
  • Uniprot/SWISSPROT, derived from a Uniprot/Swissprot entry
  • Uniprot/SPTREMBL, derived from a Uniprot/SP-TrEMBL entry
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the MemberAdaptor
my $genemember_adaptor = $registry->get_adaptor('Multi','compara','GeneMember');

# fetch a Member
my $member = $genemember_adaptor->fetch_by_stable_id('ENSG00000004059');

my $taxon = $member->taxon;
print "common_name ", $taxon->get_common_name,"\ngenus ", $taxon->genus,"\nspecies ", $taxon->species, "\nbinomial ", $taxon->scientific_name,  "\nclassification ", $taxon->classification,"\n";
Used objects:
Adaptor objects
Main objects

GeneTree Objects

GeneTree objects give us the phylogenetic context for a set of genes, as well as their alignment.

In general, you would want to fetch the gene tree for a given gene of interest. The GeneTreeAdaptor has a fetching method called fetch_all_by_Member(). You will need the GeneMember object for your query gene, therefore you will fetch the GeneMember first like in this example:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# first, let's get our GeneMember of interest from a GeneMemberAdaptor
my $genemem_adapt = $registry->get_adaptor( 'Multi', 'compara', 'GeneMember' );
my $genemem = $genemem_adapt->fetch_by_stable_id('ENSG00000238344');

# next, set up a GeneTreeAdaptor and fetch the default tree for our GeneMember
my $genetree_adapt = $registry->get_adaptor( 'Multi', 'compara', 'GeneTree' );
my $genetree = $genetree_adapt->fetch_default_for_Member($genemem);

# look at all members of the tree
print "Members of tree:\n";
my @members = @{ $genetree->get_all_Members };
foreach my $m ( @members ) {
    print $m->name, "\n";
}

# print the full tree in Newick format
print $genetree->newick_format() . "\n";
Used objects:
Main objects

GeneTree objects not only hold the structure of the phylogeny, they also hold the gene alignment upon which the tree was based. This can be printed out using the following:

    $genetree->print_alignment_to_file('/path/to/file', -format=>'clustalw');

Homology Objects

An Homology object represents either an orthologous or paralogous relationships between two members.

Typically you want to get homologies for a given gene. As with the GeneTreeAdaptor, the HomologyAdaptor has a fetching method called fetch_all_by_Member(). This has optional arguments to only fetch by a particular Method, MethodLinkSpeciesSet and target species.

When you get all Homologies, you will get an array where each item is a Homology representing the relationship between exactly two genes. One is the query gene, the gene you used as input, and the other is the target gene, the homologue. Even if you specify a single species, if there is a one-to-many or many-to-many relationship, each of these will be one homology.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get a GeneMember object
my $gene_member_adaptor = $registry->get_adaptor('Multi', 'compara', 'GeneMember');
my $gene_member = $gene_member_adaptor->fetch_by_stable_id('ENSG00000004059');

# get the homologies where the member is involved
my $homology_adaptor = $registry->get_adaptor('Multi', 'compara', 'Homology');
my @homologies = @ { $homology_adaptor->fetch_all_by_Member($gene_member, -TARGET_SPECIES=>"mus_musculus") };

# the homology_adaptor will always return an array, even if it only has one homology in it, so we move through the array
foreach my $homology (@homologies) {
	
	# Get the GeneMembers and print their species and ID
	my @homologous_genes = @ { $homology->get_all_GeneMembers };
		foreach my $gene (@homologous_genes){
			print $gene->taxon->get_common_name, ": ", $gene->stable_id, "\n";
		}
	# Print the homology relationship
	print $homology->description," ", $homology->taxonomy_level,"\n";
}
Used objects:
Main objects

You can get the original alignment used to define an homology:

    $homology->print_alignment_to_file('/path/to/file', -format=>'fasta');

Family Objects

Families are clusters of proteins including all the Ensembl proteins plus all the metazoan UniProt/SwissProt and UniProt/Trembl entries. The object and the adaptor are really similar to the previous ones.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $gene_member_adaptor = $registry->get_adaptor('Multi', 'compara', 'GeneMember');
my $gene_member = $gene_member_adaptor->fetch_by_stable_id('ENSG00000004059');

my $family_adaptor = $registry->get_adaptor('Multi','compara','Family');
my $families = $family_adaptor->fetch_all_by_GeneMember($gene_member);

foreach my $family (@{$families}) {
    print join(" ", map { $family->$_ }  qw(description description_score))."\n";

    foreach my $member (@{$family->get_all_Members}) {
        print $member->stable_id," ",$member->taxon_id,"\n";
    }  
}
Used objects:
Main objects

Memory management and code efficiency

There are a myriad of ways to ensure optimal performance of your code. Here, we provide just a few tricks that pertain specifically to Ensembl APIs.

Fetching from adaptors

In all examples above, we've used a foreach to loop through the results of a fetch_all_by_YY call. This is fine for small datasets. However, adaptors can also support a while + shift combo for memory efficiency:

my $members = $genetree->get_all_Members;
while ( my $m = shift @$members ) {
    # do something with member
}

Releasing trees from memory

When working with many GeneTree objects, memory can quickly get out of hand. This is because our current object model uses a cyclic graph of Perl references. As a consequence, the usual garbage-collector is not able to release the memory used by a gene tree when you lose its reference (unlike most of the Ensembl objects). This means that you will have to call release_tree() on each tree after using it.

Preloading data

Most of the objects do lazy-loading of related objects via queries to the database. This system is sub-optimal when there are a lot of objects to fetch or if the server is distant. Our Bio::EnsEMBL::Compara::Utils::Preloader module provides several methods to do a bulk-loading of objects in a minimum number of queries. This will result in a higher memory usage, but faster processing of data.


Further help

For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.