Ensembl Core API Tutorial
Introduction
This tutorial is an introduction to the Ensembl Core API. Knowledge of the coding conventions used in the Ensembl APIs is assumed.
The Core API provides access to the genome itself and the basic features of the genome. These include genes, transcripts, exons and repeats, and their position on the genome. It also includes proteins, including translations of transcripts and domains.
Slices: genomic regions
A Slice object represents a single continuous region of a genome. Slices can be used to obtain sequence, features or other information from a particular region of interest. To retrieve a slice it is first necessary to get a slice adaptor. Here we get such an adaptor for the latest version of the Human database.
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
Registry
Fetch by region
The slice adaptor provides several ways to obtain slices, but we will start with the fetch_by_region() method which is the most commonly used. This method takes numerous arguments but most of them are optional. In order, the arguments are: coord_system_name, seq_region_name, start, end, strand, coord_system_version. The following are several examples of how to use the fetch_by_region() method:
# get a slice adaptor for the human core database my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); # Obtain a slice covering the entire chromosome X my $slice = $slice_adaptor->fetch_by_region( 'chromosome', 'X' ); # Obtain a slice covering the entire clone AL359765.6 $slice = $slice_adaptor->fetch_by_region( 'clone', 'AL359765.6' ); # Obtain a slice covering an entire scaffold $slice = $slice_adaptor->fetch_by_region( 'scaffold', 'KI270510.1' ); # Obtain a slice covering the region from 1MB to 2MB (inclusively) of # chromosome 20 $slice = $slice_adaptor->fetch_by_region( 'chromosome', '20', 1e6, 2e6 );
Fetch by gene
Another useful way to obtain a slice is with respect to a gene:
my $slice = $slice_adaptor->fetch_by_gene_stable_id( 'ENSG00000099889', 5e3 );
This will return a slice that contains the sequence of the gene specified by its stable Ensembl ID. It also returns 5000bp of flanking sequence at both the 5' and 3' ends, which is useful if you are interested in the environs that a gene inhabits. You needn't have the flanking sequence it you don't want it — in this case set the number of flanking bases to zero or simply omit the second argument entirely. Note that for historical reasons the fetch_by_gene_stable_id() method always returns a slice on the forward strand even if the gene is on the reverse strand.
Fetch all
To retrieve a set of slices from a particular coordinate system the fetch_all() method can be used:
# Retrieve slices of every chromosome in the database @slices = @{ $slice_adaptor->fetch_all('chromosome') }; # Retrieve slices of every BAC clone in the database @slices = @{ $slice_adaptor->fetch_all('clone') };
Split slices
For certain types of analysis it is necessary to break up regions into smaller manageable pieces. The method split_Slices() can be imported from the Bio::EnsEMBL::Utils::Slice module to break up larger slices into smaller component slices.
use Bio::EnsEMBL::Utils::Slice qw(split_Slices); # ... my $slices = $slice_adaptor->fetch_all('chromosome'); # Base pair overlap between returned slices my $overlap = 0; # Maximum size of returned slices my $max_size = 100_000; # Break chromosomal slices into smaller 100k component slices $slices = split_Slices( $slices, $max_size, $overlap );
Get information about the slice
To obtain sequence from a slice the seq() or subseq() methods can be used:
my $sequence = $slice->seq(); print $sequence, "\n"; $sequence = $slice->subseq( 100, 200 );
We can query the slice for information about itself:
# The method coord_system() returns a Bio::EnsEMBL::CoordSystem object my $coord_sys = $slice->coord_system()->name(); my $seq_region = $slice->seq_region_name(); my $start = $slice->start(); my $end = $slice->end(); my $strand = $slice->strand(); print "Slice: $coord_sys $seq_region $start-$end ($strand)\n";
Get features in a slice
Many object adaptors can provide a set of features which overlap a slice. The slice itself also provides a means to obtain features which overlap its region. The following are two ways to obtain a list of genes which overlap a slice:
@genes = @{ $gene_adaptor->fetch_all_by_Slice($slice) }; # Another way of doing the same thing: @genes = @{ $slice->get_all_Genes() };
Features
Features are objects in the database which have a defined location on the genome. All features in Ensembl inherit from the Bio::EnsEMBL::Feature class and have the following location defining attributes: start, end, strand, slice.
The start of any feature is always less than or equal to its end, regardless of the strand it is on. This means that for features on the reverse strand, the start is actually the 3' end, and the end of the 5' end.
In addition to locational attributes all features have internal database identifiers accessed via the method dbID(). All feature objects can be retrieved from their associated object adaptors using a slice object or the feature's internal identifier (dbID).
All features also have the methods transform(), transfer(), and project() which are described in detail in the Transform, Transfer and Project sections of this tutorial.
Slice and seq-region location
All features have the methods start, end and strand. These are defined in terms of the slice you obtain them from. If you want to obtain the feature's absolute position on the underlying sequence region (such as a chromosome or scaffold), you instead need to use the methods seq_region_start, seq_region_end and seq_region_strand. The name of the chromosome or scaffold itself is called using the seq_region_name method.
In the following example, we fetch the human genes in the region 10:3770000-3790000, then print their coordinates in terms of both the slice and the sequence region.
# Fetch the slice adaptor and slice my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '10', 3770000, 3790000 ); # Get all the genes in the slice my @genes = @{ $slice->get_all_Genes() }; foreach my $gene (@genes){ printf( "In terms of slice: %d-%d (%+d)\n", $gene->start(), $gene->end(), $gene->strand() ); printf( "In terms of seq_region: %d-%d (%+d)\n", $gene->seq_region_start(), $gene->seq_region_end(), $gene->seq_region_strand() ); }
Slices from features
The feature_Slice() method will return a slice which is the exact overlap of the feature the method was called on. This slice can then be used to obtain the underlying sequence of the feature or to retrieve other features that overlap the same region, etc.
$feature_slice = $feature->feature_Slice(); # Display the sequence of the feature region print $feature_slice->seq(), "\n"; # Display the sequence of the feature region + 5000bp flanking sequence print $feature_slice->expand( 5000, 5000 )->seq(), "\n"; # Get all genes which overlap the feature $genes = $feature_slice->get_all_Genes();
Genes, Transcripts, and Exons
Genes, exons and transcripts are the most-used features in the API and can be treated in the same way as any other feature within Ensembl. A transcript in Ensembl is a grouping of exons. A gene in Ensembl is a grouping of transcripts which share any overlapping (or partially overlapping) exons. Transcripts also have an associated Translation object which defines the UTR and CDS composition of the transcript. Introns are not defined explicitly in the database but can be obtained by the Transcript method get_all_Introns().
Like all Ensembl features the start of an exon is always less than or equal to the end of the exon, regardless of the strand it is on. The start of the transcript is the start of the first exon of a transcript on the forward strand or the end of the last exon of a transcript on the reverse strand. The start and end of a gene are defined to be the lowest start value of its transcripts and the highest end value respectively.
Genes, translations, transcripts and exons all have stable identifiers. These are identifiers that are assigned to Ensembl's predictions, and maintained in subsequent releases. For example, if a transcript (or a sufficiently similar transcript) is re-predicted in a future release then it will be assigned the same stable identifier as its predecessor.
The following is an example of the retrieval of a set of genes, transcripts and exons:
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '10', 3770000, 3790000 ); foreach my $gene ( @{ $slice->get_all_Genes() } ) { print "Gene ID: ", $gene->stable_id, "/n"; foreach my $transcript ( @{ $gene->get_all_Transcripts() } ) { print "\tTranscript ID: ", $transcript->stable_id, "/n"; foreach my $exon ( @{ $transcript->get_all_Exons() } ) { print "\t\tExon ID: ", $exon->stable_id, "/n"; } } }
Sequences
Several methods can be used to obtain transcript related sequences. For historical reasons some of these methods return strings while others return Bio::Seq objects, using the BioPerl module. The following example demonstrates the use of some of these methods:
# The spliced_seq() method returns the concatenation of the exon # sequences. This is the cDNA of the transcript print "cDNA: ", $transcript->spliced_seq(), "\n"; # The translateable_seq() method returns only the CDS of the transcript print "CDS: ", $transcript->translateable_seq(), "\n"; # UTR sequences are obtained via the five_prime_utr() and # three_prime_utr() methods my $fiv_utr = $transcript->five_prime_utr(); my $thr_utr = $transcript->three_prime_utr(); print "5' UTR: ", ( defined $fiv_utr ? $fiv_utr->seq() : "None" ), "\n"; print "3' UTR: ", ( defined $thr_utr ? $thr_utr->seq() : "None" ), "\n"; # The protein sequence is obtained from the translate() method. If the # transcript is non-coding, undef is returned. my $protein = $transcript->translate(); print "Translation: ", ( defined $protein ? $protein->seq() : "None" ), "\n";
Names
To get the name or identifier of any particular feature, you can use the display_id(). For a gene or transcript this method would return the name/symbol.
print $gene->display_id(), "\n";
Translations and ProteinFeatures
Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. Once you have a Translation you can go back to its Transcript. If you retrieved the Translation using a stable identifier then the API will fetch the appropriate Transcript automatically. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier:
my $stable_id = 'ENST00000528762'; my $transcript_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Transcript' ); my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); print $transcript->translation()->stable_id(), "\n"; print $transcript->translate()->seq(), "\n"; print $transcript->translation()->transcript()->stable_id(), "\n";
ProteinFeatures are features which are on an amino acid sequence rather than a nucleotide sequence, such as domains. The method get_all_ProteinFeatures() can be used to obtain a set of protein features from a Translation object.
$translation = $transcript->translation(); my $pfeatures = $translation->get_all_ProteinFeatures(); while ( my $pfeature = shift @{$pfeatures} ) { my $logic_name = $pfeature->analysis()->logic_name(); printf( "%d-%d %s %s %s\n", $pfeature->start(), $pfeature->end(), $logic_name, $pfeature->interpro_ac(), $pfeature->idesc() ); }
If only the protein features created by a particular analysis are desired the name of the analysis can be provided as an argument. To obtain the subset of features which are considered to be 'domain' features the convenience method get_all_DomainFeatures() can be used:
my $seg_features = $translation->get_all_ProteinFeatures('Seg'); my $domain_features = $translation->get_all_DomainFeatures();
External References
Ensembl cross references its genes, transcripts and translations with identifiers from other databases. A DBEntry object represents a cross reference and is often referred to as an Xref. The following code snippet retrieves and prints DBEntries for a gene:
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' ); # Get the 'COG6' gene from human my $gene = $gene_adaptor->fetch_by_display_label('COG6'); print "GENE ", $gene->stable_id(), "\n"; my @dbentries = @{ $gene->get_all_DBEntries() }; foreach my $dbe ( @dbentries ) { printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname(); }
Often it is useful to obtain all of the DBEntries associated with a gene and its associated transcripts and translations. To do this, the get_all_DBLinks() method can be used instead. The objects obtained through this method are still DBEntry objects. We can change the example above to:
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' ); # Get the 'COG6' gene from human my $gene = $gene_adaptor->fetch_by_display_label('COG6'); print "GENE ", $gene->stable_id(), "\n"; my @dblinks = @{ $gene->get_all_DBLinks() }; foreach my $dbl ( @dblinks ) { printf "\tXREF %s (%s)\n", $dbl->display_id(), $dbl->dbname(); }
When fetching both DBEntries and DBLinks, you can restrict your query to a single external database. This uses SQL wildcards, so you don't have to know exactly how the external databases are named in the Ensembl database, and it allows you to retrieve multiple related external references. For example:
my @dblinks = @{ $gene->get_all_DBLinks('Uniprot%') };
This script would get you external references from both UniProt/SWISSProt and UniProt/TrEMBL.
Alignment Features
Two types of alignments are stored in the core Ensembl database: alignments of DNA sequence to the genome and alignments of protein sequence to the genome, retrieved as DnaDnaAlignFeatures and DnaPepAlignFeatures respectively. The alignments are the sequences used to annotate the genes and transcripts.
A single gapped alignment is represented by a single feature with a cigar line. A cigar line is a compact representation of a gapped alignment as single string containing letters M (match) D (deletion), and I (insertion) prefixed by integer lengths (the number may be omitted if it is 1). A gapped alignment feature can be broken into its component ungapped alignments by the method ungapped_features() which returns a list of FeaturePair objects. The following example shows the retrieval of some DNA alignment features:
# Retrieve dna-dna alignment features from the slice region my @features = @{ $slice->get_all_DnaAlignFeatures('Vertrna') }; foreach my $feature ( @features ) { printf( "%s %d-%d (%+d)\t=> %d-%d (%+d)\n", $feature->hseqname(), $feature->hstart(), $feature->hend(), $feature->hstrand(), $feature->start(), $feature->end(), $feature->strand() ); print "Percent identity: ", $feature->percent_id(), "\n"; print "Cigar string: ", $feature->cigar_string(), "\n"; my @ungapped = $feature->ungapped_features(); print "ungapped:\n"; print_feature_pairs( \@ungapped ); print "\n"; }
Repeats
Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are represented in the Ensembl database as RepeatFeatures. Low-complexity regions are found by the program Dust and are also stored as RepeatFeatures. RepeatFeatures can be retrieved and used in the same way as other Ensembl features.
my @repeats = @{ $slice->get_all_RepeatFeatures() }; foreach my $repeat (@repeats) { printf( "%s %d-%d\n", $repeat->display_id(), $repeat->start(), $repeat->end() ); }
RepeatFeatures are used to perform repeat masking of the genomic sequence. Hard or soft-masked genomic sequence can be retrieved from Slice objects using the get_repeatmasked_seq() method. Hard-masking replaces sequence in repeat regions with Ns. Soft-masking replaces sequence in repeat regions with lower-case sequence.
my $unmasked_seq = $slice->seq(); my $hardmasked_seq = $slice->get_repeatmasked_seq()->seq(); my $softmasked_seq = $slice->get_repeatmasked_seq( undef, 1 )->seq(); # Soft-mask sequence using TRF results only my $tandem_masked_seq = $slice->get_repeatmasked_seq( ['TRF'], 1 )->seq();
Markers
Markers are imported into the Ensembl database from UniSTS and several other sources. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms. Markers are placed on the genome electronically using an analysis program such as ePCR and their genomic positions are retrievable as MarkerFeatures. Map locations (genetic, radiation hybrid and in situ hybridisation) for markers obtained from actual experimental evidence are also accessible.
Markers can be fetched via their name from the MarkerAdaptor.
my $marker_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Marker' ); # Obtain marker by synonym (this returns a list, but there's seldom more # than one marker in the list) my $marker = $marker_adaptor->fetch_all_by_synonym('D9S1038E')->[0]; # Display the various names associated with the same marker foreach my $synonym ( @{ $marker->get_all_MarkerSynonyms() } ) { if ( defined $synonym->source() ) { print $synonym->source(), ':'; } print $synonym->name(), ' '; } print "\n"; # Display the primer info printf( "left primer : %s\n", $marker->left_primer() ); printf( "right primer: %s\n", $marker->right_primer() ); printf( "product size: %d-%d\n", $marker->min_primer_dist(), $marker->max_primer_dist() ); # Display out genetic/RH/FISH map information print "Map locations:\n"; foreach my $map_loc ( @{ $marker->get_all_MapLocations() } ) { printf( "\t%s %s %s\n", $map_loc->map_name(), $map_loc->chromosome_name(), $map_loc->position() ); }
MarkerFeatures, which represent genomic positions of markers, can be retrieved and manipulated in the same way as other Ensembl features.
# Obtain the positions for an already retrieved marker foreach my $marker_feature ( @{ $marker->get_all_MarkerFeatures() } ) { printf( "%s %d-%d\n", $marker_feature->seq_region_name(), $marker_feature->start(), $marker_feature->end() ); } # Retrieve all marker features in a given region my $marker_features = $slice->get_all_MarkerFeatures(); while ( my $marker_feature = shift @{$marker_features} ) { printf( "%s %s %d-%d\n", $marker_feature->display_id(), $marker_feature->seq_region_name(), $marker_feature->start(), $marker_feature->end() ); }
MiscFeatures
MiscFeatures are features with arbitrary attributes which are placed into arbitrary groupings. MiscFeatures can be retrieved as any other feature and are classified into distinct sets by a set code. Generally it only makes sense to retrieve all features which have a particular set code because very diverse types of MiscFeatures are stored in the database.
MiscFeature attributes are represented by Attribute objects and can be retrieved via a get_all_Attributes() method.
The following example retrieves all MiscFeatures representing ENCODE regions on a given slice and prints out their attributes:
my $encode_regions = $slice->get_all_MiscFeatures('encode'); while ( my $encode_region = shift @{$encode_regions} ) { my @attributes = @{ $encode_region->get_all_Attributes() }; foreach my $attribute ( @attributes ) { printf "%s:%s\n", $attribute->name(), $attribute->value(); } }
This example retrieves all misc features representing a BAC clone via its name and prints out their location and other information:
my $mf_adaptor = $registry->get_adaptor( 'Human', 'Core', 'MiscFeature' ); my $clones = $mf_adaptor->fetch_all_by_attribute_type_value( 'clone_name', 'RP5-60P11' ); while ( my $clone = shift @{$clones} ) { my $slice = $clone->slice(); printf( "%s %s %d-%d\n", $slice->coord_system->name(), $slice->seq_region_name(), $clone->start(), $clone->end() ); my @attributes = @{ $clone->get_all_Attributes() }; foreach my $attribute ( @attributes ) { printf "\t%s:%s\n", $attribute->name(), $attribute->value(); } }
Coordinate Systems
Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:
my $cs_adaptor = $registry->get_adaptor( 'Human', 'Core', 'CoordSystem' ); my $cs = $cs_adaptor->fetch_by_name('chromosome'); printf "Coordinate system: %s %s\n", $cs->name(), $cs->version();
A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.
Slice objects have an associated CoordSystem object and a seq_region_name() method that returns its name which uniquely defines the sequence that they are positioned on. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in the fetch_by_region() method. Similarly the version may also be specified (though it can almost always be omitted):
$slice = $slice_adaptor->fetch_by_region( 'chromosome', 'X', 1e6, 10e6, '1', 'NCBI36' );
Sometimes it is useful to obtain full slices of every sequence in a given coordinate system; this may be done using the SliceAdaptor method fetch_all():
my @chromosomes = @{ $slice_adaptor->fetch_all('chromosome') }; my @clones = @{ $slice_adaptor->fetch_all('clone') };
Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the CoordSystemAdaptor and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.
# List all coordinate systems in this database: my @coord_systems = @{ $cs_adaptor->fetch_all() }; foreach $cs (@coord_systems) { printf "Coordinate system: %s %s\n", $cs->name(), $cs->version; } # Get all slices on the highest coordinate system: my @slices = @{ $slice_adaptor->fetch_all('toplevel') };
The Transform, Transfer and Project methods can all be used to convert between coordinate systems. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.
Method | Converts from | Converts to |
---|---|---|
Transform | Feature in slice or coordinate system | Feature in a different coordinate system |
Transfer | Feature in slice or coordinate system | Feature in a different slice |
Project | Feature in slice or coordinate system | Feature spanning multiple seq_regions in a coordinate system |
Transform
The transform() method can be used to move a feature to any coordinate system which is in the database. The feature will be placed on a slice which spans the entire sequence that the feature is on in the requested coordinate system.
if ( my $new_feature = $feature->transform('clone') ) { printf( "Feature's clonal position is: %s %d-%d (%+d)\n", $new_feature->slice->seq_region_name(), $new_feature->start(), $new_feature->end(), $new_feature->strand() ); } else { print "Feature is not defined in clonal coordinate system\n"; }
The transform() method returns a copy of the original feature in the new coordinate system, or undef if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:
|~~~~~~~| (Feature A) |~~~~| (Feature B) (ctg 1) [=============] (ctg 2) (------==========] (ctg 2) (ctg 3) (--============] (ctg3)
Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.
The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:
my $new_feature = $feature->transform('toplevel'); printf( "Feature's toplevel position is: %s %s %d-%d (%+d)\n", $new_feature->slice->coord_system->name(), $new_feature->slice->seq_region_name(), $new_feature->start(), $new_feature->end(), $new_feature->strand() );
Transfer
Another method that is available on all Ensembl features is the transfer() method. The transfer() method is similar to the previously described transform() method, but rather than taking a coordinate system argument it takes a slice argument. This is useful when you want a feature's coordinates to be relative to a certain region. Calling transform() on the feature will return a copy of the feature which is shifted onto the provided slice. If the feature would be placed on a gap or across a coordinate system boundary, then undef is returned instead. It is illegal to transfer a feature to a slice on a sequence region which it cannot be placed on. For example, a feature which is on chromosome X cannot be transferred to a slice on chromosome 20 and attempting to do so will raise an exception. It is legal to transfer a feature to a slice on which it has coordinates past the slice end or before the slice start. The following example illustrates the use of the transfer method:
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '2', 1e6, 2e6 ); my $new_slice = $slice_adaptor->fetch_by_region( 'chromosome', '2', 1.5e6, 2e6 ); foreach my $simple_feature ( @{ $slice->get_all_SimpleFeatures('Eponine') } ) { printf( "Before:\t%6d - %6d\n", $simple_feature->start(), $simple_feature->end() ); my $new_feature = $simple_feature->transfer($new_slice); if ( !defined $new_feature ) { print "Could not transfer feature\n"; } else { printf( "After:\t%6d - %6d\n", $new_feature->start(), $new_feature->end() ); } }
In the above example a slice from another coordinate system could also have been used, provided you had an idea about what sequence region the features would be mapped to.
Project
When moving features between coordinate systems it is usually sufficient to use the transfer() or transform() methods. Sometimes, however, it is necessary to obtain coordinates in a another coordinate system even when a coordinate system boundary is crossed. Even though the feature is considered to be undefined in this case, the feature's coordinates can still be obtained in the requested coordinate system using the project() method.
Both slices and features have their own project() methods, which take the same arguments and have the same return values. The project() method takes a coordinate system name as an argument and returns a reference to a list of ProjectionSegment objects. A projection segment has three attributes, accessible by the methods from_start(), from_end(), to_Slice(). The from_start() and from_end() methods return integers representing the part of the feature or slice that is used to form that part of the projection. The to_Slice() method returns a slice object representing the part of the region that the slice or feature was projected to. The following example illustrates the use of the project() method on a feature. The project() method on a slice can be used in the same way. As with the feature transform() method the pseudo coordinate system toplevel can be used to indicate you wish to project to the highest possible level.
printf( "Feature at: %s %d-%d (%+d) projects to\n", $feature->seq_region_name(), $feature->start(), $feature->end(), $feature->strand() ); my $projection = $feature->project('clone'); foreach my $segment ( @{$projection} ) { my $to_slice = $segment->to_Slice(); printf( "\t%s %d-%d (%+d)\n", $to_slice->seq_region_name(), $to_slice->start(), $to_slice->end(), $to_slice->strand() ); }
An example script and sample input for mapping coordinates between assembly versions can be found in the Assembly converter directory on the Ensembl GitHub website.
Further help
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.