Ensembl Regulation (funcgen) API Tutorial

Introduction

The Ensembl Regulation team deals with functional genomics data. The API and databases for Ensembl Regulation are called Funcgen.

This tutorial is an introduction to the Funcgen API. Knowledge of the Ensembl Core API and of the coding conventions used in the Ensembl APIs is assumed.

Documentation about the Regulation database schema is available here, and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.

Regulatory Features

RegulatoryFeatures are features involved with regulatory aspects like:

  • Predicted promoters,
  • Predicted promoter flanking regions,
  • Predicted enhancer regions,
  • CTCF Binding Sites,
  • Transcription factor binding sites or
  • Open chromatin regions.

They are generated by the Ensembl Regulatory Build.

To fetch RegulatoryFeatures from the funcgen database, you need to use the corresponding adaptor. To obtain all the regulatory features present in a given region of the genome, use the adaptor method fetch_all_by_Slice:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the SliceAdaptor and Slice
my $slice_adaptor = $registry->get_adaptor('Human', 'Core', 'Slice');
my $slice = $slice_adaptor->fetch_by_region('chromosome', 1, 54_960_000, 54_980_000);

# Get the RegulatoryFeatureAdaptor and fetch all RegulatoryFeatures by Slice
my $regulatory_feature_adaptor = $registry->get_adaptor('Human', 'Funcgen', 'RegulatoryFeature');
my @regulatory_features = @{$regulatory_feature_adaptor->fetch_all_by_Slice($slice)};

# Move through the regulatory features and print information about them
foreach my $current_regulatory_feature (@regulatory_features) {
  print $current_regulatory_feature->stable_id, "\t", $current_regulatory_feature->feature_type->name, "\n";
}
Used objects:
Adaptor objects
Main objects

Registry

What is this object $registry? Make sure you have defined it in all your scripts. Learn more in the general instructions page.

You can also narrow down by FeatureType. To do this, you need to specify the FeatureType using the FeatureTypeAdaptor.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the SliceAdaptor and Slice
my $slice_adaptor = $registry->get_adaptor('Human', 'Core', 'Slice');
my $slice = $slice_adaptor->fetch_by_region('chromosome', 17, 64000000, 64050000);

# Get the FeatureTypeAdaptor and specify the FeatureType
my $feature_type_adaptor = $registry->get_adaptor('Human', 'Funcgen', 'FeatureType');
my $feature_type = $feature_type_adaptor->fetch_by_name("Promoter");

# Get the RegulatoryFeatureAdaptor and fetch all RegulatoryFeatures by Slice
my $regulatory_feature_adaptor = $registry->get_adaptor('Human', 'Funcgen', 'RegulatoryFeature');
my @regulatory_features = @{$regulatory_feature_adaptor->fetch_all_by_Slice_FeatureType($slice, $feature_type)};

# Move through the regulatory features and print information about them
foreach my $current_regulatory_feature (@regulatory_features) {
    print $current_regulatory_feature->stable_id, "\t", $current_regulatory_feature->seq_region_start, "-", $current_regulatory_feature->seq_region_end, "\n";
}
Used objects:
Main objects

Regulatory Activities

For every regulatory feature the Ensembl Regulatory Build predicts the regulatory activity of the regulatory feature in each of the epigenomes of the regulatory build. For every epigenome there are five possible activities:

  1. Active
  2. Poised (Has both active and repressive marks, "ready to go"),
  3. Inactive,
  4. Repressed,
  5. NA (No data available for this epigenome)

The regulatory activities have their own object and can be queried like this:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the regulatory feature by ID
my $regulatory_feature_id = 'ENSR00000358244';
my $regulatory_feature_adaptor = $registry->get_adaptor('Human', 'Funcgen', 'RegulatoryFeature');
my $regulatory_feature = $regulatory_feature_adaptor->fetch_by_stable_id($regulatory_feature_id);

# print information about the feature
print "The ", $regulatory_feature->get_FeatureType->name, " with stable id: "  . $regulatory_feature->stable_id . " has the following activities: \n";

# Get the activity
my $regulatory_activity_adaptor = $registry->get_adaptor('homo_sapiens', 'funcgen', 'RegulatoryActivity');
my $regulatory_activity_list    = $regulatory_activity_adaptor->fetch_all_by_RegulatoryFeature($regulatory_feature);

# print the activity
foreach my $current_regulatory_activity (@$regulatory_activity_list) {
	print "\tIn the epigenome ", $current_regulatory_activity->get_Epigenome->short_name, " it is ", $current_regulatory_activity->activity, "\n";
}

Peaks: Enriched regions from ChIP-seq and other high throughput experiments

Regulatory Features are built based on results from experiments like Dnase1 sensitivity assays (Dnase-Seq) to detect regions of open chromatin, or transcription factor binding assays, like Chromatin immunoprecipitation coupled with high throughput sequencing (ChIP-Seq). ChIP-Seq studies are also used to detect histone modifications (eg. H3K36 trimethylation) and Polymerase binding sites. Results from these experiments are stored as Peaks.

Peaks have these properties:

  • Score. An analysis-dependent value (eg. peak-caller score)
  • The peak Summit. Precise 1bp position within the peak with the highest read density in a ChIP experiment. It is dependent on the analysis and sometimes it may not be present.

Peaks also link to an object called PeakCalling, which contains information about the Epigenome, FeatureType and Experiment.

Fetch Peaks on a Slice

Here is an example how Peaks can be fetched from a Slice and filtered by their Epigenome:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the adaptors
my $peak_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'Peak');
my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'Slice');

# Fetch a Slice
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '17', 63_992_802, 64_038_237);

# Fetch all Peaks on the Slice
my @peaks = @{ $peak_adaptor->fetch_all_by_Slice($slice) };

# move through the Peaks and get the PeakCalling for them 
while (my $peak = shift @peaks) {
	my $peakcalling = $peak->get_PeakCalling;
	
	# get the Epigenome for the PeakCalling and filter by those found in placenta
	my $epigenome = $peakcalling->fetch_Epigenome->short_name;	
	if ($epigenome eq 'placenta') {
	
		# Print the FeatureType and location of each Peak
		print $peakcalling->fetch_FeatureType->name, "\t", $peak->seq_region_name, ":",  $peak->seq_region_start, "-",  $peak->seq_region_end, "\n";
	}
}
Used objects:
Adaptor objects
Main objects

Motif Features: Transcription factor binding sites

Motif Features represent short genomic regions where a Transcription Factor is thought to be directly interacting with the DNA. These regions are called Transcription Factor binding sites. More information on how these sites are found in Ensembl is on the RegulatoryBuild page.

MotifFeatures can be fetched from a RegulatoryFeature or Peak using the method get_all_MotifFeatures. They also have their own Adaptor which you can use to fetch them by Slice, BindingMatrix and Epigenome.

Information about the transcription factors bound by a MotifFeature can be found using the BindingMatrix.

The following script fetches MotifFeatures from a Slice and prints their properties:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the adaptors
my $motif_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'MotifFeature');
my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'Slice');

# Fetch a slice
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '13',32315400, 32315500);

# Fetch all motifs on the slice
my @motifs = @{ $motif_adaptor->fetch_all_by_Slice($slice) };

# move through the motifs
while (my $motif = shift @motifs) {
	
	# print the motif ID and location
	print $motif->stable_id, "\t", $motif->seq_region_name, ":", $motif->seq_region_start, "-", $motif->seq_region_end, "\n";
	
	# get the transcription factors associated with the MotifFeature, going via the BindingMatrix
	my @transcription_factors = @{ $motif->get_BindingMatrix->get_all_TranscriptionFactors };
	
	# create and add to an array of transcription factor names, then print it
	my @tfs;
	foreach my $tf (@transcription_factors) {
		push @tfs, $tf->name;
	}
	print join(", ", @tfs), "\n";
}
Used objects:
Adaptor objects

If a Peak for the same transcription factor overlaps a MotifFeature, the MotifFeature is classed as "experimentally validated" in that Epigenome. From the MotifFeature object, you can fetch the Epigenomes where it is validated and the Peaks that correspond to it. The MotifFeatureAdaptor will also allow you to fetch by Epigenome where the MotifFeature is validated.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get the adaptors
my $motif_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'MotifFeature');
my $epigenome_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'Epigenome');
my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'Slice');

# Fetch a slice and epigenome of interest
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '13',32315400, 32315500);
my $epigenome = $epigenome_adaptor->fetch_by_short_name('K562');

# Fetch all motifs on the slice
my @motifs = @{ $motif_adaptor->fetch_all_by_Slice($slice) };

# move through the motifs
while (my $motif = shift @motifs) {

	# filter motifs to find only those that have been verified in our epigenome of interest
	if ($motif->is_experimentally_verified_in_Epigenome($epigenome)){
	
		# print the motif ID and location
		print $motif->stable_id, "\t", $motif->seq_region_name, ":", $motif->seq_region_start, "-", $motif->seq_region_end, "\n";
		
		# get the peaks associated with the motif and epigenome, then print information about them
		my @peaks = @{ $motif->get_all_overlapping_Peaks_by_Epigenome($epigenome) };		
		foreach my $peak (@peaks){
			print $peak->get_PeakCalling->fetch_FeatureType->name, "\t", $peak->seq_region_name, ":",  $peak->seq_region_start, "-",  $peak->seq_region_end, "\n";
		}
	}
}
Used objects:
Adaptor objects
Main objects

External Features: Externally curated data

There are some Feature Sets that are either entirely or partially curated by external groups. These are stored as ExternalFeatures and can be accessed using the ExternalFeatureAdaptor.

If you know the name of a feature set, you can use the name to fetch the data, using the FeatureSetAdaptor. For example, we store data from the Vista Enhancer Browser.

The following script fetches the Vista Enhancers for a Slice.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get the Slice, FeatureSet and ExternalFeatureAdaptors
my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'Slice');
my $feature_set_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'FeatureSet');
my $ex_feat_adaptor  = $registry->get_adaptor('homo_sapiens', 'funcgen', 'ExternalFeature');

# Fetch a Slice and FeatureSet of interest
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '13');
my $vista_feature_set = $feature_set_adaptor->fetch_by_name('VISTA enhancer set');

# Use the ExternalFeatureAdaptor to fetch the Vista enhancers in the Slice
my @vistas = @{ $ex_feat_adaptor->fetch_all_by_Slice_FeatureSets($slice, [$vista_feature_set]); };

# Move through the Vista enhancers and print their locations
while (my $vista = shift @vistas) {
	print $vista->seq_region_name, ":", $vista->seq_region_start, "-", $vista->seq_region_end, "\n";
}
Used objects:
Main objects

Feature Types

FeatureTypes provide a biological annotation for features. They are divided in classes forming biologically coherent groups (eg. Transcription Factors). This is different from the FeatureSet class, which just states the origin of the data. Feature Types can be accessed using the FeatureTypeAdaptor.

External FeatureTypes

FeatureTypes for ExternalFeatures have a meaning that is specific to the FeatureSet. For example, for features of the Vista FeatureSet, the feature type indicates if the feature was active or inactive in an experiment.

Microarrays and associated information

Some popular commercial microarrays are stored in the Ensembl database, with mapping to genomic regions and genes. The arrays themselves are stored as Array objects, which can be fetched with the Array Adaptor.

The following script fetches all the arrays for a species and prints information about them:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);
  
# get the array adaptor
my $array_adaptor = $registry->get_adaptor('Human','Funcgen','Array');

# fetch all arrays and move through
my @arrays = @ { $array_adaptor->fetch_all };
foreach my $array (@arrays) {

	# Print some array info
	print "Array:\t", $array->name,"\nType:\t",  $array->type, "\nVendor:\t", $array->vendor, "\n";

	# Get some information about the array chips and print
	my @array_chips   = @{ $array->get_ArrayChips };
	foreach my $array_chip (@array_chips) {
		print "ArrayChip:\t", $array_chip->name, " DesignID:\t", $array_chip->design_id, "\n";
	}
	print "\n";
}
Used objects:
Adaptor objects
Main objects

Fetch all Probe Features from a specific Array and Probe

Probes are stored as Probe objects, which represent the probe on the array, and ProbeFeature objects, which represent the mapping of the Probe to the genome.

In this example, a Probe from the WholeGenome_4x44k_v1 array is obtained.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# get ProbeAdaptor and use to fetch a probe from the  WholeGenome_4x44k_v1 array
my $probe_adaptor = $registry->get_adaptor('Human', 'Funcgen', 'Probe');
my $probe = $probe_adaptor->fetch_by_array_probe_probeset_name('WholeGenome_4x44k_v1', 'A_23_P18656');

# Fetch the feature associated with this probe
my @probe_features = @{ $probe->get_all_ProbeFeatures };

#Print some info about the features
foreach my $probe_feature ( @probe_features ){
	print "ProbeFeature found at:\t", $probe_feature->feature_Slice->name, "\n";
}
Used objects:
Adaptor objects
Main objects

Probe mappings to transcripts

ProbeSets represent groups of Probes, and are mapped to transcripts.

In this example, the FOXP2 transcript is fetched by its stable_id. Then all ProbeSets that have been mapped to this transcript are fetched and printed.

  use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $trans_id = "ENST00000393489";

# get the ProbeSetAdaptor
my $probe_set_adaptor  = $registry->get_adaptor("human", "Funcgen", "ProbeSet");

# Fetch ProbeSets associated with a transcript and move through
my @probesets  = @{ $probe_set_adaptor->fetch_all_by_transcript_stable_id($trans_id) };
foreach my $probeset (@probesets) {
	
	# get all the Arrays the ProbeSets are found on, then make an array of their names 
	my @arrays = @{ $probeset->get_all_Arrays };	
	my @arraynames;
	foreach my $array (@arrays) {
		push @arraynames, ($array->name);
	}
	
	# print information about the mapping
	print "Probeset ", $probeset->name, " on array(s) ", join(", ", @arraynames), " maps to ", $trans_id, ".\n";
}
Used objects:
Adaptor objects
Main objects

Further help

For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.