Using the Ensembl variant mapper API#

This code example illustrates how to use the version of the variant mapper that uses Ensembl to localise the variant in the genome and to get metadata for the variant. This functionality is provided by the gwas_norm.variants.mapper.EnsemblMapper class and this class is also used in the cmd-line program variant-mapper-ensembl. The code will also demonstrate how the gwas_norm.variants.mapper.EnsemblVariantMapper can be used to map variants contained in a pandas.DataFrame.

Basic Ensembl mapper usage#

This illustrates the most basic usage of the gwas_norm.variants.mapper.EnsemblMapper which is essentially just used to validate variants and provide mapping information (map_info) detailing how well the provided variant maps to a site and if any steps such as strand flipping or reference allele flipping had to occur to get the variant to map to the specific site. The map_info is a bitwise vector and is explained in greater detail below.

To interface with Ensembl, the mapper uses an ensembl_rest_client.lient.Rest object. This can take a URL argument that can point to either GRCh37 or GRCh38 assemblies.

[1]:
from gwas_norm.variants import mapper
from ensembl_rest_client import client
import pprint as pp
[2]:
# Create the rest client, it defaults to GRCh38
rc = client.Rest()
[3]:
# The chromosome is a string not an interger
chr_name = '1'
start_pos = 230710048
ref_allele = "A"
alt_allele = "G"

Now create a mapper and get a mapping for our variant. Notice that the alt_allele is an optional argument as more advanced usages of the mapper can allow of basic imputation of the alternate allele

[4]:
with mapper.EnsemblVariantMapper(rc) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)

The MappingResult object#

The returned variable (my_var) is a gwas_norm.variants.constants.MappingResult namedtuple. In Python namedtuples can behave like tuples and also have named attributes, similar to many objects. The MappingResult has the attributes:

  • source_coords - the coordinates and alleles of the variant being mapped

  • mapping_coords - the coordinates and alleles of the variant mapping to the source variant. If there is no mapping then this will be None

  • map_bits - The bitwise integer containing the information on the mapping

  • source_row - The row from the input source file. This will only really be used by the scan mapper that does not work in the same way to the EnsemblVariantMapper.

  • map_row - The full data from the mapping data source. For the EnsemblVariantMapper this will have list where the first 7 elements are the same order as a VCF file (for compatibility reasons) with the last element being the data for the variant from the Ensembl REST call.

  • errors - In the event that an error has occured during the mapping (i.e. many errors are caught and handled unless the user has done something really wrong). Then the error object will be stored here. In general, the mappers are designed not to error out unless it is something serious.

  • nsites - This is the total number of bi-allelic sites that overlap the source variant from which the mapping variant was choosen.

  • resolver - The resolver class associated with the mapper, in the output below we can see that we have an gwas_norm.variants.resolvers.BaseResolver object.

[5]:
# View the mapping variant data
my_var
[5]:
MappingResult(source_coords=MapCoord(chr_name='1', start_pos=230710048, strand=1, ref_allele='A', alt_allele='G'), mapping_coords=MapCoord(chr_name='1', start_pos=230710048, strand=1, ref_allele='A', alt_allele='G'), map_bits=28160, source_row=None, map_row=['1', 230710048, 'rs699', 'A', 'G', '.', '.', {'start': 230710048, 'consequence_type': 'missense_variant', 'id': 'rs699', 'end': 230710048, 'clinical_significance': ['benign', 'risk factor'], 'seq_region_name': '1', 'alleles': ['A', 'G'], 'source': 'dbSNP', 'assembly_name': 'GRCh38', 'feature_type': 'variation', 'strand': 1}], errors=None, nsites=3, resolver=<gwas_norm.variants.resolvers.BaseResolver object at 0x7ff619c774c0>)

As the MappingResult is a namedtuple we can use attribute names to access the data. Below we access the source variant coordinates and alleles. This is also modelled based on a namedtuple with the following attributes:

  • chr_name - the name of the chromosome, this is always treated as a string

  • start_pos - the integer start position in base pairs

  • strand - the strand modelled as a signed integer of either 1 (forward strand) or -1 (reverse strand). In general, within gwas-norm, if strand is not known, then the strand defaults to 1

  • ref_allele - The reference allele for the variant

  • alt_allele - the alternate allele for the variant, note that there will only ever be 1 alternate allele as gwas-norm and the variant mappers only handle bi-allelic variants, so multi-allelic sites are split into ref/A1, ref/A2, ref/A3, ref/AN…

[6]:
my_var.source_coords
[6]:
MapCoord(chr_name='1', start_pos=230710048, strand=1, ref_allele='A', alt_allele='G')

The mapping bits (map_bits)#

Dispite being a single integer, the map_bits variable contains a lot of information about how the variant was mapped into the source data. It can be easily (and efficiently) interigated using bitwise operations. This sounds more complicated than it is and in fact if you have subset a data frame in pandas or R you have probably used it before without realising.

[7]:
my_var.map_bits
[7]:
28160

The value 28160 is pretty meaningless as it stands. However, you can decode this into a human readable format using a static method from the mapper.

[8]:
# Make the map_bits human readable
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[8]:
['CHR', 'START', 'STRAND', 'REF', 'ALT']

We have converted the single integer into a list of human readable strings and now we can see that the source variant maps against our maping variant based on CHR (chromosome), START (start position), STRAND (strand - using the default strand of 1), REF (reference allele matches), ALT (alternate allele matches). So all in all it is a good match!

Whilst decoding the mapping imformation is useful in some circumstances you might not want to use if for filtering purposes. You can also use bitwise operations.

So if we want to see if the strand has matched

[9]:
from gwas_norm.variants import constants as vc
[10]:
# my_var.map_bits & vc.STRAND.bits is the important part the == is just making boolean
my_var.map_bits & vc.STRAND.bits == vc.STRAND.bits
[10]:
True

If we want to see if we have had to flip the reference allele in order to get a mapping. In the bitwise AND (&) operation we are testing for the bit position respresenting a strand flip being set to 1 in both cases. It is in vc.STRAND_FLIP.bits but not in my_var.map_bits so the result is 0

[11]:
my_var.map_bits & vc.STRAND_FLIP.bits
[11]:
0

The bitwise information that is currently inplement in the variant mapper is:

  • NO_DATA - No mapping

  • CHR - mapping on chromosome

  • START - mapping on start position

  • STRAND - strand matching

  • REF - reference allele match

  • ALT - alternate allele match

  • REF_FLIP - The reference allele was flipped to obtain the reference/alternate allele match

  • STRAND_FLIP - The DNA strand was flipped to obtain the reference/alternate allele match

  • ALT_ALLELE_INFERRED - the alternate allele has been inferred (imputed), will be added if the source variant alt_allele is None.

  • NORMALISED - is added if the source variant is an indel and was adjusted to normal form prior to mapping

These mapping information flags are also implemented as namedtuples with the bit (integer) under the bits attribute as can be seen above and below:

[12]:
print(vc.NO_DATA)
print(vc.CHR)
print(vc.START)
print(vc.STRAND)
print(vc.REF)
print(vc.ALT)
print(vc.REF_FLIP)
print(vc.STRAND_FLIP)
print(vc.ALT_ALLELE_INFERRED)
print(vc.NORMALISED)
MappingFlag(name='NO_DATA', bits=0, description='A flag indicating a mapping has no data associated with it')
MappingFlag(name='CHR', bits=512, description='A flag indicating a variant has been mapped based on chromosome name')
MappingFlag(name='START', bits=2048, description='A flag indicating a mapping has no data associated with it')
MappingFlag(name='STRAND', bits=1024, description='A flag indicating a variant has been mapped based on strand')
MappingFlag(name='REF', bits=8192, description='A flag indicating a variant has been mapped based on reference allele')
MappingFlag(name='ALT', bits=16384, description='A flag indicating a variant has been mapped based on alternate(other) allele')
MappingFlag(name='REF_FLIP', bits=64, description='A flag indicating the variant reference allele has been flipped prior tomatching')
MappingFlag(name='STRAND_FLIP', bits=128, description='A flag indicating a variant strand has been flipped prior to matching')
MappingFlag(name='ALT_ALLELE_INFERRED', bits=32, description='A flag indicating the the alternate allele has been inferred for a variant')
MappingFlag(name='NORMALISED', bits=8, description='A flag indicating an insertion/deletion variant has normalised prior tomatching')

The resolver class#

We have seen above that the mapping result object contains an attribute for the resolver. The job a the resolver object is as follows.

  • Provide a resolution strategy for resolving ambigious mappings. Currently the implemented resolvers do not do this but that have a method that is called from the mapper that can be overidden.

  • Resolve mappings where the source variant has no alternate allele data. This feature is implemented in the gwas_norm.variants.resolvers.EnsemblResolver and gwas_norm.variants.resolvers.MappingFileResolver objects.

  • Provide access to metadata that is available in the data source used to localise the mapping variant.

The resolver also provides a mechanism where the user can define their own additional functionality and metadata extraction. The resolver used in the example above is a gwas_norm.variants.resolvers.BaseResolver object, it does the base minimum and has no alt imputation capabilities and minimal metadata extraction. It is not recomanded for general purpose use but is used here just to illustrate the starting point for a resolver class. Here is an illustration of some metadata extraction below. We will perform the same mapping as above however, we will flip the strand and the reference alleles as well, so will also demonstrate how the map_info changes.

[13]:
# We will use the same example as before but flip the strand and ref allele
# The chromosome is a string not an interger
chr_name = '1'
start_pos = 230710048
ref_allele = "C"
alt_allele = "T"
[14]:
with mapper.EnsemblVariantMapper(rc) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
[15]:
my_var
[15]:
MappingResult(source_coords=MapCoord(chr_name='1', start_pos=230710048, strand=1, ref_allele='C', alt_allele='T'), mapping_coords=MapCoord(chr_name='1', start_pos=230710048, strand=1, ref_allele='A', alt_allele='G'), map_bits=27328, source_row=None, map_row=['1', 230710048, 'rs699', 'A', 'G', '.', '.', {'consequence_type': 'missense_variant', 'start': 230710048, 'feature_type': 'variation', 'assembly_name': 'GRCh38', 'source': 'dbSNP', 'seq_region_name': '1', 'clinical_significance': ['benign', 'risk factor'], 'end': 230710048, 'alleles': ['A', 'G'], 'strand': 1, 'id': 'rs699'}], errors=None, nsites=3, resolver=<gwas_norm.variants.resolvers.BaseResolver object at 0x7ff619c09af0>)

So for illustration we will make the map_bits human readable, so now we can see that we have a REF/ALT match after REF_FLIP and STRAND_FLIP. Also, in the process of flipping the strand our source strand went from the default 1, to -1, so no longer matched the strand in Ensembl.

[16]:
# Make the map_bits human readable
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[16]:
['CHR', 'START', 'REF', 'ALT', 'REF_FLIP', 'STRAND_FLIP']

So, back to the resolver object. We have seen that it is available in our mapping result and it is also available via out mapper object i.e. emap.resolver. Given that the mapper object is now out of scope we will use the (idenitcal) version in the results object.

[17]:
rsvr = my_var.resolver
type(rsvr)
[17]:
gwas_norm.variants.resolvers.BaseResolver

So, it is important to note that the resolver object is somewhat tied to the datasource, so whilst BaseResolver will work with everything (because it does not do much), the EnsemblResolver will only work the the EnsemblMapper and the MappingFileResolver will only work with the mapping file detailed here. However, you can easily adapt the MappingFileResolver for other VCF files and hopefully in the future there will be a more generalisable VCF resolver. The resolvers basically allow some separation of the core mapping duties from the “other” information we might want from a mapped variant.

To get the most out of the EnsemblVariantMapper, you will need to use the EnsemblResolver

[18]:
# Lets import the resolvers module
from gwas_norm.variants import resolvers
[19]:
# Now we will create an ensembl resolver - this will need access to the rest client also
rsvr = resolvers.EnsemblResolver(rc)

# Now create the EnsemblVariantMapper giving it the resolver
with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)

We can use the EnsemblResolver to extract the metadata from our mapping result. Much of this is similar to the map_row in the MappingResult object but there are a few differences. Firstly, the clinical_significance key has been expanded to contain some namedtuples for the ClinVar data. Similarly the consequence_type has been mapped to give more information. We also, have an alt_allele_freq key and freq_pops key. These are not defined at the moment (freq_pops is empty). This is because we have not told our resolver what populations that we are interested in.

[20]:
rsvr.extract_metadata(my_var)
[20]:
{'alleles': ['A', 'G'],
 'end': 230710048,
 'strand': 1,
 'id': 'rs699',
 'seq_region_name': '1',
 'clinical_significance': [ClinVarSig(int=2, rank=1024, name='benign', display_name='Benign'),
  ClinVarSig(int=9, rank=8, name='risk_factor', display_name='Risk factor')],
 'assembly_name': 'GRCh38',
 'feature_type': 'variation',
 'source': 'dbSNP',
 'start': 230710048,
 'consequence_type': So(name='missense_variant', rank=1024, impact='MODERATE', id='SO:0001583', description='A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved', display_name='Missense variant'),
 'alt_allele_freq': 0,
 'freq_pops': []}

Specifying population groups#

Both the EnsemblResolver and the MappingFileResolver inherit from teh generic PopulationResolver. This is a base class that is able to parse population arguments and has some methods for calculating allele frequencies in different ways. This raises a couple of questions? How do we specify what populations we are interested in and how do we know what populations are available for the data source in question (in this case Ensembl).

We can get a list of the populations available in Ensembl from the resolver object and we can use the name to specify them in our resolver

[21]:
rsvr.list_populations()
[21]:
[{'name': '1000GENOMES:phase_3:ACB',
  'size': 96,
  'description': 'African Caribbean in Barbados'},
 {'name': '1000GENOMES:phase_3:AFR', 'size': 661, 'description': 'African'},
 {'size': 2504,
  'name': '1000GENOMES:phase_3:ALL',
  'description': 'All phase 3 individuals'},
 {'description': 'American', 'name': '1000GENOMES:phase_3:AMR', 'size': 347},
 {'description': 'African Ancestry in Southwest US',
  'size': 61,
  'name': '1000GENOMES:phase_3:ASW'},
 {'name': '1000GENOMES:phase_3:BEB',
  'size': 86,
  'description': 'Bengali in Bangladesh'},
 {'description': 'Chinese Dai in Xishuangbanna, China',
  'size': 93,
  'name': '1000GENOMES:phase_3:CDX'},
 {'description': 'Utah residents with Northern and Western European ancestry',
  'name': '1000GENOMES:phase_3:CEU',
  'size': 99},
 {'name': '1000GENOMES:phase_3:CHB',
  'size': 103,
  'description': 'Han Chinese in Bejing, China'},
 {'description': 'Southern Han Chinese, China',
  'size': 105,
  'name': '1000GENOMES:phase_3:CHS'},
 {'name': '1000GENOMES:phase_3:CLM',
  'size': 94,
  'description': 'Colombian in Medellin, Colombia'},
 {'name': '1000GENOMES:phase_3:EAS', 'size': 504, 'description': 'East Asian'},
 {'description': 'Esan in Nigeria',
  'name': '1000GENOMES:phase_3:ESN',
  'size': 99},
 {'name': '1000GENOMES:phase_3:EUR', 'size': 503, 'description': 'European'},
 {'description': 'Finnish in Finland',
  'size': 99,
  'name': '1000GENOMES:phase_3:FIN'},
 {'description': 'British in England and Scotland',
  'name': '1000GENOMES:phase_3:GBR',
  'size': 91},
 {'name': '1000GENOMES:phase_3:GIH',
  'size': 103,
  'description': 'Gujarati Indian in Houston, TX'},
 {'description': 'Gambian in Western Division, The Gambia',
  'size': 113,
  'name': '1000GENOMES:phase_3:GWD'},
 {'name': '1000GENOMES:phase_3:IBS',
  'size': 107,
  'description': 'Iberian populations in Spain'},
 {'description': 'Indian Telugu in the UK',
  'name': '1000GENOMES:phase_3:ITU',
  'size': 102},
 {'description': 'Japanese in Tokyo, Japan',
  'name': '1000GENOMES:phase_3:JPT',
  'size': 104},
 {'description': 'Kinh in Ho Chi Minh City, Vietnam',
  'name': '1000GENOMES:phase_3:KHV',
  'size': 99},
 {'name': '1000GENOMES:phase_3:LWK',
  'size': 99,
  'description': 'Luhya in Webuye, Kenya'},
 {'description': 'Mende in Sierra Leone',
  'size': 85,
  'name': '1000GENOMES:phase_3:MSL'},
 {'description': 'Mexican Ancestry in Los Angeles, California',
  'size': 64,
  'name': '1000GENOMES:phase_3:MXL'},
 {'name': '1000GENOMES:phase_3:PEL',
  'size': 85,
  'description': 'Peruvian in Lima, Peru'},
 {'description': 'Punjabi in Lahore, Pakistan',
  'size': 96,
  'name': '1000GENOMES:phase_3:PJL'},
 {'description': 'Puerto Rican in Puerto Rico',
  'name': '1000GENOMES:phase_3:PUR',
  'size': 104},
 {'size': 489,
  'name': '1000GENOMES:phase_3:SAS',
  'description': 'South Asian'},
 {'name': '1000GENOMES:phase_3:STU',
  'size': 102,
  'description': 'Sri Lankan Tamil in the UK'},
 {'description': 'Toscani in Italy',
  'name': '1000GENOMES:phase_3:TSI',
  'size': 107},
 {'description': 'Yoruba in Ibadan, Nigeria',
  'name': '1000GENOMES:phase_3:YRI',
  'size': 108},
 {'size': None, 'name': 'ESP6500:African_American', 'description': None},
 {'size': None, 'name': 'ESP6500:European_American', 'description': None},
 {'description': 'Population from the Gambian Genome Variation Project. Gambian in Western Division, The Gambia - Fula',
  'size': 100,
  'name': 'GGVP:GWF'},
 {'description': 'Population from the Gambian Genome Variation Project. Gambian in Western Division, The Gambia - Mandinka.',
  'size': 280,
  'name': 'GGVP:GWD'},
 {'description': 'Population from the Gambian Genome Variation Project. Gambian in Western Division, The Gambia - Wolof.',
  'size': 100,
  'name': 'GGVP:GWW'},
 {'name': 'GGVP:GWJ',
  'size': 100,
  'description': 'Population from the Gambian Genome Variation Project. Gambian in Western Division, The Gambia - Jola.'},
 {'size': 580,
  'name': 'GGVP:ALL',
  'description': 'All populations from the Gambian Genome Variation Project.'}]

So, lets suppose we want our allele frequency to be an average of a selection of West African population groups. We can run this and get an alt_allele_freq of 0.95 (2dp).

[22]:
# Define the populations in a list
pops = ['1000GENOMES:phase_3:MSL', '1000GENOMES:phase_3:GWD', '1000GENOMES:phase_3:ESN']

# Now we will create an ensembl resolver - this will need access to the rest client also
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

# Now create the EnsemblVariantMapper giving it the resolver
with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
rsvr.extract_metadata(my_var)
[22]:
{'clinical_significance': [ClinVarSig(int=2, rank=1024, name='benign', display_name='Benign'),
  ClinVarSig(int=9, rank=8, name='risk_factor', display_name='Risk factor')],
 'end': 230710048,
 'alleles': ['A', 'G'],
 'seq_region_name': '1',
 'consequence_type': So(name='missense_variant', rank=1024, impact='MODERATE', id='SO:0001583', description='A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved', display_name='Missense variant'),
 'start': 230710048,
 'id': 'rs699',
 'source': 'dbSNP',
 'feature_type': 'variation',
 'assembly_name': 'GRCh38',
 'strand': 1,
 'alt_allele_freq': 0.9466666666666668,
 'freq_pops': ['1000GENOMES:phase_3:MSL',
  '1000GENOMES:phase_3:GWD',
  '1000GENOMES:phase_3:ESN']}

Now lets suppose that we want to calculate the allele frequency for an admixed population where we expect 90% European and 10% East Asian. We can set a weighted average:

[23]:
# Define the populations in a list
pops = [('1000GENOMES:phase_3:EUR', 0.9), ('1000GENOMES:phase_3:EAS', 0.1)]

# Now we will create an ensembl resolver - this will need access to the rest client also
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

# Now create the EnsemblVariantMapper giving it the resolver
with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
rsvr.extract_metadata(my_var)
[23]:
{'feature_type': 'variation',
 'assembly_name': 'GRCh38',
 'strand': 1,
 'source': 'dbSNP',
 'id': 'rs699',
 'start': 230710048,
 'consequence_type': So(name='missense_variant', rank=1024, impact='MODERATE', id='SO:0001583', description='A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved', display_name='Missense variant'),
 'seq_region_name': '1',
 'alleles': ['A', 'G'],
 'end': 230710048,
 'clinical_significance': [ClinVarSig(int=2, rank=1024, name='benign', display_name='Benign'),
  ClinVarSig(int=9, rank=8, name='risk_factor', display_name='Risk factor')],
 'alt_allele_freq': 0.454,
 'freq_pops': ['1000GENOMES:phase_3:EUR', '1000GENOMES:phase_3:EAS']}

We can also have preferred and fallback population groups. So Imagine we wanted 90% Britsh and 10% Sierra Leone but if they did not exist then we would be happy with 90% CEU and 10% Yoruba. We can define hierarchical population groups.

[24]:
# Define the populations in a list
pops = [
    (
        ('1000GENOMES:phase_3:GBR', '1000GENOMES:phase_3:CEU'), 0.9
    ),
    (
        ('1000GENOMES:phase_3:MSL', '1000GENOMES:phase_3:YRI'), 0.1
    )
]

# Now we will create an ensembl resolver
# Note that we are supplying the populations and the method is set to hierarchy
rsvr = resolvers.EnsemblResolver(rc, populations=pops, allele_freq_method='hierarchy')

# Now create the EnsemblVariantMapper giving it the resolver
with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
rsvr.extract_metadata(my_var)
[24]:
{'alleles': ['A', 'G'],
 'id': 'rs699',
 'source': 'dbSNP',
 'assembly_name': 'GRCh38',
 'seq_region_name': '1',
 'strand': 1,
 'consequence_type': So(name='missense_variant', rank=1024, impact='MODERATE', id='SO:0001583', description='A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved', display_name='Missense variant'),
 'clinical_significance': [ClinVarSig(int=2, rank=1024, name='benign', display_name='Benign'),
  ClinVarSig(int=9, rank=8, name='risk_factor', display_name='Risk factor')],
 'feature_type': 'variation',
 'end': 230710048,
 'start': 230710048,
 'alt_allele_freq': 0.43100000000000005,
 'freq_pops': ['1000GENOMES:phase_3:GBR', '1000GENOMES:phase_3:MSL']}

No mapping and errors#

So far we have used a simple example where we have a mapping against the reference Ensembl data. However, what happens when we have no mapping, or when our attempts at mapping produce an error. In these cases the mapper will still return a mapping however, the map_info will have either the NO_DATA or ERROR bits assigned. This is illustrated below.

[25]:
with mapper.EnsemblVariantMapper(rc) as emap:
    # A Made up indel site
    no_map = emap.map_variant('6', 31315407, 'A', alt_allele='ACT')
    # The same made up indel site with alleles changed
    error = emap.map_variant('6', 31315407, 'G', alt_allele='GCT')

The made up INDEL site no_map demonstrates the mapping_coords attribute being set to None in the event of no mapping and the map_bits are 0 which translates to NO_DATA. Using an INDEL example also reveals some of the inner workings of the EnsemblVariantMapper. The source variant had the alleles A/ACT - an insertion expressed in VCF format. However, the MappingResult has the alleles -/CT - these are expressed in ensembl/VEP format. You will also notice that the start_pos has been incremented by 1. So, Ensembl respresents INDELs differently to VCF files and the EnsemblVariantMapper will internally ensemblise any VCF INDELs, then query against Ensembl before re-normalising the match.

[26]:
print(no_map)
print(mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits))
MappingResult(source_coords=MapCoord(chr_name='6', start_pos=31315408, strand=1, ref_allele='-', alt_allele='CT'), mapping_coords=None, map_bits=0, source_row=None, map_row=None, errors=None, nsites=1, resolver=None)
['CHR', 'START', 'REF', 'ALT', 'REF_FLIP', 'STRAND_FLIP']

The made up INDEL site error demonstrates a site that does not map to the reference data source, so therefore is normalised and aligned to the genome. The reference genome alignment failed and produced an error that is captured in the result errors=KeyError('REF allele not in reference assembly'). The map_bits are 1 which tranlates to ERROR.

[27]:
print(error)
print(mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits))
MappingResult(source_coords=MapCoord(chr_name='6', start_pos=31315407, strand=1, ref_allele='G', alt_allele='GCT'), mapping_coords=None, map_bits=1, source_row=None, map_row=None, errors=KeyError('REF allele not in reference assembly'), nsites=0, resolver=None)
['CHR', 'START', 'REF', 'ALT', 'REF_FLIP', 'STRAND_FLIP']

Inferring ALT alleles#

The core variant mapping algorithm can be given sites with an unknown ALT allele. Just as with full sites, it will localise these to the reference dataset and measure the mapping information, for sites where the positional and reference allele information matches (even after “flipping”), the mapper will call on the resolver for help in identifing the localised matching variant that is most liekly to be the correct match. This ALT inferrence algorithm is entirely resolver dependent. The BaseResolver will just return a NO_DATA when asked to resolve the ALT allele, however, the EnsemblResolver will attempt to identify the ALT allele according to some simple heuristics. Importantly, the user can subclass a resolver and implement their own method for doing it if they are not happy with the default method.

We will start of using the same variant as before but we will not supply the ALT allele. We know from the initial examples that this variant overlaps 3 other variants at that site. We will also use the BaseResolver and see that it returns NO_DATA as it has no ability to infer the ALT allele.

[28]:
# The chromosome is a string not an interger
chr_name = '1'
start_pos = 230710048
ref_allele = "A"
[29]:
with mapper.EnsemblVariantMapper(rc) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=None)
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[29]:
['NO_DATA']

Now we try the same with the EnsemblResolver

[30]:
rsvr = resolvers.EnsemblResolver(rc)

with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=None)
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[30]:
['NO_DATA']

What happened here? The EnsemblResolver should have the ability to infer the ALT allele but it didn’t? This is because we did not supply any populations from which to calculate allele frequences. If the variant was unique at the site then this would not matter and it would be returned as the most liekly ALT allele if the ref allele could be matched, given that the assumption is that the data source has complete information. However, this genomic site has 3 variants localising to it so we need some population information to use to select the one with the highest MAF for our selected population(s). This is based on the assumption that GWAS will usually assay common variants. So lets try again supplying some population information.

[41]:
pops = ['1000GENOMES:phase_3:EUR']
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=None)
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[41]:
['CHR', 'START', 'STRAND', 'REF', 'ALT_ALLELE_INFERRED']

We can see above that we have now managed to locate our variant and that is has mapped with CHR,START,STRAND,REF and that the ALT_ALLELE_INFERRED. Now lets take a look at the number of possible variants at that location.

[42]:
my_var.nsites
[42]:
2

We can see that is says 2, you may have noticed that the same location said 3 variant sites when we supplied the ALT allele in the beginning. The reason for the difference is that the nsites when the ALT allele was given is determined by the mapper and is all of the possible sites at that location - irrespective of allele matching. However, when the ALT allele is not given the nsites attribute is determined by the Ensemblresolver and this is looking closly at the 3 supplied variants at that location and determined that two of them had reference alleles that matched with one being selected based on the fact it had the higher MAF (and the only variant with a MAF > unsafe_alt_infer see the PopulationResolver API).

Ultimately, whether you trust the ALT imputed variants will depend on what your expect you dataset to contain. If it is an older GWAS then you might expect more common variants to be present and not many indels. However, if you expect there to be many sites in the GWAS from the same genomic location then you should be careful with the variants returned from ALT allele inferrence and perhaps filter your dataset for variants that have only a single possible site. In future updates, the mapper will also be able to accept existing variant IDs that it will pass to the resolver, this might also be able to help deal with multiple variants at a single site.

Ensemblisation#

This section is just to illustrate some of the current pitfalls with mapping against Ensembl (as implemented in the EnsemblVariantMapper). The term “ensemblisation” is used to describe turning a VCF INDEL into an INDEL as represented by Ensembl and the VEP. This was touched upon above but the EnsemblVariantMapper will:

  1. Ensemblise INDELs if they are not already in Ensembl format

  2. Map them against Ensembl

  3. Re-normalise the INDELs to VCF format

Below, is an illustration of how this can go (kind of) wrong. This is due the the EnsemblVariantMapping implementation and not Ensembl. For this, we will return the the same genomic location we have used up to now but just focus on the other variant that is co-located at the site.

[ ]:
chr_name = '1'
start_pos = 230710048
ref_allele = "A"
# This is a deletion represented in Ensembl format
alt_allele = '-'

Lets search using the same population we used when imputing the ALT allele

[43]:
pops = ['1000GENOMES:phase_3:EUR']
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[43]:
['CHR', 'START', 'STRAND', 'REF', 'ALT']

We can see that we have managed to map it to the genome. Now lets take a look at what it has maped to:

[44]:
my_var.mapping_coords
[44]:
MapCoord(chr_name='1', start_pos=230710047, strand=1, ref_allele='CA', alt_allele='C')

We can see that the mapping coordinates are normalised as you might expect in a VCF file with the site before the deletion represented and the start position adjusted accordingly. This is fine at the moment and the mapper is designed to give back normalised sites and the EnsemblVariantMapper can accept and handle Ensembl formatted sites. However, the EnsemblVariantMapper can also handle normalised VCF sites. So lets test it again and give it and VCF style site that is incorrect (the T should be a C) according to what we have seen above.

[45]:
chr_name = '1'
# Shifted the start position
start_pos = 230710047
ref_allele = "TA"
alt_allele = 'T'
[46]:
pops = ['1000GENOMES:phase_3:EUR']
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    my_var = emap.map_variant(chr_name, start_pos, ref_allele, alt_allele=alt_allele)
mapper.EnsemblVariantMapper.decode_mapping_flags(my_var.map_bits)
[46]:
['CHR', 'START', 'STRAND', 'REF', 'ALT']
[47]:
my_var.mapping_coords
[47]:
MapCoord(chr_name='1', start_pos=230710047, strand=1, ref_allele='CA', alt_allele='C')

So we can see that our incorrect alleles have perfectly mapped, and we can see that the alleles in the mapped site are different to the ones supplied. In reality this should have generated an error as the T allele is not present at that position in the reference genome. However, the reference genome is only checked on the event of no match (for efficiency) and the quality of the mapping is assessed based on the Ensemblised form of the INDEL. So please be aware of this issue, I will hopefully fix it soon.

Mapping pandas.DataFrame#

The EnsemblVariantMapper can also be used to validate all the variants held in an pandas.DataFrame. This section will illustrate that with a small example.

[49]:
import pandas as pd
[66]:
def create_df():
    # Create a data.frame
    return pd.DataFrame(
        [
            ['1', 230710048, 1, 'A', 'G', 'rs699'],
            ['2', 231454043, 1, 'C', 'G', 'rs16828074'],
            ['2', 231454043, 1, 'C', 'T', 'rs16828074'],
            ['2', 231454043, -1, 'A', 'G', 'rs16828074'],
            ['18', 68625022, 1, 'T', 'C', 'rs17232800'],
            ['18', 68625022, 1, 'T', 'G', 'rs17232800']
        ], columns=['CHR', 'START', 'STRAND', 'REF', 'ALT', 'VAR_ID']
    )

As before we will start off using the BaseResolver and progress to the EnsemblResolver

[60]:
df = create_df()

with mapper.EnsemblVariantMapper(rc) as emap:
    map_df = mapper.map_data_frame(
            df, emap, chr_name='CHR', start_pos='START', ref_allele='REF',
            alt_allele='ALT', strand='STRAND', decode_map_info=False
        )
[61]:
map_df
[61]:
CHR START STRAND REF ALT VAR_ID chr_name_mapper start_pos_mapper strand_mapper ref_allele_mapper alt_allele_mapper nsites map_info
0 1 230710048 1 A G rs699 1 230710048 1 A G 3 28160
1 2 231454043 1 C G rs16828074 2 231454043 1 C G 2 28160
2 2 231454043 1 C T rs16828074 2 231454043 1 C T 2 28160
3 2 231454043 -1 A G rs16828074 2 231454043 1 C T 2 27328
4 18 68625022 1 T C rs17232800 18 68625022 1 T C 2 28160
5 18 68625022 1 T G rs17232800 18 68625022 1 T G 2 28160

So now we can use the bitwise operation to filter out the site that has mapped via a strand flip

[64]:
map_df.loc[~((map_df.map_info & vc.STRAND_FLIP.bits) == vc.STRAND_FLIP.bits),]
[64]:
CHR START STRAND REF ALT VAR_ID chr_name_mapper start_pos_mapper strand_mapper ref_allele_mapper alt_allele_mapper nsites map_info
0 1 230710048 1 A G rs699 1 230710048 1 A G 3 28160
1 2 231454043 1 C G rs16828074 2 231454043 1 C G 2 28160
2 2 231454043 1 C T rs16828074 2 231454043 1 C T 2 28160
4 18 68625022 1 T C rs17232800 18 68625022 1 T C 2 28160
5 18 68625022 1 T G rs17232800 18 68625022 1 T G 2 28160

Now we can do the same using the EnsemblResolver

[67]:
df = create_df()
pops = ['1000GENOMES:phase_3:EUR']
rsvr = resolvers.EnsemblResolver(rc, populations=pops)

with mapper.EnsemblVariantMapper(rc, resolver=rsvr) as emap:
    map_df = mapper.map_data_frame(
            df, emap, chr_name='CHR', start_pos='START', ref_allele='REF',
            alt_allele='ALT', strand='STRAND', decode_map_info=False
        )
[68]:
df
[68]:
CHR START STRAND REF ALT VAR_ID chr_name_mapper start_pos_mapper strand_mapper ref_allele_mapper alt_allele_mapper nsites map_info alt_allele_freq used_pops var_id worst_consequence worst_clinvar
0 1 230710048 1 A G rs699 1 230710048 1 A G 3 28160 0.41 [1000GENOMES:phase_3:EUR] rs699 missense_variant risk_factor
1 2 231454043 1 C G rs16828074 2 231454043 1 C G 2 28160 NaN None rs16828074 3_prime_UTR_variant None
2 2 231454043 1 C T rs16828074 2 231454043 1 C T 2 28160 0.00 [1000GENOMES:phase_3:EUR] rs16828074 3_prime_UTR_variant None
3 2 231454043 -1 A G rs16828074 2 231454043 1 C T 2 27328 0.00 [1000GENOMES:phase_3:EUR] rs16828074 3_prime_UTR_variant None
4 18 68625022 1 T C rs17232800 18 68625022 1 T C 2 28160 NaN None rs17232800 intergenic_variant None
5 18 68625022 1 T G rs17232800 18 68625022 1 T G 2 28160 0.07 [1000GENOMES:phase_3:EUR] rs17232800 intergenic_variant None

Summary#

So this ends the brief walk throough the variant mappers with a focus on the EnsemblVariantMapper, much of the functionality you see here can be applied to the VcfTabixMapper as well.