Introduction to variant mapping#

Within gwas-norm, variant mapping describes the process of validating a genetic variant against known data sources and annotating it with basic information. The assumptions behind this are that given most of the variants used in GWAS are known, they should all be present in public data repositories and those that are not are suspect. With the advent of sequencing projects this assumption may be eroded, however, variants that do not validate against known data sources are never excluded from the data sets, rather they are flagged with a NO_DATA flag.

Some general features and strategies used in variant mapping are outlined below for both the API and command line endpoints. Then separate sections are dedicated to the command line calls and API examples.

What are the data sources?#

The gwas_norm.variants Python API is designed to be flexible in terms of the data source. So, it could be a VCF file of your choosing or a pre-built mapping file. If using your own custom VCF file you will need to write some code for metadata extraction from the VCF, however, if you only want the rsIDs then custom VCFs should work fine with the base level API classes. There are also web based solutions using the Ensembl REST API .

The Unix command line endpoints are designed to work with the pre-built mapping file or the Ensembl REST API . However, it is relatively straight forward to build your own command line program for any custom VCFs using the Python API.

The genome assembly you will be mapping against is defined by your data source.

The variant mapping process#

Broadly speaking we can divide the variant mapping process up into three stages.

Localisation - the process of defining the likely mapping variants in the genome.
Mapping - out of the localised variants which one(s) actually represents the query variant.
Resolution - given a list of mapping variants refine to a single variant (or not) and extract any metadata associated with that variant.

Localisation#

The process of localisation will usally zero in on an area of the genome and extract sets of variants that have the same chromosome name and start position as the query variant. These will represent the best chance of validating the query variant. However, for INDELs this might not necessarily be the case. The variant mapping API and command line programs offer three strategies for localisation.

Via an Ensembl REST query (REST data sources) ~ requires and internet connection and can process 5 variants/second. This is implemented in the command line endpoint variant-map-ensembl and in the API class gwas_norm.variants.mapper.EnsemblVariant.Mapper.
Via a TABIX query (VCF data sources) ~ requires a tabix index and can process 100-200 variants/second. This is implemented in the command line endpoint variant-map-tabix and in the API class gwas_norm.variants.mapper.TabixVcfVariantMapper.
Via a file scan/join (VCF data sources) ~ Input file and the VCF file need to be pre-sorted (using the same sort strategy) after which this will process 2000-5000 variants/second for a GWAS of 20M variants and a mapping file of common variants. This is implemented in the command line endpoint variant-map-scan and in the API class gwas_norm.variants.mapper.ScanVcfVariantMapper.

Mapping#

Mapping broadly works the same irrespective of the localisation method. The core mapping process is outlined below.

All the localised variants are first scored for how well they match the source variant. The scoring will:
- Compare the chromosome/start position (even though it is likely that the potential mapping variants have been localised on these). If any of these do not match then no further checks are carried out for that localised variant.
- If chromosome/start position checks out then strand is compared, strand missmatches do not cause any stops in allele comparisons.
- Make sure the ref/alt alleles are DNA, which is ATCGctcg-, ambiguious codes will not be recognised as DNA. However, missing (i.e. None) ALT alleles are permitted. Also, I , D , R alleles are tollerated but will not be mapped.
- Test for allele matches, this tests for an exact match i.e. source REF allele equals localised REF allele and the same for the ALT allele (singular - the expectation is that we are bi-allelic). The flipped orientation is also tested, i.e. source REF allele equals localised ALT allele and localised REF allele equals source ALT allele. This is because we can’t guarentee that GWAS software will perform association tests on the any particular allele.
- If no allele matches are detected, then the alleles are flipped onto the opposing strand and tested again. Failure to match here will terminate the scoring.
- If no ALT allele is supplied, the same process as above is carried out just with the expectation that only a single allele will match.
- Success or failue at any points above will result in different map_bits being assiged to the source/localised variant pair (see below).
- Any allele errors detected during scoring will result in a no mapping, they are caught internally and a gwas_norm.variants.constants.MappingResult with the error contained within it’s error attribute is returned. This is not very Pythonic and I will probably add an option to raise the errors in future.
The scored localised variants are then sorted so that the best matching variant is located at the start of the list of localised variants.
If the best matching variant matches on at least CHR, START, REF/ALT (with REF/STRAND flipping allowed), then it will form the mapping result. Currently, ties are not tested for and it is difficult to see if we should expected them at all, other than if there is an error in the reference data source used to localise the variants.
In the case where no variant matches then the resolve_poor_mapping method of the resolver is called. Currently, in all resolvers this will just return a no mapping. However, it is a route by which users can implement there own resolution strategies if they so wish.
In the case where no ALT allele has been supplied then the impute_alt_allele method of the resolver is called. Currently, this method is implemented for all the key resolvers in gwas-norm. Once again, the user can override this method if they wish.
For either steps 3,4 or 5, the variants are tested to see if any of their variant identifiers matches any identifiers supplied by the user. Whilst an ID match does not define a successful mapping on it’s own it may be used by the resolvers in the case of partial allele matches and is potentially useful infor mation for the user to know.

Specific mappers will either conduct some pre-mapping steps and/or post mapping steps in addition to the core mapping outlined above. These are outlined below.

Resolution#

In most cases there will either be a single mapping variant (or no mapping variants). However, the mapper is able to accept variants where only a single allele is known (yes many GWAS report this way!?) and the resolution is designed to help with that by extracting allele frequencies from the metadata and using them to assign the variant with the largest MAF. Specific resolvers are outlined in more detail below.

The mapping bits (`map_bits`)#

Despite being a single integer, the map_bits variable contains a lot of information about how the variant was mapped into the source data. It can be easily (and efficiently) interrogated using bitwise operations. This sounds more complicated than it is and in fact if you have subset a data frame in pandas or R you have probably used them before without realising. In the command-line programs there is also an option to decode it into a human readable string.

The bitwise information that is currently implemented in the variant mapper is:

NO_DATA - No mapping.
ID - The variant ID is matched between the source and the mapping variant.
CHR - mapping on chromosome.
START - mapping on start position.
STRAND - strand matching.
REF - reference allele match.
ALT - alternate allele match.
REF_FLIP - The reference allele was flipped to obtain the reference/alternate allele match.
STRAND_FLIP - The DNA strand was flipped to obtain the reference/alternate allele match.
ALT_ALLELE_INFERRED - the alternate allele has been inferred (imputed), will be added if the source variant alt_allele is None.
NORMALISED - is added if the source variant is an indel and was adjusted to normal form prior to mapping.

Mapping using Ensembl as a data source#

This provides variant validation via an Ensembl REST query (REST data sources) ~ requires and internet connection and can process 5 variants/second. It implemented in the command line endpoint variant-map-ensembl and in the API class gwas_norm.variants.mapper.EnsemblVariant.Mapper.

Whilst it is slow, it is useful for ad-hoc tasks where you do not want masses of data stored locally. Some more detail on Ensembl specific functionality is listed below.

The `EnsemblVariantMapper`#

There are a number of additional complications with using Ensembl as a data source. Foremost, is the way Ensembl represents INDEL alleles, which differs from the way they are represented in a VCF file. Both Ensembl and the Varient Effect Predictor (VEP) represent INDELS at the precise point in which they occur in the genome, with deletions having a - (minus sign) as the ALT allele. VCFs however, reference INDELS from the base pair before the variant. Greater detail is given here. To be consistent the EnsemblVariantMapper represents every variant in normalised form (as in a VCF), as such it has to perform some pre-mapping and post mapping steps which are highlighted below.

Prior to localisation against Ensembl, the variant is Ensemblised - converted into Ensembl format. This only impacts INDELs, balanced polymorphisms such as single nucleotide polymorphisms (SNPs) remain unchanged, as do INDELs already in Ensembl format.
If no ALT allele has been supplied then Ensemblisation will only occur if the variant has a deletion (-) for the REF allele.
Localisation is then performed against Ensembl, gathering all the variants that overlap the (potentially adjusted) coordinates.
The core mapping process outlined above is then performed.
If a variant has been mapped then normalisation then occurs against the reference genome assembly. With the EnsemblVariantMapper, either a reference genome assembly can be supplied (queried with pysam) or REST API calls are used to gather DNA sequence information (the default).
If no variant was mapped, and the source variant is an INDEL, then normalisation against the reference genome assembl is attempted and mapping is repeated. The rationale for this is that there is a possibility that the source variant was not provided in normal form and that is a possible reason why no mapping was found. If the variant is not an INDEL or nothing is found after this second attempt, then a no mapping is returned.

The `EnsemblResolver`#

Here the process of ALT allele inference is outlined. This is useful as some GWAS only provide the effect allele and not the non-effect (other_allele). It must be noted that the heuristic used to do this can’t perform miracles and ultimately it is up to the user if they trust the ALT allele inference. The variant mappers in gwas-norm just aim to provide the data in a standised way that is easy for the user to subset, they do not really attempt to make any analytical decisions.

If no ALT allele has been supplied then the following process is followed:

Localised and scored variants are passed to EnsemblResolver.impute_alt_allele.
The localised variants are filtered for any variant matching at least with CHR, START, REF.
The filtered list is assessed to see if any of the variants have matching IDs. If one or more variants have ID matches they are carried forward, if 0 variants have matching IDs then the list from step 2 is carried forward.
If there is only a single variant at this stage then it is returned as the most likely variant, if > 1, then the MAF is assessed.
If the user has supplied populations to the resolver (see example code), then MAFs are queried for each of the variants.
If a single variant has a MAF defined or a single variant has a MAF defined >= unsafe_alt_infer then it is returned as the likely mapping variant.
If > 1 variant has a MAF >= unsafe_alt_infer then a no mapping is returned.

Usage#

Mapping using VCF files as a data source#

For any kind of performance it is preferable to map using locally available files rather then remote resources such as Ensembl. The variant mapper sub-package of gwas-norm provides two interfaces for mapping against VCF files. The first makes use of tabix indexing (TabixVcfVariantMapper) to localise variants and the second uses iteration and joining to localise the variants (ScanVcfVariantMapper). These can be combined with specific resolvers that can handle metadata extraction and ALT allele inference in the cases where it is unknown.

Normalising variant alleles#

All the variant mappers implemented in gwas-norm have the ability to normalise alleles of a variant should it not be able to be verified aganst the mapping data source. The API to do this is also available for end users to use for other purposes. As with the variant mappers, this can also be used offline via a reference DNA assembly or online via Ensembl, Some examples of usage are given in the documentation below.

Normalisation and multi-allelic-sites#

The goal of normalising alleles in a format such as VCF is to make variants compatible across different samples. This is detailed in Tan et al. This mainly applied to INDELs but could also apply to balanced variation that is > length 1. It is worth considering this in the context of GWAS. GWAS test the association of an one allele of a bi-allelic set of alleles. In resources such as dbSNP, variants are given reference SNP identifiers (rsIDs) and often INDELs can have > 1 potential allele. A GWAS might test bi-allelic combinations of these multi-site variants. Depending on the software, the site that is tested, the effect allele, might be the alternate allele or the minor allele. The two are not necessarily the same. In addition, there may be post-GWAS processing of results, that may also flip alleles to be in a certain direction of effect. This can present issues with certain multi-allelic INDEL sites as illustrated below.

If we consider the variant rs16635. In dbSNP this has the co-ordinates chr6:99341899 (1-based) and the alleles ATGAT/ATGATGAT/AT, with ATGAT being the reference and ATGATGAT and AT being two possible alternate alleles. Breaking it down, at rs16635 an insertion of GAT has been observed as rel as a deletion of GAT. Therefore, rs16635, is an insertion and a deletion site. If we normalise this multi-allelic site into a VCF format, it becomes: chr6:99341898 with the alleles AATG, AATGATG, A, with either an insertion of ATG or a deletion of ATG. The process of normalisation left aligns the site.

To put it in context, lets look at the site in the context of its flanking sequence, this is queried from a GRCh38 human genome reference sequence from the NIH, so NC_000006.12 is chromsome 6. The bases are lowercase as they are soft masked repeat regions (hard masked would be N):

$ samtools faidx nih_GRCh38_latest_genomic.fna.gz "NC_000006.12:99341890-99341910"
>NC_000006.12:99341890-99341910
aaattaaaaatgataacaaaa

Now we can flag our two representations:

sequence: AAATTAAAAATGATAACAAAA
dbSNP:             -----
VCF:              ----

If we add in the two alternate alleles from both representations, first dbSNP.

sequence: AAATTAAAAATGAT***AACAAAA
dbSNP:             -----
dbSNP-ALT1:             GAT
dbSNP-ALT2:        --

Now VCF:

sequence: AAATTAAAAATG***ATAACAAAA
VCF:              ----
VCF-ALT1              ATG
VCF-ALT2          -

We can see that they are shifted, however, if we represent both ALT sequences as a whole, we can see that they are the same, the first ALT variant:

dbSNP-ALT1: AAATTAAAAATGATGATAACAAAA
VCF-ALT1:   AAATTAAAAATGATGATAACAAAA

The second ALT variant:

dbSNP-ALT2: AAATTAAAAATAACAAAA
VCF-ALT2:   AAATTAAAAATAACAAAA

Now suppose we converted the dbSNP site to two bi-allelic sites, so that would give:

chr6:99341899 - ATGAT/ATGATGAT
chr6:99341899 - ATGAT/AT

And then normalised to VCF format:

chr6:99341898 - AATG/A
chr6:99341898 - A/AATG

In the bi-alleilic form, we now have mirror image alleles. Both are correct, but this can be confusing. From a GWAS perspective, it is particularly problematic if we are working with a dataset where it is unclear what allele has been tested and what has happened to the data post GWAS.This insertion/deletion example is a specific case of a general issue with bi-allelic sites in GWAS where the ALT allele aligns perfectly with the reference genome. In short, I would be careful with these sites, unless you know the provenance of the data. Of course, if the allele frequencies are known, it may be possible to resolve this. In future, the mapper will flag bi-alleilic sites where the ALT allele aligns with the reference genome.

INDEL normalisation