INDEL normalisation#
Insertions and deletions can be particularly problematic to validate as the same variant site might not be represented in the same way. Methods have been proposed to normalise an INDEL site, these have been implemented in the variant mappers but are also available directly via the API. This short example will demostrate how to use lookups against the reference genome assembly and perform variant normalisation using the
Ensembl REST API. A similar interface is available for locally available indexed genome assemblies in .faa
format.
[11]:
# Normalisation API
from gwas_norm.variants import norm
# Interaction with Ensembl
from ensembl_rest_client import client
import sys
Create the Ensembl REST client for interacting with the REST API and use it to create a normalisation object. The REST client will default to GRCh38
[4]:
rc = client.Rest()
[5]:
n = norm.EnsemblRefNorm(rc)
Searching for DNA sequence#
Using the norm.EnsemblRefNorm
(and also the norm.RefNorm
) object, you can search for searchions of DNA sequence by suppling the chromosome (treated as a sting) and the start/end positions in base pairs (ints). Some examples are shown below.
[6]:
n.search_assembly('1', 1000000, 1000015)
[6]:
'GGTGGAGCGCGCCGCC'
[7]:
n.search_assembly('22', 12345678, 12345681)
[7]:
'GAAT'
[8]:
n.search_assembly('5', 138763326, 138763332)
[8]:
'TACATGC'
Normalising alleles#
Looking up DNA sequences is useful if we want to validate a reference allele against the genome, for example, after lifting over. However, the main purpose of the gwas_norm.variants.norm
module is to normalise alleles. Some examples are shouwn below. The variables returned by normalise_alleles are:
The chromosome
The start position of the normalised reference allele
The normalised reference allele
The normalised alternate allele
A Boolean indicating if normalisation has taken place.
Some examples are shown below, first we will start with some examples where no normalisation takes place, i.e. they are already normalised.
[16]:
# Deletion
n.normalise_alleles('1', 1000008, 'GCGCCGC', 'G')
[16]:
('1', 1000008, 'GCGCCGC', 'G', False)
G/T - this is a single base pair balanced polymorphism, these are never normalised but the reference allele is checked against the reference assembly. The second example illustrates an error when it does not align.
[9]:
n.normalise_alleles('1', 1000000, 'G', 'T')
[9]:
('1', 1000000, 'G', 'T', False)
[15]:
try:
n.normalise_alleles('1', 1000000, 'T', 'G')
except KeyError as e:
# Neater for notebooks than the full stacktrace
print(e.args[0], file=sys.stderr)
REF allele not in reference assembly
There is a 3 base pair balanced polymorphism in the alleles (GAG/TCA), so normalisation should extract it and adjust it.
[9]:
# 1000000:GGTGGAGCGCGCCGCC:1000015
n.normalise_alleles('1', 1000000, 'GGTGGAGCGCG', 'GGTGTCACGCG')
[9]:
('1', 1000004, 'GAG', 'TCA', True)
This should left align by 1 base pair to give a deletion of GCGCCGC/G
[11]:
# # 1000000:GGTGGAGCGCGCCGCC:1000015
n.normalise_alleles('1', 1000009, 'CGCCGCC', 'C')
[11]:
('1', 1000008, 'GCGCCGC', 'G', True)
Should right align by 1-bp
[17]:
# 1000000:ACACTCTAATTTTGTA:1000015
n.normalise_alleles('10', 1000011, 'TTGTA', 'TT')
[17]:
('10', 1000012, 'TGTA', 'T', True)
[22]:
# 1000000:ACACTCTAATTTTGTA:1000015
n.normalise_alleles('10', 1000011, 'TT', 'TTG')
[22]:
('10', 1000012, 'T', 'TG', True)
Normalisation can also be used to de-Ensemblise deletions
[18]:
n.normalise_alleles('1', 1000009, 'CGCCGC', '-')
[18]:
('1', 1000008, 'GCGCCGC', 'G', True)
However, insertions need some work (this is currently an open issue)
[20]:
try:
n.normalise_alleles('1', 1000009, '-', 'G')
except Exception as e:
print(e.args[0], file=sys.stderr)
400 Client Error: Bad Request for url: https://rest.ensembl.org/sequence/region/human/1:1000009..1000008:1?slice_length=10000000.0&format=plain
Summary#
The example above shows a very simple interface to variant normalisation and this can be used via Ensembl with norm.EnsemblRefNorm
or with a locally available reference sequence with norm.RefNorm
. The interface between the two is the same.