Building the mapping files#

Introduction#

There are a set of helper are VCF files of coordinate data, functional annotations and rsIDs that can be built and used to assign the latest rsIDs to GWAS variants and allele frequencies from a variety of population groups. It incorporates data from multiple available datasets. From end-to-end this is a manual process, although the component parts have various scripts associated with them. The whole process is documented here and the individual scripts used can be found in bash mapping scripts and python mapping scripts.

Version information#

The gwas-norm mapping files have the structure gwas-norm.v<version>.<optional:rare|common>.biallelic.vep.<genome_assembly:b36|b37|b38>vcf.gz. Where the version number is the date in YYYYMMDD so that it will sort correctly. The data sources and basic statstics of built versions are detailed below. The partition in to common or rare is based on having a MAF >= 0.01 or a MAC of >= 50. The only exception is v20210208 where the MAC was >= 10.

Number of bi-allelic sites#
Version	Genome assembly	#sites (all)	#sites (common)	#sites (rare)
v20210208	GRCh37	1,161,556,872	99,419,831 (MAC>=10)	1,062,137,041 (MAC<10)
v20210208	GRCh38	1,199,931,094	101,770,179 (MAC>=10)	1,098,160,915 (MAC<10)
v20220402	GRCh37	1,215,008,330	81,947,534	1,133,060,793
v20220402	GRCh38	1,243,710,237	TBC…	TBC…

The data sources in each version are listed below:

Data sources#
Data source	v20210208	v20210208	v20220402	v20220402
	GRCh37	GRCh38	GRCh37	GRCh38
dbSNP	v154	v154	v155	v155
CADD	v1.6dev,all SNVs+gnomad2.1.1 indels	v1.6dev,all SNVs+gnomad3.0 indels	v1.6dev,all SNVs+gnomad2.1.1 indels	v1.6dev,all SNVs+gnomad3.0 indels
VEP	v102	v102	v105	v105
ALFA	06/01/2021 (GRCh38->GRCh37 liftover)	06/01/2021	06/01/2021 (GRCh38->GRCh37 liftover)	06/01/2021
Gnomad2 Exomes			v2.1.1	v2.1.1 (GRCh37->GRCh38 liftover)
Gnomad3	v3.1 (GRCh38->GRCh37 liftover)	v3.1	v3.1 (GRCh38->GRCh37 liftover)	v3.1
1000genomes	v5a (02/05/2013)	v5a (05/04/2017) (GRCh37->GRCh38 liftover)	v5a (02/05/2013)	v5a (05/04/2017) (GRCh37->GRCh38 liftover)
UK BioBank	v3 imputed GBR unrelated QC	v3 imputed GBR unrelated QC (GRCh37->GRCh38 liftover)	v3 imputed GBR unrelated QC	v3 imputed GBR unrelated QC (GRCh37->GRCh38 liftover)

The populations in the mapping file#

Below is a summary of each of the population groups in the mapping file

Population groups#
Column Name	Source	Description
UKBB_EUR	UK BioBank	The white GBR subset from UK BioBank imputed genotype files based on unrelateds
ALFA_EUR	ALFA 01/21 release	European
ALFA_AFO	ALFA 01/21 release	Individuals with African ancestry
ALFA_EAS	ALFA 01/21 release	East Asian (95%)
ALFA_AFA	ALFA 01/21 release	African American
ALFA_LAC	ALFA 01/21 release	Latin American individiuals with Afro-Caribbean ancestry
ALFA_LEN	ALFA 01/21 release	Latin American individiuals with mostly European and Native American Ancestry
ALFA_OAS	ALFA 01/21 release	Asian individiuals excluding South or East Asian
ALFA_SAS	ALFA 01/21 release	South Asian
ALFA_OTR	ALFA 01/21 release	The self-reported population is inconsistent with the GRAF-assigned population
ALFA_AFR	ALFA 01/21 release	All Africans
ALFA_ASN	ALFA 01/21 release	All Asian individuals (EAS and OAS) excluding South Asian (SAS)
ALFA_TOT	ALFA 01/21 release	Total (~global) across all populations
GNOMAD31_AFR	Gnomad 3.1	African/African-American
GNOMAD31_AMI	Gnomad 3.1	Amish
GNOMAD31_AMR	Gnomad 3.1	Latino/Admixed American
GNOMAD31_ASJ	Gnomad 3.1	Ashkenazi Jewish
GNOMAD31_EAS	Gnomad 3.1	Japanese (jpn), Korean (kor), Other East Asian (oea)
GNOMAD31_FIN	Gnomad 3.1	Finnish
GNOMAD31_MID	Gnomad 3.1	Middle Eastern
GNOMAD31_NFE	Gnomad 3.1	Non-Finnish Europeans. Bulgarian (bgr), Estonian (est), North-western European (nwe), Other non-Finnish European (onf), Southern European (seu), Swedish (swe)
GNOMAD31_OTH	Gnomad 3.1	Other
GNOMAD31_RAW	Gnomad 3.1	All?
GNOMAD31_SAS	Gnomad 3.1	South Asian
GNOMAD2EX_AFR	Gnomad 2.1.1 exomes	African/African-American
GNOMAD2EX_AMR	Gnomad 2.1.1 exomes	Latino/Admixed American
GNOMAD2EX_ASJ	Gnomad 2.1.1 exomes	Ashkenazi Jewish
GNOMAD2EX_EAS	Gnomad 2.1.1 exomes	Japanese (jpn), Korean (kor), Other East Asian (oea)
GNOMAD2EX_FIN	Gnomad 2.1.1 exomes	Finnish
GNOMAD2EX_NFE	Gnomad 2.1.1 exomes	Non-Finnish Europeans. Bulgarian (bgr), Estonian (est), North-western European (nwe), Other non-Finnish European (onf), Southern European (seu), Swedish (swe)
GNOMAD2EX_OTH	Gnomad 2.1.1 exomes	Other
GNOMAD2EX_RAW	Gnomad 2.1.1 exomes	All?
GNOMAD2EX_SAS	Gnomad 2.1.1 exomes	South Asian
1KG_AFR	1000 genomes phase 3	All African populations
1KG_AMR	1000 genomes phase 3	Ad Mixed American
1KG_EAS	1000 genomes phase 3	East Asian
1KG_EUR	1000 genomes phase 3	All European populations
1KG_SAS	1000 genomes phase 3	South Asian
1KG_ACB	1000 genomes phase 3	African Caribbeans in Barbados
1KG_ASW	1000 genomes phase 3	Americans of African Ancestry in SW USA
1KG_BEB	1000 genomes phase 3	Bengali from Bangladesh
1KG_CDX	1000 genomes phase 3	Chinese Dai in Xishuangbanna, China
1KG_CEU	1000 genomes phase 3	Utah Residents (CEPH) with Northern and Western European Ancestry
1KG_CHB	1000 genomes phase 3	Han Chinese in Beijing, China
1KG_CHS	1000 genomes phase 3	Southern Han Chinese
1KG_CLM	1000 genomes phase 3	Colombians from Medellin, Colombia
1KG_ESN	1000 genomes phase 3	Esan in Nigeria
1KG_FIN	1000 genomes phase 3	Finnish in Finland
1KG_GBR	1000 genomes phase 3	British in England and Scotland
1KG_GIH	1000 genomes phase 3	Gujarati Indian from Houston, Texas
1KG_GWD	1000 genomes phase 3	Gambian in Western Divisions in the Gambia
1KG_IBS	1000 genomes phase 3	Iberian Population in Spain
1KG_ITU	1000 genomes phase 3	Indian Telugu from the UK
1KG_JPT	1000 genomes phase 3	Japanese in Tokyo, Japan
1KG_KHV	1000 genomes phase 3	Kinh in Ho Chi Minh City, Vietnam
1KG_LWK	1000 genomes phase 3	Luhya in Webuye, Kenya
1KG_MSL	1000 genomes phase 3	Mende in Sierra Leone
1KG_MXL	1000 genomes phase 3	Mexican Ancestry from Los Angeles USA
1KG_PEL	1000 genomes phase 3	Peruvians from Lima, Peru
1KG_PJL	1000 genomes phase 3	Punjabi from Lahore, Pakistan
1KG_PUR	1000 genomes phase 3	Puerto Ricans from Puerto Rico
1KG_STU	1000 genomes phase 3	Sri Lankan Tamil from the UK
1KG_TSI	1000 genomes phase 3	Toscani in Italia
1KG_YRI	1000 genomes phase 3	Yoruba in Ibadan, Nigeria

The mapping VCF file format#

The objective of the mapping file is to provided pre-calculated annotations for all known human variants. The annotations currently used in gwas-norm are:

rsID
Ensembl VEP prediction
SIFT score (where appropriate)
PolyPhen score (where appropriate)
CADD score (where available)
Allele counts from various populations (where available)

Ideally, a mapping file containing these annotations must be available for each genome assembly that would want to liftover to. If a mapping file is not available for any genome assembly that you are lifting to, then the mapping step is skipped. Equally, if any of the annotations listed above are not available for a variant in the mapping file then they will be skipped from the final normalised GWAS file (but the variant is still output). In essence the pipeline is designed to not fail if there are missing annotation data, rather, appropriate flags are added to the row in the normalised GWAS file and then it is up to the user if they want to exclude variants at analysis time, based on the presence or absence of flags.

Currently, the process below generates fully annotated mapping files for GRCh38 (referred to as b38 in the file names below), GRCh37 (b37) and a mapping file with only rsID and allele counts for NCBI36 (b36).

The mapping file is referred to as a “counts VCF file”. This is because it has some differences with a standard “sites” VCF file, namely the allele count information is placed in the sample field having the format AN:AC for each of the populations represented at the site, where AN is the total number of alleles observed at that site and AC is the number of each ALT allele observed at the site for the population. So in theory a site with two ALT alleles could have a AN:AC field that looks like 1284737:12,1, where 12 is the allele count for the first ALT allele and 1 is the allele count for the second ALT allele. In practice, all the sites in the mapping file are represented in bi-allelic form (see below), so this would be represented as two lines 1284737:12 and 1284737:1. The IDs for each of the populations occupy the sample fields after the FORMAT field and have the broad structure COHORT_POPULATION (see below). Any missing allele count is represented as .:., i.e. the allele number (AN) is missing (.) and the allele count (AC) is missing. This representation of the allele counts is similar to that used by the ALFA project.

Other than the AN:AC representation, there are two absolute requirements for the mapping file.

All the variant sites should be represented in bi-allelic form. This makes the mapping process a bit easier.
The chromosomes in the mapping file should always be represented in string sort order and not in the biological order. i.e. it might be natural to represent human chromosomes in the order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT . However, they should be represented in string sort order: 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y . This can be achieved by stripping the VCF header and performing:
```
zcat mapping_file.vcf.gz | sort -t$'\t' -k1,1 -k2n,2n | bgzip > sorted.vcf.gz
```
However, it is probably easier to generate a separate VCF header with the contigs in string sort order and then using bcftools reheader/bcftools sort. If the procedure below is followed then you do not need to worry about this.

The build procedure outlined below will split all multi-allelic sites into multiple bi-allelic sites bcftools norm -m"+any" validate all the reference alleles against the respective reference genome and drop any sites that do not validate. It also normalises INDELs to a standard format bcftools norm -c x -f "<PATH TO REF GENOME>". In the final mapping files, only sites that overlap with dbSNP will have an rsID assigned. If a site had an rsID assigned in one of the non-dnSNP sources but that does not overlap with a dbSNP site, then the rsID of the non-dbSNP source is set to missing .. Therefore, in the final normalised GWAS files, the user can be sure that the rsIDs are assigned only from dbSNP sites.

Using the procedure below, the VEP fields are slightly truncated to save some space and only output the information used by the pipeline. The VEP sting in the INFO field is: "Allele, Consequence, Gene, Feature_type, Feature, SIFT, PolyPhen". Although, the pipeline only uses Consequence, SIFT, PolyPhen and the structure of the VEP string is checked prior to the mapping annotation taking place. VEP strings are delimited by pipes |.

CADD annotations are not added using VEP (as it is far to slow), so instead an additional script is used for this. This adds a CADD annotation to the INFO field, with two entries CADD_PHRED, CADD_RAW. CADD strings are delimited by pipes |.

An excerpt from a final mapping file is shown below:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##VEP="v102" time="2021-01-24 20:35:18" cache="/lustre/projects/DTAdb/resources/vep/homo_sapiens/102_GRCh38" ensembl-funcgen=102.6bd93a0 ensembl-variation=102.2716d2e ensembl=102.347f9ed ensembl-io=102.ff1cf96 1000genomes="phase3" COSMIC="91" ClinVar="202006" ESP="V2-SSA137" HGMD-PUBLIC="20194" assembly="GRCh38.p13" dbSNP="153" gencode="GENCODE 36" genebuild="2014-07" gnomAD="r2.1" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##contig=<ID=1,length=248956422>
##contig=<ID=10,length=133797422>
##contig=<ID=11,length=135086622>
##contig=<ID=12,length=133275309>
##contig=<ID=13,length=114364328>
##contig=<ID=14,length=107043718>
##contig=<ID=15,length=101991189>
##contig=<ID=16,length=90338345>
##contig=<ID=17,length=83257441>
##contig=<ID=18,length=80373285>
##contig=<ID=19,length=58617616>
##contig=<ID=2,length=242193529>
##contig=<ID=20,length=64444167>
##contig=<ID=21,length=46709983>
##contig=<ID=22,length=50818468>
##contig=<ID=3,length=198295559>
##contig=<ID=4,length=190214555>
##contig=<ID=5,length=181538259>
##contig=<ID=6,length=170805979>
##contig=<ID=7,length=159345973>
##contig=<ID=8,length=145138636>
##contig=<ID=9,length=138394717>
##contig=<ID=MT,length=16569>
##contig=<ID=X,length=156040895>
##contig=<ID=Y,length=57227415>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical  INFO tags.">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated by colon (:).  Each identifier is separated by a vertical bar (|)">
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 8 - confers sensitivity, 9 - risk-factor, 10 - association, 11 - protective, 12 - conflict, 13 - affects, 255 - other">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria provided by submitter, single - Classified by single submitter, mult - Classified by multiple submitters, conf - Criteria provided conflicting interpretations, exp - Reviewed by expert panel, guideline - Practice guideline">
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version strings associated with that allele.">
##INFO=<ID=DS,Number=.,Type=String,Description="Datasets merged at the site">
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|Gene|Feature_type|Feature|SIFT|PolyPhen">
##INFO=<ID=CADD,Number=.,Type=String,Description="CADD Consequences. Format: CADD_PHRED,CADD_RAW">
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
##bcftools_viewVersion=1.11+htslib-1.11
##bcftools_viewCommand=view -h gwas_norm.biallelic.b38.1.1.10001-1377535.vcf.gz; Date=Mon Jan 25 10:00:19 2021
#CHROM       POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  UKBB_EUR        ALFA_EUR        ALFA_AFO        ALFA_EAS        ALFA_AFA        ALFA_LAC        ALFA_LEN        ALFA_OAS        ALFA_SAS        ALFA_OTR        ALFA_AFR        ALFA_ASN        ALFA_TOT        GNOMAD31_AFR    GNOMAD31_AMI    GNOMAD31_AMR    GNOMAD31_ASJ    GNOMAD31_EAS    GNOMAD31_FIN    GNOMAD31_MID    GNOMAD31_NFE    GNOMAD31_OTH    GNOMAD31_RAW    GNOMAD31_SAS    1KG_AFR 1KG_AMR 1KG_EAS 1KG_EUR 1KG_SAS 1KG_ACB 1KG_ASW 1KG_BEB 1KG_CDX 1KG_CEU 1KG_CHB 1KG_CHS 1KG_CLM 1KG_ESN 1KG_FIN 1KG_GBR 1KG_GIH 1KG_GWD 1KG_IBS 1KG_ITU 1KG_JPT 1KG_KHV 1KG_LWK 1KG_MSL 1KG_MXL 1KG_PEL 1KG_PJL 1KG_PUR 1KG_STU 1KG_TSI 1KG_YRI
1    10001   rs1570391677    T       A       .       .       dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=A|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,A|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,A|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.702541|8.478        AN:AC   .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.
1    10001   .       T       C       .       .       DS=alfa;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.750954|8.921        AN:AC   .:.     7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0 .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.
1    10002   rs1570391692    A       C       .       .       dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.713993|8.583        AN:AC   .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.
1    10003   rs1570391694    A       C       .       .       dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.714485|8.588        AN:AC   .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.
1    10007   .       T       C       .       .       DS=alfa;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.755890|8.967        AN:AC   .:.     7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0 .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.     .:.

Download reference genomes#

Many of the commands below will use bcftools norm to validate the reference allele again the reference genome and remove any variants where it does not validate. So, we will download the reference genome for, GRCh38, GRCh37, NCBI36.

GRCh38:

wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz.fai
wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz.gzi

GRCh37, this was not indexed or bgzipped so I had to process it after download (this was also done for NCBI36):

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.dna.toplevel.fa.gz hg.b37.fa.gz
zcat hg.b37.fa.gz | bgzip > Homo_sapiens.GRCh37.dna.toplevel.fa.gz
samtools faidx Homo_sapiens.GRCh37.dna.toplevel.fa.gz
rm hg.b37.fa.gz

Installing helper bash scripts for building the mapping files#

I hope to provide a download link for the mapping files. However, until then, they will need to be build using this procedure. It is fully documented and you do not have to use all the resources in the documentation. However, if you do follow the procedure then you will need to have the Python package installed as documented above and the bash scripts in ./resources/bin available in your PATH (where . is the root of the cloned git repository). These scripts all also need the bash-helpers repo in your PATH and also shflags (which is a very nice BASH command line argument handler). To put something in your PATH involves editing either your ~/.bashrc file or your ~/.bash_profile file (depending on what you use). For example, for adding the bash scripts for building the mapping files you should add something like this:

PATH="/path/to/gwas-norm/resources/bin:${PATH}"
export PATH

where /path/to/ is where you cloned the repository. If you didn’t clone the repository and installed via conda, then it is easiest just to clone it (but not install it with pip) and just use the ./resources/bin scripts.

Mapping file data#

The process for incorporating various data resources into the mapping file is detailed below. Whilst you can follow it exactly if you want and have a finished mapping file at the end. You might want to add your own resources, if that is the case then please keep in mind the expected format of the mapping file and the requirements for sorting.

dbSNP#

The dbSNP content forms the backbone of the mapping file. It is provided in multiple formats, a JSON format and a VCF format. Currently only the VCF format is used but in future I will move to the JSON format as I think it might contain rsID synonyms (although I will have to research this a bit). The tabix indexes will not actually be used in the processing but will be downloaded for completeness.

To download all the files for GRCh38 version of dbSNP:

$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi.md5

To download all the files for the GRCh37 version of dbSNP:

$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi.md5

The chromosomes in the VCF file are expressed as NCBI assembly IDs, these will be processed into more familiar chr_name designations. For this, a conversion file will need to be downloaded (assembly mapper).

For GRCh38:

$ wget https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

For GRCh37:

$ wget https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt

There is a Python script for doing this. However, do not use this directly if you are using the process-dbsnp.sh bash wrapper script. See the documentation for more information.

For GRCh38, the following command was run:

$ process-dbsnp.sh --steps 1000000 -c -T ~/Scratch/tmp/ -v "GCF_000001405.39.gz" "GCF_000001405.39_GRCh38.p13_assembly_report.txt" ../../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz "dbsnp.biallelic.v155.b38.vcf.gz"

For GRCh37, the following command was run:

$ process-dbsnp.sh --steps 1000000 -c -T ~/Scratch/tmp/ -v GCF_000001405.25.gz  GCF_000001405.25_GRCh37.p13_assembly_report.txt ../../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz dbsnp.biallelic.v155.b37.vcf.gz

On the HPC (grid-engine) these were run with:

$ qsub -cwd -S "/bin/bash" -N "<NAME>" -l h_rt="10:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err <PROCESS COMMAND>

Where <NAME> was either dbsnp155_b38 or dbsnp155_b37 respectively and <PROCESS COMMAND> was one of the two commands above (but with the full path to process-dbsnp.sh).

Currently process-dbsnp.sh does not put the VCF file in the correct sort order, therefore, the output VCF files need to be sorted such that the chromosomes are in a string sort order. There is another bash script for doing this called vcf-sort.sh. For GRCh37, this job was run:

$ qsub -cwd -S "/bin/bash" -N "dbsnp_sort_b37" -l h_rt="24:00:00" -l mem="8G" -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-sort.sh -T ~/Scratch/tmp/ dbsnp.biallelic.v155.b37.vcf.gz dbsnp_sorted_b37.vcf.gz
$ mv dbsnp_sorted_b37.vcf.gz dbsnp.biallelic.v155.b37.vcf.gz

For GRCh38:

$ qsub -cwd -S "/bin/bash" -N "dbsnp_sort_b38" -l h_rt="24:00:00" -l mem="8G" -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-sort.sh -T ~/Scratch/tmp/ dbsnp.biallelic.v155.b38.vcf.gz dbsnp_sorted_b38.vcf.gz
$ mv dbsnp_sorted_b38.vcf.gz dbsnp.biallelic.v155.b38.vcf.gz

Both the GRCh38 and the GRCh37 output files were tabix indexed after completion:

$ tabix -p"vcf" dbsnp.biallelic.v155.b38.vcf.gz
$ tabix -p"vcf" dbsnp.biallelic.v155.b37.vcf.gz

The output report from bcftools norm for both genome builds and current/previous dbSNP versions is shown below:

GRCh38 (v155) - total/split/realigned/skipped: 1053471122/145710568/17985666/1591057
GRCh38 (v154) - total/split/realigned/skipped: 701206150/66148905/11210568/958052
GRCh37 (v155) - total/split/realigned/skipped: 1032379139/138117126/17956788/1505604
GRCh37 (v154) - total/split/realigned/skipped: 689821210/63751490/11199525/913045

ALFA#

Data from the ALFA project is also incorporated into the mapping file. ALFA is an Al lele F requency A ggregator that gathers allele frequencies of variants from populations used in GWAS analysis and partitions them into broad ethnic groups. The latest version (06-01-2021) provides a VCF file that contains ~900M rows and 12 population groups. The ALFA data is only available as genome assembly GRCh38, so this will be processed into the desired format and then lifted down to GRCh37.

The data was downloaded as follows

$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.tbi.md5

The tabix index will not actually be used in the processing but was downloaded for completeness.

Check the MD5:

$ md5sum -c freq.vcf.gz.md5
freq.vcf.gz: OK

$ md5sum -c freq.vcf.gz.tbi.md5
freq.vcf.gz.tbi: OK

As with dbSNP the chromosomes in ALFA are represented in NCBI format so will need to be converted to regular chromosome nomenclature. A helper python script (format-alfa) exists to remap the chromosomes to the standard identifiers, and remap the ALFA sample names to something more human readable. However, this script is run via a bash wrapper script (process-alfa.sh) that also performs several other functions. See it’s documentation for details.

The command was run with:

$ process-alfa.sh -c -v -T ~/Scratch/tmp/ --steps 1000000 freq.vcf.gz GCF_000001405.39_GRCh38.p13_assembly_report.txt ../../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ../alfa0121.biallelic.b38.vcf.gz

And on the grid-engine HPC:

$ qsub -cwd -S "/bin/bash" -N "alfa_b38" -l h_rt="10:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/bin/process-alfa.sh <ARGS AS ABOVE>

The bcftools norm output for ALFA is shown below:

GRCh38 (06-01-2021) - total/split/realigned/skipped: 904081572/108266225/13083399/1377859

Lifting ALFA to GRCh37#

ALFA is only available in assembly GRCh38, therefore, the processed file needs to be lifted down to GRCh37. I have a couple of scripts to help with this. They use CrossMap for the liftover but not the VCF handling, see the documentation for details.

$ vcf-lift.sh -T ~/Scratch/tmp -o ../b37/alfa0121.biallelic.b37.vcf.gz -O 'z' alfa0121.biallelic.b38.vcf.gz ../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ../../../../chain_files/GRCh38_to_GRCh37.chain.gz

$ qsub -cwd -S "/bin/bash" -N "alfa_b37_lift" -l h_rt="10:00:00" -l mem="4G" -pe smp 4 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/bin/process-alfa.sh <ARGS AS ABOVE>

GNOMAD2 exome allele counts#

Allele counts for variants in selected gnomad2 exome populations were also incorporated into the mapping file. For this, the chromosome combined VCF file was downloaded from Gnomad and processed with a bash script (process-gnomad2.sh). There files are about ~85Gb and are available from either Amazon, Google or Azure endpoints. The google ones are shown below. Please note, the tabix index files are important as they are used to gather the contigs represented in the file.

For GRCh38:

$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
$ https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi

For GRCh37:

$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz.tbi

To process the GRCh37 file:

$ process-gnomad2.sh -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.vcf.bgz ../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz gnomad2ex.biallelic.b37.vcf.gz

And on the HPC, please note, I vastly overestimated the HPC time. The GRCh37 version was completed in ~1.25h.:

$ qsub -cwd -S "/bin/bash" -N "gnomad2_ex_b37" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="48:00:00" process-gnomad2.sh -v -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.vcf.bgz ../../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ../gnomad2ex.biallelic.b37.vcf.gz

To process the GRCh38 file I first attempted to use the same commands as written below:

$ process-gnomad2.sh -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad2ex.biallelic.b38.vcf.gz

And on the HPC:

$ qsub -cwd -S "/bin/bash" -N "gnomad2_ex_b38" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="48:00:00" process-gnomad2.sh -v -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad2ex.biallelic.b38.vcf.gz

However, these failed with the errors below. It seems bcftools, not sure if bcftools norm or bcftools sort (I am guessing sort) gived the error The sequence "1_KI270706v1_random" was not found and then the whole thing segfaults. This is a stringe one, I am striping the leading chr from all the chromosomes but I wrote out some of the intermediate files and they have 7 instances of 1_KI270706v1_random in there. So I assuming that bcftools norm is removing them as they do not align with the reference assembly, however, they are retained in the contigs in the header. As we are piping directly into sort it is erroring out as they are in the header but no longer in the file. I could adapt the script to remove all chromosomes not in the reference assembly from the VCF header - this is not for the future. However, for now, I will work around it by simply lifting over the GRCh37 file (which is the original file anyway).

$ cat rerun_gnomad2_ex_b38.err

=== 690168 ===
[info] started at Fri  1 Apr 12:02:14 BST 2022
[info] temp directory: /home/rmjdcfi/Scratch/tmp/
[info] working temp directory: /home/rmjdcfi/Scratch/tmp/tmp.vpymnRBLaC
[info] input VCF: /lustre/projects/DTAdb/resources/mapping_files/source_files/gnomad2/b38/original_files/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
[info] reference genome: /lustre/projects/DTAdb/resources/human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
[info] output file: /lustre/projects/DTAdb/resources/mapping_files/source_files/gnomad2/b38/gnomad2ex.biallelic.b38.vcf.gz
[info] will extract 9 pop(s): afr,amr,asj,eas,fin,nfe,oth,raw,sas
[info] processing gnomad2ex
[info] generating VCF header
[info] generating allele counts...
[info] normalising final VCF...
Writing to /home/rmjdcfi/Scratch/tmp/8yoRxV
[E::faidx_adjust_position] The sequence "1_KI270706v1_random" was not found
faidx_fetch_seq failed at 1_KI270706v1_random:45952
/var/opt/sge/node-h00a-037/job_scripts/690168: line 192: 55175 Broken pipe             zcat "$temp_header" "$temp_counts"
     55176 Exit 255                | bcftools norm -m'-any' -cx -f "$reference_genome_file"
     55177 Segmentation fault      (core dumped) | bcftools sort "${SORT_ARGS[@]}" -o "$temp_final" -O "z"
[info] removing /home/rmjdcfi/Scratch/tmp/tmp.vpymnRBLaC
[fatal] error on or near, '192'
!!!! ERROR EXIT !!!!

This is the command used for the liftover, it uses the vcf-lift.sh script that has been used previously.

$ qsub -cwd -S "/bin/bash" -N "gnomad2ex_lift_to_b38" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 4 -l mem="4G" -l h_rt="10:00:00" vcf-lift.sh -T ~/Scratch/tmp/ -o gnomad2ex.biallelic.b38.vcf.gz -O 'z' -f process_logs/liftover_fails.txt.gz ../b37/gnomad2ex.biallelic.b37.vcf.gz ../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ../../../../chain_files/GRCh37_to_GRCh38.chain.gz

GNOMAD3 allele counts#

Allele counts for variants in gnomad3 will also be incorporated into the mapping file. The gnomad3 data is split per chromosome and due to the size it is particularly problematic for handling. There is a script process-gnomad3.sh to do this. It was run with the command below using the default populations (where gnomad_processed_files is the output directory):

$ process-gnomad3.sh -v human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad_processed_files

However, due to the run time, process_gnomad3.sh was parallelised using the hpc-tools framework, running two concurrent tasks. This requires a job array file and it was created with the gnarly bash command below:

echo -e "rowidx\tinfile\toutfile\tchr_name" > gnomad3-1.ja;
idx=1;
for i in ${chrs[@]}; do
    echo -e "$idx\t"$(readlink -f human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz)"\t"$(readlink -f ../gnomad3-1.chr${i}.biallelic.b38.vcf.gz)"\t"$i;
    idx=$(( idx + 1));
done >> gnomad3-1.ja

Then submitted with:

pyjasub -v --batch 2 --mem "8G" --time '15:00:00' --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/gnomad3-1_proc --ja_file process_logs/gnomad3-1.ja --task_script ~/code/gwas_norm/resources/bin/gnomad.task.sh --infiles infile --outfiles outfile

Where gwas_norm/resources/bin/gnomad.task.sh is located in this repository. The individual chromosome data were concatinated using the vcf-cat.sh bash wrapper script. Note that gnomad_processed_files/*.vcf.gz gives the chromosome files in string sort order.

qsub -cwd -S "/bin/bash" -N "gnomad_cat" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="10:00:00" ~/code/gwas_norm/resources/bin/vcf-cat.sh -v --delete -T ~/Scratch/tmp/ gnomad3-1.biallelic.b38.vcf.gz gnomad_processed_files/*.vcf.gz

The joined Gnomad data was also lifted down to GRCh37, using another bash wrapper vcf-lift.sh. And submitted to the HPC as follows:

qsub -cwd -S "/bin/bash" -N "gnomad_b37_lift" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 4 -l mem="4G" -l h_rt="10:00:00" vcf-lift.sh -v -T ~/Scratch/tmp -o b37/gnomad3-1.biallelic.b37.vcf.gz -O 'z' b38/gnomad3-1.biallelic.b38.vcf.gz /human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz /chain_files/GRCh38_to_GRCh37.chain.gz

1000 genomes allele counts#

Allele counts are generated for each of the super and sub populations in the 1000 genomes using the :ref:` script <bash_1kg>` process-1000g.sh. This will download and process all of the files. It was run on the HPC as follows:

For GRCh37:

$ qsub -cwd -S "/bin/bash" -N "thousand_g_b37" -l h_rt="36:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/process-1000g.sh -v --tmp-dir ~/Scratch/tmp/ --download-dir resources/mapping_files/source_files/1000g/b37/original_files/ --keep-vcf b37 resources/human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz resources/mapping_files/source_files/1000g/b37

For GRCh38:

$ qsub -cwd -S "/bin/bash" -N "thousand_g_b38" -l h_rt="36:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/process-1000g.sh -v --tmp-dir ~/Scratch/tmp/ --download-dir resources/mapping_files/source_files/1000g/b38/original_files/ --keep-vcf b38 resources/human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz resources/mapping_files/source_files/1000g/b38/

The processed 1000 genomes chromosomes were then merged into a single file using the vcf-cat.sh script. This is a wrapper around bcftools concat that performs some input file/output file checks and deleted the concatenated files if all is good. This was submitted with qsub as below:

For GRCh37:

$ qsub -cwd -S "/bin/bash" -N "tkg_b37_merge" -l h_rt="15:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/vcf-cat.sh --delete -T ~/Scratch/tmp/ -v 1kg.biallelic.b37.vcf.gz 1kg.chr*.vcf.gz

For GRCh38:

$ qsub -cwd -S "/bin/bash" -N "tkg_b38_merge" -l h_rt="15:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/vcf-cat.sh --delete -T ~/Scratch/tmp/ -v 1kg.biallelic.b38.vcf.gz 1kg.chr*.vcf.gz

UKBB SNPstats files#

Allele frequencies from the UKBioBank will be incorporated into the final allele counts VCF file. These have been pre-calculated from the BGEN files using SNPstats. A script (format-snpstats) was written that will take one or more SNPSTATs summary files and create an allele counts VCF file from them. This

We also need the UK BioBank counts data to be represented as GRCh38 assembly. So this was lifted up to GRCh38 using vcf-lift.sh:

$ qsub -cwd -S "/bin/bash" -N "ukbb_eur_b37_b38_liftover" -l h_rt="01:30:00" -l mem="4G" -pe smp 4 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-lift.sh -o"../b38/ukbb_eur.biallelic.b38.vcf.gz" -O"z" -v -T ~/Scratch/tmp -f"../b38/process_logs/liftover_files.ukbb_eur.biallelic.b37.b38.txt.gz" ukbb_eur.biallelic.b37.vcf.gz /human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz /chain_files/GRCh37_to_GRCh38.chain.gz

Merging all the data sources#

The data sources were merged into a single VCF file using a python script merge-count-vcfs. This is wrapped in a bash script merge-count-vcfs.sh that was used to actually run the merge. Before merging any of the VCFs, tabix indexes were created on all of them, this is to ensure that they all have the same sort order.

The GRCh37 data sources were merged with this:

$ qsub -cwd -S "/bin/bash" -N "b37_merge" -l h_rt="36:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    merge-count-vcfs.sh -v -o gwas_norm.v20220402.biallelic.b37.vcf.gz --data-names "ukbb,alfa,gnomad3,gnomad2,1kg" --ref-name "dbsnp" --ref-genome ../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz -T ~/Scratch/tmp/ ../../source_files/dbsnp/b37/v155/dbsnp.biallelic.v155.b37.vcf.gz ../../source_files/ukbb_freq/b37/ukbb_eur.biallelic.b37.vcf.gz ../../source_files/alfa/b37/alfa0121.biallelic.b37.vcf.gz ../../source_files/gnomad31/b37/gnomad3-1.biallelic.b37.vcf.gz ../../source_files/gnomad2/b37/gnomad2ex.biallelic.b37.vcf.gz ../../source_files/1000g/b37/1kg.biallelic.b37.vcf.gz

The GRCh37 data sources were merged with this:

$ qsub -cwd -S "/bin/bash" -N "b37_merge" -l h_rt="36:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    merge-count-vcfs.sh -v -o gwas_norm.v20220402.biallelic.b38.vcf.gz --data-names "ukbb,alfa,gnomad3,gnomad2,1kg" --ref-name "dbsnp" --ref-genome ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -T ~/Scratch/tmp/ ../../source_files/dbsnp/b38/v155/dbsnp.biallelic.v155.b38.vcf.gz ../../source_files/ukbb_freq/b38/ukbb_eur.biallelic.b38.vcf.gz ../../source_files/alfa/b38/alfa0121.biallelic.b38.vcf.gz ../../source_files/gnomad31/b38/gnomad3-1.biallelic.b38.vcf.gz ../../source_files/gnomad2/b38/gnomad2ex.biallelic.b38.vcf.gz ../../source_files/1000g/b38/1kg.biallelic.b38.vcf.gz

Running VEP#

The merged VCF data sources were then annotated with the Ensembl Variant Effect Predictor (VEP) and CADD . Due to the size of the merged files (over 1 billion rows), this was parallelised using ~20,000 tasks each with about 50,000 variants to annotate. VEP was installed using these instructions. I was going to run using the CADD plugin but I had all sorts of issues with performance, so I wrote my own script to merge CADD data into VEP. I never managed to achieve the 100k variants/min that VEP supposedly can do.

Running VEP for GRCh37#

For GRCh37, the procedure is as follows, each parallel task will extract it’s defined regions from the merged VCF file, therefore the merged file was tabix indexed first of all.

$ tabix -p"vcf" gwas_norm.v20220402.biallelic.b37.vcf.gz

Then a job array file was created that describes the arguments to be used in each task. This was done using a bash script make-vep-ja.sh, this wraps a Python script that creates the genomic chunks with a fixed number of variants make-site-chunks.

$ qsub -cwd -S "/bin/bash" -N "vep_b37_job_array" -l h_rt="15:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    make-vep-ja.sh -n 50000 -T ~/Scratch/tmp/ -o process_logs/gwas_norm_vep.ja gwas_norm.v20220402.biallelic.b37.vcf.gz vep_chunks/ GRCh37 ../../../cadd/b37/whole_genome_SNVs.tsv.gz ../../../cadd/b37/gnomad.genomes.r2.1.1.indel.tsv.gz

The tasks were submitted with the command below. This also shows the output of the job submission process (wih the paths truncated).

$ pyjasub -v --mem "2G" --time "07:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/vep_b37 --task_script ~/code/gwas_norm/resources/tasks/vep.task.sh --ja_file process_logs/gwas_norm_vep.ja --infiles infile --outfiles outfile
=== pyjasub (hpc_tools v0.1.1dev1) ===
[info] step: 1
[info] batch: None
[info] mem: 2G
[info] shell: /bin/bash
[info] time: 07:00:00
[info] nodes: None
[info] tmp_dir: ~/Scratch/tmp/
[info] scratch_size: None
[info] tmp_size: None
[info] project: None
[info] paid: None
[info] log_dir: ~/Scratch/job_logs/vep_b37
[info] ja_file: process_logs/gwas_norm_vep.ja
[info] task_script: ~/code/gwas_norm/resources/tasks/vep.task.sh
[info] infiles: ['infile']
[info] outfiles: ['outfile']
[info] full job log directory '~/job_logs/vep_b37'
[info] run started at: '2022-04-03 17:17:35.518636'
[info] current run number: '1'
[info] run log path: '~/vep_b37/run.log'
[info] run directory: '~/vep_b37/2_1649002655'
[info] run out files: '~/vep_b37/2_1649002655/out/$TASK_ID_2_1649002655.out'
[info] run err files: '~/vep_b37/2_1649002655/err/$TASK_ID_2_1649002655.err'
[info] full job array path: 'process_logs/gwas_norm_vep.ja'
[info] job array md5: 'c7fc30dc6e78142651effb929c121134'
[info] full task script path: '~/code/gwas_norm/resources/tasks/vep.task.sh'
[info] task script md5: '7fadfd0e6101e35f9c32bfef4044f154'
[info] previous task script md5: '7fadfd0e6101e35f9c32bfef4044f154'
[info] copying task script: '~/vep_b37/2_1649002655/7fadfd0e6101e35f9c32bfef4044f154.task'
[info] run job array path: '~/vep_b37/2_1649002655/run_job_array.job'
[info] check results below:
[info] input file present: '24391'
[info] input file absent: '0'
[info] output file present: '0'
[info] output file absent: '24391'
[info] total tasks: '24391'
[info] tasks to be submitted: '24391'
[run] You are about to run '24391' tasks over '24391' rows in a job array with a step of '1', using the following command:
[run] qsub -t 1-24391:1 -e ~/vep_b37/2_1649002655/err/$TASK_ID_2_1649002655.err -o ~/job_logs/vep_b37/2_1649002655/out/$
TASK_ID_2_1649002655.out -l mem=2G -S /bin/bash -l h_rt=07:00:00 ~/code/gwas_norm/resources/tasks/vep.task.sh ~/vep_b37/2_1649002655/run_job_array.job ~/Scratch/tmp 1
[run] if all this looks correct, press [ENTER] to submit or [CTRL-C] to quit >

[run] submitting job...
[run] submitted array job with ID 706453
[info] run log details below...

...

[info] *** job submission finished ***
*** END ***

Surprisingly, all the tasks completed successfully on the first run:

+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| run_time                   | run_dir      | step | batch | mem | time     | total_tasks | no_of_tasks_submitted | infile_exist | infile_missing | outfile_exist | outfile_missing | job_id        |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| 2022-04-03 19:18:37.500647 | 1_1649009917 | 1    |       | 2G  | 07:00:00 | 24391       | 24391                 | 24391        | 0              | 0             | 24391           | 706762        |
| 2022-04-04 13:58:09.936763 | 2_1649077089 | 1    |       | 4G  | 15:00:00 | 24391       | 0                     | 24391        | 0              | 24391         | 0               | NOT_SUBMITTED |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+

The next step is to merge the 24391 VEP tasks into a single file. bcftools does have a concatination function, however, we do not need anything so sophistacated as we know the sorted file order from the job array file and that all of the VEP task VCFs will have the same header. Therefore, a simplier variant of vcf-cat.sh was used, this is called vcf-dumb-cat.sh, this script should only be used for jobs like this, where the origin and structure of the VCF files is well known.

$ qsub -cwd -S "/bin/bash" -N "vep_vcf_cat" -l h_rt="24:00:00" -l mem="2G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    vcf-dumb-cat.sh -v -T ~/Scratch/tmp/ -f process_logs/vcf-dumb-cat-files.txt gwas_norm.v20220402.biallelic.vep.b37.vcf.gz

Finally tabix the indexed the file:

$ tabix -p"vcf" gwas_norm.v20220402.biallelic.vep.b37.vcf.gz

Running VEP for GRCh38#

The process for GRCh38 was pretty much identical as GRCh37, first tabix.

$ tabix -p"vcf" gwas_norm.v20220402.biallelic.b38.vcf.gz

Then create the job array file:

$ qsub -cwd -S "/bin/bash" -N "vep_b38_job_array" -l h_rt="15:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    make-vep-ja.sh -n 50000 -T ~/Scratch/tmp/ -o process_logs/gwas_norm_vep.ja gwas_norm.v20220402.biallelic.b38.vcf.gz vep_chunks/ GRCh38 ../../../cadd/b38/whole_genome_SNVs.tsv.gz ../../../cadd/b38/gnomad.genomes.r3.0.indel.tsv.gz

The run the tasks - this time a step size of 2 was used for the first run:

$ pyjasub -v --mem "4G" --time "07:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/vep_b38 --task_script ~/code/gwas_norm/resources/tasks/vep.task.sh --ja_file /lustre/projects/DTAdb/resources/mapping_files/b38/v20220402/process_logs/gwas_norm_vep_b38.ja --infiles infile --outfiles outfile --step 2

Running VEP on the GRCh38 variants is more difficult than running on GRCh37. Some of the tasks took a very long time to run and 1 solitary task of ~50,000 variants did not finish within the 72h job time limit. This was for the region of chr2:70020254-70125051. Therefore, this region was simply annoated with CAD scores and used as is. The final run log is seen below:

+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| run_time                   | run_dir      | step | batch | mem | time     | total_tasks | no_of_tasks_submitted | infile_exist | infile_missing | outfile_exist | outfile_missing | job_id        |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| 2022-04-05 14:13:41.229093 | 1_1649164421 | 2    |       | 4G  | 07:00:00 | 12440       | 12440                 | 24880        | 0              | 0             | 24880           | 716604        |
| 2022-04-06 08:16:22.372758 | 2_1649229382 | 1    |       | 4G  | 15:00:00 | 24880       | 6                     | 24880        | 0              | 24874         | 6               | 722148        |
| 2022-04-07 08:09:55.976248 | 3_1649315395 | 1    |       | 4G  | 72:00:00 | 24880       | 1                     | 24880        | 0              | 24879         | 1               | 730934        |
| 2022-04-21 09:19:59.978175 | 4_1650529199 | 1    |       | 4G  | 15:00:00 | 24880       | 0                     | 24880        | 0              | 24880         | 0               | NOT_SUBMITTED |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+

The CADD run to finish the 1 remaining task used the same job array but different task script:

pyjasub -v --mem "4G" --time "15:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/cadd_fill_b38 --task_script ~/code/gwas_norm/resources/tasks/cad.task.sh --ja_file

The issue of not being able to VEP annotate certain regions of GRCh38 has been observed before with v102 of VEP (v105 was used above). In this case a lot more regions were affected, including the region affected with v105. I am not sure why, there are some regions that take ages to annotate and overall there were 14 regions that I was unable to annotate with VEP during the 72h job time limit (some of these had only annotated ~4000 variants in that time). There were other regions that only just managed to finish within that time frame as well. These are listed below:

bad VEP genomic regions#
chr_name	start_pos	end_pos
2	69948179	70062825
2	70062826	70172885
2	144689850	144821972
2	144821974	144949272
2	144949273	145078739
2	145078740	145209919
21	16393050	16512767
3	107740352	107874157
4	85967761	86100203
4	86100211	86228031
4	86228035	86364788
8	127857056	127979976
8	129289944	129410854
8	129410855	129536753

Rather than try to figure out what is going on, these regions were mapped to CADD only and Incorporated into the mapping file un-VEPed. Note that these issues do not occur with GRCh37. A separate cadd only task was created and the regions were run with:

Within the mapping files, there were also some variants that were missing. These seemed to be a subset of variants that did not have an ALT allele (not sure why) and were probably skipped by VEP. I am not concerned about these.

The VEP annotated chunks were combined into a single file with:

$ qsub -cwd -S "/bin/bash" -N "vep_vcf_cat_b38" -l h_rt="24:00:00" -l mem="2G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-dumb-cat.sh -v -T ~/Scratch/tmp/ -f process_logs/vcf_dumb_cat_files.b38.txt gwas_norm.v20220402.biallelic.vep.b38.vcf.gz

Splitting the mapping file into common/rare#

It makes sense from a GWAS perspective to have a common variant mapping file and a rare one. Most of the variants in the mapping file are rare but most of the GWAS variants will be common, so there is no point massively increasing the search space unnecessarily. So, the common file will be the main mapping file with the rare file (or the original file with all variants) being an occasional lookup.

For this, a python script was used, split-mapping-file. The default arguments were used, namely that a variant site is classed as common if at least one of the populations has a MAF of >= 0.01 or a MAC of >= 50. Both are used for cases where the variant is rare but the total population size is large enough for there to be enough samples for it to be valid.

For GRCh37, this command was used - note the time is critical here, this takes a while as the script will tabix index both files as well:

$ qsub -cwd -S "/bin/bash" -N "split_mapping_b37" -l h_rt="48:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
    wrap split-mapping-file -T ~/Scratch/tmp/ gwas_norm.v20220402.biallelic.vep.b37.vcf.gz gwas_norm.v20220402.common.biallelic.vep.b37.vcf.gz gwas_norm.v20220402.rare.biallelic.vep.b37.vcf.gz

The variant synonyms file#

A flat file of current variant identifiers and their synonyms is also maintained. This is only used to normalise GWAS files that have only rsID information and no positional information. It is built jointly from Ensembl database tables and the dbSNP download file. TBC…