Building the mapping files#
Introduction#
There are a set of helper are VCF files of coordinate data, functional annotations and rsIDs that can be built and used to assign the latest rsIDs to GWAS variants and allele frequencies from a variety of population groups. It incorporates data from multiple available datasets. From end-to-end this is a manual process, although the component parts have various scripts associated with them. The whole process is documented here and the individual scripts used can be found in bash mapping scripts and python mapping scripts.
Version information#
The gwas-norm mapping files have the structure gwas-norm.v<version>.<optional:rare|common>.biallelic.vep.<genome_assembly:b36|b37|b38>vcf.gz
. Where the version number is the date in YYYYMMDD
so that it will sort correctly. The data sources and basic statstics of built versions are detailed below. The partition in to common or rare is based on having a MAF >= 0.01 or a MAC of >= 50. The only exception is v20210208 where the MAC was >= 10.
Version |
Genome assembly |
#sites (all) |
#sites (common) |
#sites (rare) |
---|---|---|---|---|
v20210208 |
GRCh37 |
1,161,556,872 |
99,419,831 (MAC>=10) |
1,062,137,041 (MAC<10) |
v20210208 |
GRCh38 |
1,199,931,094 |
101,770,179 (MAC>=10) |
1,098,160,915 (MAC<10) |
v20220402 |
GRCh37 |
1,215,008,330 |
81,947,534 |
1,133,060,793 |
v20220402 |
GRCh38 |
1,243,710,237 |
TBC… |
TBC… |
The data sources in each version are listed below:
Data source |
v20210208 |
v20210208 |
v20220402 |
v20220402 |
---|---|---|---|---|
GRCh37 |
GRCh38 |
GRCh37 |
GRCh38 |
|
dbSNP |
v154 |
v154 |
v155 |
v155 |
CADD |
v1.6dev,all SNVs+gnomad2.1.1 indels |
v1.6dev,all SNVs+gnomad3.0 indels |
v1.6dev,all SNVs+gnomad2.1.1 indels |
v1.6dev,all SNVs+gnomad3.0 indels |
VEP |
v102 |
v102 |
v105 |
v105 |
ALFA |
06/01/2021 (GRCh38->GRCh37 liftover) |
06/01/2021 |
06/01/2021 (GRCh38->GRCh37 liftover) |
06/01/2021 |
Gnomad2 Exomes |
v2.1.1 |
v2.1.1 (GRCh37->GRCh38 liftover) |
||
Gnomad3 |
v3.1 (GRCh38->GRCh37 liftover) |
v3.1 |
v3.1 (GRCh38->GRCh37 liftover) |
v3.1 |
1000genomes |
v5a (02/05/2013) |
v5a (05/04/2017) (GRCh37->GRCh38 liftover) |
v5a (02/05/2013) |
v5a (05/04/2017) (GRCh37->GRCh38 liftover) |
UK BioBank |
v3 imputed GBR unrelated QC |
v3 imputed GBR unrelated QC (GRCh37->GRCh38 liftover) |
v3 imputed GBR unrelated QC |
v3 imputed GBR unrelated QC (GRCh37->GRCh38 liftover) |
The populations in the mapping file#
Below is a summary of each of the population groups in the mapping file
Column Name |
Source |
Description |
---|---|---|
UKBB_EUR |
UK BioBank |
The white GBR subset from UK BioBank imputed genotype files based on unrelateds |
ALFA_EUR |
ALFA 01/21 release |
European |
ALFA_AFO |
ALFA 01/21 release |
Individuals with African ancestry |
ALFA_EAS |
ALFA 01/21 release |
East Asian (95%) |
ALFA_AFA |
ALFA 01/21 release |
African American |
ALFA_LAC |
ALFA 01/21 release |
Latin American individiuals with Afro-Caribbean ancestry |
ALFA_LEN |
ALFA 01/21 release |
Latin American individiuals with mostly European and Native American Ancestry |
ALFA_OAS |
ALFA 01/21 release |
Asian individiuals excluding South or East Asian |
ALFA_SAS |
ALFA 01/21 release |
South Asian |
ALFA_OTR |
ALFA 01/21 release |
The self-reported population is inconsistent with the GRAF-assigned population |
ALFA_AFR |
ALFA 01/21 release |
All Africans |
ALFA_ASN |
ALFA 01/21 release |
All Asian individuals (EAS and OAS) excluding South Asian (SAS) |
ALFA_TOT |
ALFA 01/21 release |
Total (~global) across all populations |
GNOMAD31_AFR |
Gnomad 3.1 |
African/African-American |
GNOMAD31_AMI |
Gnomad 3.1 |
Amish |
GNOMAD31_AMR |
Gnomad 3.1 |
Latino/Admixed American |
GNOMAD31_ASJ |
Gnomad 3.1 |
Ashkenazi Jewish |
GNOMAD31_EAS |
Gnomad 3.1 |
Japanese (jpn), Korean (kor), Other East Asian (oea) |
GNOMAD31_FIN |
Gnomad 3.1 |
Finnish |
GNOMAD31_MID |
Gnomad 3.1 |
Middle Eastern |
GNOMAD31_NFE |
Gnomad 3.1 |
Non-Finnish Europeans. Bulgarian (bgr), Estonian (est), North-western European (nwe), Other non-Finnish European (onf), Southern European (seu), Swedish (swe) |
GNOMAD31_OTH |
Gnomad 3.1 |
Other |
GNOMAD31_RAW |
Gnomad 3.1 |
All? |
GNOMAD31_SAS |
Gnomad 3.1 |
South Asian |
GNOMAD2EX_AFR |
Gnomad 2.1.1 exomes |
African/African-American |
GNOMAD2EX_AMR |
Gnomad 2.1.1 exomes |
Latino/Admixed American |
GNOMAD2EX_ASJ |
Gnomad 2.1.1 exomes |
Ashkenazi Jewish |
GNOMAD2EX_EAS |
Gnomad 2.1.1 exomes |
Japanese (jpn), Korean (kor), Other East Asian (oea) |
GNOMAD2EX_FIN |
Gnomad 2.1.1 exomes |
Finnish |
GNOMAD2EX_NFE |
Gnomad 2.1.1 exomes |
Non-Finnish Europeans. Bulgarian (bgr), Estonian (est), North-western European (nwe), Other non-Finnish European (onf), Southern European (seu), Swedish (swe) |
GNOMAD2EX_OTH |
Gnomad 2.1.1 exomes |
Other |
GNOMAD2EX_RAW |
Gnomad 2.1.1 exomes |
All? |
GNOMAD2EX_SAS |
Gnomad 2.1.1 exomes |
South Asian |
1KG_AFR |
1000 genomes phase 3 |
All African populations |
1KG_AMR |
1000 genomes phase 3 |
Ad Mixed American |
1KG_EAS |
1000 genomes phase 3 |
East Asian |
1KG_EUR |
1000 genomes phase 3 |
All European populations |
1KG_SAS |
1000 genomes phase 3 |
South Asian |
1KG_ACB |
1000 genomes phase 3 |
African Caribbeans in Barbados |
1KG_ASW |
1000 genomes phase 3 |
Americans of African Ancestry in SW USA |
1KG_BEB |
1000 genomes phase 3 |
Bengali from Bangladesh |
1KG_CDX |
1000 genomes phase 3 |
Chinese Dai in Xishuangbanna, China |
1KG_CEU |
1000 genomes phase 3 |
Utah Residents (CEPH) with Northern and Western European Ancestry |
1KG_CHB |
1000 genomes phase 3 |
Han Chinese in Beijing, China |
1KG_CHS |
1000 genomes phase 3 |
Southern Han Chinese |
1KG_CLM |
1000 genomes phase 3 |
Colombians from Medellin, Colombia |
1KG_ESN |
1000 genomes phase 3 |
Esan in Nigeria |
1KG_FIN |
1000 genomes phase 3 |
Finnish in Finland |
1KG_GBR |
1000 genomes phase 3 |
British in England and Scotland |
1KG_GIH |
1000 genomes phase 3 |
Gujarati Indian from Houston, Texas |
1KG_GWD |
1000 genomes phase 3 |
Gambian in Western Divisions in the Gambia |
1KG_IBS |
1000 genomes phase 3 |
Iberian Population in Spain |
1KG_ITU |
1000 genomes phase 3 |
Indian Telugu from the UK |
1KG_JPT |
1000 genomes phase 3 |
Japanese in Tokyo, Japan |
1KG_KHV |
1000 genomes phase 3 |
Kinh in Ho Chi Minh City, Vietnam |
1KG_LWK |
1000 genomes phase 3 |
Luhya in Webuye, Kenya |
1KG_MSL |
1000 genomes phase 3 |
Mende in Sierra Leone |
1KG_MXL |
1000 genomes phase 3 |
Mexican Ancestry from Los Angeles USA |
1KG_PEL |
1000 genomes phase 3 |
Peruvians from Lima, Peru |
1KG_PJL |
1000 genomes phase 3 |
Punjabi from Lahore, Pakistan |
1KG_PUR |
1000 genomes phase 3 |
Puerto Ricans from Puerto Rico |
1KG_STU |
1000 genomes phase 3 |
Sri Lankan Tamil from the UK |
1KG_TSI |
1000 genomes phase 3 |
Toscani in Italia |
1KG_YRI |
1000 genomes phase 3 |
Yoruba in Ibadan, Nigeria |
The mapping VCF file format#
The objective of the mapping file is to provided pre-calculated annotations for all known human variants. The annotations currently used in gwas-norm
are:
rsID
Ensembl VEP prediction
SIFT score (where appropriate)
PolyPhen score (where appropriate)
CADD score (where available)
Allele counts from various populations (where available)
Ideally, a mapping file containing these annotations must be available for each genome assembly that would want to liftover to. If a mapping file is not available for any genome assembly that you are lifting to, then the mapping step is skipped. Equally, if any of the annotations listed above are not available for a variant in the mapping file then they will be skipped from the final normalised GWAS file (but the variant is still output). In essence the pipeline is designed to not fail if there are missing annotation data, rather, appropriate flags are added to the row in the normalised GWAS file and then it is up to the user if they want to exclude variants at analysis time, based on the presence or absence of flags.
Currently, the process below generates fully annotated mapping files for GRCh38 (referred to as b38 in the file names below), GRCh37 (b37) and a mapping file with only rsID and allele counts for NCBI36 (b36).
The mapping file is referred to as a “counts VCF file”. This is because it has some differences with a standard “sites” VCF file, namely the allele count information is placed in the sample field having the format AN:AC
for each of the populations represented at the site, where AN is the total number of alleles observed at that site and AC is the number of each ALT allele observed at the site for the population. So in theory a site with two ALT alleles could have a AN:AC
field that looks like 1284737:12,1
, where 12 is the allele count for the first ALT allele and 1 is the allele count for the second ALT allele. In practice, all the sites in the mapping file are represented in bi-allelic form (see below), so this would be represented as two lines 1284737:12
and 1284737:1
. The IDs for each of the populations occupy the sample fields after the FORMAT
field and have the broad structure COHORT_POPULATION
(see below). Any missing allele count is represented as .:.
, i.e. the allele number (AN
) is missing (.
) and the allele count (AC
) is missing. This representation of the allele counts is similar to that used by the ALFA project.
Other than the AN:AC
representation, there are two absolute requirements for the mapping file.
All the variant sites should be represented in bi-allelic form. This makes the mapping process a bit easier.
The chromosomes in the mapping file should always be represented in string sort order and not in the biological order. i.e. it might be natural to represent human chromosomes in the order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT . However, they should be represented in string sort order: 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y . This can be achieved by stripping the VCF header and performing:
zcat mapping_file.vcf.gz | sort -t$'\t' -k1,1 -k2n,2n | bgzip > sorted.vcf.gz
However, it is probably easier to generate a separate VCF header with the contigs in string sort order and then using
bcftools reheader
/bcftools sort
. If the procedure below is followed then you do not need to worry about this.
The build procedure outlined below will split all multi-allelic sites into multiple bi-allelic sites bcftools norm -m"+any"
validate all the reference alleles against the respective reference genome and drop any sites that do not validate. It also normalises INDELs to a standard format bcftools norm -c x -f "<PATH TO REF GENOME>"
. In the final mapping files, only sites that overlap with dbSNP will have an rsID assigned. If a site had an rsID assigned in one of the non-dnSNP sources but that does not overlap with a dbSNP site, then the rsID of the non-dbSNP source is set to missing .
. Therefore, in the final normalised GWAS files, the user can be sure that the rsIDs are assigned only from dbSNP sites.
Using the procedure below, the VEP fields are slightly truncated to save some space and only output the information used by the pipeline. The VEP sting in the INFO field is: "Allele, Consequence, Gene, Feature_type, Feature, SIFT, PolyPhen"
. Although, the pipeline only uses Consequence, SIFT, PolyPhen
and the structure of the VEP string is checked prior to the mapping annotation taking place. VEP strings are delimited by pipes |
.
CADD annotations are not added using VEP (as it is far to slow), so instead an additional script is used for this. This adds a CADD
annotation to the INFO field, with two entries CADD_PHRED, CADD_RAW
. CADD strings are delimited by pipes |
.
An excerpt from a final mapping file is shown below:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##VEP="v102" time="2021-01-24 20:35:18" cache="/lustre/projects/DTAdb/resources/vep/homo_sapiens/102_GRCh38" ensembl-funcgen=102.6bd93a0 ensembl-variation=102.2716d2e ensembl=102.347f9ed ensembl-io=102.ff1cf96 1000genomes="phase3" COSMIC="91" ClinVar="202006" ESP="V2-SSA137" HGMD-PUBLIC="20194" assembly="GRCh38.p13" dbSNP="153" gencode="GENCODE 36" genebuild="2014-07" gnomAD="r2.1" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##contig=<ID=1,length=248956422>
##contig=<ID=10,length=133797422>
##contig=<ID=11,length=135086622>
##contig=<ID=12,length=133275309>
##contig=<ID=13,length=114364328>
##contig=<ID=14,length=107043718>
##contig=<ID=15,length=101991189>
##contig=<ID=16,length=90338345>
##contig=<ID=17,length=83257441>
##contig=<ID=18,length=80373285>
##contig=<ID=19,length=58617616>
##contig=<ID=2,length=242193529>
##contig=<ID=20,length=64444167>
##contig=<ID=21,length=46709983>
##contig=<ID=22,length=50818468>
##contig=<ID=3,length=198295559>
##contig=<ID=4,length=190214555>
##contig=<ID=5,length=181538259>
##contig=<ID=6,length=170805979>
##contig=<ID=7,length=159345973>
##contig=<ID=8,length=145138636>
##contig=<ID=9,length=138394717>
##contig=<ID=MT,length=16569>
##contig=<ID=X,length=156040895>
##contig=<ID=Y,length=57227415>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags.">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM. Source and id separated by colon (:). Each identifier is separated by a vertical bar (|)">
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 8 - confers sensitivity, 9 - risk-factor, 10 - association, 11 - protective, 12 - conflict, 13 - affects, 255 - other">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria provided by submitter, single - Classified by single submitter, mult - Classified by multiple submitters, conf - Criteria provided conflicting interpretations, exp - Reviewed by expert panel, guideline - Practice guideline">
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version strings associated with that allele.">
##INFO=<ID=DS,Number=.,Type=String,Description="Datasets merged at the site">
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|Gene|Feature_type|Feature|SIFT|PolyPhen">
##INFO=<ID=CADD,Number=.,Type=String,Description="CADD Consequences. Format: CADD_PHRED,CADD_RAW">
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
##bcftools_viewVersion=1.11+htslib-1.11
##bcftools_viewCommand=view -h gwas_norm.biallelic.b38.1.1.10001-1377535.vcf.gz; Date=Mon Jan 25 10:00:19 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT UKBB_EUR ALFA_EUR ALFA_AFO ALFA_EAS ALFA_AFA ALFA_LAC ALFA_LEN ALFA_OAS ALFA_SAS ALFA_OTR ALFA_AFR ALFA_ASN ALFA_TOT GNOMAD31_AFR GNOMAD31_AMI GNOMAD31_AMR GNOMAD31_ASJ GNOMAD31_EAS GNOMAD31_FIN GNOMAD31_MID GNOMAD31_NFE GNOMAD31_OTH GNOMAD31_RAW GNOMAD31_SAS 1KG_AFR 1KG_AMR 1KG_EAS 1KG_EUR 1KG_SAS 1KG_ACB 1KG_ASW 1KG_BEB 1KG_CDX 1KG_CEU 1KG_CHB 1KG_CHS 1KG_CLM 1KG_ESN 1KG_FIN 1KG_GBR 1KG_GIH 1KG_GWD 1KG_IBS 1KG_ITU 1KG_JPT 1KG_KHV 1KG_LWK 1KG_MSL 1KG_MXL 1KG_PEL 1KG_PJL 1KG_PUR 1KG_STU 1KG_TSI 1KG_YRI
1 10001 rs1570391677 T A . . dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=A|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,A|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,A|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.702541|8.478 AN:AC .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:.
1 10001 . T C . . DS=alfa;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.750954|8.921 AN:AC .:. 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0 .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:.
1 10002 rs1570391692 A C . . dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.713993|8.583 AN:AC .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:.
1 10003 rs1570391694 A C . . dbSNPBuildID=154;SSR=0;DS=dbsnp;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.714485|8.588 AN:AC .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:.
1 10007 . T C . . DS=alfa;CSQ=C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000450305||,C|upstream_gene_variant|ENSG00000223972|Transcript|ENST00000456328||,C|downstream_gene_variant|ENSG00000227232|Transcript|ENST00000488147||;CADD=0.755890|8.967 AN:AC .:. 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0 .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:. .:.
Download reference genomes#
Many of the commands below will use bcftools norm
to validate the reference allele again the reference genome and remove any variants where it does not validate. So, we will download the reference genome for, GRCh38, GRCh37, NCBI36.
GRCh38:
wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz.fai
wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz.gzi
GRCh37, this was not indexed or bgzipped so I had to process it after download (this was also done for NCBI36):
wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.dna.toplevel.fa.gz hg.b37.fa.gz
zcat hg.b37.fa.gz | bgzip > Homo_sapiens.GRCh37.dna.toplevel.fa.gz
samtools faidx Homo_sapiens.GRCh37.dna.toplevel.fa.gz
rm hg.b37.fa.gz
Installing helper bash scripts for building the mapping files#
I hope to provide a download link for the mapping files. However, until then, they will need to be build using this procedure. It is fully documented and you do not have to use all the resources in the documentation. However, if you do follow the procedure then you will need to have the Python package installed as documented above and the bash scripts in ./resources/bin
available in your PATH (where .
is the root of the cloned git repository). These scripts all also need the bash-helpers repo in your PATH and also shflags (which is a very nice BASH command line argument handler). To put something in your PATH involves editing either your ~/.bashrc
file or your ~/.bash_profile
file (depending on what you use). For example, for adding the bash scripts for building the mapping files you should add something like this:
PATH="/path/to/gwas-norm/resources/bin:${PATH}"
export PATH
where /path/to/
is where you cloned the repository. If you didn’t clone the repository and installed via conda, then it is easiest just to clone it (but not install it with pip) and just use the ./resources/bin
scripts.
Mapping file data#
The process for incorporating various data resources into the mapping file is detailed below. Whilst you can follow it exactly if you want and have a finished mapping file at the end. You might want to add your own resources, if that is the case then please keep in mind the expected format of the mapping file and the requirements for sorting.
dbSNP#
The dbSNP content forms the backbone of the mapping file. It is provided in multiple formats, a JSON format and a VCF format. Currently only the VCF format is used but in future I will move to the JSON format as I think it might contain rsID synonyms (although I will have to research this a bit). The tabix indexes will not actually be used in the processing but will be downloaded for completeness.
To download all the files for GRCh38 version of dbSNP:
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi.md5
To download all the files for the GRCh37 version of dbSNP:
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi.md5
The chromosomes in the VCF file are expressed as NCBI assembly IDs, these will be processed into more familiar chr_name
designations. For this, a conversion file will need to be downloaded (assembly mapper).
For GRCh38:
$ wget https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt
For GRCh37:
$ wget https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt
There is a Python script for doing this. However, do not use this directly if you are using the process-dbsnp.sh
bash wrapper script. See the documentation for more information.
For GRCh38, the following command was run:
$ process-dbsnp.sh --steps 1000000 -c -T ~/Scratch/tmp/ -v "GCF_000001405.39.gz" "GCF_000001405.39_GRCh38.p13_assembly_report.txt" ../../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz "dbsnp.biallelic.v155.b38.vcf.gz"
For GRCh37, the following command was run:
$ process-dbsnp.sh --steps 1000000 -c -T ~/Scratch/tmp/ -v GCF_000001405.25.gz GCF_000001405.25_GRCh37.p13_assembly_report.txt ../../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz dbsnp.biallelic.v155.b37.vcf.gz
On the HPC (grid-engine) these were run with:
$ qsub -cwd -S "/bin/bash" -N "<NAME>" -l h_rt="10:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err <PROCESS COMMAND>
Where <NAME>
was either dbsnp155_b38 or dbsnp155_b37 respectively and <PROCESS COMMAND>
was one of the two commands above (but with the full path to process-dbsnp.sh
).
Currently process-dbsnp.sh
does not put the VCF file in the correct sort order, therefore, the output VCF files need to be sorted such that the chromosomes are in a string sort order. There is another bash script for doing this called vcf-sort.sh
. For GRCh37, this job was run:
$ qsub -cwd -S "/bin/bash" -N "dbsnp_sort_b37" -l h_rt="24:00:00" -l mem="8G" -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-sort.sh -T ~/Scratch/tmp/ dbsnp.biallelic.v155.b37.vcf.gz dbsnp_sorted_b37.vcf.gz
$ mv dbsnp_sorted_b37.vcf.gz dbsnp.biallelic.v155.b37.vcf.gz
For GRCh38:
$ qsub -cwd -S "/bin/bash" -N "dbsnp_sort_b38" -l h_rt="24:00:00" -l mem="8G" -o ~/Scratch/' $JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-sort.sh -T ~/Scratch/tmp/ dbsnp.biallelic.v155.b38.vcf.gz dbsnp_sorted_b38.vcf.gz
$ mv dbsnp_sorted_b38.vcf.gz dbsnp.biallelic.v155.b38.vcf.gz
Both the GRCh38 and the GRCh37 output files were tabix indexed after completion:
$ tabix -p"vcf" dbsnp.biallelic.v155.b38.vcf.gz
$ tabix -p"vcf" dbsnp.biallelic.v155.b37.vcf.gz
The output report from bcftools norm for both genome builds and current/previous dbSNP versions is shown below:
GRCh38 (v155) - total/split/realigned/skipped: 1053471122/145710568/17985666/1591057
GRCh38 (v154) - total/split/realigned/skipped: 701206150/66148905/11210568/958052
GRCh37 (v155) - total/split/realigned/skipped: 1032379139/138117126/17956788/1505604
GRCh37 (v154) - total/split/realigned/skipped: 689821210/63751490/11199525/913045
ALFA#
Data from the ALFA project is also incorporated into the mapping file. ALFA is an Al lele F requency A ggregator that gathers allele frequencies of variants from populations used in GWAS analysis and partitions them into broad ethnic groups. The latest version (06-01-2021) provides a VCF file that contains ~900M rows and 12 population groups. The ALFA data is only available as genome assembly GRCh38, so this will be processed into the desired format and then lifted down to GRCh37.
The data was downloaded as follows
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.tbi
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.md5
$ wget https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.tbi.md5
The tabix index will not actually be used in the processing but was downloaded for completeness.
Check the MD5:
$ md5sum -c freq.vcf.gz.md5
freq.vcf.gz: OK
$ md5sum -c freq.vcf.gz.tbi.md5
freq.vcf.gz.tbi: OK
As with dbSNP the chromosomes in ALFA are represented in NCBI format so will need to be converted to regular chromosome nomenclature. A helper python script (format-alfa
) exists to remap the chromosomes to the standard identifiers, and remap the ALFA sample names to something more human readable. However, this script is run via a bash wrapper script (process-alfa.sh
) that also performs several other functions. See it’s documentation for details.
The command was run with:
$ process-alfa.sh -c -v -T ~/Scratch/tmp/ --steps 1000000 freq.vcf.gz GCF_000001405.39_GRCh38.p13_assembly_report.txt ../../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ../alfa0121.biallelic.b38.vcf.gz
And on the grid-engine HPC:
$ qsub -cwd -S "/bin/bash" -N "alfa_b38" -l h_rt="10:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/bin/process-alfa.sh <ARGS AS ABOVE>
The bcftools norm
output for ALFA is shown below:
GRCh38 (06-01-2021) - total/split/realigned/skipped: 904081572/108266225/13083399/1377859
Lifting ALFA to GRCh37#
ALFA is only available in assembly GRCh38, therefore, the processed file needs to be lifted down to GRCh37. I have a couple of scripts to help with this. They use CrossMap for the liftover but not the VCF handling, see the documentation for details.
$ vcf-lift.sh -T ~/Scratch/tmp -o ../b37/alfa0121.biallelic.b37.vcf.gz -O 'z' alfa0121.biallelic.b38.vcf.gz ../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ../../../../chain_files/GRCh38_to_GRCh37.chain.gz
$ qsub -cwd -S "/bin/bash" -N "alfa_b37_lift" -l h_rt="10:00:00" -l mem="4G" -pe smp 4 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/bin/process-alfa.sh <ARGS AS ABOVE>
GNOMAD2 exome allele counts#
Allele counts for variants in selected gnomad2 exome populations were also incorporated into the mapping file. For this, the chromosome combined VCF file was downloaded from Gnomad and processed with a bash script (process-gnomad2.sh
). There files are about ~85Gb and are available from either Amazon, Google or Azure endpoints. The google ones are shown below. Please note, the tabix index files are important as they are used to gather the contigs represented in the file.
For GRCh38:
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
$ https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi
For GRCh37:
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz.tbi
To process the GRCh37 file:
$ process-gnomad2.sh -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.vcf.bgz ../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz gnomad2ex.biallelic.b37.vcf.gz
And on the HPC, please note, I vastly overestimated the HPC time. The GRCh37 version was completed in ~1.25h.:
$ qsub -cwd -S "/bin/bash" -N "gnomad2_ex_b37" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="48:00:00" process-gnomad2.sh -v -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.vcf.bgz ../../../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ../gnomad2ex.biallelic.b37.vcf.gz
To process the GRCh38 file I first attempted to use the same commands as written below:
$ process-gnomad2.sh -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad2ex.biallelic.b38.vcf.gz
And on the HPC:
$ qsub -cwd -S "/bin/bash" -N "gnomad2_ex_b38" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="48:00:00" process-gnomad2.sh -v -e -T ~/Scratch/tmp/ gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad2ex.biallelic.b38.vcf.gz
However, these failed with the errors below. It seems bcftools, not sure if bcftools norm or bcftools sort (I am guessing sort) gived the error The sequence "1_KI270706v1_random" was not found
and then the whole thing segfaults. This is a stringe one, I am striping the leading chr
from all the chromosomes but I wrote out some of the intermediate files and they have 7 instances of 1_KI270706v1_random
in there. So I assuming that bcftools norm
is removing them as they do not align with the reference assembly, however, they are retained in the contigs in the header. As we are piping directly into sort it is erroring out as they are in the header but no longer in the file. I could adapt the script to remove all chromosomes not in the reference assembly from the VCF header - this is not for the future. However, for now, I will work around it by simply lifting over the GRCh37 file (which is the original file anyway).
$ cat rerun_gnomad2_ex_b38.err
=== 690168 ===
[info] started at Fri 1 Apr 12:02:14 BST 2022
[info] temp directory: /home/rmjdcfi/Scratch/tmp/
[info] working temp directory: /home/rmjdcfi/Scratch/tmp/tmp.vpymnRBLaC
[info] input VCF: /lustre/projects/DTAdb/resources/mapping_files/source_files/gnomad2/b38/original_files/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
[info] reference genome: /lustre/projects/DTAdb/resources/human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
[info] output file: /lustre/projects/DTAdb/resources/mapping_files/source_files/gnomad2/b38/gnomad2ex.biallelic.b38.vcf.gz
[info] will extract 9 pop(s): afr,amr,asj,eas,fin,nfe,oth,raw,sas
[info] processing gnomad2ex
[info] generating VCF header
[info] generating allele counts...
[info] normalising final VCF...
Writing to /home/rmjdcfi/Scratch/tmp/8yoRxV
[E::faidx_adjust_position] The sequence "1_KI270706v1_random" was not found
faidx_fetch_seq failed at 1_KI270706v1_random:45952
/var/opt/sge/node-h00a-037/job_scripts/690168: line 192: 55175 Broken pipe zcat "$temp_header" "$temp_counts"
55176 Exit 255 | bcftools norm -m'-any' -cx -f "$reference_genome_file"
55177 Segmentation fault (core dumped) | bcftools sort "${SORT_ARGS[@]}" -o "$temp_final" -O "z"
[info] removing /home/rmjdcfi/Scratch/tmp/tmp.vpymnRBLaC
[fatal] error on or near, '192'
!!!! ERROR EXIT !!!!
This is the command used for the liftover, it uses the vcf-lift.sh
script that has been used previously.
$ qsub -cwd -S "/bin/bash" -N "gnomad2ex_lift_to_b38" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 4 -l mem="4G" -l h_rt="10:00:00" vcf-lift.sh -T ~/Scratch/tmp/ -o gnomad2ex.biallelic.b38.vcf.gz -O 'z' -f process_logs/liftover_fails.txt.gz ../b37/gnomad2ex.biallelic.b37.vcf.gz ../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ../../../../chain_files/GRCh37_to_GRCh38.chain.gz
GNOMAD3 allele counts#
Allele counts for variants in gnomad3 will also be incorporated into the mapping file. The gnomad3 data is split per chromosome and due to the size it is particularly problematic for handling. There is a script process-gnomad3.sh
to do this. It was run with the command below using the default populations (where gnomad_processed_files
is the output directory):
$ process-gnomad3.sh -v human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gnomad_processed_files
However, due to the run time, process_gnomad3.sh
was parallelised using the hpc-tools framework, running two concurrent tasks. This requires a job array file and it was created with the gnarly bash command below:
echo -e "rowidx\tinfile\toutfile\tchr_name" > gnomad3-1.ja;
idx=1;
for i in ${chrs[@]}; do
echo -e "$idx\t"$(readlink -f human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz)"\t"$(readlink -f ../gnomad3-1.chr${i}.biallelic.b38.vcf.gz)"\t"$i;
idx=$(( idx + 1));
done >> gnomad3-1.ja
Then submitted with:
pyjasub -v --batch 2 --mem "8G" --time '15:00:00' --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/gnomad3-1_proc --ja_file process_logs/gnomad3-1.ja --task_script ~/code/gwas_norm/resources/bin/gnomad.task.sh --infiles infile --outfiles outfile
Where gwas_norm/resources/bin/gnomad.task.sh
is located in this repository. The individual chromosome data were concatinated using the vcf-cat.sh
bash wrapper script. Note that gnomad_processed_files/*.vcf.gz
gives the chromosome files in string sort order.
qsub -cwd -S "/bin/bash" -N "gnomad_cat" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 2 -l mem="4G" -l h_rt="10:00:00" ~/code/gwas_norm/resources/bin/vcf-cat.sh -v --delete -T ~/Scratch/tmp/ gnomad3-1.biallelic.b38.vcf.gz gnomad_processed_files/*.vcf.gz
The joined Gnomad data was also lifted down to GRCh37, using another bash wrapper vcf-lift.sh
. And submitted to the HPC as follows:
qsub -cwd -S "/bin/bash" -N "gnomad_b37_lift" -e ~/Scratch/'$JOB_NAME'.err -o ~/Scratch/'$JOB_NAME'.out -pe smp 4 -l mem="4G" -l h_rt="10:00:00" vcf-lift.sh -v -T ~/Scratch/tmp -o b37/gnomad3-1.biallelic.b37.vcf.gz -O 'z' b38/gnomad3-1.biallelic.b38.vcf.gz /human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz /chain_files/GRCh38_to_GRCh37.chain.gz
1000 genomes allele counts#
Allele counts are generated for each of the super and sub populations in the 1000 genomes using the :ref:` script <bash_1kg>` process-1000g.sh
. This will download and process all of the files. It was run on the HPC as follows:
For GRCh37:
$ qsub -cwd -S "/bin/bash" -N "thousand_g_b37" -l h_rt="36:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/process-1000g.sh -v --tmp-dir ~/Scratch/tmp/ --download-dir resources/mapping_files/source_files/1000g/b37/original_files/ --keep-vcf b37 resources/human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz resources/mapping_files/source_files/1000g/b37
For GRCh38:
$ qsub -cwd -S "/bin/bash" -N "thousand_g_b38" -l h_rt="36:00:00" -l mem="8G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/process-1000g.sh -v --tmp-dir ~/Scratch/tmp/ --download-dir resources/mapping_files/source_files/1000g/b38/original_files/ --keep-vcf b38 resources/human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz resources/mapping_files/source_files/1000g/b38/
The processed 1000 genomes chromosomes were then merged into a single file using the vcf-cat.sh
script. This is a wrapper around bcftools concat
that performs some input file/output file checks and deleted the concatenated files if all is good. This was submitted with qsub as below:
For GRCh37:
$ qsub -cwd -S "/bin/bash" -N "tkg_b37_merge" -l h_rt="15:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/vcf-cat.sh --delete -T ~/Scratch/tmp/ -v 1kg.biallelic.b37.vcf.gz 1kg.chr*.vcf.gz
For GRCh38:
$ qsub -cwd -S "/bin/bash" -N "tkg_b38_merge" -l h_rt="15:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err ~/code/gwas_norm/resources/bin/vcf-cat.sh --delete -T ~/Scratch/tmp/ -v 1kg.biallelic.b38.vcf.gz 1kg.chr*.vcf.gz
UKBB SNPstats files#
Allele frequencies from the UKBioBank will be incorporated into the final allele counts VCF file. These have been pre-calculated from the BGEN files using SNPstats. A script (format-snpstats
) was written that will take one or more SNPSTATs summary files and create an allele counts VCF file from them. This
We also need the UK BioBank counts data to be represented as GRCh38 assembly. So this was lifted up to GRCh38 using vcf-lift.sh
:
$ qsub -cwd -S "/bin/bash" -N "ukbb_eur_b37_b38_liftover" -l h_rt="01:30:00" -l mem="4G" -pe smp 4 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-lift.sh -o"../b38/ukbb_eur.biallelic.b38.vcf.gz" -O"z" -v -T ~/Scratch/tmp -f"../b38/process_logs/liftover_files.ukbb_eur.biallelic.b37.b38.txt.gz" ukbb_eur.biallelic.b37.vcf.gz /human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz /chain_files/GRCh37_to_GRCh38.chain.gz
Merging all the data sources#
The data sources were merged into a single VCF file using a python script merge-count-vcfs
. This is wrapped in a bash script merge-count-vcfs.sh
that was used to actually run the merge. Before merging any of the VCFs, tabix indexes were created on all of them, this is to ensure that they all have the same sort order.
The GRCh37 data sources were merged with this:
$ qsub -cwd -S "/bin/bash" -N "b37_merge" -l h_rt="36:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
merge-count-vcfs.sh -v -o gwas_norm.v20220402.biallelic.b37.vcf.gz --data-names "ukbb,alfa,gnomad3,gnomad2,1kg" --ref-name "dbsnp" --ref-genome ../../../human_genome/b37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz -T ~/Scratch/tmp/ ../../source_files/dbsnp/b37/v155/dbsnp.biallelic.v155.b37.vcf.gz ../../source_files/ukbb_freq/b37/ukbb_eur.biallelic.b37.vcf.gz ../../source_files/alfa/b37/alfa0121.biallelic.b37.vcf.gz ../../source_files/gnomad31/b37/gnomad3-1.biallelic.b37.vcf.gz ../../source_files/gnomad2/b37/gnomad2ex.biallelic.b37.vcf.gz ../../source_files/1000g/b37/1kg.biallelic.b37.vcf.gz
The GRCh37 data sources were merged with this:
$ qsub -cwd -S "/bin/bash" -N "b37_merge" -l h_rt="36:00:00" -l mem="4G" -pe smp 2 -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
merge-count-vcfs.sh -v -o gwas_norm.v20220402.biallelic.b38.vcf.gz --data-names "ukbb,alfa,gnomad3,gnomad2,1kg" --ref-name "dbsnp" --ref-genome ../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -T ~/Scratch/tmp/ ../../source_files/dbsnp/b38/v155/dbsnp.biallelic.v155.b38.vcf.gz ../../source_files/ukbb_freq/b38/ukbb_eur.biallelic.b38.vcf.gz ../../source_files/alfa/b38/alfa0121.biallelic.b38.vcf.gz ../../source_files/gnomad31/b38/gnomad3-1.biallelic.b38.vcf.gz ../../source_files/gnomad2/b38/gnomad2ex.biallelic.b38.vcf.gz ../../source_files/1000g/b38/1kg.biallelic.b38.vcf.gz
Running VEP#
The merged VCF data sources were then annotated with the Ensembl Variant Effect Predictor (VEP) and CADD . Due to the size of the merged files (over 1 billion rows), this was parallelised using ~20,000 tasks each with about 50,000 variants to annotate. VEP was installed using these instructions. I was going to run using the CADD plugin but I had all sorts of issues with performance, so I wrote my own script to merge CADD data into VEP. I never managed to achieve the 100k variants/min that VEP supposedly can do.
Running VEP for GRCh37#
For GRCh37, the procedure is as follows, each parallel task will extract it’s defined regions from the merged VCF file, therefore the merged file was tabix indexed first of all.
$ tabix -p"vcf" gwas_norm.v20220402.biallelic.b37.vcf.gz
Then a job array file was created that describes the arguments to be used in each task. This was done using a bash script make-vep-ja.sh
, this wraps a Python script that creates the genomic chunks with a fixed number of variants make-site-chunks
.
$ qsub -cwd -S "/bin/bash" -N "vep_b37_job_array" -l h_rt="15:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
make-vep-ja.sh -n 50000 -T ~/Scratch/tmp/ -o process_logs/gwas_norm_vep.ja gwas_norm.v20220402.biallelic.b37.vcf.gz vep_chunks/ GRCh37 ../../../cadd/b37/whole_genome_SNVs.tsv.gz ../../../cadd/b37/gnomad.genomes.r2.1.1.indel.tsv.gz
The tasks were submitted with the command below. This also shows the output of the job submission process (wih the paths truncated).
$ pyjasub -v --mem "2G" --time "07:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/vep_b37 --task_script ~/code/gwas_norm/resources/tasks/vep.task.sh --ja_file process_logs/gwas_norm_vep.ja --infiles infile --outfiles outfile
=== pyjasub (hpc_tools v0.1.1dev1) ===
[info] step: 1
[info] batch: None
[info] mem: 2G
[info] shell: /bin/bash
[info] time: 07:00:00
[info] nodes: None
[info] tmp_dir: ~/Scratch/tmp/
[info] scratch_size: None
[info] tmp_size: None
[info] project: None
[info] paid: None
[info] log_dir: ~/Scratch/job_logs/vep_b37
[info] ja_file: process_logs/gwas_norm_vep.ja
[info] task_script: ~/code/gwas_norm/resources/tasks/vep.task.sh
[info] infiles: ['infile']
[info] outfiles: ['outfile']
[info] full job log directory '~/job_logs/vep_b37'
[info] run started at: '2022-04-03 17:17:35.518636'
[info] current run number: '1'
[info] run log path: '~/vep_b37/run.log'
[info] run directory: '~/vep_b37/2_1649002655'
[info] run out files: '~/vep_b37/2_1649002655/out/$TASK_ID_2_1649002655.out'
[info] run err files: '~/vep_b37/2_1649002655/err/$TASK_ID_2_1649002655.err'
[info] full job array path: 'process_logs/gwas_norm_vep.ja'
[info] job array md5: 'c7fc30dc6e78142651effb929c121134'
[info] full task script path: '~/code/gwas_norm/resources/tasks/vep.task.sh'
[info] task script md5: '7fadfd0e6101e35f9c32bfef4044f154'
[info] previous task script md5: '7fadfd0e6101e35f9c32bfef4044f154'
[info] copying task script: '~/vep_b37/2_1649002655/7fadfd0e6101e35f9c32bfef4044f154.task'
[info] run job array path: '~/vep_b37/2_1649002655/run_job_array.job'
[info] check results below:
[info] input file present: '24391'
[info] input file absent: '0'
[info] output file present: '0'
[info] output file absent: '24391'
[info] total tasks: '24391'
[info] tasks to be submitted: '24391'
[run] You are about to run '24391' tasks over '24391' rows in a job array with a step of '1', using the following command:
[run] qsub -t 1-24391:1 -e ~/vep_b37/2_1649002655/err/$TASK_ID_2_1649002655.err -o ~/job_logs/vep_b37/2_1649002655/out/$
TASK_ID_2_1649002655.out -l mem=2G -S /bin/bash -l h_rt=07:00:00 ~/code/gwas_norm/resources/tasks/vep.task.sh ~/vep_b37/2_1649002655/run_job_array.job ~/Scratch/tmp 1
[run] if all this looks correct, press [ENTER] to submit or [CTRL-C] to quit >
[run] submitting job...
[run] submitted array job with ID 706453
[info] run log details below...
...
[info] *** job submission finished ***
*** END ***
Surprisingly, all the tasks completed successfully on the first run:
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| run_time | run_dir | step | batch | mem | time | total_tasks | no_of_tasks_submitted | infile_exist | infile_missing | outfile_exist | outfile_missing | job_id |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| 2022-04-03 19:18:37.500647 | 1_1649009917 | 1 | | 2G | 07:00:00 | 24391 | 24391 | 24391 | 0 | 0 | 24391 | 706762 |
| 2022-04-04 13:58:09.936763 | 2_1649077089 | 1 | | 4G | 15:00:00 | 24391 | 0 | 24391 | 0 | 24391 | 0 | NOT_SUBMITTED |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
The next step is to merge the 24391 VEP tasks into a single file. bcftools
does have a concatination function, however, we do not need anything so sophistacated as we know the sorted file order from the job array file and that all of the VEP task VCFs will have the same header. Therefore, a simplier variant of vcf-cat.sh
was used, this is called vcf-dumb-cat.sh
, this script should only be used for jobs like this, where the origin and structure of the VCF files is well known.
$ qsub -cwd -S "/bin/bash" -N "vep_vcf_cat" -l h_rt="24:00:00" -l mem="2G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
vcf-dumb-cat.sh -v -T ~/Scratch/tmp/ -f process_logs/vcf-dumb-cat-files.txt gwas_norm.v20220402.biallelic.vep.b37.vcf.gz
Finally tabix the indexed the file:
$ tabix -p"vcf" gwas_norm.v20220402.biallelic.vep.b37.vcf.gz
Running VEP for GRCh38#
The process for GRCh38 was pretty much identical as GRCh37, first tabix.
$ tabix -p"vcf" gwas_norm.v20220402.biallelic.b38.vcf.gz
Then create the job array file:
$ qsub -cwd -S "/bin/bash" -N "vep_b38_job_array" -l h_rt="15:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
make-vep-ja.sh -n 50000 -T ~/Scratch/tmp/ -o process_logs/gwas_norm_vep.ja gwas_norm.v20220402.biallelic.b38.vcf.gz vep_chunks/ GRCh38 ../../../cadd/b38/whole_genome_SNVs.tsv.gz ../../../cadd/b38/gnomad.genomes.r3.0.indel.tsv.gz
The run the tasks - this time a step size of 2 was used for the first run:
$ pyjasub -v --mem "4G" --time "07:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/vep_b38 --task_script ~/code/gwas_norm/resources/tasks/vep.task.sh --ja_file /lustre/projects/DTAdb/resources/mapping_files/b38/v20220402/process_logs/gwas_norm_vep_b38.ja --infiles infile --outfiles outfile --step 2
Running VEP on the GRCh38 variants is more difficult than running on GRCh37. Some of the tasks took a very long time to run and 1 solitary task of ~50,000 variants did not finish within the 72h job time limit. This was for the region of chr2:70020254-70125051
. Therefore, this region was simply annoated with CAD scores and used as is. The final run log is seen below:
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| run_time | run_dir | step | batch | mem | time | total_tasks | no_of_tasks_submitted | infile_exist | infile_missing | outfile_exist | outfile_missing | job_id |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
| 2022-04-05 14:13:41.229093 | 1_1649164421 | 2 | | 4G | 07:00:00 | 12440 | 12440 | 24880 | 0 | 0 | 24880 | 716604 |
| 2022-04-06 08:16:22.372758 | 2_1649229382 | 1 | | 4G | 15:00:00 | 24880 | 6 | 24880 | 0 | 24874 | 6 | 722148 |
| 2022-04-07 08:09:55.976248 | 3_1649315395 | 1 | | 4G | 72:00:00 | 24880 | 1 | 24880 | 0 | 24879 | 1 | 730934 |
| 2022-04-21 09:19:59.978175 | 4_1650529199 | 1 | | 4G | 15:00:00 | 24880 | 0 | 24880 | 0 | 24880 | 0 | NOT_SUBMITTED |
+----------------------------+--------------+------+-------+-----+----------+-------------+-----------------------+--------------+----------------+---------------+-----------------+---------------+
The CADD run to finish the 1 remaining task used the same job array but different task script:
pyjasub -v --mem "4G" --time "15:00:00" --tmp_dir ~/Scratch/tmp/ --log_dir ~/Scratch/job_logs/cadd_fill_b38 --task_script ~/code/gwas_norm/resources/tasks/cad.task.sh --ja_file
The issue of not being able to VEP annotate certain regions of GRCh38 has been observed before with v102 of VEP (v105 was used above). In this case a lot more regions were affected, including the region affected with v105. I am not sure why, there are some regions that take ages to annotate and overall there were 14 regions that I was unable to annotate with VEP during the 72h job time limit (some of these had only annotated ~4000 variants in that time). There were other regions that only just managed to finish within that time frame as well. These are listed below:
chr_name |
start_pos |
end_pos |
---|---|---|
2 |
69948179 |
70062825 |
2 |
70062826 |
70172885 |
2 |
144689850 |
144821972 |
2 |
144821974 |
144949272 |
2 |
144949273 |
145078739 |
2 |
145078740 |
145209919 |
21 |
16393050 |
16512767 |
3 |
107740352 |
107874157 |
4 |
85967761 |
86100203 |
4 |
86100211 |
86228031 |
4 |
86228035 |
86364788 |
8 |
127857056 |
127979976 |
8 |
129289944 |
129410854 |
8 |
129410855 |
129536753 |
Rather than try to figure out what is going on, these regions were mapped to CADD only and Incorporated into the mapping file un-VEPed. Note that these issues do not occur with GRCh37. A separate cadd only task was created and the regions were run with:
Within the mapping files, there were also some variants that were missing. These seemed to be a subset of variants that did not have an ALT allele (not sure why) and were probably skipped by VEP. I am not concerned about these.
The VEP annotated chunks were combined into a single file with:
$ qsub -cwd -S "/bin/bash" -N "vep_vcf_cat_b38" -l h_rt="24:00:00" -l mem="2G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err vcf-dumb-cat.sh -v -T ~/Scratch/tmp/ -f process_logs/vcf_dumb_cat_files.b38.txt gwas_norm.v20220402.biallelic.vep.b38.vcf.gz
Splitting the mapping file into common/rare#
It makes sense from a GWAS perspective to have a common variant mapping file and a rare one. Most of the variants in the mapping file are rare but most of the GWAS variants will be common, so there is no point massively increasing the search space unnecessarily. So, the common file will be the main mapping file with the rare file (or the original file with all variants) being an occasional lookup.
For this, a python script was used, split-mapping-file
. The default arguments were used, namely that a variant site is classed as common if at least one of the populations has a MAF of >= 0.01 or a MAC of >= 50. Both are used for cases where the variant is rare but the total population size is large enough for there to be enough samples for it to be valid.
For GRCh37, this command was used - note the time is critical here, this takes a while as the script will tabix index both files as well:
$ qsub -cwd -S "/bin/bash" -N "split_mapping_b37" -l h_rt="48:00:00" -l mem="4G" -o ~/Scratch/'$JOB_NAME'.out -e ~/Scratch/'$JOB_NAME'.err \
wrap split-mapping-file -T ~/Scratch/tmp/ gwas_norm.v20220402.biallelic.vep.b37.vcf.gz gwas_norm.v20220402.common.biallelic.vep.b37.vcf.gz gwas_norm.v20220402.rare.biallelic.vep.b37.vcf.gz
The variant synonyms file#
A flat file of current variant identifiers and their synonyms is also maintained. This is only used to normalise GWAS files that have only rsID information and no positional information. It is built jointly from Ensembl database tables and the dbSNP download file. TBC…