BASH endpoints#

There are several bash scripts that are included in gwas_norm/resources/bin. Many of these are wrapping some of the Python scripts and integratig them with external programs. There functionality and usage is documented below.

Mapping file generation#

`process-dbsnp.sh`#

This performs start to finish processing of dbSNP for incorporating into a gwas-norm mapping file. This performs the following functions:

Assign regular chromosome names (using format-dbsnp)
Strip out unwanted INFO fields
Perform REF allele checks against the reference genome assembly - excluding if not matching
Make multi-allelic sites into bi-allelic sites
Normalise INDELs
Sort the output

For process-dbsnp.sh to run you will need to have bash > v4.02, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc or ~/.bash_profile). A copy of bcftools will also need to be installed. Also, if you use the --hpc option you will need simple-progress installed.

$ process-dbsnp.sh --help
USAGE: /home/rmjdcfi/code/gwas_norm/resources/bin/process-dbsnp.sh [flags] <dnsnp vcf> <assembly map> <reference genome> <output file>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -c,--[no]hpc:  changes the progress verbosity if running headless on an HPC (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (default: '0')
  -s,--steps:  The progress steps to report when --hpc is used (default: 1000000)
  -h,--help:  show this help (default: false)

Unwanted INFO fields#

the following INFO fields will be removed from the dnSNP VCF file:

INFO/GENEINFO
INFO/PSEUDOGENEINFO,
INFO/VC,
INFO/RS,
INFO/PM,
INFO/NSF,
INFO/NSM,
INFO/NSN,
INFO/SYN,
INFO/U3,
INFO/U5
INFO/ASS,
INFO/DSS,
INFO/INT,
INFO/R3,
INFO/R5,
INFO/GNO,
INFO/PUB,
INFO/FREQ,
INFO/COMMON

Input vcf#

Below is an example of the input dbSNP VCF:

##fileformat=VCFv4.2
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimite
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id.  The pseudogene symbol and id are delimited by a colon (:) and eac
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 -
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids.
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele foll
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequenc
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - in
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Lik
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version str
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_000001.11    10001   rs1570391677    T       A       .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10002   rs1570391692    A       C       .       .       RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10003   rs1570391694    A       C       .       .       RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10008   rs1570391698    A       G       .       .       RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10009   rs1570391702    A       G       .       .       RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10015   rs1570391706    A       G       .       .       RS=1570391706;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10020   rs1570391708    A       C       .       .       RS=1570391708;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10021   rs1570391710    A       G       .       .       RS=1570391710;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10026   rs1570391712    A       C       .       .       RS=1570391712;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10027   rs1570391716    A       C,G     .       .       RS=1570391716;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10032   rs1570391720    A       C       .       .       RS=1570391720;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10033   rs1570391722    A       G       .       .       RS=1570391722;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11    10045   rs1570391729    A       C,G     .       .       RS=1570391729;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10051   rs1052373574    A       C,G     .       .       RS=1052373574;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10051   rs1326880612    A       AC      .       .       RS=1326880612;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10055   rs768019142     T       TA      .       .       RS=768019142;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10055   rs892501864     T       A       .       .       RS=892501864;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11    10056   rs1570391738    A       C       .       .       RS=1570391738;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10057   rs1570391741    A       C,G     .       .       RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10059   rs1570391745    C       G       .       .       RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10063   rs1010989343    A       C,G     .       .       RS=1010989343;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10067   rs1489251879    T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      .       .       RS=1489251879;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX
NC_000001.11    10069   rs1570391755    A       G       .       .       RS=1570391755;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10075   rs1570391757    A       G       .       .       RS=1570391757;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10077   rs1022805358    C       G       .       .       RS=1022805358;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11    10081   rs1570391762    A       G       .       .       RS=1570391762;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10086   rs1570391767    A       C       .       .       RS=1570391767;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10092   rs1570391770    A       C       .       .       RS=1570391770;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.

Output VCF#

The an excerpt of the output of process-dbsnp.sh is shown below:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##contig=<ID=1>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=2>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=MT>
##contig=<ID=X>
##contig=<ID=Y>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 -
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - in
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Lik
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version str
##bcftools_annotateVersion=1.11+htslib-1.11
##bcftools_annotateCommand=annotate -xINFO/GENEINFO,INFO/PSEUDOGENEINFO,INFO/VC,INFO/RS,INFO/PM,INFO/NSF,INFO/NSM,INFO/NSN,INFO/SYN,INFO/U3,INFO/U5,INFO/ASS,INFO/DSS,I
##bcftools_normVersion=1.11+htslib-1.11
##bcftools_normCommand=norm -m-any -cx -f /lustre/projects/DTAdb/resources/human_genome/Homo_sapiens.GRCh38.dna.toplevel.fa.gz; Date=Fri Jan 15 14:24:00 2021
##bcftools_viewVersion=1.11+htslib-1.11
##bcftools_viewCommand=view -h dbsnp.biallelic.v154.b38.vcf.gz; Date=Mon Jan 18 21:53:16 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10001   rs1570391677    T       A       .       .       dbSNPBuildID=154;SSR=0
1       10002   rs1570391692    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10003   rs1570391694    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10008   rs1570391698    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10009   rs1570391702    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10015   rs1570391706    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10019   rs775809821     TA      T       .       .       dbSNPBuildID=144;SSR=0
1       10020   rs1570391708    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10021   rs1570391710    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10026   rs1570391712    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10027   rs1570391716    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10027   rs1570391716    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10032   rs1570391720    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10033   rs1570391722    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10039   rs978760828     A       C       .       .       dbSNPBuildID=150;SSR=0
1       10043   rs1008829651    T       A       .       .       dbSNPBuildID=150;SSR=0
1       10045   rs1570391729    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10045   rs1570391729    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10051   rs1326880612    A       AC      .       .       dbSNPBuildID=151;SSR=0
1       10051   rs1052373574    A       C       .       .       dbSNPBuildID=150;SSR=0
1       10051   rs1052373574    A       G       .       .       dbSNPBuildID=150;SSR=0
1       10055   rs892501864     T       A       .       .       dbSNPBuildID=150;SSR=0
1       10055   rs768019142     T       TA      .       .       dbSNPBuildID=144;SSR=0
1       10056   rs1570391738    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10057   rs1570391741    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10057   rs1570391741    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10059   rs1570391745    C       G       .       .       dbSNPBuildID=154;SSR=0
1       10063   rs1010989343    A       C       .       .       dbSNPBuildID=150;SSR=0
1       10063   rs1010989343    A       G       .       .       dbSNPBuildID=150;SSR=0
1       10067   rs1489251879    T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      .       .       dbSNPBuildID=151;SSR=0
1       10069   rs1570391755    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10075   rs1570391757    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10077   rs1022805358    C       G       .       .       dbSNPBuildID=150;SSR=0
1       10081   rs1570391762    A       G       .       .       dbSNPBuildID=154;SSR=0
1       10086   rs1570391767    A       C       .       .       dbSNPBuildID=154;SSR=0
1       10092   rs1570391770    A       C       .       .       dbSNPBuildID=154;SSR=0

`process-alfa.sh`#

This performs start to finish processing of the ALFA VCF file for incorporation into a gwas-norm mapping file. This performs the following functions:

Assign regular chromosome names (using format-alfa)
Perform REF allele checks against the reference genome assembly - excluding if not matching
Make multi-allelic sites into bi-allelic sites
Normalise INDELs
Fix the AN for split multi-allelic sites (using fix-vcf-allele-number)
Sort the output

For process-alfa.sh to run you will need to have bash > v4.2, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc or ~/.bash_profile). A copy of bcftools will also need to be installed. Also, if you use the --hpc option you will need simple-progress installed.

$ process-alfa.sh --help
USAGE: process-alfa.sh [flags] <alfa vcf> <assembly map> <reference genome> <output file>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -c,--[no]hpc:  changes the progress verbosity if running headless on an HPC (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (not implemented yet) (default: '0')
  -s,--steps:  The progress steps to report when --hpc is used (default: 1000000)
  -h,--help:  show this help (default: false)

Input vcf#

Below is an example of the input ALFA VCF:

##fileformat=VCFv4.0
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMN10492695    SAMN10492696    SAMN10492697    SAMN10492698    SAMN10492699    SAMN10492700    SAMN10492701    SAMN1
NC_000001.9     144135212       rs1553120241    G       A       .       .       .       AN:AC   8560:5387       8:8     256:224 336:288 32:24   170:117 32:24   18:13   20:15   344:296 288:2
NC_000001.9     144148243       rs2236566       G       T       .       .       .       AN:AC   5996:510        0:0     0:0     0:0     0:0     0:0     0:0     0:0     84:8    0:0     0:0
NC_000001.9     146267105       rs1553119693    T       G       .       .       .       AN:AC   37168:28800     36:22   56:44   1378:839        18:14   70:60   10:9    4836:3639       452:3
NC_000001.9     148488564       .       C       A       .       .       .       AN:AC   8552:0  8:0     256:0   338:0   32:0    170:0   32:0    16:0    20:0    346:0   288:0   9424:0
NC_000001.10    2701535 rs371068661     C       T       .       .       .       AN:AC   134:9   0:0     0:0     48:1    0:0     0:0     0:0     0:0     188:15  48:1    0:0     370:25
NC_000001.10    2701546 rs587702211     G       A       .       .       .       AN:AC   134:0   0:0     0:0     48:4    0:0     0:0     0:0     0:0     188:2   48:4    0:0     370:6
NC_000001.10    7426777 rs1553119850    GT      G       .       .       .       AN:AC   4473:4462       0:0     0:0     8:0     0:0     0:0     0:0     0:0     24:8    8:0     0:0     4505:
NC_000001.10    7426778 rs1553119849    T       C,G     .       .       .       AN:AC   4494:0,4483     0:0,0   2:0,2   32:0,24 8:0,8   6:0,6   2:0,2   0:0,0   304:0,288       32:0,24 4:0,4
NC_000001.10    12461010        rs762190215     T       TGC,TGCGCGCGC,TGCGCGC   .       .       .       AN:AC   4456:85,8,45    0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 8:0,0
NC_000001.11    10001   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
NC_000001.11    10007   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10008   .       A       C,T     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10009   .       A       C,G     .       .       .       AN:AC   7616:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10013   .       TA      T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10013   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10014   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10015   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10016   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10020   .       A       C,G,T   .       .       .       AN:AC   7616:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10021   .       A       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10022   .       C       A,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10023   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10024   .       C       CT      .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0

Output VCF#

The an excerpt of the output of process-alfa.sh is shown below. This runs format-alfa without the --ignore-chr-version enabled so older assembly chromosomes are removed - NC_000001.9 and NC_000001.10.

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
##bcftools_normVersion=1.15+htslib-1.15
##bcftools_normCommand=norm -m-any -cx -f ../../../../../human_genome/b38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz; Date=Thu Mar 31 09:44:16 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ALFA_EUR        ALFA_AFO        ALFA_EAS        ALFA_AFA        ALFA_LAC        ALFA_LEN        ALFA_OAS        ALFA_
1       10001   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10007   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10007   .       T       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10008   .       A       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10008   .       A       T       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10009   .       A       C       .       .       .       AN:AC   7616:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11860:0
1       10009   .       A       G       .       .       .       AN:AC   7616:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11860:0
1       10013   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10013   .       T       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10013   .       TA      T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10014   .       A       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10014   .       A       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10014   .       A       T       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10015   .       A       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10015   .       A       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10015   .       A       T       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10016   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10020   .       A       C       .       .       .       AN:AC   7616:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11860:0
1       10020   .       A       G       .       .       .       AN:AC   7616:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11860:0
1       10020   .       A       T       .       .       .       AN:AC   7616:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11860:0
1       10021   .       A       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10021   .       A       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10022   .       C       A       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10022   .       C       G       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10023   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10024   .       C       CT      .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0

`process-gnomad2.sh`#

This is designed to take the Gnomad2 data and extract the AN:AC data on it for selected populations. Please note that the tabix index is also required as it provided the contig list. This script does the following:

Extracts rows that have the filter PASS
Extracts the AN:AC data for selected population groups (note these are not checked to see if they are present so if you input any incorrect groups you will probably get some bcftools errors.
Splits bi-allelic sites (although I think they are all bi-allelic in Gnomad any way)
Normalise INDELs
Performs a validation against the reference genome assembly
Sorts the final VCF so that the chromosomes are represented in string sort order.

The population groups that are extracted as a default are:

afr - African/African-American ancestry
amr - Latino ancestry
asj - Ashkenazi Jewish ancestry
eas - East Asian ancestry
fin - Finnish ancestry
nfe - Non-Finnish European ancestry
oth - Other ancestry
raw - All (before removing low-confidence genotypes)
sas - South Asian ancestry

Please note, that I think this should be able to handle the Gnomad2 whole genomes data but I have not tried it, it has only been tested with the exome data in GRCh37 and GRCh38 assemblies.

USAGE: process-gnomad2.sh [flags] <input vcf> <reference genome> <output vcf>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -e,--[no]exome:  is the gnomad2 dataset the exome data? (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -p,--pop-names:  comma separated list of gnomad population names to extract from the VCFs (i.e. in the INFO field) (default: 'afr,amr,asj,eas,fin,nfe,oth,raw,sas')
  -m,--sort-mem:  The memory usage for bcftools sort (default: '0')
  -h,--help:  show this help (default: false)

`process-gnomad3.sh`#

This is designed to download and process the Gnomad3 whole genome data data and extract the AN:AC data from it for selected populations. The gnomad3 data is split per chromosome and due to the size it is particularly problematic for handling. So this script aims to download, extract what is needed into a smaller VCF file, and delete the original. Although the downloaded VCFs can retained with the --keep-vcf flag to process_gnomad.sh. By default, this script downloads from the Google endpoint as but the Amazon or Azure endpoints can be selected if needed. The population counts it extracts by default are listed below but other populations can be given if needed. Please note, all info field data other than the selected allele counts is stripped from the VCFs, so in summary this script will:

Extracts rows that have the filter PASS
Download the Gnomad3 chromosome VCFs (can be run for selected chromosomes)
Extracts the AN:AC data for selected population groups (note these are not checked to see if they are present so if you input any incorrect groups you will probably get some bcftools errors.
Splits bi-allelic sites (although I think they are all bi-allelic in Gnomad any way)
Normalise INDELs
Performs a validation against the reference genome assembly

The population groups that are extracted as a default are:

afr - African/African-American ancestry
ami - Amish ancestry
amr - Latino ancestry
asj - Ashkenazi Jewish ancestry
eas - East Asian ancestry
fin - Finnish ancestry
mid - Middle Eastern ancestry
nfe - Non-Finnish European ancestry
oth - Other ancestry
raw - All (before removing low-confidence genotypes)
sas - South Asian ancestry

The -help for process-gnomad3.sh is shown below:

$ process-gnomad3.sh --help
USAGE: /home/rmjdcfi/code/gwas_norm/resources/bin/process-gnomad.sh [flags] <reference genome> <output directory>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -k,--[no]keep-vcf:  This ensures the downloaded VCF is not deleted after processing (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -e,--endpoint:  The download endpoint to use google, amazon or azure (default: 'google')
  -r,--chr-names:  comma separated list of chromosomes to download (no spaces) either 1-22 (separately i.e. 1,2,3 etc..) or X, Y (default: '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y')
  -p,--pop-names:  comma separated list of gnomad population names to extract from the VCFs (i.e. in the INFO field) (default: 'afr,ami,amr,asj,eas,fin,mid,nfe,oth,raw,sas')
  -h,--help:  show this help (default: false)

To run this you will need to have bash > v4.2, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc or ~/.bash_profile). A copy of bcftools will also need to be installed (tested with v1.11 and v1.15). As a guide, using the defaults. After extracting for each chromosome - the output files collectively contain ~644,268,002 rows. To join the individual chromosome files have a look at vcf-cat.sh

`process-1000g.sh`#

Generate allele counts for each of the super and sub populations in the 1000 genomes using. This script will download the vcf files (if they are not already downloaded), tabix index files and list of samples (and keep them in place). It uses bcftools with the +fill-tags plugin to generate allele numbers and counts for each population. You can test that the +fill-tags plugin works ok with:

$ bcftools +fill-tags -vv
plugin directory /software/bin/libexec/bcftools .. ok
/software/bin/libexec/bcftools/fill-tags.so:
        plugin open   .. ok
        init     .. ok

About:   Run user defined plugin
Usage:   bcftools plugin <name> [OPTIONS] <file> [-- PLUGIN_OPTIONS]
         bcftools +name [OPTIONS] <file>  [-- PLUGIN_OPTIONS]

VCF input options:
   -e, --exclude <expr>        exclude sites for which the expression is true
   -i, --include <expr>        select sites for which the expression is true
   -r, --regions <region>      restrict to comma-separated list of regions
   -R, --regions-file <file>   restrict to regions listed in a file
   -t, --targets <region>      similar to -r but streams rather than index-jumps
   -T, --targets-file <file>   similar to -R but streams rather than index-jumps
VCF output options:
       --no-version            do not append version and command line to the header
   -o, --output <file>         write output to a file [standard output]
   -O, --output-type <type>    'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
       --threads <int>         use multithreading with <int> worker threads [0]
Plugin options:
   -h, --help                  list plugin's options
   -l, --list-plugins          list available plugins. See BCFTOOLS_PLUGINS environment variable and man page for details
   -v, --verbose               print verbose information, -vv increases verbosity
   -V, --version               print version string and exit

To summarise, the following actions are performed:

Extracts rows that have the filter PASS
Download the 1000 genomes chromosome VCFs (supports GRCh37 and GRCh38 versions and can be run for selected chromosomes)
Generates the AN:AC data for all sub and super population groups.
Splits bi-allelic sites
Normalise INDELs
Performs a validation against the reference genome assembly
Fixes the allele counts (AN) for split sites (see here)

The help for process-1000g.sh is shown below:

$ process-1000g.sh --help
USAGE: /home/rmjdcfi/code/gwas_norm/resources/bin/process-1000g.sh [flags] <download assembly> <reference genome> <output directory>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -k,--[no]keep-vcf:  This ensures the downloaded VCF is not deleted after processing (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -d,--download-dir:  An alternative location to download the VCFs to the default is to use <outdir> (default: '')
  -r,--chr-names:  comma separated list of chromosomes to download (no spaces) either 1-22 (separately i.e. 1,2,3 etc..) or X, Y (default: '')
  -h,--help:  show this help (default: false)

The allele count VCFs are generated per chromosome and can be generated for GRCh37 or GRCh38, these are referred to as b37 and b38 respectively for the <download assembly> argument.

`process-snpstats.sh`#

To summarise, the following actions are performed:

Generates the AN:AC data for all SNPSTATs files.
Splits bi-allelic sites
Normalise INDELs
Performs a validation against the reference genome assembly
Fixes the allele counts (AN) for split sites (see here)

The help for process-snpstats.sh is shown below:

$ process-snpstats.sh --help
USAGE: process-snpstats.sh [flags] <reference genome> <snpstats files <snpstats files>...>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (not implemented yet) (default: '0')
  -o,--output:  The output file name, defaults to <STDOUT> (default: '-')
  -O,--output-type:  The output type from bcftools, defaults to un-compressed vcf (default: 'v')
  -c,--count-col:  The name of the AN:AC count sample column that is generated from the snpstats files (default: 'ALLELE_COUNT1')
  -h,--help:  show this help (default: false)

`vcf-lift.sh`#

Perform a liftover of a VCF file. Whilst CrossMap will do this it is really slow for VCFs (but not for other files such as bed files). So this uses a python script (quick-lift) that I wrote to use the CrossMap “API” and treat VCFs as a flat file with the comment character ‘##’. So this will do the basic liftover and then the output is piped into bcftools to do the normalisation of the REF alleles and drop anything that does not match to the reference. This works out a lot faster.

USAGE: vcf-lift.sh [flags] <input vcf> <target reference genome> <chain file>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (default: '0')
  -o,--output:  The output file name, defaults to <STDOUT> (default: '-')
  -O,--output-type:  The output type from bcftools, defaults to un-compressed vcf (default: 'v')
  -f,--fail-file:  The path to the liftover failure file (default: 'liftover_failures.txt.gz')
  -h,--help:  show this help (default: false)

Example usage#

Lift down from GRCh38 to GRCh37 and output to a gzipped compressed VCF file.

$ vcf-lift.sh -T ~/Scratch/tmp -o output.b37.vcf.gz -O 'z' input.b38.vcf.gz Homo_sapiens.GRCh37.dna.toplevel.fa.gz GRCh38_to_GRCh37.chain.gz

`vcf-sort.sh`#

A no frills wrapper around bcftools sort. This is designed to make sure the contigs that the VCF is sorted on are placed in a natural text sort order. This enables the VCF to be sorted as it would be with UNIX sort.

USAGE: /home/rmjdcfi/code/gwas_norm/resources/bin/vcf-sort.sh [flags] <input file> <output file>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -r,--[no]reheader:  reheader before sorting (default: true)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort)
                 (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (default: '0')
  -h,--help:  show this help (default: false)

`vcf-cat.sh`#

A no frills wrapper around bcftools concat that only performs a basic concatination but does line counting before and after and if they match up then it performs a delete of the original files (if –delete option is on). If --sort is set it will also use VCF tools to sort the concatinated VCF file.

$ vcf-cat.sh --help
USAGE: vcf-cat.sh [flags] <output file> <files to concat>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -d,--[no]delete:  If the line counts in the concatenated VCF and the original VCFs match up then delete the concatenated files (default: false)
  -s,--[no]sort:  sort the VCF post concatenation (default: true)
  -T,--tmp-dir:  An alternative location for tmp files (also applied to bcftools sort) (default: '')
  -f,--file-list:  A list of files 1/line (passed through to bcftools) (default: '')
  -m,--sort-mem:  The memory usage for bcftools sort (default: '0')
  -h,--help:  show this help (default: false)

`vcf-dumb-cat.sh`#

A truly bruit force concatenation of vcf files. This does no checks on the header or order of the VCF files, it just outputs them and bgzips them. Only use this if you are 100% sure of the origin and order of the VCF files. For example, if they are an output from a parallel job. If files are given both on the command line and in a file list then the command line files are added to the end of the file list files.

$ vcf-dumb-cat.sh --help
USAGE: vcf-dumb-cat.sh [flags] <output file> <files to concat>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (default: '')
  -f,--file-list:  A list of files 1/line (default: '')
  -h,--help:  show this help (default: false)

`merge-count-vcfs.sh`#

Merge a set of allele count VCF files into a single VCF file. An allele count VCF file is a VCF file of variant sites where the allele number (AN) and the allele count (AC) are stored where the samples usually are, they have a FORMAT of AN:AC. This script expects that the sites are bi-allelic and the VCF files are sorted with the chromosome name as a string and the position as an integer. The input VCFs must be tabix indexed, this is not usedin the merge but is used to verify the sort order of the VCF files. The reference VCF is used to supply the header and all of the ID fields. The output VCF will have a field called DS added to the INFO section - with the structure DS=<dataset_name2>,<dataset_name2>..., where the <dataset_name> are the names of the datasets input as -r and -d that contain the site in question.

$ merge-count-vcfs.sh --help
USAGE: merge-count-vcfs.sh [flags] <ref vcf> <merge vcf <merge vcf>...>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -T,--tmp-dir:  An alternative location for tmp files (default: '')
  -o,--output:  The output file name, defaults to <STDOUT>  (if <STDOUT> no tabix index is created) (default: '-')
  -r,--ref-name:  The name for the reference dataset (default: '')
  -d,--data-names:  A coma separated list of names for the datasets being merged (default: '')
  -g,--ref-genome:  A reference genome to get the contigs from (default: '')
  -h,--help:  show this help (default: false)

To run this you will need to have bash > v4.2, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc or ~/.bash_profile). This also calls the Python script merge-count-vcfs.

`make-vep-ja.sh`#

A helper script that will generate a job array file that can be used to submitting varent effect predictor (VEP) jobs. I have always had performance issues with VEP and the best way to get it running quickly is to parallelise, this sets up a list of genomic boundaries with a target number of variants in each one. Along side those are the VEP arguments and output directories. The <input file>, should be sorted by chr_name (using any sort order) and then start position (as an integer).

To run this you will need to have bash > v4.2, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc.

$ make-vep-ja.sh --help
USAGE: make-vep-ja.sh [flags] <input file> <chunk out dir> <VEP genome assembly> <CADD files>
flags:
  -v,--[no]verbose:  give more output (default: false)
  -n,--nvars:  The number of variants required in each chunk (default: 50000)
  -T,--tmp-dir:  An alternative location for tmp files (default: '')
  -o,--outfile:  An output file for the job array, if not supplied output is to STDOUT (default: '')
  -h,--help:  show this help (default: false)

The output job array file will have the following columns and is suitable for use with pyjasub. It is also designed to with with the vep.task.sh script in gwas_norm/resources/tasks:

rowidx - A counter required by pyjasub.
region_idx - Used to order the regions by their genomic position in the input file, a region is the genomic coordinates encompassing --nvars variants.
chr_name - The chromosome name of the region.
start_pos - The start position of the region.
end_pos - The end position of the region.
nsites - The number os sites in the region. This may vary slighly from the number fo sites that has been requested as the underlying Python script that generates the chunks ensures that any genomic site is unique to a region, even if it is replicated in the input file.
infile - The input file name that the regions will be extracted from.
outfile - The output file name of the VEP’d file.
assembly - The genome assembly that VEP should use (i.e. GRCh37 or GRCh38). Note it is not possible to input the cache file in the job array.
cadd_files - The location of the CADD files, these will be | separated if there are >1 CADD file passed to the script.

To run this you will need to have bash > v4.2, shflags and also bash-helpers installed (by installed I mean in your PATH in your ~/.bashrc or ~/.bash_profile).

Example usage#

From vcf_to_chunk.vcf.gz make genomic chunk boundaries containing 500,000 variants and write the output to gwas_norm_vep.ja. Use the assembly GRCh38 and the cadd files whole_genome_SNVs.tsv.gz and gnomad.genomes.r3.0.indel.tsv.gz.

$ make-vep-ja.sh -n 500000 -T ~/Scratch/tmp/ -o gwas_norm_vep.ja vcf_to_chunk.vcf.gz vep_chunks/ GRCh38 cadd/b38/whole_genome_SNVs.tsv.gz cadd/b38/gnomad.genomes.r3.0.indel.tsv.gz

BASH endpoints#

Mapping file generation#

process-dbsnp.sh#

Unwanted INFO fields#

Input vcf#

Output VCF#

process-alfa.sh#

Input vcf#

Output VCF#

process-gnomad2.sh#

process-gnomad3.sh#

process-1000g.sh#

process-snpstats.sh#

vcf-lift.sh#

Example usage#

vcf-sort.sh#

vcf-cat.sh#

vcf-dumb-cat.sh#

merge-count-vcfs.sh#

make-vep-ja.sh#

Example usage#

`process-dbsnp.sh`#

`process-alfa.sh`#

`process-gnomad2.sh`#

`process-gnomad3.sh`#

`process-1000g.sh`#

`process-snpstats.sh`#

`vcf-lift.sh`#

`vcf-sort.sh`#

`vcf-cat.sh`#

`vcf-dumb-cat.sh`#

`merge-count-vcfs.sh`#

`make-vep-ja.sh`#