Python endpoints#

The gwas_norm package installs several Python cmd-line endpoints. Some are part of the key functionality of the package, others are more tangential and involved in the generation of the mapping files.

Core scripts#

Mapping file generation#

dbsnp-download#

Download dbSNP JSON files and simultaneously process into gzip chunk files with a max number or rows per file. This enables easier parallel processing downstream.

The chunking process can take a while, however, multiple processes can be assigned to it, although, each process can only tackle a single file. In future, I will leverage the bgzip2 format to define chunk positions within the files.

usage: dbsnp-download [-h] [--url URL] [-T TMP] [-u CHUNK_SIZE] [-p PROCESSES]
                      [-v]
                      outdir download_dir

Positional Arguments#

outdir

The output directory for processed chunk files.

download_dir

The directory for downloaded files.

Named Arguments#

--url

The location of tmp, if not provided will use the system tmp

Default: “ftp.ncbi.nlm.nih.gov”

-T, --tmp

The location of tmp, if not provided will use the system tmp

-u, --chunk-size

The max number of JSON rows to output into each file.

Default: 1000000

-p, --processes

The max number of processes to use for chunking files.

Default: 1

-v, --verbose

Log output to STDERR, use -v to display file count progress and -vv for download progress monitor

format-dbsnp#

Reformat the dbSNP VCF to regular chromosome names.

usage: format-dbsnp [-h] [-v] [-c] infile assembly [outfile]

Positional Arguments#

infile

A required file

assembly

An assembly chromosome mapper

outfile

An optional output file, if not provided output is to STDOUT

Named Arguments#

-v, --verbose

give more output

Default: False

-c, --ignore-chr-version

Ignore the chromosome version i.e. .11

Default: False

Input VCF#

Below is an example of the input dbSNP VCF

##fileformat=VCFv4.2
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimite
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id.  The pseudogene symbol and id are delimited by a colon (:) and eac
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 -
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids.
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele foll
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequenc
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - in
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Lik
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version str
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_000001.11    10001   rs1570391677    T       A       .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10002   rs1570391692    A       C       .       .       RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10003   rs1570391694    A       C       .       .       RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10008   rs1570391698    A       G       .       .       RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10009   rs1570391702    A       G       .       .       RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10015   rs1570391706    A       G       .       .       RS=1570391706;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10020   rs1570391708    A       C       .       .       RS=1570391708;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10021   rs1570391710    A       G       .       .       RS=1570391710;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10026   rs1570391712    A       C       .       .       RS=1570391712;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10027   rs1570391716    A       C,G     .       .       RS=1570391716;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10032   rs1570391720    A       C       .       .       RS=1570391720;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10033   rs1570391722    A       G       .       .       RS=1570391722;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11    10045   rs1570391729    A       C,G     .       .       RS=1570391729;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10051   rs1052373574    A       C,G     .       .       RS=1052373574;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10051   rs1326880612    A       AC      .       .       RS=1326880612;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10055   rs768019142     T       TA      .       .       RS=768019142;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10055   rs892501864     T       A       .       .       RS=892501864;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11    10056   rs1570391738    A       C       .       .       RS=1570391738;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10057   rs1570391741    A       C,G     .       .       RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10059   rs1570391745    C       G       .       .       RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10063   rs1010989343    A       C,G     .       .       RS=1010989343;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10067   rs1489251879    T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      .       .       RS=1489251879;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX
NC_000001.11    10069   rs1570391755    A       G       .       .       RS=1570391755;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10075   rs1570391757    A       G       .       .       RS=1570391757;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10077   rs1022805358    C       G       .       .       RS=1022805358;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11    10081   rs1570391762    A       G       .       .       RS=1570391762;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10086   rs1570391767    A       C       .       .       RS=1570391767;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11    10092   rs1570391770    A       C       .       .       RS=1570391770;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.

Output VCF#

The an excerpt of the output of process-dbsnp.sh is shown below, note the contigs are added blind, i.e. this is known from the dbSNP vcf file rather than calculated on the fly.

##fileformat=VCFv4.2
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (>
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id.  The pseudogene symbol and id are delimited by a colon (:) and each pair is delimited b>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_E>
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3">
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele followed by alternate all>
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which>
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical  INFO tags.">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated by colon (:).  Each >
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal>
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - P>
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria provided by submitter>
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version strings associated with >
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10001   rs1570391677    T       A       .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9891,0.0109|SGDP_PRJ:0,1;COMM>
1       10002   rs1570391692    A       C       .       .       RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9944,0.005597
1       10003   rs1570391694    A       C       .       .       RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9902,0.009763
1       10008   rs1570391698    A       G       .       .       RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9969,0.003086
1       10009   rs1570391702    A       G       .       .       RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9911,0.008916
1       10015   rs1570391706    A       G       .       .       RS=1570391706;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.005826
1       10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1       10020   rs1570391708    A       C       .       .       RS=1570391708;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9973,0.002742
1       10021   rs1570391710    A       G       .       .       RS=1570391710;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.005826
1       10026   rs1570391712    A       C       .       .       RS=1570391712;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9976,0.002399
1       10027   rs1570391716    A       C,G     .       .       RS=1570391716;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.001371,0.004455
1       10032   rs1570391720    A       C       .       .       RS=1570391720;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9966,0.003427
1       10033   rs1570391722    A       G       .       .       RS=1570391722;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9949,0.005141
1       10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1       10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1       10045   rs1570391729    A       C,G     .       .       RS=1570391729;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9897,0.005822,0.004452
1       10051   rs1052373574    A       C,G     .       .       RS=1052373574;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9945,0.005479,.|Siberian:0.5,>
1       10051   rs1326880612    A       AC      .       .       RS=1326880612;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1       10055   rs768019142     T       TA      .       .       RS=768019142;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1       10055   rs892501864     T       A       .       .       RS=892501864;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1       10056   rs1570391738    A       C       .       .       RS=1570391738;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9945,0.005479
1       10057   rs1570391741    A       C,G     .       .       RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9935,0.006507,.|SGDP_PRJ:0.5,>
1       10059   rs1570391745    C       G       .       .       RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9997,0.0003425
1       10063   rs1010989343    A       C,G     .       .       RS=1010989343;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9928,0.004112,0.003084|Siberi>
1       10067   rs1489251879    T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      .       .       RS=1489251879;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1       10069   rs1570391755    A       G       .       .       RS=1570391755;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9966,0.003425
1       10075   rs1570391757    A       G       .       .       RS=1570391757;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9979,0.002055
1       10077   rs1022805358    C       G       .       .       RS=1022805358;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1       10081   rs1570391762    A       G       .       .       RS=1570391762;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.999,0.001027
1       10086   rs1570391767    A       C       .       .       RS=1570391767;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9993,0.0006849
1       10092   rs1570391770    A       C       .       .       RS=1570391770;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9993,0.0006849

format-alfa#

Reformat the ALFA VCF to regular chromosome names and better sample IDs.

usage: format-alfa [-h] [-v] [-c] infile assembly [outfile]

Positional Arguments#

infile

A required file

assembly

An assembly chromosome mapper

outfile

An optional output file, if not provided output is to STDOUT

Named Arguments#

-v, --verbose

give more output

Default: False

-c, --ignore-chr-version

Ignore the chromosome version i.e. .11

Default: False

In addition to re-mapping the chromosome names, this will also adjust the sample identifiers as detailed in the table below:

ALFA population groups#

ALFA Population ID

Short Description

Remapped ID

Long Description

SAMN10492696

African Others

ALFA_AFO

Individuals with African ancestry

SAMN10492698

African American

ALFA_AFA

African American

SAMN10492703

African

ALFA_AFR

All Africans

SAMN10492695

European

ALFA_EUR

European

SAMN10492699

Latin American 1

ALFA_LAC

Latin American individiuals with Afro-Caribbean ancestry

SAMN10492700

Latin American 2

ALFA_LEN

Latin American individiuals with mostly European and Native American Ancestry

SAMN10492702

South Asian

ALFA_SAS

South Asian

SAMN10492697

East Asian

ALFA_EAS

East Asian (95%)

SAMN10492704

Asian

ALFA_ASN

All Asian individuals (EAS and OAS) excluding South Asian (SAS)

SAMN10492701

Other Asian

ALFA_OAS

Asian individiuals excluding South or East Asian

SAMN11605645

Other

ALFA_OTR

The self-reported population is inconsistent with the GRAF-assigned population

SAMN10492705

Total

ALFA_TOT

Total (~global) across all populations

Input VCF#

Below is an example of the input ALFA VCF:

##fileformat=VCFv4.0
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMN10492695    SAMN10492696    SAMN10492697    SAMN10492698    SAMN10492699    SAMN10492700    SAMN10492701    SAMN1
NC_000001.9     144135212       rs1553120241    G       A       .       .       .       AN:AC   8560:5387       8:8     256:224 336:288 32:24   170:117 32:24   18:13   20:15   344:296 288:2
NC_000001.9     144148243       rs2236566       G       T       .       .       .       AN:AC   5996:510        0:0     0:0     0:0     0:0     0:0     0:0     0:0     84:8    0:0     0:0
NC_000001.9     146267105       rs1553119693    T       G       .       .       .       AN:AC   37168:28800     36:22   56:44   1378:839        18:14   70:60   10:9    4836:3639       452:3
NC_000001.9     148488564       .       C       A       .       .       .       AN:AC   8552:0  8:0     256:0   338:0   32:0    170:0   32:0    16:0    20:0    346:0   288:0   9424:0
NC_000001.10    2701535 rs371068661     C       T       .       .       .       AN:AC   134:9   0:0     0:0     48:1    0:0     0:0     0:0     0:0     188:15  48:1    0:0     370:25
NC_000001.10    2701546 rs587702211     G       A       .       .       .       AN:AC   134:0   0:0     0:0     48:4    0:0     0:0     0:0     0:0     188:2   48:4    0:0     370:6
NC_000001.10    7426777 rs1553119850    GT      G       .       .       .       AN:AC   4473:4462       0:0     0:0     8:0     0:0     0:0     0:0     0:0     24:8    8:0     0:0     4505:
NC_000001.10    7426778 rs1553119849    T       C,G     .       .       .       AN:AC   4494:0,4483     0:0,0   2:0,2   32:0,24 8:0,8   6:0,6   2:0,2   0:0,0   304:0,288       32:0,24 4:0,4
NC_000001.10    12461010        rs762190215     T       TGC,TGCGCGCGC,TGCGCGC   .       .       .       AN:AC   4456:85,8,45    0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 8:0,0
NC_000001.11    10001   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
NC_000001.11    10007   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10008   .       A       C,T     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10009   .       A       C,G     .       .       .       AN:AC   7616:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10013   .       TA      T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10013   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10014   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10015   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10016   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10020   .       A       C,G,T   .       .       .       AN:AC   7616:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0
NC_000001.11    10021   .       A       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10022   .       C       A,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0
NC_000001.11    10023   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
NC_000001.11    10024   .       C       CT      .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0

Output VCF#

Below is an example of the output ALFA VCF, this does nto have the --ignore-chr-version enabled so older assembly chromosomes are removed - NC_000001.9 and NC_000001.10.:

##fileformat=VCFv4.0
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ALFA_EUR        ALFA_AFO        ALFA_EAS        ALFA_AFA        ALFA_LAC        ALFA_LEN        ALFA_OAS        ALFA_
1       10001   .       T       C       .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0
1       10007   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11862
1       10008   .       A       C,T     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11862
1       10009   .       A       C,G     .       .       .       AN:AC   7616:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11860
1       10013   .       TA      T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10013   .       T       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11862
1       10014   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0        94:0,
1       10015   .       A       C,G,T   .       .       .       AN:AC   7618:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0        94:0,
1       10016   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10020   .       A       C,G,T   .       .       .       AN:AC   7616:0,0,0      108:0,0,0       84:0,0,0        2708:0,0,0      146:0,0,0       610:0,0,0       24:0,0,0        94:0,
1       10021   .       A       C,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11862
1       10022   .       C       A,G     .       .       .       AN:AC   7618:0,0        108:0,0 84:0,0  2708:0,0        146:0,0 610:0,0 24:0,0  94:0,0  470:0,0 2816:0,0        108:0,0 11862
1       10023   .       C       T       .       .       .       AN:AC   6962:0  84:0    84:0    2210:0  146:0   610:0   24:0    94:0    466:0   2294:0  108:0   10680:0
1       10024   .       C       CT      .       .       .       AN:AC   7618:0  108:0   84:0    2708:0  146:0   610:0   24:0    94:0    470:0   2816:0  108:0   11862:0

format-snpstats#

Reformat one or more SNPSTATs files into a VCF format.

usage: format-snpstats [-h] [-o OUTFILE] [--reference-genome REFERENCE_GENOME]
                       [--count-col COUNT_COL] [-v]
                       infiles [infiles ...]

Positional Arguments#

infiles

One or more SNPstats files. Files should not be compressed.

Named Arguments#

-o, --outfile

An optional output file, if not provided output is to STDOUT

--reference-genome

An indexed fasta reference genome, if you want the VCF header to contain all the contigs in the reference genome. If not provided then. chrs 1-22, X, Y, MT are used as a default

--count-col

The name of the allele counts column that will be created in the VCF file

Default: “ALLELE_COUNT”

-v, --verbose

give more output

Default: False

fix-vcf-allele-number#

Adjust the allele number from allele counts that have been split with bcftools. The issue is that when bcftools splits a multi-allelic site into multiple bi-allelic sites then it will do a good job of putting the correct AC with the correct site but does not perform any adjustment of the AN to account for the removed alleles. So a site that has AN:AC of 1000:100,5,2 (three ALT alleles) will be set to 1000:100, 1000:5 and 1000:2 this means that the reference allele count will vary for all bi-allelic forms of the site. The correct representation should be 993:100, 898:5 and 895:2. This gives a reference allele count of 893 for all forms of the site.

usage: fix-vcf-allele-number [-h] [-i INFILE] [-o OUTFILE] [-v]

Named Arguments#

-i, --infile

An optional required file, if not supplied then STDIN is used

-o, --outfile

An optional output file, if not provided output is to STDOUT

-v, --verbose

give more output

Default: False

quick-lift#

merge-count-vcfs#

Merge two or more allele count VCF files that have been pre-sorted on chr_name, start_pos). Please do not use for general VCF merging, this

is only for allele count mapping VCF files and should not be mistaken for a generalisable VCF merging script. The VCF files must have 1 or more “AN:AC” fields after format (and nothing else). Where AN, is the total allele number and AC is the count of each alternate allele. The VCF files must be sorted in the same way, which should be the natural string sort order of chromosome name and the numeric sort order of the start position. Also, it is assumed that the VCFs only portray bi-allelic variants. All the variant ID data and INFO fields are taken from the reference VCF file.

Please note that this will perform a system call to tabix, so it should be installed and in your path. Tablix is not used for the merge, only to verify the sort order of all the files being merged.

usage: merge-count-vcfs [-h] [-d DATA_NAMES [DATA_NAMES ...]] [-r REF_NAME]
                        [-g REF_GENOME] [-o OUTFILE] [-T TMP_DIR] [-v]
                        ref_file merge_files [merge_files ...]

Positional Arguments#

ref_file

A vcf to act as a reference file

merge_files

One or more input counts files to merge into ref_file

Named Arguments#

-d, --data-names

One or more dataset names, if not given will default to ds1,ds2,ds3 - if given must equal the numbers of merge files

-r, --ref-name

If you want rows from the reference file labelled in the output then supply a name for the reference

-g, --ref-genome

Path to a reference genome assembly, if provided the contigs from this are used in the output VCF

-o, --outfile

An output file, if provided will be written as a bgzipped file, if not provided then will output to STDOUT

-T, --tmp-dir

An alternate temp location to write to (default /tmp)

-v, --verbose

give more output

Default: False

See merge-count-vcfs.sh for a bash wrapper around this.

make-site-chunks#

Parse genomic coordinates from an input file and generate coordinate boundaries of regions that contain a target number of sites (rows) in the input file. This is designed so that the region boundaries can be tabix queried out to give subsets of the input file with defined size. The number of sites in a region may be greater than the target in instances where the target number of sites occurs in a region with multiple sites having the same co-ordinates. This is to ensure that the end coordinate of one region is different from the start coordinate of the subsequent region. The input file must be sorted on the chr_name, start_pos and end_pos columns.

Parse through the positional information of a file and produce an output of region boundaries that contain ~a target number of sites. The number of sites might not be exactly the same as the target if there are multiple sites with the same coordinate at the point where the target number is reached. In these cases all sites at those coordinates are included in the chunk so the start/end coordinate of all regions is unique.

usage: make-site-chunks [-h] [-i [INFILES ...]] [-o OUTFILE]
                        [--out-dir OUT_DIR] [--out-ext OUT_EXT] [-T TMP_DIR]
                        [-d DELIMITER] [-c COMMENT_CHAR] [-C CHR_NAME]
                        [-S START_POS] [-E END_POS] [-R REF_ALLELE] [-v]
                        [--bgen]
                        target

Positional Arguments#

target

The number of sites per interval

Named Arguments#

-i, --infiles

An input file, if not provided then STDIN is used. Must be sorted on chr_name, start_pos, end_pos, can be gzip compressed

-o, --outfile

An output file, if not provided then STDOUT is used

--out-dir

The output directory prefixed onto the outfile (if input is not STDIN)

--out-ext

The output file extension added onto the outfile (if input is not STDIN)

-T, --tmp-dir

A temp directory

-d, --delimiter

An input file delimiter (default=’t’)

Default: ” “

-c, --comment-char

The comment character, lines starting with this are ignored (but still output) (default: ##)

Default: “##”

-C, --chr-name

The name of the chromosome column (default: #CHROM)

Default: “#CHROM”

-S, --start-pos

The name of the start position column (default: POS)

Default: “POS”

-E, --end-pos

The name of the end position column, if not there use –start-pos as –end-pos (default: POS)

-R, --ref-allele

The name of the reference allele column (if present), if not if this is defined then the end position is calculated from the start position + length(ref) - 1

Default: “REF”

-v, --verbose

Give more output (to <STDERR>)

Default: False

--bgen

The input files are bgen format

Default: False

The script will output the following columns in all cases:

  1. rowidx - The row number (sequential count indexed from 1)

  2. region_idx - The region number (sequential count indexed from 1)

  3. chr_name - The chromosome name

  4. start_pos - The start position of the region

  5. end_pos - The end position of the region

  6. nsites - The number of sites present in the region

If input is from a file rather than <STDIN>, then additional columns will be added:

  1. infile - The input file name given to make-site-chunks

  2. outfile - A potential output file name based on the input file name and the data in columns 2-5. it has the structure: <root infile name>.<region_idx>.<chr_name>.<start_pos>-<end_pos>.<extension>

merge-cadd#

Merge CADD data into a VCF file.

usage: merge-cadd [-h] [-o OUTFILE] [-T TMP_DIR] [-v]
                  vcf_file cadd_files [cadd_files ...]

Positional Arguments#

vcf_file

A vcf to merge into

cadd_files

One or more input counts files to merge into ref_file

Named Arguments#

-o, --outfile

An output file, if provided will be written as a bgzipped file, if not provided then will output to STDOUT

-T, --tmp-dir

An alternate temp location to write to (default /tmp)

-v, --verbose

give more output

Default: False

split-mapping-file#

Partition the mapping file into a common file and a rare file based on MAF and/or MAC (applied in an OR fashion).

usage: split-mapping-file [-h] [-f MAF] [-c MAC] [-T TMP_DIR] [-v]
                          mapping_file common_out rare_out

Positional Arguments#

mapping_file

A vcf mapping file to partition

common_out

The name of the output file containing the common variants

rare_out

The name of the output file containing the rare variants

Named Arguments#

-f, --maf

The MAF cutoff anything >= to this is common anything < this is rare

Default: 0.01

-c, --mac

The MAC cutoff anything >= to this is common anything < this is rare

Default: 50

-T, --tmp-dir

An alternate temp location to write to

-v, --verbose

give more output

Default: False

Other admin scripts#

convert-xml#

Scripts and functions to convert old style XML files to new-style ones. The

average user will not need to use this as it was designed to convert old XMLs to new formats during development.

usage: convert-xml [-h] [-v] old_xml new_xml

Positional Arguments#

old_xml

The old XML file (can be gzip compressed)

new_xml

The new XML file will be compressed if file extension is gz

Named Arguments#

-v, --verbose

give more output

Default: False

gwas-norm-test-data#

A tool to generate test data and result data. This is to simplify the process of generating end-to-end tests as they are a real pain to manually setup.

Currently, this can implement a:

  1. study/analysis files

  2. study_file/key analysis

  3. Multiple genome assemblies

  4. Failed liftover files

  5. Top hits files.

  6. XML metadata generation

  7. Mapper file generation

  8. Duplicated variants in mapper.

  9. Flipping of effect alleles

  10. beta, log(or), log(rr), log(hr), or, rr, hr effect types

  11. different analysis types

Need to implement:

  1. Logged p-values in input

  2. Missing other allele column and adding rows to bad data

  3. Out of range p-value generation, including inf, missing etc…

  4. Different map info columns.

  5. Metadata tests and output files, include the probability for test failure.

  6. Other info column definitions and static data info.

  7. Missing effect sizes

  8. Exotic column types, CIs/chr-pos

  9. Proper population definitions, including error populations

usage: gwas-norm-test-data [-h] [--genomic-config GENOMIC_CONFIG]
                           [--tmpdir TMPDIR] [--mapper-name MAPPER_NAME]
                           [--refgen-name REFGEN_NAME] [--species SPECIES]
                           [--seed SEED] [-v]
                           outdir config

Positional Arguments#

outdir

An output directory name, will be created if does not exist. If it exists it will be wiped. Several sub-directories and files will be creayed in here.

config

A test data setup config file. This is a config file that describes how the test should be setup

Named Arguments#

--genomic-config

The path to the genomic config file. This provides donor files for the test, such as a mapping file and chain files

--tmpdir

The path to a tempdir to create the files in. The finished files are then moved from here to the outdir

--mapper-name

The name of the mapper file in the genomic config to use for the source assembly in the test.

Default: “all”

--refgen-name

The name of the reference genome name to use for all required assemblies (source and target).

Default: “local”

--species

The name of the species to use (for genomic config queries).

Default: “human”

--seed

The random seed to use. If not set then no random seed is used.

-v, --verbose

give more output, use -vv for progress monitoring

Test config file#

The parameters for the test data are defined in a test confguration TOML file. Some examples can be seen in ./resources/test_config. They are also documented in the comments of the two example files below.

An example of a study/analysis file config:

[general]
# The total number of variants across all requested files.
nvariants = 100

# The secondary mapping file is backup file that contains rare variants
# not in the primary mapping file. The idea is to speed up performance
# with most common variants in the primary file. This does not test
# performance but does test mapping from the secondary mapping file.
# The number of variants to be placed into a secondary mapping file, so
# variants in the primary mapping file are nvariants - secondary_mapper
secondary_mapper = 20

# Every 'mapper_dup_vars_idx' variant is duplicated, to enable testing
# of lack of mapping when no other allele is present
mapper_dup_vars_idx = 10

# This is the probability that the ref allele is flipped
prob_ref_flip=0.2

# This is the probability that a duplicated variant is ref flipped.
prob_dup_flip=0.8

# The mapper info filed to include in the output normalised file. If an
# empty list is given then none will be included. If this option is
# missed altogether then all will be included. They are all listed here
# for reference.
# obs (list): of mapping databases that the variant has been observed in
# idx (int): The row number in the input file that the variant occupies
# nsites (int): The number of variant sites that overlap the source variants
#               chromosome/position
# caddp (float): The CADD Phred score
# caddr (float): The CADD raw score
# sift (float): The SIFT score
# polyp (float): The polyphen score
# clinvar (str): The clinvar consequence
# vep (str): The variant effect predictor worst consequence for the variant.
mapper_info_fields=['obs', 'idx', 'nsites', 'caddp', 'caddr', 'sift', 'polyp',
                    'clinvar', 'vep']
# Default is 0.05
top_hits_pvalue=5E-05

# Files specified outside of an analysis apply to all analyses or study files
[files]
# Default is 1
nfiles = 4

# Files are 0-based indexed and fileN is a default that will apply to all
# files unless specified otherwise
[files.fileN]
header = true
delimiter = "\t"

# The column definitions for a file, Columns known by GWAS norm should have
# the name mapping, XML columns should be indicated (default is False)
[files.fileN.columns]
# Test input name = {name=<gwas-norm-name>}
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}

# In this case file3 is the forth file, a file4 would raise an error as we
# are expecting for files not 5, they can have different columns and
# delimiters
[files.file3]
delimiter = ","

[files.file3.columns]
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}

# Studies are 0-based currently, only a single study is supported for each
# test data
[study0]
# Required
study_id = 1
# Optional
pubmed_id = 1
# Required
study_name = "test study"
# Required
source = "b37"
# Required
target = ["b36", "b37", "b38"]

# Not implemented yet
[study0.info]
[study0.info.defs]
A = {}

# Analyses are 0-based indexed and should be below studies
[study0.analysis0]
# Required, must be unique
analysis_id = 1
# Required
analysis_name = "chd"
# Required
phenotype = "CHD"
# Required
effect_type = 'or'
# Required
analysis_type = 'disease'
# These are not implemented yet
# af_pops = {}
# ld_pops = {}

[study0.analysis1]
analysis_id = 2
analysis_name = "cvd"
phenotype = "CVD"
effect_type = 'or'
analysis_type = 'disease'

An example of a study file/ key analysis file config, this is where multiple analyses share a single input file (or occur anywhere in a set of input files):

[general]
# The total number of variants across all requested files.
nvariants = 100

# The secondary mapping file is backup file that contains rare variants
# not in the primary mapping file. The idea is to speed up performance
# with most common variants in the primary file. This does not test
# performance but does test mapping from the secondary mapping file.
# The number of variants to be placed into a secondary mapping file, so
# variants in the primary mapping file are nvariants - secondary_mapper
secondary_mapper = 20

# Every 'mapper_dup_vars_idx' variant is duplicated, to enable testing
# of lack of mapping when no other allele is present
mapper_dup_vars_idx = 10

# This is the probability that the ref allele is flipped
prob_ref_flip=0.2

# This is the probability that a duplicated variant is ref flipped.
prob_dup_flip=0.8

# The mapper info filed to include in the output normalised file. If an
# empty list is given then none will be included. If this option is
# missed altogether then all will be included. They are all listed here
# for reference.
# obs (list): of mapping databases that the variant has been observed in
# idx (int): The row number in the input file that the variant occupies
# nsites (int): The number of variant sites that overlap the source variants
#               chromosome/position
# caddp (float): The CADD Phred score
# caddr (float): The CADD raw score
# sift (float): The SIFT score
# polyp (float): The polyphen score
# clinvar (str): The clinvar consequence
# vep (str): The variant effect predictor worst consequence for the variant.
mapper_info_fields=['obs', 'idx', 'nsites', 'caddp', 'caddr', 'sift', 'polyp',
                    'clinvar', 'vep']
# Default is 0.05
top_hits_pvalue=5E-05

# Files specified outside of an study file apply to all analyses or study
# files, however, currently only a single study file is allowed so it makes
# very little difference at the moment.
[files]
# Default is 1
nfiles = 4

# Files are 0-based indexed and fileN is a default that will apply to all
# files unless specified otherwise
[files.fileN]
header = true
delimiter = "\t"

# The column definitions for a file, Columns known by GWAS norm should have
# the name mapping, XML columns should be indicated (default is False)
[files.fileN.columns]
# Test input name = {name=<gwas-norm-name>}
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# Key columns must be defined for study files only int or str are supported.
# The key values are generated automatically
key1 = {type='int', key=true}

# In this case file3 is the forth file, a file4 would raise an error as we
# are expecting for files not 5, they can have different columns and
# delimiters
[files.file3]
delimiter = ","

[files.file3.columns]
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# Multiple key columns are supported as well as different keys in different
# files. Alanyses are spread over files such that they only occur in files
# with the same key columns
key3 = {type='int', key=true}
key4 = {type='str', key=true}

# stufy file definition
[study_file0]
# Required
study_id = 1
# Optional
pubmed_id = 1
# Required
study_name = "test study"
# Required
source = "b37"
# Required
target = ["b36", "b37", "b38"]
# Required
effect_type ='beta'
# Required
analysis_type ='metabqtl'

[study_file0.info]

# Not implemented yet
[study_file0.info.defs]
A = {}

[study_file0.analysis0]
# Requried
analysis_id = 1
# Requried
analysis_name = "hdl-c"
# Requried
phenotype = "HDL-C"
# Not implemented yet
# af_pops = {}
# ld_pops = {}

[study_file0.analysis1]
analysis_id = 2
analysis_name = "vldl-c"
phenotype = "VLDL-C"
# af_pops = {}
# ld_pops = {}