Python endpoints#
The gwas_norm
package installs several Python cmd-line endpoints. Some are part of the key functionality of the package, others are more tangential and involved in the generation of the mapping files.
Core scripts#
Mapping file generation#
dbsnp-download
#
Download dbSNP JSON files and simultaneously process into gzip chunk files with a max number or rows per file. This enables easier parallel processing downstream.
The chunking process can take a while, however, multiple processes can be assigned to it, although, each process can only tackle a single file. In future, I will leverage the bgzip2 format to define chunk positions within the files.
usage: dbsnp-download [-h] [--url URL] [-T TMP] [-u CHUNK_SIZE] [-p PROCESSES]
[-v]
outdir download_dir
Positional Arguments#
- outdir
The output directory for processed chunk files.
- download_dir
The directory for downloaded files.
Named Arguments#
- --url
The location of tmp, if not provided will use the system tmp
Default: “ftp.ncbi.nlm.nih.gov”
- -T, --tmp
The location of tmp, if not provided will use the system tmp
- -u, --chunk-size
The max number of JSON rows to output into each file.
Default: 1000000
- -p, --processes
The max number of processes to use for chunking files.
Default: 1
- -v, --verbose
Log output to STDERR, use -v to display file count progress and -vv for download progress monitor
format-dbsnp
#
Reformat the dbSNP VCF to regular chromosome names.
usage: format-dbsnp [-h] [-v] [-c] infile assembly [outfile]
Positional Arguments#
- infile
A required file
- assembly
An assembly chromosome mapper
- outfile
An optional output file, if not provided output is to STDOUT
Named Arguments#
- -v, --verbose
give more output
Default: False
- -c, --ignore-chr-version
Ignore the chromosome version i.e. .11
Default: False
Input VCF#
Below is an example of the input dbSNP VCF
##fileformat=VCFv4.2
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimite
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id. The pseudogene symbol and id are delimited by a colon (:) and eac
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 -
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids.
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele foll
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequenc
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM. Source and id separated
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - in
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Lik
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version str
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_000001.11 10001 rs1570391677 T A . . RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10002 rs1570391692 A C . . RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10003 rs1570391694 A C . . RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10008 rs1570391698 A G . . RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10009 rs1570391702 A G . . RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10015 rs1570391706 A G . . RS=1570391706;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10019 rs775809821 TA T . . RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11 10020 rs1570391708 A C . . RS=1570391708;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10021 rs1570391710 A G . . RS=1570391710;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10026 rs1570391712 A C . . RS=1570391712;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10027 rs1570391716 A C,G . . RS=1570391716;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10032 rs1570391720 A C . . RS=1570391720;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10033 rs1570391722 A G . . RS=1570391722;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10039 rs978760828 A C . . RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11 10043 rs1008829651 T A . . RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11 10045 rs1570391729 A C,G . . RS=1570391729;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10051 rs1052373574 A C,G . . RS=1052373574;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10051 rs1326880612 A AC . . RS=1326880612;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11 10055 rs768019142 T TA . . RS=768019142;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11 10055 rs892501864 T A . . RS=892501864;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0
NC_000001.11 10056 rs1570391738 A C . . RS=1570391738;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10057 rs1570391741 A C,G . . RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10059 rs1570391745 C G . . RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10063 rs1010989343 A C,G . . RS=1010989343;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10067 rs1489251879 T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC . . RS=1489251879;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX
NC_000001.11 10069 rs1570391755 A G . . RS=1570391755;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10075 rs1570391757 A G . . RS=1570391757;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10077 rs1022805358 C G . . RS=1022805358;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:
NC_000001.11 10081 rs1570391762 A G . . RS=1570391762;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10086 rs1570391767 A C . . RS=1570391767;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
NC_000001.11 10092 rs1570391770 A C . . RS=1570391770;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.
Output VCF#
The an excerpt of the output of process-dbsnp.sh
is shown below, note the contigs are added blind, i.e. this is known from the dbSNP vcf file rather than calculated on the fly.
##fileformat=VCFv4.2
##fileDate=20200501
##source=dbSNP
##dbSNP_BUILD_ID=154
##reference=GRCh38.p12
##phasing=partial
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (>
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id. The pseudogene symbol and id are delimited by a colon (:) and each pair is delimited b>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_E>
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3">
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele followed by alternate all>
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which>
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags.">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM. Source and id separated by colon (:). Each >
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal>
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - P>
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria provided by submitter>
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version strings associated with >
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10001 rs1570391677 T A . . RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9891,0.0109|SGDP_PRJ:0,1;COMM>
1 10002 rs1570391692 A C . . RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9944,0.005597
1 10003 rs1570391694 A C . . RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9902,0.009763
1 10008 rs1570391698 A G . . RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9969,0.003086
1 10009 rs1570391702 A G . . RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9911,0.008916
1 10015 rs1570391706 A G . . RS=1570391706;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.005826
1 10019 rs775809821 TA T . . RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1 10020 rs1570391708 A C . . RS=1570391708;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9973,0.002742
1 10021 rs1570391710 A G . . RS=1570391710;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.005826
1 10026 rs1570391712 A C . . RS=1570391712;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9976,0.002399
1 10027 rs1570391716 A C,G . . RS=1570391716;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9942,0.001371,0.004455
1 10032 rs1570391720 A C . . RS=1570391720;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9966,0.003427
1 10033 rs1570391722 A G . . RS=1570391722;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9949,0.005141
1 10039 rs978760828 A C . . RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1 10043 rs1008829651 T A . . RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1 10045 rs1570391729 A C,G . . RS=1570391729;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9897,0.005822,0.004452
1 10051 rs1052373574 A C,G . . RS=1052373574;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9945,0.005479,.|Siberian:0.5,>
1 10051 rs1326880612 A AC . . RS=1326880612;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1 10055 rs768019142 T TA . . RS=768019142;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1 10055 rs892501864 T A . . RS=892501864;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1 10056 rs1570391738 A C . . RS=1570391738;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9945,0.005479
1 10057 rs1570391741 A C,G . . RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9935,0.006507,.|SGDP_PRJ:0.5,>
1 10059 rs1570391745 C G . . RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9997,0.0003425
1 10063 rs1010989343 A C,G . . RS=1010989343;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9928,0.004112,0.003084|Siberi>
1 10067 rs1489251879 T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC . . RS=1489251879;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
1 10069 rs1570391755 A G . . RS=1570391755;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9966,0.003425
1 10075 rs1570391757 A G . . RS=1570391757;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9979,0.002055
1 10077 rs1022805358 C G . . RS=1022805358;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=Siberian:0.5,0.5
1 10081 rs1570391762 A G . . RS=1570391762;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.999,0.001027
1 10086 rs1570391767 A C . . RS=1570391767;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9993,0.0006849
1 10092 rs1570391770 A C . . RS=1570391770;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;GNO;FREQ=KOREAN:0.9993,0.0006849
format-alfa
#
Reformat the ALFA VCF to regular chromosome names and better sample IDs.
usage: format-alfa [-h] [-v] [-c] infile assembly [outfile]
Positional Arguments#
- infile
A required file
- assembly
An assembly chromosome mapper
- outfile
An optional output file, if not provided output is to STDOUT
Named Arguments#
- -v, --verbose
give more output
Default: False
- -c, --ignore-chr-version
Ignore the chromosome version i.e. .11
Default: False
In addition to re-mapping the chromosome names, this will also adjust the sample identifiers as detailed in the table below:
ALFA Population ID |
Short Description |
Remapped ID |
Long Description |
---|---|---|---|
SAMN10492696 |
African Others |
ALFA_AFO |
Individuals with African ancestry |
SAMN10492698 |
African American |
ALFA_AFA |
African American |
SAMN10492703 |
African |
ALFA_AFR |
All Africans |
SAMN10492695 |
European |
ALFA_EUR |
European |
SAMN10492699 |
Latin American 1 |
ALFA_LAC |
Latin American individiuals with Afro-Caribbean ancestry |
SAMN10492700 |
Latin American 2 |
ALFA_LEN |
Latin American individiuals with mostly European and Native American Ancestry |
SAMN10492702 |
South Asian |
ALFA_SAS |
South Asian |
SAMN10492697 |
East Asian |
ALFA_EAS |
East Asian (95%) |
SAMN10492704 |
Asian |
ALFA_ASN |
All Asian individuals (EAS and OAS) excluding South Asian (SAS) |
SAMN10492701 |
Other Asian |
ALFA_OAS |
Asian individiuals excluding South or East Asian |
SAMN11605645 |
Other |
ALFA_OTR |
The self-reported population is inconsistent with the GRAF-assigned population |
SAMN10492705 |
Total |
ALFA_TOT |
Total (~global) across all populations |
Input VCF#
Below is an example of the input ALFA VCF:
##fileformat=VCFv4.0
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMN10492695 SAMN10492696 SAMN10492697 SAMN10492698 SAMN10492699 SAMN10492700 SAMN10492701 SAMN1
NC_000001.9 144135212 rs1553120241 G A . . . AN:AC 8560:5387 8:8 256:224 336:288 32:24 170:117 32:24 18:13 20:15 344:296 288:2
NC_000001.9 144148243 rs2236566 G T . . . AN:AC 5996:510 0:0 0:0 0:0 0:0 0:0 0:0 0:0 84:8 0:0 0:0
NC_000001.9 146267105 rs1553119693 T G . . . AN:AC 37168:28800 36:22 56:44 1378:839 18:14 70:60 10:9 4836:3639 452:3
NC_000001.9 148488564 . C A . . . AN:AC 8552:0 8:0 256:0 338:0 32:0 170:0 32:0 16:0 20:0 346:0 288:0 9424:0
NC_000001.10 2701535 rs371068661 C T . . . AN:AC 134:9 0:0 0:0 48:1 0:0 0:0 0:0 0:0 188:15 48:1 0:0 370:25
NC_000001.10 2701546 rs587702211 G A . . . AN:AC 134:0 0:0 0:0 48:4 0:0 0:0 0:0 0:0 188:2 48:4 0:0 370:6
NC_000001.10 7426777 rs1553119850 GT G . . . AN:AC 4473:4462 0:0 0:0 8:0 0:0 0:0 0:0 0:0 24:8 8:0 0:0 4505:
NC_000001.10 7426778 rs1553119849 T C,G . . . AN:AC 4494:0,4483 0:0,0 2:0,2 32:0,24 8:0,8 6:0,6 2:0,2 0:0,0 304:0,288 32:0,24 4:0,4
NC_000001.10 12461010 rs762190215 T TGC,TGCGCGCGC,TGCGCGC . . . AN:AC 4456:85,8,45 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 0:0,0,0 8:0,0
NC_000001.11 10001 . T C . . . AN:AC 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0
NC_000001.11 10007 . T C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10008 . A C,T . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10009 . A C,G . . . AN:AC 7616:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10013 . TA T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
NC_000001.11 10013 . T C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10014 . A C,G,T . . . AN:AC 7618:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0
NC_000001.11 10015 . A C,G,T . . . AN:AC 7618:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0
NC_000001.11 10016 . C T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
NC_000001.11 10020 . A C,G,T . . . AN:AC 7616:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0
NC_000001.11 10021 . A C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10022 . C A,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0
NC_000001.11 10023 . C T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
NC_000001.11 10024 . C CT . . . AN:AC 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0
Output VCF#
Below is an example of the output ALFA VCF, this does nto have the --ignore-chr-version
enabled so older assembly chromosomes are removed - NC_000001.9
and NC_000001.10
.:
##fileformat=VCFv4.0
##build_id=20201027095038
##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">
##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ALFA_EUR ALFA_AFO ALFA_EAS ALFA_AFA ALFA_LAC ALFA_LEN ALFA_OAS ALFA_
1 10001 . T C . . . AN:AC 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0
1 10007 . T C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11862
1 10008 . A C,T . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11862
1 10009 . A C,G . . . AN:AC 7616:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11860
1 10013 . TA T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
1 10013 . T C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11862
1 10014 . A C,G,T . . . AN:AC 7618:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0 94:0,
1 10015 . A C,G,T . . . AN:AC 7618:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0 94:0,
1 10016 . C T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
1 10020 . A C,G,T . . . AN:AC 7616:0,0,0 108:0,0,0 84:0,0,0 2708:0,0,0 146:0,0,0 610:0,0,0 24:0,0,0 94:0,
1 10021 . A C,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11862
1 10022 . C A,G . . . AN:AC 7618:0,0 108:0,0 84:0,0 2708:0,0 146:0,0 610:0,0 24:0,0 94:0,0 470:0,0 2816:0,0 108:0,0 11862
1 10023 . C T . . . AN:AC 6962:0 84:0 84:0 2210:0 146:0 610:0 24:0 94:0 466:0 2294:0 108:0 10680:0
1 10024 . C CT . . . AN:AC 7618:0 108:0 84:0 2708:0 146:0 610:0 24:0 94:0 470:0 2816:0 108:0 11862:0
format-snpstats
#
Reformat one or more SNPSTATs files into a VCF format.
usage: format-snpstats [-h] [-o OUTFILE] [--reference-genome REFERENCE_GENOME]
[--count-col COUNT_COL] [-v]
infiles [infiles ...]
Positional Arguments#
- infiles
One or more SNPstats files. Files should not be compressed.
Named Arguments#
- -o, --outfile
An optional output file, if not provided output is to STDOUT
- --reference-genome
An indexed fasta reference genome, if you want the VCF header to contain all the contigs in the reference genome. If not provided then. chrs 1-22, X, Y, MT are used as a default
- --count-col
The name of the allele counts column that will be created in the VCF file
Default: “ALLELE_COUNT”
- -v, --verbose
give more output
Default: False
fix-vcf-allele-number
#
Adjust the allele number from allele counts that have been split with
bcftools
. The issue is that when bcftools
splits a multi-allelic
site into multiple bi-allelic sites then it will do a good job of
putting the correct AC
with the correct site but does not perform any
adjustment of the AN
to account for the removed alleles. So a site
that has AN:AC
of 1000:100,5,2
(three ALT alleles) will be set to
1000:100
, 1000:5
and 1000:2
this means that the reference
allele count will vary for all bi-allelic forms of the site. The correct
representation should be 993:100
, 898:5
and 895:2
. This gives
a reference allele count of 893 for all forms of the site.
usage: fix-vcf-allele-number [-h] [-i INFILE] [-o OUTFILE] [-v]
Named Arguments#
- -i, --infile
An optional required file, if not supplied then STDIN is used
- -o, --outfile
An optional output file, if not provided output is to STDOUT
- -v, --verbose
give more output
Default: False
quick-lift
#
merge-count-vcfs
#
Merge two or more allele count VCF files that have been pre-sorted on
chr_name
, start_pos
). Please do not use for general VCF merging, this
is only for allele count mapping VCF files and should not be mistaken for a generalisable VCF merging script. The VCF files must have 1 or more “AN:AC” fields after format (and nothing else). Where
AN
, is the total allele number andAC
is the count of each alternate allele. The VCF files must be sorted in the same way, which should be the natural string sort order of chromosome name and the numeric sort order of the start position. Also, it is assumed that the VCFs only portray bi-allelic variants. All the variant ID data and INFO fields are taken from the reference VCF file.
Please note that this will perform a system call to tabix
, so it should
be installed and in your path. Tablix is not used for the merge, only to
verify the sort order of all the files being merged.
usage: merge-count-vcfs [-h] [-d DATA_NAMES [DATA_NAMES ...]] [-r REF_NAME]
[-g REF_GENOME] [-o OUTFILE] [-T TMP_DIR] [-v]
ref_file merge_files [merge_files ...]
Positional Arguments#
- ref_file
A vcf to act as a reference file
- merge_files
One or more input counts files to merge into ref_file
Named Arguments#
- -d, --data-names
One or more dataset names, if not given will default to ds1,ds2,ds3 - if given must equal the numbers of merge files
- -r, --ref-name
If you want rows from the reference file labelled in the output then supply a name for the reference
- -g, --ref-genome
Path to a reference genome assembly, if provided the contigs from this are used in the output VCF
- -o, --outfile
An output file, if provided will be written as a bgzipped file, if not provided then will output to STDOUT
- -T, --tmp-dir
An alternate temp location to write to (default /tmp)
- -v, --verbose
give more output
Default: False
See merge-count-vcfs.sh
for a bash wrapper around this.
make-site-chunks
#
Parse genomic coordinates from an input file and generate coordinate boundaries of regions that contain a target number of sites (rows) in the input file. This is designed so that the region boundaries can be tabix queried out to give subsets of the input file with defined size. The number of sites in a region may be greater than the target in instances where the target number of sites occurs in a region with multiple sites having the same co-ordinates. This is to ensure that the end coordinate of one region is different from the start coordinate of the subsequent region. The input file must be sorted on the chr_name
, start_pos
and end_pos
columns.
Parse through the positional information of a file and produce an output of region boundaries that contain ~a target number of sites. The number of sites might not be exactly the same as the target if there are multiple sites with the same coordinate at the point where the target number is reached. In these cases all sites at those coordinates are included in the chunk so the start/end coordinate of all regions is unique.
usage: make-site-chunks [-h] [-i [INFILES ...]] [-o OUTFILE]
[--out-dir OUT_DIR] [--out-ext OUT_EXT] [-T TMP_DIR]
[-d DELIMITER] [-c COMMENT_CHAR] [-C CHR_NAME]
[-S START_POS] [-E END_POS] [-R REF_ALLELE] [-v]
[--bgen]
target
Positional Arguments#
- target
The number of sites per interval
Named Arguments#
- -i, --infiles
An input file, if not provided then STDIN is used. Must be sorted on chr_name, start_pos, end_pos, can be gzip compressed
- -o, --outfile
An output file, if not provided then STDOUT is used
- --out-dir
The output directory prefixed onto the outfile (if input is not STDIN)
- --out-ext
The output file extension added onto the outfile (if input is not STDIN)
- -T, --tmp-dir
A temp directory
- -d, --delimiter
An input file delimiter (default=’t’)
Default: ” “
- -c, --comment-char
The comment character, lines starting with this are ignored (but still output) (default: ##)
Default: “##”
- -C, --chr-name
The name of the chromosome column (default: #CHROM)
Default: “#CHROM”
- -S, --start-pos
The name of the start position column (default: POS)
Default: “POS”
- -E, --end-pos
The name of the end position column, if not there use –start-pos as –end-pos (default: POS)
- -R, --ref-allele
The name of the reference allele column (if present), if not if this is defined then the end position is calculated from the start position + length(ref) - 1
Default: “REF”
- -v, --verbose
Give more output (to <STDERR>)
Default: False
- --bgen
The input files are bgen format
Default: False
The script will output the following columns in all cases:
rowidx
- The row number (sequential count indexed from 1)region_idx
- The region number (sequential count indexed from 1)chr_name
- The chromosome namestart_pos
- The start position of the regionend_pos
- The end position of the regionnsites
- The number of sites present in the region
If input is from a file rather than <STDIN>
, then additional columns will be added:
infile
- The input file name given tomake-site-chunks
outfile
- A potential output file name based on the input file name and the data in columns 2-5. it has the structure:<root infile name>.<region_idx>.<chr_name>.<start_pos>-<end_pos>.<extension>
merge-cadd
#
Merge CADD data into a VCF file.
usage: merge-cadd [-h] [-o OUTFILE] [-T TMP_DIR] [-v]
vcf_file cadd_files [cadd_files ...]
Positional Arguments#
- vcf_file
A vcf to merge into
- cadd_files
One or more input counts files to merge into ref_file
Named Arguments#
- -o, --outfile
An output file, if provided will be written as a bgzipped file, if not provided then will output to STDOUT
- -T, --tmp-dir
An alternate temp location to write to (default /tmp)
- -v, --verbose
give more output
Default: False
split-mapping-file
#
Partition the mapping file into a common file and a rare file based on MAF and/or MAC (applied in an OR fashion).
usage: split-mapping-file [-h] [-f MAF] [-c MAC] [-T TMP_DIR] [-v]
mapping_file common_out rare_out
Positional Arguments#
- mapping_file
A vcf mapping file to partition
- common_out
The name of the output file containing the common variants
- rare_out
The name of the output file containing the rare variants
Named Arguments#
- -f, --maf
The MAF cutoff anything >= to this is common anything < this is rare
Default: 0.01
- -c, --mac
The MAC cutoff anything >= to this is common anything < this is rare
Default: 50
- -T, --tmp-dir
An alternate temp location to write to
- -v, --verbose
give more output
Default: False
Other admin scripts#
convert-xml
#
- Scripts and functions to convert old style XML files to new-style ones. The
average user will not need to use this as it was designed to convert old XMLs to new formats during development.
usage: convert-xml [-h] [-v] old_xml new_xml
Positional Arguments#
- old_xml
The old XML file (can be gzip compressed)
- new_xml
The new XML file will be compressed if file extension is gz
Named Arguments#
- -v, --verbose
give more output
Default: False
gwas-norm-test-data
#
A tool to generate test data and result data. This is to simplify the process of generating end-to-end tests as they are a real pain to manually setup.
Currently, this can implement a:
study/analysis files
study_file/key analysis
Multiple genome assemblies
Failed liftover files
Top hits files.
XML metadata generation
Mapper file generation
Duplicated variants in mapper.
Flipping of effect alleles
beta, log(or), log(rr), log(hr), or, rr, hr effect types
different analysis types
Need to implement:
Logged p-values in input
Missing other allele column and adding rows to bad data
Out of range p-value generation, including inf, missing etc…
Different map info columns.
Metadata tests and output files, include the probability for test failure.
Other info column definitions and static data info.
Missing effect sizes
Exotic column types, CIs/chr-pos
Proper population definitions, including error populations
usage: gwas-norm-test-data [-h] [--genomic-config GENOMIC_CONFIG]
[--tmpdir TMPDIR] [--mapper-name MAPPER_NAME]
[--refgen-name REFGEN_NAME] [--species SPECIES]
[--seed SEED] [-v]
outdir config
Positional Arguments#
- outdir
An output directory name, will be created if does not exist. If it exists it will be wiped. Several sub-directories and files will be creayed in here.
- config
A test data setup config file. This is a config file that describes how the test should be setup
Named Arguments#
- --genomic-config
The path to the genomic config file. This provides donor files for the test, such as a mapping file and chain files
- --tmpdir
The path to a tempdir to create the files in. The finished files are then moved from here to the outdir
- --mapper-name
The name of the mapper file in the genomic config to use for the source assembly in the test.
Default: “all”
- --refgen-name
The name of the reference genome name to use for all required assemblies (source and target).
Default: “local”
- --species
The name of the species to use (for genomic config queries).
Default: “human”
- --seed
The random seed to use. If not set then no random seed is used.
- -v, --verbose
give more output, use -vv for progress monitoring
Test config file#
The parameters for the test data are defined in a test confguration TOML file. Some examples can be seen in ./resources/test_config
. They are also documented in the comments of the two example files below.
An example of a study/analysis file config:
[general]
# The total number of variants across all requested files.
nvariants = 100
# The secondary mapping file is backup file that contains rare variants
# not in the primary mapping file. The idea is to speed up performance
# with most common variants in the primary file. This does not test
# performance but does test mapping from the secondary mapping file.
# The number of variants to be placed into a secondary mapping file, so
# variants in the primary mapping file are nvariants - secondary_mapper
secondary_mapper = 20
# Every 'mapper_dup_vars_idx' variant is duplicated, to enable testing
# of lack of mapping when no other allele is present
mapper_dup_vars_idx = 10
# This is the probability that the ref allele is flipped
prob_ref_flip=0.2
# This is the probability that a duplicated variant is ref flipped.
prob_dup_flip=0.8
# The mapper info filed to include in the output normalised file. If an
# empty list is given then none will be included. If this option is
# missed altogether then all will be included. They are all listed here
# for reference.
# obs (list): of mapping databases that the variant has been observed in
# idx (int): The row number in the input file that the variant occupies
# nsites (int): The number of variant sites that overlap the source variants
# chromosome/position
# caddp (float): The CADD Phred score
# caddr (float): The CADD raw score
# sift (float): The SIFT score
# polyp (float): The polyphen score
# clinvar (str): The clinvar consequence
# vep (str): The variant effect predictor worst consequence for the variant.
mapper_info_fields=['obs', 'idx', 'nsites', 'caddp', 'caddr', 'sift', 'polyp',
'clinvar', 'vep']
# Default is 0.05
top_hits_pvalue=5E-05
# Files specified outside of an analysis apply to all analyses or study files
[files]
# Default is 1
nfiles = 4
# Files are 0-based indexed and fileN is a default that will apply to all
# files unless specified otherwise
[files.fileN]
header = true
delimiter = "\t"
# The column definitions for a file, Columns known by GWAS norm should have
# the name mapping, XML columns should be indicated (default is False)
[files.fileN.columns]
# Test input name = {name=<gwas-norm-name>}
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# In this case file3 is the forth file, a file4 would raise an error as we
# are expecting for files not 5, they can have different columns and
# delimiters
[files.file3]
delimiter = ","
[files.file3.columns]
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# Studies are 0-based currently, only a single study is supported for each
# test data
[study0]
# Required
study_id = 1
# Optional
pubmed_id = 1
# Required
study_name = "test study"
# Required
source = "b37"
# Required
target = ["b36", "b37", "b38"]
# Not implemented yet
[study0.info]
[study0.info.defs]
A = {}
# Analyses are 0-based indexed and should be below studies
[study0.analysis0]
# Required, must be unique
analysis_id = 1
# Required
analysis_name = "chd"
# Required
phenotype = "CHD"
# Required
effect_type = 'or'
# Required
analysis_type = 'disease'
# These are not implemented yet
# af_pops = {}
# ld_pops = {}
[study0.analysis1]
analysis_id = 2
analysis_name = "cvd"
phenotype = "CVD"
effect_type = 'or'
analysis_type = 'disease'
An example of a study file/ key analysis file config, this is where multiple analyses share a single input file (or occur anywhere in a set of input files):
[general]
# The total number of variants across all requested files.
nvariants = 100
# The secondary mapping file is backup file that contains rare variants
# not in the primary mapping file. The idea is to speed up performance
# with most common variants in the primary file. This does not test
# performance but does test mapping from the secondary mapping file.
# The number of variants to be placed into a secondary mapping file, so
# variants in the primary mapping file are nvariants - secondary_mapper
secondary_mapper = 20
# Every 'mapper_dup_vars_idx' variant is duplicated, to enable testing
# of lack of mapping when no other allele is present
mapper_dup_vars_idx = 10
# This is the probability that the ref allele is flipped
prob_ref_flip=0.2
# This is the probability that a duplicated variant is ref flipped.
prob_dup_flip=0.8
# The mapper info filed to include in the output normalised file. If an
# empty list is given then none will be included. If this option is
# missed altogether then all will be included. They are all listed here
# for reference.
# obs (list): of mapping databases that the variant has been observed in
# idx (int): The row number in the input file that the variant occupies
# nsites (int): The number of variant sites that overlap the source variants
# chromosome/position
# caddp (float): The CADD Phred score
# caddr (float): The CADD raw score
# sift (float): The SIFT score
# polyp (float): The polyphen score
# clinvar (str): The clinvar consequence
# vep (str): The variant effect predictor worst consequence for the variant.
mapper_info_fields=['obs', 'idx', 'nsites', 'caddp', 'caddr', 'sift', 'polyp',
'clinvar', 'vep']
# Default is 0.05
top_hits_pvalue=5E-05
# Files specified outside of an study file apply to all analyses or study
# files, however, currently only a single study file is allowed so it makes
# very little difference at the moment.
[files]
# Default is 1
nfiles = 4
# Files are 0-based indexed and fileN is a default that will apply to all
# files unless specified otherwise
[files.fileN]
header = true
delimiter = "\t"
# The column definitions for a file, Columns known by GWAS norm should have
# the name mapping, XML columns should be indicated (default is False)
[files.fileN.columns]
# Test input name = {name=<gwas-norm-name>}
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# Key columns must be defined for study files only int or str are supported.
# The key values are generated automatically
key1 = {type='int', key=true}
# In this case file3 is the forth file, a file4 would raise an error as we
# are expecting for files not 5, they can have different columns and
# delimiters
[files.file3]
delimiter = ","
[files.file3.columns]
CHROMO = {name="chr_name", xml=true}
POS = {name="start_pos", xml=true}
A1 = {name="effect_allele", xml=true}
A2 = {name="other_allele", xml=true}
P = {name="pvalue", xml=true}
EFFECT = {name="effect_size", xml=true}
SE = {name="standard_error", xml=true}
# Multiple key columns are supported as well as different keys in different
# files. Alanyses are spread over files such that they only occur in files
# with the same key columns
key3 = {type='int', key=true}
key4 = {type='str', key=true}
# stufy file definition
[study_file0]
# Required
study_id = 1
# Optional
pubmed_id = 1
# Required
study_name = "test study"
# Required
source = "b37"
# Required
target = ["b36", "b37", "b38"]
# Required
effect_type ='beta'
# Required
analysis_type ='metabqtl'
[study_file0.info]
# Not implemented yet
[study_file0.info.defs]
A = {}
[study_file0.analysis0]
# Requried
analysis_id = 1
# Requried
analysis_name = "hdl-c"
# Requried
phenotype = "HDL-C"
# Not implemented yet
# af_pops = {}
# ld_pops = {}
[study_file0.analysis1]
analysis_id = 2
analysis_name = "vldl-c"
phenotype = "VLDL-C"
# af_pops = {}
# ld_pops = {}