Welcome to gwas-norm#
The gwas-norm repostory and package aims to make is as simple as possible to standardise the format of GWAS summary statistics files so they can be used in downstream applications in a uniform way.
There is a massive amount of GWAS summary level data in the public domain. This covers many different phenotyes from molecular trails such as eQTLs and pQTLs to disease traits. Unfortunately, there has never really been an agreed standard for data sharing of GWAS summary level data and the available datasets exhibit substantial heterogeneity and are of varying quality, with some being directly usable and others bordering on deliberate obfuscation of the data within them.
gwas-norm provides an interface where users can define the attributes of the dataset they want to normalise, and use that definition to perform the normalisation in a scalable way. With gwas-norm you can normalise a single dataset in an interactive shell, or you can scale it up to many thousands of datasets (for example an eQTL/pQTL study) and run in parallel with hpc-gwas-norm
. It will handle some of the most common scenarios and produce a flat file with:
A standardised column order
Uniform genome assembly
Variant IDs
Basic functional annotations
Independent variant assessment (not implemented yet).
What is gwas-norm#
When you install gwas-norm, you get access to both programs that you can run from the unix command-line and a python application programming interface (API) that you can use to interact with generalisable components of the package, or use to integrate gwas summary statistic normalisation into your own pipelines.
Whist it is anticipated that most users will want to use only the command line scripts, gwas-norm
and hpc-gwas-norm
(the code
formatting of gwas-norm
distinguishes the program from the package of the same name), however, both the command line endpoints and the API are documented.
Next steps…#
To get gwas-norm up and running you will want to:
Contents#
Setup
- Getting Started with GWAS Normalisation
- Setup a configuration file
- Environment variables
- Building the mapping files
- Introduction
- Version information
- The populations in the mapping file
- The mapping VCF file format
- Download reference genomes
- Installing helper bash scripts for building the mapping files
- Mapping file data
- ALFA
- GNOMAD2 exome allele counts
- GNOMAD3 allele counts
- 1000 genomes allele counts
- UKBB SNPstats files
- Merging all the data sources
- Running VEP
- Splitting the mapping file into common/rare
- The variant synonyms file
- Installing Ensembl VEP
- Updating Ensembl VEP
GWAS normalisation
- Overview and usage
- Key features
- Overview
- Usage
- The XML metadata file
- Prerequisites
- The XML elements
- The
<gwas_data>
root - The
<study>
/<study_file>
elements - The
<analysis>
/<analysis_key>
elements - The
<file>
element - The
<phenotype>
element - The
<caveat>
element - The
<definition>
element - The
<column>
element - The
<info>
element and info attributes - The cohort elements
- The population elements
- The
<test>
element - Example XML files
- The
- Input file columns
- The GWAS effect type
- The GWAS analysis type
- The directory structure of the normalised data
- Output files
- The XML metadata file
Variant mapping
Programmer reference
- Command-line endpoints
- gwas-norm API
gwas_norm
packagegwas_norm.gwas_norm
gwas_norm.processors
gwas_norm.config
gwas_norm.common
ChrPosSpec
Msg
add_column_name()
bsd_chksum_file()
bsd_chksum_str()
check_abs_path()
check_analysis_type()
check_effect_type()
check_parent()
compress_file()
convert()
count_lines()
create_chrpos_spec_str()
create_uni_id()
error_on_empty()
expand_relative_path()
get_column_name()
get_file_name()
get_old_analysis_id()
get_open_method()
get_tmp_file()
md5_file()
norm_name()
parse_bool()
parse_chrpos_spec_str()
passthrough()
safe_move()
stdopen()
gwas_norm.constants
ChrPosSpec
Msg
add_column_name()
bsd_chksum_file()
bsd_chksum_str()
check_abs_path()
check_analysis_type()
check_effect_type()
check_parent()
compress_file()
convert()
count_lines()
create_chrpos_spec_str()
create_uni_id()
error_on_empty()
expand_relative_path()
get_column_name()
get_file_name()
get_old_analysis_id()
get_open_method()
get_tmp_file()
md5_file()
norm_name()
parse_bool()
parse_chrpos_spec_str()
passthrough()
safe_move()
stdopen()
gwas_norm.metadata
sub-packagegwas_norm.utils
sub-packagegwas_norm.variants
sub-packagegwas_norm.variants.vcf_info
ALLELE
CADD_INFO_FIELD
CADD_KEY_LOOKUP
CADD_MAIN_DELIMITER
CADD_PHRED
CADD_RAW
CLINORIGIN_BI
CLINORIGIN_DEN
CLINORIGIN_GERM
CLINORIGIN_INC
CLINORIGIN_INH
CLINORIGIN_LOOKUP
CLINORIGIN_MAT
CLINORIGIN_NT
CLINORIGIN_OTH
CLINORIGIN_PAT
CLINORIGIN_SOM
CLINORIGIN_UNI
CLINORIGIN_UNKN
CLINSIG_AFF
CLINSIG_ASSC
CLINSIG_BEN
CLINSIG_CONF
CLINSIG_DRUG
CLINSIG_ERR
CLINSIG_LBEN
CLINSIG_LOOKUP
CLINSIG_LPATH
CLINSIG_NP
CLINSIG_OTH
CLINSIG_PATH
CLINSIG_PROT
CLINSIG_RISK
CLINSIG_SEN
CLINSIG_USIG
CLINVAR_ID_DELIMITER
CLINVAR_MAIN_DELIMITER
CLINVAR_MAPPED_FIELDS
CLINVAR_ORIGIN
CLINVAR_SIGNIF
CLINVAR_SUB_DELIMITER
CLNACC
CLNDISDB
CLNDN
CLNHGVS
CLNORIGIN
CLNREVSTAT
CLNSIG
CLNVI
CODING_SEQUENCE
CONSEQUENCE
CONSEQUENCES
CONSEQUENCE_LOOKUP
CaddKeys
ClinVarKeys
ClinVarOri
ClinVarSig
DOWNSTREAM
FEATURE
FEATURE_ELONGATION
FEATURE_TYPE
FEAT_TRUNCATION
FIVE_PRIME_UTR
FRAMESHIFT
GENE
INCOMPLETE_TERMINAL_CODON
INFRAME_DEL
INFRAME_INS
INTERGENIC
INTRON
MATURE_MIRNA
MISSENSE
NC_TRANS
NC_TRANS_EXON
NMD_TRANS
POLYPHEN
PROT_ALTERING
REG_REGION
REG_REGION_ABLATION
REG_REGION_AMP
SIFT
SPLICE_ACCEPTOR
SPLICE_DONOR
SPLICE_DONOR_FIFTH_BASE
SPLICE_DONOR_REGION
SPLICE_POLYPRIM_TRACT
SPLICE_REGION
START_LOST
START_RETAINED
STOP_GAINED
STOP_LOST
STOP_RETAINED
SYNONYMOUS
So
TFBS_ABLATION
TFBS_AMP
TF_BINDING_SITE
THREE_PRIME_UTR
TRANS_ABLATION
TRANS_AMP
UPSTREAM
VCF_MISSING
VEP_INFO_FIELD
VEP_KEY_LOOKUP
VEP_MAIN_DELIMITER
VepKeys
cadd_info_parser()
clinvar_most_significant()
parse_clinvar()
parse_clinvar_dbs()
parse_clinvar_delim()
parse_clinvar_disease_db()
parse_clinvar_origin()
parse_clinvar_significance()
parse_clinvar_var_id()
parse_float()
parse_none()
parse_return()
parse_vep_consequence()
validate_header_metadata()
vep_info_parser()
vep_worst_consequence()
gwas_norm.variants.norm
gwas_norm.variants.mapper
gwas_norm.variants.resolvers
gwas_norm.variants.constants
ALLELE_DELIMITER
DNA
ALLOWED_TYPES
BLANK_ALLELE
ID
CHR
START
END
STRAND
REF
ALT
STRAND_FLIP
REF_FLIP
PARTIAL_ALLELE_MATCH
COORD_OFFSET
MAPPING_DECODE
MAPPING_DECODE_STR
ALT
ALT_ALLELE_INFERRED
BALANCED
CHR
COMP_TRANSLATE
Column
DELETION
DNA_DEL_REGEX
DNA_DEL_STR
DNA_REGEX
DNA_STR
DataSet
END
ENSEMBL_DELETION
ENS_ID_REGEX
ENS_ID_STR
ERROR
FLAGS
ID
INSERTION
IS_PALINDROMIC
MapCoord
MappingFlag
MappingResult
NORMALISED
NO_DATA
PARTIAL_ALLELE_MATCH
POP_START_IDX
REF
REF_FLIP
RS_REGEX
RS_STR
START
STRAND
STRAND_FLIP
UNKNOWN_INDEL
VCF_FORMAT_IDX
VCF_ID_IDX
VCF_INFO_IDX
dataset_bits()
decode_mapping_flags()
main()
gwas_norm.variants.common
gwas_norm.variants.downloads
sub-packagegwas_norm.hpc
sub-package
- Testing in GWAS norm
Project admin