Getting Started with GWAS Normalisation#

version: 0.3.0a0

The gwas-norm package is a toolkit to normalise GWAS summary statistics flat files. It aims to handle some of the most common scenarios and produce a flat file with:

  1. A standardised column order

  2. Uniform genome assembly

  3. Variant IDs

  4. Basic functional annotations

There is online documentation for gwas-norm.

Installing the Python package#

At present, gwas-norm is undergoing development and no packages exist yet on PyPy or in Conda. Therefore it is recommended that it is installed in either of the two ways listed below. First, clone this repository and then cd to the root of the repository.

git clone git@gitlab.com:cfinan/gwas-norm.git
cd gwas_norm

Installation using conda dependencies#

A conda environment is provided in a yaml file in the directory ./resources/conda/env. A new conda environment called gwas_norm_py3X can be built using the command:

# From the root of the gwas-norm repository
conda env create --file ./resources/conda/env/py39/conda_create.yaml

To add to an existing environment use:

# From the root of the gwas-norm repository
conda env update --file ./resources/conda/env/py39/conda_update.yaml

There are also Conda environments for Python v3.7, v3.8 and and v3.10. Then to install gwas_norm you can either do:

python -m pip install .

Or for an editable (developer) install run the command below from the root of the gwas-norm repository. The difference with this is that you can just to a git pull to update gwas-norm, or switch branches without re-installing:

python -m pip install -e .

Installation not using any conda dependencies#

If you are not using conda in any way then install the dependencies via pip and install gwas_norm as an editable install also via pip:

Install dependencies:

python -m pip install --upgrade -r requirements.txt

Then to install gwas_norm you can either do:

python -m pip install .

Or for an editable (developer) install run the command below from the root of the gwas-norm repository. The difference with this is that you can just to a git pull to update gwas_norm, or switch branches without re-installing:

python -m pip install -e .

Installing helper bash scripts for building the Mapping files#

I hope to provide a download link for the mapping files. However, until then, they will need to be build using this procedure. It is fully documented and you do not have to use all the resources in the documentation. However, if you do follow the procedure then you will need to have the Python package installed as documented above and the bash scripts in ./resources/bin available in your PATH (where . is the root of the cloned git repository). These scripts all also need the bash-helpers repo in your PATH and also shflags (which is a very nice BASH command line argument handler). To put something in your PATH involves editing either your ~/.bashrc file or your ~/.bash_profile file (depending on what you use). For example, for adding the bash scripts for building the mapping files you should add something like this:

PATH="/path/to/gwas-norm/resources/bin:${PATH}"
export PATH

where /path/to/ is where you cloned the repository. If you didn’t clone the repository and installed via conda, then it is easiest just to clone it (but not install it with pip) and just use the ./resources/bin scripts.

Next steps…#

After installation you will want to:

  1. Setup your configuration file

  2. Build a mapping file

Command endpoints#

Installation will install the following endpoints. Usage of all these scripts can be listed with <COMMAND> --help.

For handling flat files#

  • gwas-norm - For performing full normalisation on one of more datasets

  • variant-map - A standalone variant ID mapper/annotator,

  • extract-pvalue - For extracting variants exceeding a p-value cut point from flat files extractions

  • quick-lift - A generic no frills flat file genomic-liftover program (uses CrossMap API)

For use with HPCs#

  • hpc-gwas-norm - Parallelise gwas-norm on an HPC

  • make-pvalue-filter-ja - For creating cluster job array files to perform large scale p-value cutoff e

For handling VCF mapping files#

  • make-site-chunks - For creating genomic co-ordinate ranges that contain a defined amount of variants.

  • split-map - For splitting a mapping file based on a MAF cutoff.

  • format-alfa - Parsing a ALFA project vcf file.

  • format-snpstats - Parsing a QCTool snpstats file into a VCF.

  • format-dbsnp - Parsing a dbSNP VCF file.

  • format-eqtlgen - Parsing eQTLgen allele frequencies into VCF format.

  • merge-cadd - Merge CADD scores into a VCF file.