Testing in GWAS norm#

GWAS norm contains a variety of unit tests and some larger end-to-end tests of the full pipeline. This document describes how to put together end-to-end tests of the full pipeline and covers scenarios that have tests and the scenarios that need testing.

GWAS norm uses pytest for all the testing and if you clone the gwas-norm repository. All the pytests are located in ./tests, where . is the root of the repository. The end-to-end tests are located in the ./tests/test_gwas_norm.py file.

The data for an end-to-end test should be located within a sub-directory in gwas_norm/example_data/example_datasets/norm_data.

The ./gwas_norm/example_data/example_datasets, is where the data that is distributed with the package is located and these datasets can be read in to each test or pytest fixtures.

All of the data and code can be found in the updates branch of gwas-norm. I recommend you install in developer mode for the test development.

git clone git@gitlab.com:cfinan/gwas-norm.git
git checkout updates
pip install -e .

The components of an end-to-end test#

The end-to-end test directory should contain a results directory, which mimics the expected result directory structure and output files from running the pipeline. Keep in mind the the output files need the same sort order as the pipeline outputs, i.e. chr_name (string), start_pos (integer). The root of the directory should contain one or more the test input files and an XML file that describes the input file and test study metadata.

An example of the directory structure from an existing test is shown below. We can also see an excel file in there, study1_00000001.results.xlsx this is used to put together the expected results file, although, the expected results files can be put together using any method that does not use any of the gwas_norm API.

$ tree ./gwas_norm/example_data/example_datasets/norm_data/study1/

gwas_norm/example_data/example_datasets/norm_data/study1/
├── results
│   ├── gwas_data
│   │   ├── b37
│   │   │   ├── data_files
│   │   │   │   └── test1_00000001.b37.gnorm
│   │   │   └── summary_files
│   │   │       ├── study1_00000001_bad_data.b37.txt
│   │   │       └── study1_00000001_top_hits.b37.gnorm
│   │   └── b38
│   │       ├── data_files
│   │       └── summary_files
│   └── original_files
├── study1_00000001.results.xlsx
├── test_gwas1.f1.txt.bz2
├── test_gwas1.f2.txt.bz2
├── test_gwas1.f3.txt.bz2
└── test_study1.xml

You will notice that the output files in the test results are uncompressed, where as the pipeline would normally compress the output. This is intentional as it is far easy to work with uncompressed files if you need to manually edit them.

The test will also need a mapping file. As the mapping file has to be distributed with the package, it can’t be large. I have put together some tabix indexed test mapping files for GRCh37, GRCh38 at these locations:

./gwas_norm/example_data/example_datasets/norm_data/test_mapper_b37.vcf.gz
./gwas_norm/example_data/example_datasets/norm_data/test_mapper_b38.vcf.gz

Each contains 540 variants, so you might want to base any tests around these and perhaps include some additional variants that you know will not be mapped.

You will also need a copy of the GRCh37 and GRCh38 reference genome. These are not distributed with the package but these are functions to download and index them. This should happen automatically. but if not try running the commands below. Note that the GRCh37 (b37) command will take a long time to run and it will appear to hang (but it isn’t). This is because the downloaded genome has to be re-compressed and indexed as the Ensembl download is not indexed for GRCh37 (I have no idea why):

from gwas_norm.example_data import examples

examples.get_data("ref_genome", assembly='b37')
examples.get_data("ref_genome", assembly='b38')

Finally you will need a genomic config file. However, you do not need to create one yourself. There is a function that will do this for you:

from gwas_norm.example_data import examples

examples.get_data("config_file", tmpdir)

This should work if you use the distributed mapping file, chain files and downloaded genome assemblies. Let me know if there are any issues.

Some additional input files#

I also put together some additional input files, for conducting tests on different aspects, such as effect types. These contain variants subset from the mapping files detailed above:

./gwas_norm/example_data/example_datasets/norm_data/test_gwas2.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas3.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas4.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas5.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas6.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas7.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas8.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas9_chr12.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas9_chr19.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas9_chr3.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas9_chr5.txt
./gwas_norm/example_data/example_datasets/norm_data/test_gwas9_chr6.txt

What needs testing#

So far, there is a single end-to-end of the study1 directory that is shown above. This tests a beta effect type and some re-calculation of very small p-values and standard errors and that variants/effect type flipping is handled correctly. This is all in an for an analysis file setup. In general terms, things that still need to be tested via an end-to-end test in the analysis file setup are:

The beta effect type (GRCh38)
The odds ratio effect type (GRCh37, GRCh38)
The log odds ratio effect type (GRCh37, GRCh38)
The risk ratio effect type (GRCh37, GRCh38)
The log risk ratio effect type (GRCh37, GRCh38)
The hazard ratio effect type (GRCh37, GRCh38)
The log hazard ratio effect type (GRCh37, GRCh38)
Non-effect allele imputation (GRCh37, GRCh38)
logged/unlogged p-values (GRCh37, GRCh38)

See the effect type docs for more information.

For the study file setup:

The beta effect type (GRCh37, GRCh38)
The odds ratio effect type (GRCh37, GRCh38)
The log odds ratio effect type (GRCh37, GRCh38)
The risk ratio effect type (GRCh37, GRCh38)
The log risk ratio effect type (GRCh37, GRCh38)
The hazard ratio effect type (GRCh37, GRCh38)
The log hazard ratio effect type (GRCh37, GRCh38)
Non-effect allele imputation (GRCh37, GRCh38)
logged/unlogged p-values (GRCh37, GRCh38)

In addition you might want to test differing compression of input files, for the study1 test, bzip2 was used. See defining file metadata for more information.

Putting together the test data#

I used Excel to do this, but it is not ideal. However, you do it just make sure it is distinct from any code used by GWAS norm. The most fiddly part is the info field and calculating any allele frequency data.

In the existing tests, I had most of the allele frequency data derived from the input data. However, I deliberately excluded it for some rows to test the pipelines ability to re-calculate the allele frequencies based on the defined cohort definitions. I also, tested the ability to include info data from the input file columns and statically defined info data.

As for creating the XML file. I did that programmatically, although you can do that by hand if you prefer. The script that was used is at ./resources/misc/make-test-data.py.

Re-using test data#

The most time consuming part of implementing the test is putting together the input data and expected results. However, it will be possible to reuse some of these. For example, the tests for odds ratio, risk ratio and hazard ratio (and their logged counterparts) can use the same values and they are handled in the same way. Similarly, for testing non-effect allele imputation, you can put together a fill with both effect and non-effect allele and for one of the tests just do not tell gwas-norm about the non-effect allele column. The same goes for logged/unlogged p-values.

However, when testing for non-effect allele imputation, it is a good idea to include some variants that have multiple alleles in the mapping file, as the pipeline should not be able to map these.

In addition, for the study file setup tests, you might be able to combine several analysis files together to mimic a study file.

The mechanics of setting up a test#

Here are the implementation details you need to setup an end-to-end test.

Making the test data available#

To make the data available, you need a directory structure as above and then make a function in the ./gwas_norm/example_data/examples.py file to give the data to the test. The function should be tagged with a @dataset decorator. An example is shown below for the study1 test data:

@dataset
def study1_data():
    """Get the normalisation test files.

    Returns
    -------
    source_root_dir : `str`
        The source directory where all the source files are referenced from.
    study_dir : `str`
        The path to the output study directly.
    results_dir : `str`
        The path to the expected results directory.
    xml_file : `str`
        The path to the XML file.
    top_hits_pvalue : `float`
        The pvalue cutoff for top hit inclusion.
    target_assemblies : `list` of `str`
        The names of the target genome assemblies.
    """
    study_name = "study1"
    top_hits_pvalue = 0.05
    target_assemblies = ["GRCh37", "GRCh38"]
    return *(_get_study_dirs(study_name)), top_hits_pvalue, target_assemblies

This defines the study name, which is the same as the directory name for the test data. The top hits p-value cut off and the target genome assemblies for the pipeline to test.

This function can then be used by:

from gwas_norm.example_data import examples

srd, sd, xml, th, ta = examples.get_data("study_1_data")

Implementing the actual test#

You can write your own functions for doing this in ./tests/test_gwas_norm.py or you can drop into the function which already exists which should do most of the work, although, will need editing for GRCh38 tests (see the line for ga, ga_dir in [('b37', gas['b37'])]).

This function is test_gwas_norm and it is parameterised, so you should just be able to add you test data name and a chunksize (use something small but > 1) and it should run your test alongside any of the others. The function definition is shown below:

@pytest.mark.parametrize(
 "test_dataset,chunksize",
     (
         ("study1_data", 5),
         # <YOUR TEST GOES HERE>
     )
 )
 def test_gwas_norm(tmpdir, test_dataset, chunksize):
     pass

The tmpdir is a fixture implemented by pytest.

Running pytest#

You can run the tests by:

$ pytest -s ./tests/test_gwas_norm.py

The -s flag is helpful if you are using print debugging. If you have pycharm, that has some powerful test debugging tools.

If you run all the tests:

$ pytest -s ./tests/

Then most of the XML file read/write tests will fail at the moment as I altered the XML file structure and have not updated the tests yet.

Summary#

If you have any issues, please contact me, it could be some things are not as genalisable as I intended them to be.

Obviously, you do not have to stick to the mini-framework outlined above, you can do implement something yourself if you want to but it is a good start point.