The directory structure of the normalised data#
Below the root of the study normalisation directory (study_norm_dir
in the metadata). Is a fixed directory structure which contains the copies of the original data, the normalised data and various summary files generated during the normalisation.
gwas_data
- The name of the directory where the normalised GWAS data and log files are stored.metadata
- Contains copies of the metadata XML files.original_files
- Contains copies of the original un-normalised source data files. This may have additional directories depending on the directory structure below the study source data variable (study_source_dir
in the metadata).support_files
- Contains copies of any other support files that were defined in the study metadata file. (metafiles
in the metadata). As with the source files, this may have additional directories depending on the directory structure below the study source data variable.
The GWAS data directory (gwas_data
)#
This directory contains all of the standardised GWAS data and various log/summary files that are associated with the normalisation process. There will be sub-direcories for each target genome assembly that was requested. The precise names of these directories will be determined by the genome assembly synonyms within your genomic config file. For example if the assembly synonyms in your config file looks something like this:
[assembly_synonyms.human]
grch38 = b38
b38 = b38
GRCh38 = b38
hg38 = b38
grch37 = b37
GRCh37 = b37
b37 = b37
hg19 = b37
And you requested, target genome assemblies GRCh37 and GRCh38, then there would be two sub directories, called b37
and b38
, these are referred to as genome assembly specific directories and their structure is outlined below.
Genome assembly specific directories#
Each genome assembly specific directory contains two sub directories:
data_files
- This contains the normalised GWAS summary statistics files. If the source was a study with separate analyses, then there will be a file for each analyses, along with a tabix index for the data file. The file names will have the structure<analysis_name>_<pubmed_id>.<genome_assembly>.gnorm.gz
. If the pubmed ID is unknown then a dummy pubmed ID is created00000000
. The<genome_assembly>
will be the same name as the root genome assembly specific directory. If the source was a study file, where all of the analyses are combined in a single file. Then the normalised file name will have the structure<study_name>_<pubmed_id>.<genome_assembly>.gnorm.gz
.summary_files
- This contains aggregated summary/log files for all of the analyses in the study and is documented in more detail below.
The summary files directory (summary_files
)#
This can contain up to 4 different summary files plus some tabix indexes, depending on the type of summery file.
The top hits file. You can specify a p-value cutoff for a standardised row to appear in a summary top hits file, the default value for this is 5E-04. This is an aggregated file containing top hits from all the analyses and will have the file name structure
<study_name>_<pubmed_id>_top_hits.<genome_assembly>.gnorm.gz
and it will be tabix indexed, so will have an associated.tbi
file. If no rows in any of the GWAS reach the specified p-value cutoff criteria, then this file will not be created.The failed liftover file. If you have requested a target genome assembly that is different from your source genome assembly, then any variants that could not be lifted over will be contained within this file. If the target genome assembly was the same as the source assembly or everything lifted over ok, then this file will not be created. If created, it will have the file name structure
<study_name>_<pubmed_id>_failed_liftover.<genome_assembly>.gnorm.gz
and will also be tabix indexed.The bad data file. This contains any source GWAS summary data rows from any analyses that could not be normalised. The the majority of the rejection of bad data happens in the first stage of normalisation prior to lifting over. Even so, a bad data file is output for each target genome assembly. However, in the case of some of the more exotic effect types, rejection of bad data rows can happen after mapping. So, the co-ordinate systems in this file can be mixed, although the assembly for each row is indicated. Additionally, depending in the file structure of the input, this row may have many columns. For these reasons, it is left unsorted and is not indexed. It will have the file structure
<study_name>_<pubmed_id>_bad_data.<genome_assembly>.txt.gz
. If no bad data rows are found then it will not be created.The test results file. If you have specified specific tests within your metadata file, then the results of the all of the specified tests will be here. Even though, tests are source genome assembly specific, they are lifted over to each target assembly and applied after the mapping phase. If you have specified no tests, then this file will not be created. If it is created, it will have the file name structure
<study_name>_<pubmed_id>_tests.<genome_assembly>.txt.gz
.
While the normalisation is running, you will also see several directories present in as well. These are failed_liftover
(if liftovers are happening), tests
, top_hits
, bad_data
. These contain hidden log files and intermediate analyses specific files that are merged together when the last normalisation process has run. Once the intermediate files are merged, then these directories are deleted.