Output files#

This describes the various output files that are created by the GWAS normalisation pipeline.

In addition to the input files that are copied to the normalisation directory. The gwas-norm pipeline will create up-to 4 different file types for each target assembly.

  1. The full standardised GWAS summary statistics files (.gnorm files).

  2. Top hits summary files, these are a subset of the full-standardised GWAS summary statistics files, combined across all analyses (.gnorm files).

  3. Bad data files: These are combined rows from all analyses that could not be standardised for some reason. Their file format will depend partly on the format of the input files.

  4. Failed liftover files. These are combined across all analyses and contain rows that could not be lifted over from the source to the target assembly. If the source and the target assembly are the same, then these will not be created (.gnorm files).

The full standardised GWAS files are the only files that are guaranteed to be produced. The top hits, bad data and failed liftover files are only produced if their are any rows that pass the top-hit p-value threshold, can’t be standardised or lifted over respectively.

Also, unless, the GWAS input files are study-file-level summary statistics, there will be a single standardised GWAS file for each analysis defined in the metadata. Where as the top-hits, bad data and failed liftover files are a single file, that will be aggregated over all analyses.

In addition, all of the .gnorm file types, that is full standardised data, top-hits and failed liftovers, will be bgzipped and sorted on chromosome name in string sort over (C-locale), integer start position and end position. Therefore, they will also all have an associated tabix index file.

Please be aware, that the sort over of these files, is not the biological chromosome sort order. This is deliberate as the sort order of the file can always be known without being defined in the header (the sort order can also be extracted from the tabix index). In most cases, biological sort order is only required for visualisation and data presentation.

The .gnorm format#

The normalised GWAS files are a simple tab-delimited flat file format with a single header line. The co-ordinates are 1-based and sorted in chromosome string order and integer start/end position. The columns are defined below.

  1. chr_name (string) - The chromosome name, these are treated as strings. How the chromosomes are actually defined depends on any chromosome mappings you have defined in your genomic config file.

  2. start_pos (integer) - The start position of the effect allele.

  3. end_pos (integer) - The end position, this is the start position + effect allele length -1.

  4. effect_allele (string) - The effect allele - currently gwas-norm only supports DNA alleles (ATCG), alleles are represented in uppercase. If the variant site is known (in the mapping file), then the effect allele, will always be the alternate allele in the reference genome. If the variant site is unknown then the effect allele will always be the lowest allele in the sort order of the effect_allele / other_allele.

  5. other_allele (string) - The non-effect allele - currently gwas-norm only supports DNA alleles (ATCG), alleles are represented in uppercase. If the variant site is known (in the mapping file), then the other allele, will always be the reference allele in the reference genome. If the variant site is unknown then the other allele will always be the highest allele in the sort order of the effect_allele / other_allele.

  6. var_id (string) - The variant identifier, commonly these are rsIDs from dbSNP, but they do not have to be. The name of the identifier will be determined by the name in the mapping VCF file. If the variant is unmapped then the variant identifier will be set to . (missing).

  7. effect_allele_freq (float) - The effect allele frequency. If available in the input data then this is used. If not then one is calculated from the mapping file according to the allele frequency cohort definitions in the metadata files.

  8. effect_size (float) - The effect size with respect to the effect allele. The type of the effect size is given in the effect_type column.

  9. standard_error (float) - The standard error of the effect size.

  10. mlog10_pvalue (float) - The -log10(p-value). The pipeline will attempt to re-calculate any p-values that have been zeroed due to precision issues. The negative log10 representation will ensure that they do not need any special handling downstream.

  11. number_of_samples (integer) - The number of samples for the variant summary stats. If these are defined in the input data, then they are used. However, if not, then a total sample size will be extracted from any cohort definitions (if known) and a flag will be placed in the norm_info column to indicate that an aggregate sample number has been used for that row.

  12. effect_type (string) - The effect type, please see the GWAS effect types for more detailed information.

  13. analysis_type (string) - The analysis type, please see the GWAS analysis types for more detailed information.

  14. phenotype (string) - A phenotype indicator, this will be the same as the phenotype reference string that has been defined in the metadata files. It does not allow for such as rich definition as in the case of the metadata files but it useful for day-to-day purposes.

  15. caveat (string) - A caveat indicator, caveats are any co-variates that will change the interpretation of the effect size. As with the phenotype indicator, this will be the same as the caveat reference string that has been defined in the metadata files.

  16. study_id (integer) - A study ID value. These are defined in the metadata files. They can be user defined (recommended), defined via a config file or they fall back to a randomly generated integer between 1-100000000.

  17. analysis_id (integer) - An analysis ID value. These are defined in the metadata files. They can be user defined (recommended), defined via a config file or they fall back to a randomly generated integer between 1-100000000.

  18. uni_id (string) - A universal identifier. These can be used to match up variant sites if the var_id is unknown. They are universal only for the same strand, which should be the positive strand. They are defined as the <chr_name>_<start_pos>_<effect/other alleles in sort order separated by _>.

  19. eaf_populations (string) - The population identifiers used to derive the effect allele frequency. If the effect allele frequency was derived from the input files then this will be STUDY, if they have been calculated according to the cohort definitions then these will be the populations defined in the reference population tags.

  20. norm_info (integer) - A bit-wise integer used to store information about how the row was standardised. Specifically, the sample size. See below for more information.

  21. map_info (integer) - A bit-wise integer used to store information about how the the variant was located in the mapping file. Please see mapping bits for more information.

  22. info (string) - An info field for additional information that is not critical to the interpretation of the GWAS summary statistics. These could be additional annotations such as VEP or specific columns you want to carry over from the input file. Some info is automatically added by gwas-norm during the standardisation/mapping, this is detailed below along with the format of the info field. Also see the metadata info tags for more information on defining info fields.

The norm_info column#

The norm_info column contains a bit-wise integer that documents any steps that have occurred to standardised the row or any flags that the user might want to be made aware of. Most of the time, this will be 0, i..e no special steps were taken to standardise the row. Currently there is only two flags flag implemented, although more are planned for the future.

  • OK (0) - No special handling was performed on the row.

  • GLOBAL_SAMPLE_SIZE - (1) - The sample size column data is based on a global sample size defined in the metadata.

The data in this column can be interrogated using bitwise operations.

The info column#

The info column is essentially a dumping ground for any other data that is not critical for the interpretation of the GWAS summary statistics. It is conceptually similar to the INFO field in a VCF file although there are some key differences.

In order to maintain the .gnorm format as a flat file, the data types of the info are inferred from the info format. During normalisation, these are checked to make sure everything is compatible. However, the convenience of command line hacking comes with the downside that there is no info header in the file, if the rows in a .gnorm file are chopped about on the command line it is possible that incompatible data types will be placed together, hence the critical data is defined within the other non-info columns in the .gnorm file. However, each row should always be able to be parsed provided the format of the info field is correct.

The format of the info field is fairly simple. The info field names (keys) are defined followed by values after an equals sign (=) i.e. ensembl_gene_id="ENSG00000139618". If the values are quotes, then they are interpreted as strings, if not then they will be interpreted as floats. Arrays have the format key=["element1"|"element2"|"element3"] or key=[0.123,0.456,0.789]. For strings, internal quoting is allowed and should not cause any issues. Each key/value pair should be separated with semi-colon (;).

The pipeline and the methods that control the parsing to/from of the info fields will sort the info field into alphabetical order based on the keys, although this is not strictly required.

Info fields added during normalisation#

The gwas-norm pipeline will add the following info fields during normalisation. Currently, there is no control over these additional fields being added, although that is planned for the future.

Added to all rows during normalisation:

  • idx (scalar integer) - The row number of the source row in the input data file. Note that if the input is spread over many files then this value will be repeated for each input file, i.e it is only unique within a file, but between then.

If the variant has been mapped and the data is available in the mapping file, these will be added:

  • caddp (scalar float) - The CADD phred scaled score

  • caddr (scalar float) - The CADD raw score

  • polyp (scalar float) - The polyphen score.

  • sift (scalar float) - The SIFT score.

  • clinvar (scalar string) - The CLINVAR annotation.

  • nsites (scalar float) - The number of other variant sites that overlap with the mapped variant site.

  • obs (string array) - The other datasets in the mapping file where the variant site has been observed

  • vep (scalar string) - The worst VEP consequence for the variant site (according to this ranking).

The bad data file format#

The bad data files contain any rows that can’t be standardised across all analyses. If there are no bad rows then this file will not be present.

As the bad data file will potentially contain rows from multiple files, it’s file format will depend on the rows present across all the input files. An aggregated header is made from the header of each input file and the bad rows are aligned accordingly.

In addition, if the effect type is one of z_score_cc, z_score_log_or, direction_beta or direction_log_or, then the bad data file will contain the column names from the normalised file, as these effect types can fail after the initial row standardisation.

However, there are some initial columns that are added to the bad row files, that give an indication of the failed value, where it failed and the failure error message. These are outlined below:

  1. error_file_name (string) - The basename of the input file where the error row is located.

  2. source_row_idx (integer) - The row number (1-based) of the input file where the error row is located.

  3. error_stage (integer) - The stage in the pipeline where the error occurred, 1 is the first stage standardisation, 2 is the second stage after mapping.

  4. error_function (string) - The name of the function where the error occurred.

  5. error_message (string) - The specific error message that was thrown.

  6. error_value (string) - the specific value that caused the error.

<columns from input data>