Input file columns#
The GWAS normalisation pipeline knows how to handle multiple types of columns that can be available in the input GWAS files. It attempts to use the data in various ways to generate a uniform output file.
In essence, the input columns required by gwas-norm can vary but they must collectively allow the minimal effective data to be available after the first stage of standardisation. The minimal effective data is defined as:
chromosome name
start position
effect allele
effect size
standard error
p-value
However, for certain effect types, this is relaxed:
z_score_cc
z_score_log_or
direction_beta
direction_log_or
For these, they require an effect allele frequency to be available, which may only be present after the second variant mapping phase, so the minimal effective data is relaxed for these to remove the effect size and standard error requirements.
In addition to the minimal effective data a other_allele
column must be available and populated after the mapping phase. Any data that is missing the other allele data will be removed at this stage.
Currently, the route taken to standardise a row is based on the available columns and not the data within those columns. For example, if you have defined both a standard_error
column and a ci_combined
column, then the standard_error
column is used. If a row is missing a standard_error
value, then the pipeline will not use the ci_combined
column. This is because a parse tree is derived before the file is processed for performance reasons. This may change in future.
Allowed column definitions#
The column types that are recognised by gwas-norm and how they are used is outlined below. In the file metadata, these are applied as elements within the <columns>
element. for example chr_name
should be an XML tag <chr_name>
.
chrpos
- A column where any of thechr_name
,start_pos
,end_pos
,effect_allele
,other_allele
can be defined. This must also be accompanied by achrpos_spec
definition in the file metadata or the API. This details the types and order of the data in thechrpos
column.chr_name
- The chromosome name column of the input file. Chromosome values are treated as strings. gwas-norm can also standardise chromosome names according to the user’s preference, this is outlined in the chromosome mapping section. This column is required if a<chrpos>
column is not defined with thechr_name
in thechrpos_spec
.start_pos
- The 1-based start position column of the input file. The start position should be an integer in base pairs. This column is required if a<chrpos>
column is not defined with thestart_pos
in thechrpos_spec
.end_pos
- The 1-based end position column of the input file. The start position should be an integer in base pairs. This column not is required and theend_pos
is calculates based on thestart_pos
andeffect_allele
length -1.effect_allele
- The effect allele column of the input file. This should be a DNA string. Currently, non-DNA alleles are not supported. This column is required if a<chrpos>
column is not defined with theeffect_allele
in thechrpos_spec
.other_allele
- The other allele (non-effect allele) column of the input file. This should be a DNA string. Currently, non-DNA alleles are not supported. Whilst, this column is optional but it is a good idea to use it if you have it. If it is not supplied then the pipeline will attempt to impute the non-effect allele, if this fails then the row is excluded from the final standardised GWAS data. This can be defined as part of thechrpos_spec
ofchrpos
column.minor_allele
- The allele of the variant site that has the lowest allele frequency.number_of_samples
- The number of sample genotypes that were used in the association. This is carried over into the normalised file. If not provided then a global sample number is used from the cohort definitions (if available).number_of_cases
- The number of case samples that were used in the association. If this is supplied along with thenumber_of_controls
, these are used in combination to generate thenumber_of_samples
.number_of_controls
- The number of control samples that were used in the association. If this is supplied along with thenumber_of_cases
, these are used in combination to generate thenumber_of_samples
.effect_allele_count
- The allele count for the effect allele. If this is known along with the number of samples (either directly or indirectly via case/controls or global sample size), then it is used to calculate effect allele frequency.minor_allele_count
- The allele count for the minor allele. If this, theminor_allele
and thenumber_of_samples
(either directly or indirectly via case/controls or global sample size) is known, then they are used to calculate theeffect_allele_freq
.minor_allele_freq
- The allele frequency for the minor allele. If this and theminor_allele
is known, then they are used to calculate theeffect_allele_freq
.effect_allele_freq
- The frequency of the effect allele. This is taken through to the normalised data file.effect_size
- The effect size for the effect allele. If this is not available, then the row is classified as a bad row and will not be present in the final file.ci_lower
- The lower bound of the confidence interval. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage.ci_upper
- The upper bound of the confidence interval. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage.ci_combined
- The upper and lower confidence interval bounds in a single field, this should have the format<lower CI><optional space><delimiter><optional space><upper CI>
, where the delimiter can be one of,;:_
. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage. This is used to derive thestandard_error
if it is not available.standard_error
- The standard error for the effect size.t_statistic
- The t-statistic of the effect size. This is used to derive the standard error if it is not available.pvalue
- The p-value. If not available then it will be re-calculated from the standard error and the effect size. If provided then is can be -log10 transformed and there is a flag in the metadata<file>
element to indicate that it is -log10 transformed.var_id
- The variant identifier. Typically this is an rsID although it does not have to be. If provided, then it can be used the mapping process.strand
- The strand for the variant position/alleles. This will be present in very few datasets. Allowed values are for the forward strand (case-insensitive) aref
,forward
,+
,1
,positive
,plus
and for the reverse strand arer
,reverse
,-
,-1
,negative
,minus
. If provided, negative strand variants are set to the positive strand. If not provided, then it is assumed that everything is on the forward strand.imputation_info
- The imputation info score. This is not used directly but can be defined with theinfo="true"
attribute to add to the info column.het_i_square
- The heterogeneity I-square value for meta-analysed GWAS. This is not used directly but can be defined with theinfo="true"
attribute to add to the info column.het_pvalue
- The heterogeneity p-value for meta-analysed GWAS. This is not used directly but can be defined with theinfo="true"
attribute to add to the info column.het_chi_square
- The heterogeneity chi-square value for meta-analysed GWAS. This is not used directly but can be defined with theinfo="true"
attribute to add to the info column.het_df
- The heterogeneity degrees of freedom value for meta-analysed GWAS. This is not used directly but can be defined with theinfo="true"
attribute to add to the info column.