Input file columns#

The GWAS normalisation pipeline knows how to handle multiple types of columns that can be available in the input GWAS files. It attempts to use the data in various ways to generate a uniform output file.

In essence, the input columns required by gwas-norm can vary but they must collectively allow the minimal effective data to be available after the first stage of standardisation. The minimal effective data is defined as:

chromosome name
start position
effect allele
effect size
standard error
p-value

However, for certain effect types, this is relaxed:

z_score_cc
z_score_log_or
direction_beta
direction_log_or

For these, they require an effect allele frequency to be available, which may only be present after the second variant mapping phase, so the minimal effective data is relaxed for these to remove the effect size and standard error requirements.

In addition to the minimal effective data a other_allele column must be available and populated after the mapping phase. Any data that is missing the other allele data will be removed at this stage.

Currently, the route taken to standardise a row is based on the available columns and not the data within those columns. For example, if you have defined both a standard_error column and a ci_combined column, then the standard_error column is used. If a row is missing a standard_error value, then the pipeline will not use the ci_combined column. This is because a parse tree is derived before the file is processed for performance reasons. This may change in future.

Allowed column definitions#

The column types that are recognised by gwas-norm and how they are used is outlined below. In the file metadata, these are applied as elements within the <columns> element. for example chr_name should be an XML tag <chr_name>.

chrpos - A column where any of the chr_name, start_pos, end_pos, effect_allele, other_allele can be defined. This must also be accompanied by a chrpos_spec definition in the file metadata or the API. This details the types and order of the data in the chrpos column.
chr_name - The chromosome name column of the input file. Chromosome values are treated as strings. gwas-norm can also standardise chromosome names according to the user’s preference, this is outlined in the chromosome mapping section. This column is required if a <chrpos> column is not defined with the chr_name in the chrpos_spec.
start_pos - The 1-based start position column of the input file. The start position should be an integer in base pairs. This column is required if a <chrpos> column is not defined with the start_pos in the chrpos_spec.
end_pos - The 1-based end position column of the input file. The start position should be an integer in base pairs. This column not is required and the end_pos is calculates based on the start_pos and effect_allele length -1.
effect_allele - The effect allele column of the input file. This should be a DNA string. Currently, non-DNA alleles are not supported. This column is required if a <chrpos> column is not defined with the effect_allele in the chrpos_spec.
other_allele - The other allele (non-effect allele) column of the input file. This should be a DNA string. Currently, non-DNA alleles are not supported. Whilst, this column is optional but it is a good idea to use it if you have it. If it is not supplied then the pipeline will attempt to impute the non-effect allele, if this fails then the row is excluded from the final standardised GWAS data. This can be defined as part of the chrpos_spec of chrpos column.
minor_allele - The allele of the variant site that has the lowest allele frequency.
number_of_samples - The number of sample genotypes that were used in the association. This is carried over into the normalised file. If not provided then a global sample number is used from the cohort definitions (if available).
number_of_cases - The number of case samples that were used in the association. If this is supplied along with the number_of_controls, these are used in combination to generate the number_of_samples.
number_of_controls - The number of control samples that were used in the association. If this is supplied along with the number_of_cases, these are used in combination to generate the number_of_samples.
effect_allele_count - The allele count for the effect allele. If this is known along with the number of samples (either directly or indirectly via case/controls or global sample size), then it is used to calculate effect allele frequency.
minor_allele_count - The allele count for the minor allele. If this, the minor_allele and the number_of_samples (either directly or indirectly via case/controls or global sample size) is known, then they are used to calculate the effect_allele_freq.
minor_allele_freq - The allele frequency for the minor allele. If this and the minor_allele is known, then they are used to calculate the effect_allele_freq.
effect_allele_freq - The frequency of the effect allele. This is taken through to the normalised data file.
effect_size - The effect size for the effect allele. If this is not available, then the row is classified as a bad row and will not be present in the final file.
ci_lower - The lower bound of the confidence interval. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage.
ci_upper - The upper bound of the confidence interval. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage.
ci_combined - The upper and lower confidence interval bounds in a single field, this should have the format <lower CI><optional space><delimiter><optional space><upper CI>, where the delimiter can be one of ,;:_. Currently the coverage for confidence intervals is fixed at 95%, attributes will be added in future to allow for different CI coverage. This is used to derive the standard_error if it is not available.
standard_error - The standard error for the effect size.
t_statistic - The t-statistic of the effect size. This is used to derive the standard error if it is not available.
pvalue - The p-value. If not available then it will be re-calculated from the standard error and the effect size. If provided then is can be -log10 transformed and there is a flag in the metadata <file> element to indicate that it is -log10 transformed.
var_id - The variant identifier. Typically this is an rsID although it does not have to be. If provided, then it can be used the mapping process.
strand - The strand for the variant position/alleles. This will be present in very few datasets. Allowed values are for the forward strand (case-insensitive) are f, forward, +, 1, positive, plus and for the reverse strand are r, reverse, -, -1, negative, minus. If provided, negative strand variants are set to the positive strand. If not provided, then it is assumed that everything is on the forward strand.
imputation_info - The imputation info score. This is not used directly but can be defined with the info="true" attribute to add to the info column.
het_i_square - The heterogeneity I-square value for meta-analysed GWAS. This is not used directly but can be defined with the info="true" attribute to add to the info column.
het_pvalue - The heterogeneity p-value for meta-analysed GWAS. This is not used directly but can be defined with the info="true" attribute to add to the info column.
het_chi_square - The heterogeneity chi-square value for meta-analysed GWAS. This is not used directly but can be defined with the info="true" attribute to add to the info column.
het_df - The heterogeneity degrees of freedom value for meta-analysed GWAS. This is not used directly but can be defined with the info="true" attribute to add to the info column.