The XML metadata file#

The process of normalising one or more GWAS data files requires an XML file to describe the study and the structure of the source data files. This document describes the structure and rationale behind the XML metadata file and will enable the user to put together the file correctly for their GWAS.

Hand crafting an XML file can be a teadious and error prone process. It is do-able for small numbers of GWAS. However, if you are standardising many small GWAS or a large pQTL GWAS there are better options available to you.

For small GWAS there is a graphical user interface built by Sandesh Chopade that can be used to fill the various fields in and output a finished XML file.
For larger GWAS it is best to construct them programmatically using the available API. The API encapsulates all of the elements of the XML metadata. This allows easy programmatic handling of all the aspects of the GWAS metadata through a set of modules located in the gwas_norm.metadata sub-package. There is also some example code illustrating this and full API documentation.

Prerequisites#

Before going through the elements in a GWAS norm XML file, it is important to understand the distinction between a GWAS study and a GWAS analysis as they have distinct meanings in gwas-norm.

In addition, an understanding of how to declare file paths in the XML files is useful.

GWAS studies and analyses#

In essence, there is a one:many relationship between study:analyses. That is a GWAS study can contain multiple GWAS analyses. In the vast majority of cases a study correlates to a publication and an analysis is one of the phenotypes that have been analysed in that publication. However, there are instances where a study may not have been published, in which case it will be up to the user to delineate what is a study and an analysis. Analyses could also be two different transformations of the same phenotype in the same cohort. Studies do not have phenotypes, they are simply identified via either a pubmed identifier (not a required field) or a study_name / study_id, these are required and should be unique for a study. However, there is no real way of determining if a study identifier is unique across distinct GWAS normalisation runs. Although, it is checked if multiple studies are included in the GWAS metadata XML file. If the study does not have a pubmed identifier a dummy one us used, that has the value of 00000000.

File and directory paths#

It is useful to understand how file paths are handled in the XML files before learning about the actual elements.

Input source files#

The name of the un-normalised source GWAS file is specified in the <file> element (see below). This is always specified as a path relative to the directory stored within the <study_source_dir> of the study.

The value of the <study_source_dir> can either be a relative path or an absolute path, with designations such as ./ and ~/ being expanded and treated as absolute paths. If the <study_source_dir> is a relative path, then it is relative to an absolute root path stored in the attribute GwasData.root_source_dir in the API. The value of this attribute can be set by the user is using the API or input on the command line, however, if you are using the same value all the time, there is also an environment variable GWAS_SOURCE_DATA_ROOT, this can be set in your ~/.bashrc or ~/.bash_profile, i.e. export GWAS_SOURCE_DATA_ROOT="/path/to/dir/holding/un_normalised/studies". Note, that the environment variable is not used if you manually specific the root_source_dir via the API or the command line. If <study_source_dir> is a relative path and this is not set (either via GwasData.root_source_dir, the command line or the environment variable GWAS_SOURCE_DATA_ROOT) then an error will be raised.

This structure means that the XMLs are valid if the root paths have to be changed for some reason. However, if you do not care about this then you can set the <study_source_directory> to the absolute path of your study directory.

If you input files are in sub-directories below the <study_source_dir>, then you should specify the path in the input <file> relative to the <study_source_dir>.

To give and example. Suppose you had two GWAS files located at the absolute paths:

/data/un_normalised_gwas/study1/analysis1/analysis1_input.txt.gz
/data/un_normalised_gwas/study1/analysis2/analysis2_input.txt.gz

There are two ways this could be represented, either with an absolute path to the study or a relative path to the study. If using a relative path, then the root_source_dir will be set to /data/un_normalised_gwas, either via the command line, the GWAS_SOURCE_DATA_ROOT environment variable or the API. The <study_source_dir>, will then be set to study1 and the file names will be set within two the <file> tags to analysis1/analysis1_input.txt.gz and analysis2/analysis2_input.txt.gz respectively.

However, there is also the option of setting the <study_source_dir> to an absolute path, /data/un_normalised_gwas/study1. In this case the root_source_dir does not need to be set and the relative file names will stay the same.

Output normalised files#

The name of the output normalised files is not directly set by the user. The files are named based on the analysis name concatenated with the pubmed ID and the genome assembly version. Normalised files are located within a standard directory structure. The location of the root of that directory structure is the location of the path defined in <study_norm_dir>.

This works in a similar way to the study source directory, <study_norm_dir> can either be relative or absolute. If it is relative, then it is relative to a path defined in either GwasData.root_norm_dir or the environment variable GWAS_DEST_DATA_ROOT.

The XML elements#

The various XML elements are described below. They are arranged approximately in their hierarchical order within the XML file. There are also some small example XMLs at the end of this document.

The `<gwas_data>` root#

All of the GWAS metadata for one of more studies is contained in the <gwas_data></gwas_data> XML element. The API class gwas_norm.metadata.GwasData can be used to create an object that acts as a container for GWAS studies. There can be only one of these in the XML file.

The `<study>` / `<study_file>` elements#

The study elements are the only children contained below <gwas_data>. There can be greater than one study within the <gwas_data>, although, currently, the pipeline is limited to accept only one study per <gwas_data>. Typically, this is not a problem as it makes sense for XML metadata files to be centred around a study. However, this will change in future after more testing.

A study can be thought of as a publication unit. For example, all the GWAS that are published from one publication. Although, there is no requirement for this, a study can be anything you want, it is just a natural grouping for the individual analyses that are contained within in. However, keep in mind that the source reference genome assembly must be the same for each analysis within the study.

There are two possible ways to represent a study in the XML metadata, either with a <study> or a <study_file> element. The one you use will depend on the structure of the data within the GWAS summary statistics file, this is outlined in more detail below but practically, it is possible for each input file to contain data from a single analysis, i.e. a single disease, alternatively, all of the GWAS summary data can be contained in a single file. So, multiple analyses in a single file. This sort of arrangement is often found in large eQTL/pQTL data sets where 1MBp aorund a cis locus is included in a single file. These arrangements are represented by the <study> and <study_file> elements respectively. However, both element types have multiple child elements in common:

<study_name> - A free text string name that the study should be known by. Ideally, this should be unique for each study. The XML parsing API will convert the study name to lower case and any spaces will be replaced with underscores.
<study_source_dir> - The root directory of the study that contains the un-normalised source files. This can be given as an absolute or relative path (see the file path section above).
<source_genome_assembly> - The genome assembly of the study (and by extension, analyses within the study).
<study_norm_dir> - The root directory of the output study that contains the normalised files. This can be given as an absolute or relative path (see the file path section below). This is optional, if it is not provided then a relative directory name will be created for you from a concatenation of the <study_name>_<pubmed_id>.
<study_id> - An integer ID for the study. Whilst, this is optional, ideally it will be set by the user. If it is not provided then one will be created for you, there are several ways this will happen and these are outlined in the study ID section below.
<pubmed_id> - The pubmed identifier. This is optional. If not provided then a dummy pubmed ID of 00000000 will be is used by the pipeline instead.
<consortium> - Any consortium name for the study, this is optional.
<url> - Any URLs for the study, i.e. a consortium website. This is optional.
<metafile> - Any companion files associated with the study. These are not used directly in the pipeline, however, they are re-located into the final normalised directory structure located in the study normalised directory.
<info> - Descriptions of either input columns or static data elements that will be added to the info column in the normalised GWAS file for every analyses within the study. This is a similar idea to the info field of a VCF but a much more stripped down implementation. This is optional and described in detail below.

The `<study>` element#

The study element should be used for GWAS where the individual GWAS analyses are located in one or more files. That is, each file should only contain data from a single analysis. However, it is possible that a single analysis be split over several files, for example, a file for each chromosome. It is also possible that each file has a slighly different structure. In addition to the elements outlined above, the <study> element must contain one or more <analysis> elements, that describe the individual analyses and their source files (see below).

The `<study_file>` element#

The study file element should be used for GWAS summary data where multiple analyses are located in a single file. For example, this situation exists for data like the GTEX data set. In this case each analysis is the data from 1MB flank either side of a single gene (phenotype). The data can also be located in multiple files as long as each file contains data from multiple analyses.

The data within study files must be keyed in some way that allows rows belonging to distinct analyses to be identified. The keys are defined by one or more columns in the input files. The key columns should be defined in the <file> element. The key values relating to each analyses are defined at the analysis level, this is documented in the relevant sections below.

The study file has a number of additional elements:

<analysis_type> - The type of all the analyses within the study file. This could be either eqtl, pqtl, mqtl, metabqtl, qtl, disease, trait, the analysis types are described in detail in the analysis type section. This is a required element.
<effect_type> - The effect_type of all the analyses within the study file. This can be either or (odds ratio), rr (risk ratio), hr (hazard ratio), beta, log_or (log odds ratio), log_rr (log risk ratio), log_hr (log hazard ratio), z_score_cc (Z score that will be converted to a correlation coefficient), z_score_log_or (Z score that will be converted to a log OR), direction_beta a unit effect direction, 1/-1 that will be used with other data (if available) to calculate a beta, direction_log_or a unit effect direction, 1/-1 that will be used with other data (if available) to calculate a log OR. The rationale and further details of these are described in the effect type section. This is required.
<units> - The units of all the analysis within the study file. This is optional.
cohort - A description of the cohort the study was performed in. This can either be <cohort>, <case_control_cohort>, <sample_cohort> (described below).
<file> - One or more files that that contain all the analyses data. Each file, should contain one or more columns that act as key columns. These columns should collectively contain data that identify rows to specific analyses. This is required.
<key_analysis> - The study file should contain one or more of these. These will contain descriptions of analysis phenotypes and caveats along with key values for the key columns.

Setting the `study_id` and `analysis_id`#

Both the study ID and the analysis ID are integer identifier fields the are propagated into the final standardised files. There are several ways these can be set. Ideally, the user should set these according to their own criteria. However, if they are not set via the user then they will be set via the metadata API in either of the following ways.

Using an ID file#

It is possible to create a file on your system to act as the source for identifiers for studies and analyses. The path to this file should be set in the environment variable GWAS_ID_FILE in your ~/.bashrc. The ID file should have the following simple structure.

ANALYSIS_ID=1
STUDY_ID=1

If this file is found, then every time either a study ID or an analysis ID is required, this file will be opened the IDs are gathered and incremented before the file is re-written. This acts similar to an auto-increment in a relational database.

The file is manipulated in a process safe way, so it is locked by the process handling it. Therefore it can be used by multiple users providing the permissions are set correctly.

Whilst this works, it does have some drawbacks. Firstly, it requires multiple users not to manually change the file, so you need to trust them. To is also a relatively slow processes to read, increment and write the ID file if many IDs need to be generated. Lastly, there is no rollback function for the IDs, so if you have any errors when using the API, then the IDs in the file will still be incremented. So, whilst it will generate unique IDs it is not ideal.

Randomly generated ID#

If the user has not set the ID and no ID file is available then the fall back is to use a randomly generated integer ID between the values 1-9999999999. So, whilst the probability of having an ID collision is low it is not zero, therefore this method should not be relied upon if the IDs are critical to your application.

The `<analysis>` / `<analysis_key>` elements#

The analysis elements are child elements of the study and they broadly represent a single phenotype of that study, or more concretely a single analysis unit. In a similar way to the GWAS study, GWAS analysis are represented by two different element types, <analysis> and <key_analysis>. The <analysis> elements are always children of <study> elements and the <key_analysis> elements are always children of the <study_file> elements.

Both <analysis> and <key_analysis> elements share some of their attributes in common:

<analysis_name> - A unique name for the analysis. Ideally it should be unique over all of your data but the pipeline can only enforce unique names within a study. This is made lowercase and has spaces replaced with underscores. It ends up forming a component of the output file name. This element is required.
<analysis_id> - A unique identifier for the analysis. Ideally it should be unique over all of your data but the pipeline can only enforce unique names within a study. This element is optional and if not provided then an ID will be created, this is outlined in the ID section.
<phenotype> - A structured description of the analysis phenotype. This is required and is described below.
<caveat> - A structured description of the analysis caveats. A caveat is described as any information that will alter your interpretation of the genetic association or effect size/direction. So these may be co-morbidities or other study co-variates/stratification, for example a sex stratified GWAS. This is optional and the structure of this element is described below.
<tests> - Any known associations that can be compared against the normalised association to determine if the effect direction is correct, for example, variants from a source publication table. This is described in detail below. Tests are optional.
<info> - Descriptions of either input columns or static data elements that will be added to the info column in the normalised GWAS file for the rows from a single analysis. This is a similar idea to the info field of a VCF but a much more stripped down implementation. This is optional and described in detail below.

The `<analysis>` element#

The <analysis> element is always a child of a <study> element. In addition to the common attributes defined above it also contains the attributes that are associated with files (at least in the gwas-norm implementation). These are:

<analysis_type> - The type of all the analyses within the study file. This could be either eqtl, pqtl, mqtl, metabqtl, qtl, disease, trait, analysis types are described in detail in the analysis type section. This is required.
<effect_type> - The effect_type of all the analyses within the study file. These are described in the effect type documentation. This is required.
<units> - The units of all the analysis within the study file. This is optional.
cohort - A description of the cohort the study was performed in. This can either be <cohort>, <case_control_cohort>, <sample_cohort> (described below).
<file> - One or more files that that contain all the analyses data. The data with in each file should be grouped by analysis. This is required.

The `<key_analysis>` element#

The <key_analysis> element is always a child of a <study_file> element. As a reminder a <study_file> is a study where all the GWAS analyses are combined together, i.e. similar to GTEX data. Therefore, there needs to be a mechanism to identify which study file rows belong to each analysis. This is the main job of the <key_analysis>. Other than defining the common attributes above, it’s job is to provide key values that will define rows belonging to each analyses in the study. The rows are defined based on data values in one or more key columns of the input file(s). The data values and the columns they reside in are defined within one or more <key> elements in the <key_analysis>. Each <key> element has a <column> element and a <value> element that contain the key column and text value data respectively.

The `<file>` element#

The file element is responsible for detailing the characteristics of the file. is present in both the <study_file> and the <analysis> elements. It has the same structure in both, although, if used with <study_file> elements that the <file> element should also, contain <key> elements. If these are present in files added to <analysis> elements, then they are ignored.

A pivotal job of the file element is to define column mappings between the names of core columns expected in a GWAS summary stats file and their corresponding names in gwas-norm. The mappings between the input file columns and the standard gwas-norm columns are located with in a <columns> child element. These are defined below in the input columns section.

The other elements below the <file> element are described below.

<relative_path> - The relative path to the input file. See the file paths <file_dir_paths> sections.
<md5_chksum> - The MD5 checksum for the input file.
<comment_char> - Any comment characters that occur before the file header in the input file.
<skiplines> - A fixed number of lines to skip at the start of the file.
<pvalue_logged> - Is the p-value in the input file -log10 transformed. Use true or false values.
<compression> - The compression of the input file. Allowed values are infer, none, gzip, bzip2, xz, lzma. Any bgzipped files should be set to gzip. The compression values can be inferred, with gzip, bzip2 being inferred from the file content and xz and lzma being inferred from file extensions. The compression value is always checked and warnings are given if different from the value the user provides.
<encoding> - The encoding of the input file. The default is utf-8. This is passed directly to the open method for the compression.
<chrpos_spec> - A string indicating the order and type of fields that are defined in any chrpos columns. These can be chr_name, start_pos, end_pos, effect_allele, other_allele. These should be delimited with a pipe |. In addition to the field names the symbos ^ and $ are used to define start and end anchors respectively. For example, you might have the chrpos values 1:123456_C/G, for this you would define a chrpos_spec of ^chr_name|start_pos|effect_allele|other_allele$, however, if you had leading or laging characters such as grch38_1:123456_C/G_rs123456, then you should use chr_name|start_pos|effect_allele|other_allele, without the anchoring characters.
<has_header> - An indicator for if the file has column headings. Should be true or false. If not provided then the default is true. If false then columns should be refered to as 0-based column numbers.
<columns> - Column mappings between the input file column names (or numbers if <has_header> is false and gwas-norm column definitions. See input columns for more details.
<info> - Descriptions of either input columns or static data elements that will be added to the info column in the normalised GWAS file for rows being derived from the input file covered by the <file> element only. This is a similar idea to the info field of a VCF but a much more stripped down implementation. This is optional and described in detail below.
<keys> - Any key columns in the input file. These columns should describe the sequence of key values in the data that are used to delineate rows into their respective analyses. Only used in files defined in <study_file> definitions. The key definitions should be one or more <column> elements below the <keys> element.
<doublequote> - The Python csv keyword argument doublequote. See the csv package for more details.
<escapechar> - The Python csv keyword argument escapechar. See the csv package for more details.
<quotechar> - The Python csv keyword argument quotechar. See the csv package for more details.
<quoting> - The Python csv keyword argument quoting. See the csv package for more details.
<skipinitialspace> - The Python csv keyword argument skipinitialspace. See the csv package for more details.
<strict> - The Python csv keyword argument strict. See the csv package for more details.
<lineterminator> - The Python csv keyword argument lineterminator. See the csv package for more details.
<delimiter> - The Python csv keyword argument delimiter. See the csv package for more details. Unlike, the csv package, the delimiter defaults to a tab \t.

Currently, the only csv keyword arguments that are acted upon in the metadata are <delimiter> and <lineterminator>, though this will change in future. The gwas-norm pipeline will attempt to detect them using the csv.Sniffer. It will do this regardless of them being defined and provide warnings if the detected values differ from what the user provides.

The values for the <delimiter> and <lineterminator> should be XML escaped values.

<file>
  <relative_path>study1/my_gwas.txt.bz2</relative_path>
  <md5_chksum>4f8fe1a3a1e3c1d8a1b4e4dba110dbbc</md5_chksum>
  <compression>infer</compression>
  <encoding>utf-8</encoding>
  <pvalue_logged>false</pvalue_logged>
  <has_header>true</has_header>
  <delimiter>\t</delimiter>
  <info>
    <definition map_to="filename">my_gwas.txt.bz2</definition>
  </info>
  <columns>
    <chr_name>CHOMO</chr_name>
    <start_pos>POS</start_pos>
    <other_allele>A2</other_allele>
    <effect_allele>A1</effect_allele>
    <minor_allele>MINOR_ALLELE</minor_allele>
    <minor_allele_freq>MAF</minor_allele_freq>
    <effect_size>BETA</effect_size>
    <standard_error>SE</standard_error>
    <pvalue>PVALUE</pvalue>
    <number_of_samples>N</number_of_samples>
  </columns>
</file>

The `<phenotype>` element#

The phenotype definition should be defined once in the <analysis> or <key_analysis> elements. This provides the opportunity to precisely define a phenotype using an arbitrary structure and also a simple text string that will be used directly in the normalised files.

A minimal phenotype will contain the following:

<ref_string> - A text definition of the phenotype that will be used in the normalised file.
<definition> - A formal phenotype definition that will reside in the metadata XML file.

However, it is possible to place more complicated definitions in place of the single <definition> element. So, for example you could define (an incomplete) cardiovascular disease phenotype as something like this.

<phenotype>
   <reference_string>CVD</reference_string>
   <or>
      <synonym>
          <definition map_to="read2">G30..14</definition>
          <definition map_to="read2">G30..00</definition>
          <definition map_to="ICD10">I21</definition>
          <definition map_to="text">Acute Myocardial Infarction</definition>
      </synonym>
      <definition map_to="read2">G30..15</definition>
      <definition map_to="read2">G30..11</definition>
      <synonym>
          <definition map_to="read2">G66..12</definition>
          <definition map_to="ICD10">I64</definition>
          <definition map_to="text">Stroke</definition>
      </synonym>
   </or>
</phenotype>

In addition to <or> elements, there are also <and> elements. The rules are, the <phenotype> will only contain a single direct child element in addition to the <reference_string> element. That can be a <definition>, <synonym>, <or> or <and> element. The map_to attribute, indicates the type of the definition value. Please see the definitions documentation of attributes of <definition> tags.

The `<caveat>` element#

A caveat is defined as anything that alters the inference of the effect size. This works in exactly the same way as the <phenotype> definition above. However, where as phenotypes are required, caveats are optional. Although, if the caveat is defined, then the <reference_string> must be defined.

The `<definition>` element#

Definition elements can occur in several places throughout the XML metadata. Their job is to hold static text definitions. The definition can have 3 different attributes.

info - A flag indicating if the definition should be carried over into the info column of the final XML file. The allowed values are true or false. If not defined then the default is false. If true, then it is a good idea to set the map_to attribute (see below), as this controls the info field name.
map_to - The type of the definition value. This is also used to define the field name in an info field if info="true". If not defined then the default value of text is used.
dtype - This defines the data type and data structure of any info fields. Allowed values for the data type are I for integer, F for float and S for string. Allowed values for the data structure are A for array and C for scalar (i.e. a single value). To give an example, the following definition defines a read2 code to be added to the info field: <definition info="true" map_to="read2" dtype="SC">G66..12</definition>. For more information on definitions and adding them to the info column see the info fields section.

The `<column>` element#

The <column> elements are used in several places within the XML metadata. Their job is to allow the user to describe column names that are required for the pipeline in, either key columns, mapping columns in the <file> element or within <info> fields.

The column elements can accept the same attributes as the definition elements, namely, info, map_to and dtype and they work in the same way. Similarly, the column mapping defintions in the <file> elements behave like column elements even through they have other names such as <pvalue>.

For more information using columns to add data to the info column see the info fields section.

The `<info>` element and info attributes#

The normalised GWAS files contain an info column. This is conceptually similar to the INFO field in a VCF file.

In order to define the data the goes into the info column, the XML metadata provides <info> elements that can be added to the study, analysis or file elements. These can contain either file columns that provide data values into the info field or static data definitions that are carried over to the info field.

When a definition or a column is used with the <info> the value of the info attribute within the column/definition is always taken as true, so it does not need to be set explicitly.

When the pipeline is run, all of the explicit <info> elements that have been defined and any info="true" attributes that have been defined in the various columns/definitions are evalusated in an InfoHandler object. This will make sure that the various definitions are compatible. For example, you may have mapped several columns and definitions to a single info field. If the data structure has not been set to array then, a warning is issued and they will be implicitly set as arrays. Similarly, the data types are compared, if some are set to integers and some floats, then they will all be set to floats, if there are mixed types between floats and strings in an array, then they are all set to strings.

To give a further example, here is an except from an XML file that incorporates our CVD phenotype that was shown in the phenotype section, only this time the definitions have been set to be included in the info field. There is also a separate analysis-level info field and one of the column definitions for the file will also be included in the final info field. XML comments represent abridged sections of the XML file:

<gwas_data>
   <study>
      <!-- other study elements -->
      <analysis>
         <!-- other analysis elements -->
         <file>
         <columns>
            <!-- other column mappings -->
            <imputation_info info="true" map_to="info_score" dtype="CF">
               info
            </imputation_info>
         </columns>
         </file>
         <info>
            <column map_to="text" dtype="SA">gwas_phenotype</column>
            <column map_to="flag" dtype="IC">bad_variant</column>
         </info>
         <phenotype>
            <reference_string>CVD</reference_string>
            <or>
               <synonym>
                   <definition info="true" map_to="read2">G30..14</definition>
                   <definition info="true" map_to="read2">G30..00</definition>
                   <definition info="true" map_to="ICD10">I21</definition>
                   <definition info="true" map_to="text">
                      Acute Myocardial Infarction
                   </definition>
               </synonym>
               <definition info="true" map_to="read2">G30..15</definition>
               <definition info="true" map_to="read2">G30..11</definition>
               <synonym>
                   <definition info="true" map_to="read2">G66..12</definition>
                   <definition info="true" map_to="ICD10">I64</definition>
                   <definition info="true" map_to="text">Stroke</definition>
               </synonym>
            </or>
         </phenotype>
      </analysis>
   </study>
</gwas_data>

Ideally, the phenotype definitions mapping to read2, ICD10 and text should have a dtype="SA", however, they have been omitted, therefore thaey will be implied by the fact that multiple columns and definitions are mapping to a single info field, so a warning will be issued. Lets suppose that a single row has the values 0.8, “cardiovascular disease”, 1 for the info, gwas_phenotype and bad_variant columns respectively, then the final info column structure for these will look like:

flag=1.0;ICD10=["I21"|"I64"];info_score=0.8;read2=["G30..14"|"G30..00"|"G30..15"|"G30..11"|"G66..12"];text=["cardiovascular disease"|"Acute Mycoardial Infarction"|"Stroke"];

Obviously, these info fields can increase your file size quite a bit, so keep that in mind, but they are a mechanism to allow the inclusion of non-core data fields into the final normalised data in a structured way.

The cohort elements#

The cohort element is the primary way of defining all of the population groups that contributed to the study or analysis. Therefore, a cohort can be considered to be a collection of populations that share the similar characteristic under study. The cohort is optional, but if used then there should be a single cohort element in each <analysis> or <study_file>.

The metadata provides several different ways to define cohorts and the ones to use depend upon how much is known about the population groups and the sample sizes that make up the study or analysis. The options are listed below. All of the various cohort element options can be given an optional overall text name and can accept one or more population elements, but the types population elements that can be used will depend on the cohort type used.

`<cohort>`#

If the individual sub-population groups that make up a cohort have a known sample size, then a <cohort> element should be used. Here the cohort acts as a simple container for the individual population groups. A <cohort> can accept one or more <population> elements, <case_control_population> elements or <sample_population> elements. The distinctions between these are outlined in the population elements section. The only stipulation is that the <case_control_population> elements and the <sample_population> elements can not be mixed.

<cohort>
   <name>main cohort</name>
   <population>
      <!-- population definition here -->
   </population>
   <sample_population>
      <!-- sample population definition here -->
   </sample_population>
</cohort>

`<case_control_cohort>`#

This can be used if very little is known about the sample size of the population groups that make up a cohort but perhaps the overall numbers of cases and controls that make up the cohort are known. Within a <case_control_cohort> element, a, <n_cases> and <n_controls> element should be defined, in addition to one or more <population> elements, <sample_population> and <case_control_population> elements are not allowed, since these are used to define numbers of individual population groups that make up a cohort.

<case_contorl_cohort>
   <name>CC cohort</name>
   <n_cases>10000</n_cases>
   <n_controls>100000</n_controls>
   <population>
      <!-- population definition here -->
   </population>
   <population>
      <!-- population definition here -->
   </population>
</case_contorl_cohort>

`<sample_cohort>`#

This can be used if very little is known about the sample size of the individual population groups that make up a cohort but perhaps the overall number of samples that make up the cohort are known. Within a <sample_cohort> element, an <n_samples> element should be defined, in addition to one or more <population> elements, <sample_population> and <case_control_population> elements are not allowed, since these are used to define numbers of individual population groups that make up a cohort.

<sample_cohort>
   <name>total sample cohort</name>
   <n_samples>50000</n_samples>
   <population>
      <!-- population definition here -->
   </population>
   <!-- more population definitions as needed -->
</sample_cohort>

The population elements#

All the cohort elements contain one or more population elements, the types of the population elements that can be assigned to the cohort elements are detailed in the cohort section. The primary job of a population element is to define the reference population groups that can be used to represent the population for either LD or for allele frequency calculations. In addition, where known, the population elements also define the number of samples or cases and controls.

There are three types of population elements that can be defined, <population>, <case_control_population> and <sample_population>. Regardless of type, all of these have the ability to define one or more LD reference populations, <ld_ref> and allele frequency populations <allele_freq_ref>. LD and allele frequency references are treated separately as often there is a greater choice of populations to use for an allele frequency reference than there is for an LD reference, where the population groups available might not be optimal. The structure of the <ld_ref> and <allele_freq_ref> are outlined below.

As with the cohorts, all of the population groups can define a name with the sample and case control populations being able to describe the number of samples and cases/controls, as shown below:

<population>
  <name>african</name>
  <!-- One or more LD reference elements <ld_ref> -->
  <!-- One or more allele frequency reference elements <allele_freq_ref> -->
</populations>

<case_control_population>
  <name>east asian</name>
  <n_cases>4000</n_cases>
  <n_controls>100000</n_controls>
  <!-- One or more LD reference elements <ld_ref> -->
  <!-- One or more allele frequency reference elements <allele_freq_ref> -->
</case_control_populations>

<sample_population>
  <name>european</name>
  <n_samples>400000</n_samples>
  <!-- One or more LD reference elements <ld_ref> -->
  <!-- One or more allele frequency reference elements <allele_freq_ref> -->
</sample_population>

LD references and allele frequency references#

The LD/allele frequency reference panels suitable for the GWAS are handled by one or more <ld_ref> and <allele_freq_ref> elements within the population element.

The data within the <ld_ref> element is not actually used in the final normalised file but it exists as a record of the best populations to use for proxy LD measures for the GWAS. The reference populations you define here will depend on the data you have available to you. Whilst the name can be anything, the original implementation here was to place names that align with reference populations in the genomic-config file. Although, if you are not using that then you can use what ever is relevant to your projects.

The allele frequency reference is used to calculate allele frequencies for the variants in the normalised GWAS files if they do not have any allele frequencies already defined in the input files. The reference population definitions should match those available in the mapping file used for the mapping stage in the GWAS normalisation.

Both types can hold the same elements which are outlined below.:

<name> - A text name for the LD/allele frequency reference.
<weight> - A float frequency for the weight that the reference population contributes to the overall population. The weights from the individual <allele_freq_ref> elements defined in the population should add to one, as should those defined for the <ld_ref>.
<ref_pops> - One or more reference population names that will be found in the mapping file in the case of <allele_freq_ref> elements. If there is > 1 <ref_pop> element then these are treated hierarchically in the order they are defined. So, if the first defined <ref_pop>, does not have an allele count defined in the mapping file, then the second one is used. This is it allow for different coverage by different reference populations. If there are no redundant populations, as may be the case for some less studied population groups, you may only be able to supply a single <ref_pop> here.

The XML except below defines both LD references <ld_ref> and allele frequency references <allele_freq_ref>. For each, two populations are defined, a European one, weighted at 0.8 and an African one weighted at 0.2. So one might imagine that this represents a multi-ancestry meta-analysis, where there are 80% Europeans and 20% Africans.

For the allele frequency reference, in the European population the UK Biobank allele frequencies will be used if available, however, if not then it will fall back to the ALFA European estimates followed by the various Gnomad estimates and finally 1000 genomes. An analogous situation also exists for the Africans.

<ld_ref>
  <name>EUR</name>
  <weight>0.8</weight>
  <ref_pop>UKBB_EUR</ref_pop>
  <ref_pop>1KG_EUR</ref_pop>
</ld_ref>
<ld_ref>
  <name>AFR</name>
  <weight>0.2</weight>
  <ref_pop>1KG_AFR</ref_pop>
</ld_ref>
<allele_freq_ref>
  <name>EUR</name>
  <weight>0.8</weight>
  <ref_pop>UKBB_EUR</ref_pop>
  <ref_pop>ALFA_EUR</ref_pop>
  <ref_pop>GNOMAD31_EUR</ref_pop>
  <ref_pop>GNOMAD2EX_EUR</ref_pop>
  <ref_pop>1KG_EUR</ref_pop>
</allele_freq_ref>
<allele_freq_ref>
  <name>EUR</name>
  <weight>0.2</weight>
  <ref_pop>GNOMAD31_AFR</ref_pop>
  <ref_pop>GNOMAD2EX_AFR</ref_pop>
  <ref_pop>ALFA_AFR</ref_pop>
  <ref_pop>1KG_AFR</ref_pop>
</allele_freq_ref>

The `<test>` element#

The XML file has the ability to describe tests that should be carried out while the pipeline is normalising the data. These are located with <test></test> tags. These compare expected values for attributes against observed values after normalisation. The attributes tested are:

<chr_name> - The chromosome name, tested for equality.
<start_pos> - The start position, tested for equality.
<effect_allele> - The effect allele, tested for equality.
<other_allele> - The non effect allele, tested for equality.
<effect_size> - The effect point estimate, this is tested for values within 0.0001 (this default can be changed). The effect sign of the effect size is also tested for equality.
<standard_error> - The standard error, tested for values within 0.0001 (this default can be changed).
<pvalue> - The p-value, here the -log10(p-values) are tested for values within 0.2 (this default can be changed). If the p-value is already logged then you can set the <pvalue_logged> element to true.
<var_id> - The variant identifier, tested for equality.

A good source of the expected values are the summary tables in published manuscripts. The pipeline will not fail if any of the tests fail, instead, a test report file is generated indicating the observed and expected values along with size their differences. Note, that if you are using published summary tables as tests there could be some differences. For example a study might report combined estimates from discovery+replication but only make discovery estimates available. So it is left up to the user to decide if they are happy with the result. However, the key attributes will to check are the effect/other alleles and the effect sign. These will indicate if there has been any miss-specification of the effect allele as for some studies this is not entirely clear.

The tests are also lifted over to the target genome assemblies used, so you will get a test report for each target genome assembly used.

<test>
    <chr_name>10</chr_name>
    <start_pos>662</start_pos>
    <effect_type>log_or</effect_type>
    <effect_size>-0.564</effect_size>
    <effect_allele>A</effect_allele>
    <other_allele>G</other_allele>
    <standard_error>0.7534</standard_error>
    <var_id>rs1256784</var_id>
    <pvalue>7.301029995663981</pvalue>
    <pvalue_logged>True</pvalue_logged>
</test>

Example XML files#

TBC…

The XML metadata file#

Prerequisites#

GWAS studies and analyses#

File and directory paths#

Input source files#

Output normalised files#

The XML elements#

The <gwas_data> root#

The <study> / <study_file> elements#

The <study> element#

The <study_file> element#

Setting the study_id and analysis_id#

Using an ID file#

Randomly generated ID#

The <analysis> / <analysis_key> elements#

The <analysis> element#

The <key_analysis> element#

The <file> element#

The <phenotype> element#

The <caveat> element#

The <definition> element#

The <column> element#

The <info> element and info attributes#