`gwas_norm` package#

`gwas_norm.gwas_norm`#

`gwas_norm.processors`#

`gwas_norm.config`#

Handle all interaction with GwasNorm configuration options

gwas_norm.config.ASSEMBLY_PREFIX = 'assembly'#: The name of the genome assembly section in the configuration file (str)

gwas_norm.config.ASSEMBLY_SYN_SECTION = 'assembly.synonyms'#: The name of the assembly synonyms section in the MeRIT config file (str)

gwas_norm.config.CHAIN_FILE_PREFIX = 'chain_files'#: The name of the chain file section in the configuration file (str)

gwas_norm.config.CONFIG_DEFAULT_NAME = '.gwas_norm.cnf'#: The default name for a gwas-norm configuration file (str)

gwas_norm.config.CONFIG_ENV = 'GWAS_NORM_CONFIG'#: The name of the shell environment variable that if present will store the location to the Gwas-norm configuration file (str)

gwas_norm.config.DETAIL_DELIMITER = '.'#: The delimiter that separates detail fields in the MeRIT configuration section names (str)

class gwas_norm.config.GwasNormConfig(config_file=None)#

Bases: object

Handles interaction with the MeRIT configuration options (reading only)

Parameters:

config_file (str or NoneType, optional, default: NoneType) – A configuration file location. If this is provided, it is simple checked for existence and returned. If it has not been provided then the default locations are checked and returned.

Raises:

FileNotFoundError – If a location for the default config file is not found
PermissionsError – If a location for the default config file can’t be read

as_dict()#

Return the configuration file as a nested dictionary. The returned dict is a deep copy of the actual configuration dict.

Returns:: config_dict – A dict of all the configuration parsed from the config file
Return type:: dict

get_chain_file(source_assembly, target_assembly)#

Return the path to the chain file that maps the source_assembly the target_assembly.

Parameters:

source_assembly (str) – The source assembly name. This will be normalised and searched in the config file for a path to the target_assembly (which is also normalised).
target_assembly (str) – The target assembly name. This will be normalised and searched in the config file under all the paths associated with the source_assembly.

Returns:

chain_file – The path to the chain file that maps from the source assembly to the target assembly.

Return type:

str

Raises:

KeyError – If there are no chain files that map from the source assembly to the target assembly

classmethod get_defaults()#

Get a blank empty dictionary of configuration file sections.

Returns:: blank_config – The keys are the parsed config file sections, the values are empty dictionaries where parsed content can be placed.
Return type:: dict [str, dict]

get_mapping_file(assembly)#

Return the path to the mapping file that maps the assembly.

Parameters:: assembly (str) – The assembly name. This will be normalised and searched for a mapping file associated with it.
Returns:: mapping_file – The path to the mapping file.
Return type:: str
Raises:: KeyError – If there are no mapping files that are associated with the assembly

get_norm_assembly_name(assembly_name)#

When given an assembly name get the standardised version used in the config file.

Parameters:: assembly_name (str) – The assembly name to get the standard version of.
Returns:: norm_assembly – The standardised assembly name
Return type:: str
Raises:: KeyError – If the assembly name is not recognised

get_ref_assembly(assembly, species)#

Return the path to the reference assembly that maps the assembly.

Parameters:

assembly (str) – The assembly name. This will be normalised and searched for a reference assembly associated with it.
species (str) – The species for the assembly

Returns:

reference_assembly – The path to the mapping file.

Return type:

str

Raises:

KeyError – If there is no reference assembly that associated with the assembly

property name#: Return the file name of the config file

gwas_norm.config.MAPPING_FILE_SECTION = 'mapping_files'#: The name of the assembly synonyms section in the MeRIT config file (str)

gwas_norm.config.get_config_file(config_file=None)#

Attempt to locate and return the MeRIT config file location if it has been defined either in the arguments or the environment.

Parameters:

Returns:

config_path – The absolute path to the default config file, if found.

Return type:

str

Raises:

FileNotFoundError – If a location for the default config file is not found
PermissionsError – If a location for the default config file can’t be read

Notes

The order of return is as follows, if a config_file is provided, then it is returned (provided it exists). If not then the MERIT_CONFIG environment variable is checked to see if it is defined. If so, then it is returned if it exists. Finally, the root of the HOME environment variable is checked for a file named .merit.cnf. If that exists it is returned. If any of the defined paths do not exist then the relevant FileNotFoundError will be raised.

`gwas_norm.common`#

class gwas_norm.common.ChrPosSpec(spec_columns, start_anchor, end_anchor)#

Bases: tuple

end_anchor#: Alias for field number 2

spec_columns#: Alias for field number 0

start_anchor#: Alias for field number 1

class gwas_norm.common.Msg(file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, verbose=True, prefix=None)#

Bases: object

A class for output for information based on verbosity

Parameters:

file (file-like, optional, default: sys.stderr) – The output location for the message, defaults to STDERR.
verbose (bool, optional, default: True) – Should messages be output.
prefix (str or NoneType, optional, default: NoneType) – Should messages be prefixed with some test.

msg_args(args, **kwargs)#

Output the values of command line arguments based on verbosity.

Parameters:

args (argparse.Namespace) – The arguments parsed out of the argument parser.
**kwargs – Keyword arguments to gwas_norm.common.Msg.msg

msg_prog(prog_name, package, version)#

Output the program name according to verbosity.

Parameters:

prog_name (str) – The program name to output.
package (str) – The package the program is within.
version (str) – The version number of the package the program is in.

set_file(file)#

Set the output location.

Parameters:: file (file-like) – The output location for the message, defaults to STDERR.

set_verbose(verbose)#

Set the verbosity.

Parameters:: verbose (bool) – Should messages be output.

gwas_norm.common.add_column_name(existing_header, column_name)#

Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appending an integer suffix to the end of the column name until it is unique within the header

Parameters:

existing_header (list of str) – The header to add the column name to
column_name (str) – The ideal column name to add to the header. This will be appended with a suffix should column_name already exist in the header

Returns:

column_name – The final column name added to the header, may not be the same as what was passed to the function. The header list is not returned as the addition of the column name to the header happens in place.

Return type:

str

gwas_norm.common.bsd_chksum_file(infile, chunksize=4096)#

Implement a simple BSD checksum of file. This is the same as the UNIX sum program. See here

Parameters:: infile (str) – The input file to generate a checksum for. Note that this is opened in bytes mode.
Returns:: sum – A 5 character (castable to integer) BSD checksum. It will be 0 padded if needed
Return type:: str

gwas_norm.common.bsd_chksum_str(instr)#

Implement a simple BSD checksum of a string. This is the same as the UNIX sum program. See here

Parameters:: instr (str) – The input string to generate a checksum for. Note that this is converted to bytes internally
Returns:: sum – A 5 character (castable to integer) BSD checksum. It will be 0 padded if needed
Return type:: str

gwas_norm.common.check_abs_path(path, message, root)#

Check that the path is only an absolute path when the root is not set.

Parameters:

path (str) – The path to check.
message (str) – The name of the path this will be used in any error message.
root (str or NoneType) – The root path to join to path if it is relative and root is not NoneType

Returns:

path – The absolute checked path, if ~/ ../ ./ then this will be expanded so count as absolute paths.

Return type:

str

Raises:

FileNotFoundError – If the path is relative and root is NoneType.

gwas_norm.common.check_analysis_type(analysis_type)#

Make sure the analysis_type is lowercase and one of the allowed analysis types. eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl

Parameters:: analysis_type (str) – The analysis type to test
Returns:: analysis_type – The correct analysis type which will be lower case
Return type:: str
Raises:: ValueError – If the analysis_type is not one of: eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl

gwas_norm.common.check_effect_type(effect_type)#

Check the effect_type is valid, his is one of: or, log_or, beta and lowercase

Parameters:: effect_type (str) – The effect type to test
Returns:: effect_type – The correct effect type which will be lower case
Return type:: str
Raises:: ValueError – If the effect_type is not one of: or, log_or, beta

gwas_norm.common.check_parent(obj)#

A helper function that checks to see if the object has a parent object if not it will raise a AttributeError, so this will happen if there is no parent attribute or there is a parent attribute and it is NoneType.

Parameters:: obj (Any) – Any object potentially with a parent attribute
Raises:: AttributeError – If there is no parent attribute or the parent attribute is NoneType

gwas_norm.common.compress_file(infile, chunksize=4096)#: GZIP compress a file

gwas_norm.common.convert(character)#

Convert raw string un-printables to printables and vice versa.

Parameters:: character (str) – Either a printable raw string or an unprintable \n, \t, \s.
Returns:: character – The printable or unprintable opposite of the character parsed to the function.
Return type:: str

gwas_norm.common.count_lines(file_name, gzipped=False)#

Count the lines in a file

Parameters:

file_name (str) – The file name to open and count

Returns:

line_count (int) – The number of lines in the file
gzipped (bool) – Is the file to count compressed, if so it will be opened with gzip.open and not open

gwas_norm.common.create_chrpos_spec_str(chrpos_spec)#

Parse the chrpos spec named tuple into a string

Parameters:: chrpos_spec (ChrPosSpec) – A ChrPosSpec named tuple
Returns:: chrpos_spec – The chrpos column to parse
Return type:: str

gwas_norm.common.create_uni_id(chr_name, start_pos, effect_allele, other_allele)#

Create a universal identifier based on coordinates and alleles.

Parameters:

chr_name (str) – The chromosome name.
start_pos (int) – The start position in base pairs.
effect_allele (str) – The effect allele.
other_allele (str) – The non-effect allele.

Returns:

uni_id – The universal identifier. this is the: chr_start_<aleleles in sort order>, where the alleles are also separated by an underscore.

Return type:

str

gwas_norm.common.error_on_empty(value, value_type='value')#

If a value is an empty string ‘’ or all spaces or NoneType or an empty list []

Parameters:

value (Any) – The value to test
value_type (str, optional, default: 'value') – The name of type of the value, this is used in any error message raised if the value is empty

Returns:

value – The value is passed through if not empty

Return type:

Any defined value

Raises:

ValueError – If the value is an empty string ‘’ or all spaces or NoneType or an empty list []

gwas_norm.common.expand_relative_path(path)#

This checks if a path is a relative path. That is starts with ~/ , ../, ./ , if so then it is expanded.

Absolute paths and relative paths without leading relative symbols are NOT expanded (i.e. basenames or relative_dir/basename).

Parameters:: path (str) – A relative or absolute path or a basename.
Returns:: path – An absolte path or a basename.
Return type:: str

gwas_norm.common.get_column_name(existing_header, column_name)#

Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appendding an integer suffix to the end of the column name until it is unique within the header

Parameters:

existing_header (list of str) – The header to add the column name to
column_name (str) – The ideal column name to add to the header. This will be appended with a suffix should column_name already exist in the header

Returns:

Return type:

str

gwas_norm.common.get_file_name(analysis, genome_assembly, working_dir='')#: Returns the file name for a final sorted file

gwas_norm.common.get_old_analysis_id(study_obj, analysis_obj)#: Return an auto generated analysis ID for a study/analysis pairing.

gwas_norm.common.get_open_method(infile, compression)#

Get the python file opening method based on the compression value.

Notes

Supported formats are no compression, infer, gzip, bz2, xz or lzma.

Raises:: ValueError – If the compression format can’t be determined.

gwas_norm.common.get_tmp_file(**kwargs)#

Initialise a temp file to work with. This differs from tempfile.mkstemp as the temp file is closed and only the file name is returned.

Parameters:: **kwargs – Any arguments usually passed to tempfile.mkstemp

gwas_norm.common.md5_file(file_name, chunksize=4096)#

Get the MD5 of a file, this reads the file in chunks and accumilates the MD5sum to prevent loading the whole lot into memory. Taken from [here](https://stackoverflow.com/questions/3431825)

Parameters:

file_name (str) – A file name to check the MD5 sum
chunk (int, optional) – The size of the chunks to read from the file (default=4096 bytes)
verbose (bool, optional) – If the file is huge then this could take a while. Setting verbose to try will output a remaining progress monitor if needed (default=False)

Returns:

md5sum – The md5 hash of the file (hex)

Return type:

str

gwas_norm.common.norm_name(str_to_norm)#

Normalise a study or analysis name by making it lowercase form of the variable with spaces removed

Parameters:: str_to_norm (str) – The name to normalise
Returns:: norm_str – The normalised string
Return type:: str

gwas_norm.common.parse_bool(value)#

Parse a text based Boolean value into a python Boolean

Parameters:: value (str) – The string based Boolean to convert into a python Boolean
Returns:: boolean_value – The boolean value
Return type:: bool
Raises:: TypeError – if the value is not true and false

gwas_norm.common.parse_chrpos_spec_str(chrpos_spec)#

Parse the chrpos spec column

Parameters:: chrpos_spec (str) – The chrpos column to parse
Returns:: chrpos_spec – A ChrPosSpec named tuple
Return type:: ChrPosSpec
Raises:: KeyError – If the chrpos_spec can’t be parsed

gwas_norm.common.passthrough(value)#

A dummy pass through method

Parameters:: value (Any) – The value to pass through
Returns:: value – The value to pass through
Return type:: Any

gwas_norm.common.safe_move(source, dest, force=True)#

Move the source file to the destination location. This will only happen if the source file is not present in the destination. If it is, then check the the bsd_chksum is the same, if not raise an error. If force is True, then no error is raised, only a warning.

Parameters:

source (str) – The source file location
dest (str) – The destination file location

gwas_norm.common.stdopen(filename, mode='rt', method=<built-in function open>, use_tmp=False, tmp_dir=None, **kwargs)#

Provide either an opened file or STDIN/STDOUT if filename is not a file.

Parameters:

filename (str or sys.stdin or NoneType) – The filename to open. If sys.stdin, ‘-’, ‘’ or NoneType then sys.stdin is yielded otherwise the file is opened with method.
mode (str) – Should be the usual w\/wt\/wb\/r\/rt\/rb is interpreted as read.
method (func) – The open method to use (uses the standard open as a default).
**kwargs – Any other kwargs passed to method.

Yields:

fobj (File or sys.stdin or sys.stdout) – A place to read or write depending on mode

`gwas_norm.constants`#

class gwas_norm.common.ChrPosSpec(spec_columns, start_anchor, end_anchor)#

Bases: tuple

end_anchor#: Alias for field number 2

spec_columns#: Alias for field number 0

start_anchor#: Alias for field number 1

class gwas_norm.common.Msg(file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, verbose=True, prefix=None)#

Bases: object

A class for output for information based on verbosity

Parameters:

file (file-like, optional, default: sys.stderr) – The output location for the message, defaults to STDERR.
verbose (bool, optional, default: True) – Should messages be output.
prefix (str or NoneType, optional, default: NoneType) – Should messages be prefixed with some test.

msg_args(args, **kwargs)#

Output the values of command line arguments based on verbosity.

Parameters:

args (argparse.Namespace) – The arguments parsed out of the argument parser.
**kwargs – Keyword arguments to gwas_norm.common.Msg.msg

msg_prog(prog_name, package, version)#

Output the program name according to verbosity.

Parameters:

prog_name (str) – The program name to output.
package (str) – The package the program is within.
version (str) – The version number of the package the program is in.

set_file(file)#

Set the output location.

Parameters:: file (file-like) – The output location for the message, defaults to STDERR.

set_verbose(verbose)#

Set the verbosity.

Parameters:: verbose (bool) – Should messages be output.

gwas_norm.common.add_column_name(existing_header, column_name)#

Parameters:

existing_header (list of str) – The header to add the column name to
column_name (str) – The ideal column name to add to the header. This will be appended with a suffix should column_name already exist in the header

Returns:

Return type:

str

gwas_norm.common.bsd_chksum_file(infile, chunksize=4096)#

Implement a simple BSD checksum of file. This is the same as the UNIX sum program. See here

Parameters:: infile (str) – The input file to generate a checksum for. Note that this is opened in bytes mode.
Returns:: sum – A 5 character (castable to integer) BSD checksum. It will be 0 padded if needed
Return type:: str

gwas_norm.common.bsd_chksum_str(instr)#

Implement a simple BSD checksum of a string. This is the same as the UNIX sum program. See here

Parameters:: instr (str) – The input string to generate a checksum for. Note that this is converted to bytes internally
Returns:: sum – A 5 character (castable to integer) BSD checksum. It will be 0 padded if needed
Return type:: str

gwas_norm.common.check_abs_path(path, message, root)#

Check that the path is only an absolute path when the root is not set.

Parameters:

path (str) – The path to check.
message (str) – The name of the path this will be used in any error message.
root (str or NoneType) – The root path to join to path if it is relative and root is not NoneType

Returns:

path – The absolute checked path, if ~/ ../ ./ then this will be expanded so count as absolute paths.

Return type:

str

Raises:

FileNotFoundError – If the path is relative and root is NoneType.

gwas_norm.common.check_analysis_type(analysis_type)#

Make sure the analysis_type is lowercase and one of the allowed analysis types. eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl

Parameters:: analysis_type (str) – The analysis type to test
Returns:: analysis_type – The correct analysis type which will be lower case
Return type:: str
Raises:: ValueError – If the analysis_type is not one of: eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl

gwas_norm.common.check_effect_type(effect_type)#

Check the effect_type is valid, his is one of: or, log_or, beta and lowercase

Parameters:: effect_type (str) – The effect type to test
Returns:: effect_type – The correct effect type which will be lower case
Return type:: str
Raises:: ValueError – If the effect_type is not one of: or, log_or, beta

gwas_norm.common.check_parent(obj)#

Parameters:: obj (Any) – Any object potentially with a parent attribute
Raises:: AttributeError – If there is no parent attribute or the parent attribute is NoneType

gwas_norm.common.compress_file(infile, chunksize=4096)#: GZIP compress a file

gwas_norm.common.convert(character)#

Convert raw string un-printables to printables and vice versa.

Parameters:: character (str) – Either a printable raw string or an unprintable \n, \t, \s.
Returns:: character – The printable or unprintable opposite of the character parsed to the function.
Return type:: str

gwas_norm.common.count_lines(file_name, gzipped=False)#

Count the lines in a file

Parameters:

file_name (str) – The file name to open and count

Returns:

line_count (int) – The number of lines in the file
gzipped (bool) – Is the file to count compressed, if so it will be opened with gzip.open and not open

gwas_norm.common.create_chrpos_spec_str(chrpos_spec)#

Parse the chrpos spec named tuple into a string

Parameters:: chrpos_spec (ChrPosSpec) – A ChrPosSpec named tuple
Returns:: chrpos_spec – The chrpos column to parse
Return type:: str

gwas_norm.common.create_uni_id(chr_name, start_pos, effect_allele, other_allele)#

Create a universal identifier based on coordinates and alleles.

Parameters:

chr_name (str) – The chromosome name.
start_pos (int) – The start position in base pairs.
effect_allele (str) – The effect allele.
other_allele (str) – The non-effect allele.

Returns:

uni_id – The universal identifier. this is the: chr_start_<aleleles in sort order>, where the alleles are also separated by an underscore.

Return type:

str

gwas_norm.common.error_on_empty(value, value_type='value')#

If a value is an empty string ‘’ or all spaces or NoneType or an empty list []

Parameters:

value (Any) – The value to test
value_type (str, optional, default: 'value') – The name of type of the value, this is used in any error message raised if the value is empty

Returns:

value – The value is passed through if not empty

Return type:

Any defined value

Raises:

ValueError – If the value is an empty string ‘’ or all spaces or NoneType or an empty list []

gwas_norm.common.expand_relative_path(path)#

This checks if a path is a relative path. That is starts with ~/ , ../, ./ , if so then it is expanded.

Absolute paths and relative paths without leading relative symbols are NOT expanded (i.e. basenames or relative_dir/basename).

Parameters:: path (str) – A relative or absolute path or a basename.
Returns:: path – An absolte path or a basename.
Return type:: str

gwas_norm.common.get_column_name(existing_header, column_name)#

Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appendding an integer suffix to the end of the column name until it is unique within the header

Parameters:

existing_header (list of str) – The header to add the column name to
column_name (str) – The ideal column name to add to the header. This will be appended with a suffix should column_name already exist in the header

Returns:

Return type:

str

gwas_norm.common.get_file_name(analysis, genome_assembly, working_dir='')#: Returns the file name for a final sorted file

gwas_norm.common.get_old_analysis_id(study_obj, analysis_obj)#: Return an auto generated analysis ID for a study/analysis pairing.

gwas_norm.common.get_open_method(infile, compression)#

Get the python file opening method based on the compression value.

Notes

Supported formats are no compression, infer, gzip, bz2, xz or lzma.

Raises:: ValueError – If the compression format can’t be determined.

gwas_norm.common.get_tmp_file(**kwargs)#

Initialise a temp file to work with. This differs from tempfile.mkstemp as the temp file is closed and only the file name is returned.

Parameters:: **kwargs – Any arguments usually passed to tempfile.mkstemp

gwas_norm.common.md5_file(file_name, chunksize=4096)#

Get the MD5 of a file, this reads the file in chunks and accumilates the MD5sum to prevent loading the whole lot into memory. Taken from [here](https://stackoverflow.com/questions/3431825)

Parameters:

file_name (str) – A file name to check the MD5 sum
chunk (int, optional) – The size of the chunks to read from the file (default=4096 bytes)
verbose (bool, optional) – If the file is huge then this could take a while. Setting verbose to try will output a remaining progress monitor if needed (default=False)

Returns:

md5sum – The md5 hash of the file (hex)

Return type:

str

gwas_norm.common.norm_name(str_to_norm)#

Normalise a study or analysis name by making it lowercase form of the variable with spaces removed

Parameters:: str_to_norm (str) – The name to normalise
Returns:: norm_str – The normalised string
Return type:: str

gwas_norm.common.parse_bool(value)#

Parse a text based Boolean value into a python Boolean

Parameters:: value (str) – The string based Boolean to convert into a python Boolean
Returns:: boolean_value – The boolean value
Return type:: bool
Raises:: TypeError – if the value is not true and false

gwas_norm.common.parse_chrpos_spec_str(chrpos_spec)#

Parse the chrpos spec column

Parameters:: chrpos_spec (str) – The chrpos column to parse
Returns:: chrpos_spec – A ChrPosSpec named tuple
Return type:: ChrPosSpec
Raises:: KeyError – If the chrpos_spec can’t be parsed

gwas_norm.common.passthrough(value)#

A dummy pass through method

Parameters:: value (Any) – The value to pass through
Returns:: value – The value to pass through
Return type:: Any

gwas_norm.common.safe_move(source, dest, force=True)#

Parameters:

source (str) – The source file location
dest (str) – The destination file location

gwas_norm.common.stdopen(filename, mode='rt', method=<built-in function open>, use_tmp=False, tmp_dir=None, **kwargs)#

Provide either an opened file or STDIN/STDOUT if filename is not a file.

Parameters:

filename (str or sys.stdin or NoneType) – The filename to open. If sys.stdin, ‘-’, ‘’ or NoneType then sys.stdin is yielded otherwise the file is opened with method.
mode (str) – Should be the usual w\/wt\/wb\/r\/rt\/rb is interpreted as read.
method (func) – The open method to use (uses the standard open as a default).
**kwargs – Any other kwargs passed to method.

Yields:

fobj (File or sys.stdin or sys.stdout) – A place to read or write depending on mode

gwas_norm package#

gwas_norm.gwas_norm#

gwas_norm.processors#

gwas_norm.config#

gwas_norm.common#

gwas_norm.constants#

`gwas_norm` package#

`gwas_norm.gwas_norm`#

`gwas_norm.processors`#

`gwas_norm.config`#

`gwas_norm.common`#

`gwas_norm.constants`#