gwas_norm
package#
gwas_norm.gwas_norm
#
gwas_norm.processors
#
gwas_norm.config
#
Handle all interaction with GwasNorm configuration options
- gwas_norm.config.ASSEMBLY_PREFIX = 'assembly'#
The name of the genome assembly section in the configuration file (str)
- gwas_norm.config.ASSEMBLY_SYN_SECTION = 'assembly.synonyms'#
The name of the assembly synonyms section in the MeRIT config file (str)
- gwas_norm.config.CHAIN_FILE_PREFIX = 'chain_files'#
The name of the chain file section in the configuration file (str)
- gwas_norm.config.CONFIG_DEFAULT_NAME = '.gwas_norm.cnf'#
The default name for a gwas-norm configuration file (str)
- gwas_norm.config.CONFIG_ENV = 'GWAS_NORM_CONFIG'#
The name of the shell environment variable that if present will store the location to the Gwas-norm configuration file (str)
- gwas_norm.config.DETAIL_DELIMITER = '.'#
The delimiter that separates detail fields in the MeRIT configuration section names (str)
- class gwas_norm.config.GwasNormConfig(config_file=None)#
Bases:
object
Handles interaction with the MeRIT configuration options (reading only)
- Parameters:
config_file (str or NoneType, optional, default: NoneType) – A configuration file location. If this is provided, it is simple checked for existence and returned. If it has not been provided then the default locations are checked and returned.
- Raises:
FileNotFoundError – If a location for the default config file is not found
PermissionsError – If a location for the default config file can’t be read
- as_dict()#
Return the configuration file as a nested dictionary. The returned dict is a deep copy of the actual configuration dict.
- Returns:
config_dict – A dict of all the configuration parsed from the config file
- Return type:
dict
- get_chain_file(source_assembly, target_assembly)#
Return the path to the chain file that maps the
source_assembly
thetarget_assembly
.- Parameters:
source_assembly (str) – The source assembly name. This will be normalised and searched in the config file for a path to the target_assembly (which is also normalised).
target_assembly (str) – The target assembly name. This will be normalised and searched in the config file under all the paths associated with the
source_assembly
.
- Returns:
chain_file – The path to the chain file that maps from the source assembly to the target assembly.
- Return type:
str
- Raises:
KeyError – If there are no chain files that map from the source assembly to the target assembly
- classmethod get_defaults()#
Get a blank empty dictionary of configuration file sections.
- Returns:
blank_config – The keys are the parsed config file sections, the values are empty dictionaries where parsed content can be placed.
- Return type:
dict [str, dict]
- get_mapping_file(assembly)#
Return the path to the mapping file that maps the
assembly
.- Parameters:
assembly (str) – The assembly name. This will be normalised and searched for a mapping file associated with it.
- Returns:
mapping_file – The path to the mapping file.
- Return type:
str
- Raises:
KeyError – If there are no mapping files that are associated with the assembly
- get_norm_assembly_name(assembly_name)#
When given an assembly name get the standardised version used in the config file.
- Parameters:
assembly_name (str) – The assembly name to get the standard version of.
- Returns:
norm_assembly – The standardised assembly name
- Return type:
str
- Raises:
KeyError – If the assembly name is not recognised
- get_ref_assembly(assembly, species)#
Return the path to the reference assembly that maps the
assembly
.- Parameters:
assembly (str) – The assembly name. This will be normalised and searched for a reference assembly associated with it.
species (str) – The species for the assembly
- Returns:
reference_assembly – The path to the mapping file.
- Return type:
str
- Raises:
KeyError – If there is no reference assembly that associated with the assembly
- property name#
Return the file name of the config file
- gwas_norm.config.MAPPING_FILE_SECTION = 'mapping_files'#
The name of the assembly synonyms section in the MeRIT config file (str)
- gwas_norm.config.get_config_file(config_file=None)#
Attempt to locate and return the MeRIT config file location if it has been defined either in the arguments or the environment.
- Parameters:
config_file (str or NoneType, optional, default: NoneType) – A configuration file location. If this is provided, it is simple checked for existence and returned. If it has not been provided then the default locations are checked and returned.
- Returns:
config_path – The absolute path to the default config file, if found.
- Return type:
str
- Raises:
FileNotFoundError – If a location for the default config file is not found
PermissionsError – If a location for the default config file can’t be read
Notes
The order of return is as follows, if a
config_file
is provided, then it is returned (provided it exists). If not then theMERIT_CONFIG
environment variable is checked to see if it is defined. If so, then it is returned if it exists. Finally, the root of theHOME
environment variable is checked for a file named.merit.cnf
. If that exists it is returned. If any of the defined paths do not exist then the relevant FileNotFoundError will be raised.
gwas_norm.common
#
- class gwas_norm.common.ChrPosSpec(spec_columns, start_anchor, end_anchor)#
Bases:
tuple
- end_anchor#
Alias for field number 2
- spec_columns#
Alias for field number 0
- start_anchor#
Alias for field number 1
- class gwas_norm.common.Msg(file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, verbose=True, prefix=None)#
Bases:
object
A class for output for information based on verbosity
- Parameters:
file (file-like, optional, default: sys.stderr) – The output location for the message, defaults to STDERR.
verbose (bool, optional, default: True) – Should messages be output.
prefix (str or NoneType, optional, default: NoneType) – Should messages be prefixed with some test.
- msg_args(args, **kwargs)#
Output the values of command line arguments based on verbosity.
- Parameters:
args (argparse.Namespace) – The arguments parsed out of the argument parser.
**kwargs – Keyword arguments to gwas_norm.common.Msg.msg
- msg_prog(prog_name, package, version)#
Output the program name according to verbosity.
- Parameters:
prog_name (str) – The program name to output.
package (str) – The package the program is within.
version (str) – The version number of the package the program is in.
- set_file(file)#
Set the output location.
- Parameters:
file (file-like) – The output location for the message, defaults to STDERR.
- set_verbose(verbose)#
Set the verbosity.
- Parameters:
verbose (bool) – Should messages be output.
- gwas_norm.common.add_column_name(existing_header, column_name)#
Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appending an integer suffix to the end of the column name until it is unique within the header
- Parameters:
- Returns:
column_name – The final column name added to the header, may not be the same as what was passed to the function. The header list is not returned as the addition of the column name to the header happens in place.
- Return type:
str
- gwas_norm.common.bsd_chksum_file(infile, chunksize=4096)#
Implement a simple BSD checksum of file. This is the same as the UNIX sum program. See here
- gwas_norm.common.bsd_chksum_str(instr)#
Implement a simple BSD checksum of a string. This is the same as the UNIX sum program. See here
- gwas_norm.common.check_abs_path(path, message, root)#
Check that the
path
is only an absolute path when theroot
is not set.- Parameters:
path (str) – The path to check.
message (str) – The name of the path this will be used in any error message.
root (str or NoneType) – The root path to join to path if it is relative and
root
is notNoneType
- Returns:
path – The absolute checked path, if ~/ ../ ./ then this will be expanded so count as absolute paths.
- Return type:
str
- Raises:
FileNotFoundError – If the
path
is relative and root isNoneType
.
- gwas_norm.common.check_analysis_type(analysis_type)#
Make sure the analysis_type is lowercase and one of the allowed analysis types. eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl
- Parameters:
analysis_type (str) – The analysis type to test
- Returns:
analysis_type – The correct analysis type which will be lower case
- Return type:
- Raises:
ValueError – If the analysis_type is not one of: eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl
- gwas_norm.common.check_effect_type(effect_type)#
Check the effect_type is valid, his is one of: or, log_or, beta and lowercase
- Parameters:
effect_type (str) – The effect type to test
- Returns:
effect_type – The correct effect type which will be lower case
- Return type:
- Raises:
ValueError – If the effect_type is not one of: or, log_or, beta
- gwas_norm.common.check_parent(obj)#
A helper function that checks to see if the object has a parent object if not it will raise a AttributeError, so this will happen if there is no parent attribute or there is a parent attribute and it is NoneType.
- Parameters:
obj (
Any
) – Any object potentially with a parent attribute- Raises:
AttributeError – If there is no parent attribute or the parent attribute is NoneType
- gwas_norm.common.compress_file(infile, chunksize=4096)#
GZIP compress a file
- gwas_norm.common.convert(character)#
Convert raw string un-printables to printables and vice versa.
- Parameters:
character (str) – Either a printable raw string or an unprintable
\n
,\t
,\s
.- Returns:
character – The printable or unprintable opposite of the character parsed to the function.
- Return type:
str
- gwas_norm.common.count_lines(file_name, gzipped=False)#
Count the lines in a file
- Parameters:
file_name (str) – The file name to open and count
- Returns:
line_count (int) – The number of lines in the file
gzipped (bool) – Is the file to count compressed, if so it will be opened with gzip.open and not open
- gwas_norm.common.create_chrpos_spec_str(chrpos_spec)#
Parse the chrpos spec named tuple into a string
- Parameters:
chrpos_spec (
ChrPosSpec
) – A ChrPosSpec named tuple- Returns:
chrpos_spec – The chrpos column to parse
- Return type:
- gwas_norm.common.create_uni_id(chr_name, start_pos, effect_allele, other_allele)#
Create a universal identifier based on coordinates and alleles.
- Parameters:
chr_name (str) – The chromosome name.
start_pos (int) – The start position in base pairs.
effect_allele (str) – The effect allele.
other_allele (str) – The non-effect allele.
- Returns:
uni_id – The universal identifier. this is the:
chr_start_<aleleles in sort order>
, where the alleles are also separated by an underscore.- Return type:
str
- gwas_norm.common.error_on_empty(value, value_type='value')#
If a value is an empty string ‘’ or all spaces or NoneType or an empty list []
- Parameters:
value (Any) – The value to test
value_type (str, optional, default: 'value') – The name of type of the value, this is used in any error message raised if the value is empty
- Returns:
value – The value is passed through if not empty
- Return type:
Any defined value
- Raises:
ValueError – If the value is an empty string ‘’ or all spaces or NoneType or an empty list []
- gwas_norm.common.expand_relative_path(path)#
This checks if a path is a relative path. That is starts with ~/ , ../, ./ , if so then it is expanded.
Absolute paths and relative paths without leading relative symbols are NOT expanded (i.e. basenames or relative_dir/basename).
- Parameters:
path (str) – A relative or absolute path or a basename.
- Returns:
path – An absolte path or a basename.
- Return type:
str
- gwas_norm.common.get_column_name(existing_header, column_name)#
Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appendding an integer suffix to the end of the column name until it is unique within the header
- Parameters:
- Returns:
column_name – The final column name added to the header, may not be the same as what was passed to the function. The header list is not returned as the addition of the column name to the header happens in place.
- Return type:
str
- gwas_norm.common.get_file_name(analysis, genome_assembly, working_dir='')#
Returns the file name for a final sorted file
- gwas_norm.common.get_old_analysis_id(study_obj, analysis_obj)#
Return an auto generated analysis ID for a study/analysis pairing.
- gwas_norm.common.get_open_method(infile, compression)#
Get the python file opening method based on the compression value.
Notes
Supported formats are no compression, infer, gzip, bz2, xz or lzma.
- Raises:
ValueError – If the compression format can’t be determined.
- gwas_norm.common.get_tmp_file(**kwargs)#
Initialise a temp file to work with. This differs from tempfile.mkstemp as the temp file is closed and only the file name is returned.
- Parameters:
**kwargs – Any arguments usually passed to tempfile.mkstemp
- gwas_norm.common.md5_file(file_name, chunksize=4096)#
Get the MD5 of a file, this reads the file in chunks and accumilates the MD5sum to prevent loading the whole lot into memory. Taken from [here](https://stackoverflow.com/questions/3431825)
- Parameters:
file_name (str) – A file name to check the MD5 sum
chunk (int, optional) – The size of the chunks to read from the file (default=4096 bytes)
verbose (bool, optional) – If the file is huge then this could take a while. Setting verbose to try will output a remaining progress monitor if needed (default=False)
- Returns:
md5sum – The md5 hash of the file (hex)
- Return type:
- gwas_norm.common.norm_name(str_to_norm)#
Normalise a study or analysis name by making it lowercase form of the variable with spaces removed
- gwas_norm.common.parse_bool(value)#
Parse a text based Boolean value into a python Boolean
- gwas_norm.common.parse_chrpos_spec_str(chrpos_spec)#
Parse the chrpos spec column
- gwas_norm.common.passthrough(value)#
A dummy pass through method
- Parameters:
value (Any) – The value to pass through
- Returns:
value – The value to pass through
- Return type:
Any
- gwas_norm.common.safe_move(source, dest, force=True)#
Move the source file to the destination location. This will only happen if the source file is not present in the destination. If it is, then check the the bsd_chksum is the same, if not raise an error. If force is True, then no error is raised, only a warning.
- gwas_norm.common.stdopen(filename, mode='rt', method=<built-in function open>, use_tmp=False, tmp_dir=None, **kwargs)#
Provide either an opened file or
STDIN
/STDOUT
if filename is not a file.- Parameters:
filename (str or sys.stdin or NoneType) – The filename to open. If sys.stdin, ‘-’, ‘’ or
NoneType
then sys.stdin is yielded otherwise the file is opened withmethod
.mode (str) – Should be the usual
w\/wt\/wb\/r\/rt\/rb
is interpreted as read.method (func) – The open method to use (uses the standard open as a default).
**kwargs – Any other kwargs passed to method.
- Yields:
fobj (
File
or sys.stdin or sys.stdout) – A place to read or write depending on mode
gwas_norm.constants
#
- class gwas_norm.common.ChrPosSpec(spec_columns, start_anchor, end_anchor)#
Bases:
tuple
- end_anchor#
Alias for field number 2
- spec_columns#
Alias for field number 0
- start_anchor#
Alias for field number 1
- class gwas_norm.common.Msg(file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, verbose=True, prefix=None)#
Bases:
object
A class for output for information based on verbosity
- Parameters:
file (file-like, optional, default: sys.stderr) – The output location for the message, defaults to STDERR.
verbose (bool, optional, default: True) – Should messages be output.
prefix (str or NoneType, optional, default: NoneType) – Should messages be prefixed with some test.
- msg_args(args, **kwargs)#
Output the values of command line arguments based on verbosity.
- Parameters:
args (argparse.Namespace) – The arguments parsed out of the argument parser.
**kwargs – Keyword arguments to gwas_norm.common.Msg.msg
- msg_prog(prog_name, package, version)#
Output the program name according to verbosity.
- Parameters:
prog_name (str) – The program name to output.
package (str) – The package the program is within.
version (str) – The version number of the package the program is in.
- set_file(file)#
Set the output location.
- Parameters:
file (file-like) – The output location for the message, defaults to STDERR.
- set_verbose(verbose)#
Set the verbosity.
- Parameters:
verbose (bool) – Should messages be output.
- gwas_norm.common.add_column_name(existing_header, column_name)#
Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appending an integer suffix to the end of the column name until it is unique within the header
- Parameters:
- Returns:
column_name – The final column name added to the header, may not be the same as what was passed to the function. The header list is not returned as the addition of the column name to the header happens in place.
- Return type:
str
- gwas_norm.common.bsd_chksum_file(infile, chunksize=4096)#
Implement a simple BSD checksum of file. This is the same as the UNIX sum program. See here
- gwas_norm.common.bsd_chksum_str(instr)#
Implement a simple BSD checksum of a string. This is the same as the UNIX sum program. See here
- gwas_norm.common.check_abs_path(path, message, root)#
Check that the
path
is only an absolute path when theroot
is not set.- Parameters:
path (str) – The path to check.
message (str) – The name of the path this will be used in any error message.
root (str or NoneType) – The root path to join to path if it is relative and
root
is notNoneType
- Returns:
path – The absolute checked path, if ~/ ../ ./ then this will be expanded so count as absolute paths.
- Return type:
str
- Raises:
FileNotFoundError – If the
path
is relative and root isNoneType
.
- gwas_norm.common.check_analysis_type(analysis_type)#
Make sure the analysis_type is lowercase and one of the allowed analysis types. eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl
- Parameters:
analysis_type (str) – The analysis type to test
- Returns:
analysis_type – The correct analysis type which will be lower case
- Return type:
- Raises:
ValueError – If the analysis_type is not one of: eqtl, sqtl, mqtl, metabqtl, trait, disease, pqtl
- gwas_norm.common.check_effect_type(effect_type)#
Check the effect_type is valid, his is one of: or, log_or, beta and lowercase
- Parameters:
effect_type (str) – The effect type to test
- Returns:
effect_type – The correct effect type which will be lower case
- Return type:
- Raises:
ValueError – If the effect_type is not one of: or, log_or, beta
- gwas_norm.common.check_parent(obj)#
A helper function that checks to see if the object has a parent object if not it will raise a AttributeError, so this will happen if there is no parent attribute or there is a parent attribute and it is NoneType.
- Parameters:
obj (
Any
) – Any object potentially with a parent attribute- Raises:
AttributeError – If there is no parent attribute or the parent attribute is NoneType
- gwas_norm.common.compress_file(infile, chunksize=4096)#
GZIP compress a file
- gwas_norm.common.convert(character)#
Convert raw string un-printables to printables and vice versa.
- Parameters:
character (str) – Either a printable raw string or an unprintable
\n
,\t
,\s
.- Returns:
character – The printable or unprintable opposite of the character parsed to the function.
- Return type:
str
- gwas_norm.common.count_lines(file_name, gzipped=False)#
Count the lines in a file
- Parameters:
file_name (str) – The file name to open and count
- Returns:
line_count (int) – The number of lines in the file
gzipped (bool) – Is the file to count compressed, if so it will be opened with gzip.open and not open
- gwas_norm.common.create_chrpos_spec_str(chrpos_spec)#
Parse the chrpos spec named tuple into a string
- Parameters:
chrpos_spec (
ChrPosSpec
) – A ChrPosSpec named tuple- Returns:
chrpos_spec – The chrpos column to parse
- Return type:
- gwas_norm.common.create_uni_id(chr_name, start_pos, effect_allele, other_allele)#
Create a universal identifier based on coordinates and alleles.
- Parameters:
chr_name (str) – The chromosome name.
start_pos (int) – The start position in base pairs.
effect_allele (str) – The effect allele.
other_allele (str) – The non-effect allele.
- Returns:
uni_id – The universal identifier. this is the:
chr_start_<aleleles in sort order>
, where the alleles are also separated by an underscore.- Return type:
str
- gwas_norm.common.error_on_empty(value, value_type='value')#
If a value is an empty string ‘’ or all spaces or NoneType or an empty list []
- Parameters:
value (Any) – The value to test
value_type (str, optional, default: 'value') – The name of type of the value, this is used in any error message raised if the value is empty
- Returns:
value – The value is passed through if not empty
- Return type:
Any defined value
- Raises:
ValueError – If the value is an empty string ‘’ or all spaces or NoneType or an empty list []
- gwas_norm.common.expand_relative_path(path)#
This checks if a path is a relative path. That is starts with ~/ , ../, ./ , if so then it is expanded.
Absolute paths and relative paths without leading relative symbols are NOT expanded (i.e. basenames or relative_dir/basename).
- Parameters:
path (str) – A relative or absolute path or a basename.
- Returns:
path – An absolte path or a basename.
- Return type:
str
- gwas_norm.common.get_column_name(existing_header, column_name)#
Add a column to an existing header. The column is appended to the end of the header. This function ensures that the column_name is unique within the header. This is achieved by appendding an integer suffix to the end of the column name until it is unique within the header
- Parameters:
- Returns:
column_name – The final column name added to the header, may not be the same as what was passed to the function. The header list is not returned as the addition of the column name to the header happens in place.
- Return type:
str
- gwas_norm.common.get_file_name(analysis, genome_assembly, working_dir='')#
Returns the file name for a final sorted file
- gwas_norm.common.get_old_analysis_id(study_obj, analysis_obj)#
Return an auto generated analysis ID for a study/analysis pairing.
- gwas_norm.common.get_open_method(infile, compression)#
Get the python file opening method based on the compression value.
Notes
Supported formats are no compression, infer, gzip, bz2, xz or lzma.
- Raises:
ValueError – If the compression format can’t be determined.
- gwas_norm.common.get_tmp_file(**kwargs)#
Initialise a temp file to work with. This differs from tempfile.mkstemp as the temp file is closed and only the file name is returned.
- Parameters:
**kwargs – Any arguments usually passed to tempfile.mkstemp
- gwas_norm.common.md5_file(file_name, chunksize=4096)#
Get the MD5 of a file, this reads the file in chunks and accumilates the MD5sum to prevent loading the whole lot into memory. Taken from [here](https://stackoverflow.com/questions/3431825)
- Parameters:
file_name (str) – A file name to check the MD5 sum
chunk (int, optional) – The size of the chunks to read from the file (default=4096 bytes)
verbose (bool, optional) – If the file is huge then this could take a while. Setting verbose to try will output a remaining progress monitor if needed (default=False)
- Returns:
md5sum – The md5 hash of the file (hex)
- Return type:
- gwas_norm.common.norm_name(str_to_norm)#
Normalise a study or analysis name by making it lowercase form of the variable with spaces removed
- gwas_norm.common.parse_bool(value)#
Parse a text based Boolean value into a python Boolean
- gwas_norm.common.parse_chrpos_spec_str(chrpos_spec)#
Parse the chrpos spec column
- gwas_norm.common.passthrough(value)#
A dummy pass through method
- Parameters:
value (Any) – The value to pass through
- Returns:
value – The value to pass through
- Return type:
Any
- gwas_norm.common.safe_move(source, dest, force=True)#
Move the source file to the destination location. This will only happen if the source file is not present in the destination. If it is, then check the the bsd_chksum is the same, if not raise an error. If force is True, then no error is raised, only a warning.
- gwas_norm.common.stdopen(filename, mode='rt', method=<built-in function open>, use_tmp=False, tmp_dir=None, **kwargs)#
Provide either an opened file or
STDIN
/STDOUT
if filename is not a file.- Parameters:
filename (str or sys.stdin or NoneType) – The filename to open. If sys.stdin, ‘-’, ‘’ or
NoneType
then sys.stdin is yielded otherwise the file is opened withmethod
.mode (str) – Should be the usual
w\/wt\/wb\/r\/rt\/rb
is interpreted as read.method (func) – The open method to use (uses the standard open as a default).
**kwargs – Any other kwargs passed to method.
- Yields:
fobj (
File
or sys.stdin or sys.stdout) – A place to read or write depending on mode