gwas_norm.utils
sub-package#
gwas_norm.utils.genome_chunks
#
Parse through the positional information of a file and produce an output of region boundaries that contain ~a target number of sites. The number of sites might not be exactly the same as the target if there are multiple sites with the same coordinate at the point where the target number is reached. In these cases all sites at those coordinates are included in the chunk so the start/end coordinate of all regions is unique.
- gwas_norm.utils.genome_chunks.convert_bgi_to_text(infiles, outdir)#
Convert a bgenix index into a compressed text file. bgenix index files are really SQLite3 databases.
- Parameters:
infiles (list of str) – The paths to one or more BGEN files (not the actual indexes). Index files are expected to have the same name with a
.bgi
extension.outdir (str) – The path to an output directory where the converted indexes will be written.
- Returns:
outfiles – The paths to the converted index files.
- Return type:
list of str
- gwas_norm.utils.genome_chunks.create_chunk_file(infiles=None, outfile=None, target=10000, delimiter=b'\t', comment_char=b'##', tmp_dir=None, chr_name=b'#CHROM', start_pos=b'POS', end_pos=None, ref_allele='REF', verbose=False, proxy_infiles=None, out_dir=None, out_ext=None)#
Create a chunk file from one or more input files
This is the main entry point and can be used for an API call to create a chunk output file (or STDOUT).
- Parameters:
infiles (list [str] or NoneType, optional, default: NoneType) – One or more output files, if NoneType or an empty list then it is assumed that the input is coming from STDIN.
outfile (str or NoneType, optional, default: NoneType) – The output file, if NoneType it is assumed that the output is to STDOUT.
target (int, optional, default: 10000) – The target number of variants in each chunk. Some chunks will end up with more if the coordinate boundaries at
target
would result in the same coordinates being on two separate chunks.delimiter (bytes, optional, default: b” “) – The delimiter of the input
comment_char (bytes, optional, default: b”##”) – Any comment characters that maybe present before the header is read in
tmp_dir (str or NoneType, optional, default: NoneType) – An alternative temp directory to use. If
NoneType
, then the system tmp directory will be usedchr_name (bytes, optional, default: b’#CHROM’) – The name of the chromosome column in the input files.
start_pos (bytes, optional, default: b’POS’) – The name of the start position column in the input files.
end_pos (bytes or NoneType, optional, default: NoneType) – The name of the end position column in the input files.
ref_allele (bytes or NoneType, optional, default: NoneType) – The name of the reference allele column in the input files.
verbose (bool, optional, default: False) – Should chunking progress be output.
proxy_infiles (NoneType or list of str, optional, default: NoneType) – This is present for when bgen files are being chunked, as the indexes are worked on rather than the files themselves. In this case the bgen file names can be supplied here and a text conversion of the indexes can be provided as infiles.
out_dir (str or NoneType, optional, default: NoneType) – Any output directories that need to be incorporated into the output file names.
out_ext (str or NoneType, optional, default: NoneType) – The output file extension to use.
- gwas_norm.utils.genome_chunks.find_in_header(header, chr_name, start_pos, end_pos, ref_allele)#
Find the location of the chromosome name, start position and end position columns within the header.
- Parameters:
header (list of str) – The header row to search.
chr_name (str) – The name of the chromosome name column.
start_pos (str) – The name of the start position column.
end_pos (str) – The name of the end position column.
ref_allele (str) – The name of the reference allele column.
- Returns:
chr_name_idx (int) – The index of the chromosome name column
start_pos_idx (int) – The index of the start position column
end_pos_idx (int) – The index of the end position column or potentially the reference allele column (if no end position is defined).
end_pos_func (function) – A function to use to extract the end position. This will either be from an end position column or from the start position and the reference allele length.
- Raises:
ValueError – If any of the chr_name, start_pos or end_pos columns can’t be found
- gwas_norm.utils.genome_chunks.get_chunks(infile=None, target=10000, delimiter=b'\t', comment_char=b'##', chr_name=b'#CHROM', start_pos=b'POS', end_pos=None, ref_allele=None, idx=1, row_idx=1, out_dir=None, out_ext=None, proxy_infile=None)#
Iterate through a file and yield genomic chunks.
- Parameters:
infile (str or NoneType, optional, default: NoneType) – An input file, if
NoneType
then it is assumed that the input is coming from STDIN.target (int, optional, default: 10000) – The target number of variants in each chunk. Some chunks will end up with more if the coordinate boundaries at
target
would result in the same coordinates being on two separate chunks.delimiter (str, optional, default: b” “) – The delimiter of the input
comment_char (str, optional, default: b”##”) – Any comment characters that maybe present before the header is read in
chr_name (str, optional, default: b’#CHROM’) – The name of the chromosome column in the input files.
start_pos (str, optional, default: b’POS’) – The name of the start position column in the input files.
end_pos (str or NoneType, optional, default: NoneType) – The name of the end position column in the input files.
ref_allele (str or NoneType, optional, default: NoneType) – The name of the reference allele column in the input files.
idx (int, optional, default: 1) – A counter for the number of sites passed through.
row_idx (int, optional, default: 1) – A counter for the number of regions output.
out_dir (str or NoneType, optional, default: NoneType) – Any output directory that needs to be incorporated into the output file name.
out_ext (str or NoneType, optional, default: NoneType) – The output file extension to use.
proxy_infile (NoneType or str, optional, default: NoneType) – This is present for when bgen files are being chunked, as the index is worked on rather than the files themselves. In this case the bgen file name can be supplied here and a text conversion of the index can be provided as infile.
- Yields:
outrow (list of bytes) – Output chunk ranges.
- gwas_norm.utils.genome_chunks.get_end_pos(chr_name, start_pos, end_pos)#
Extract the positional data as integers.
- Parameters:
chr_name (str) – The the chromosome name value.
start_pos (str) – The start position value.
end_pos (str) – The end position value.
- Returns:
chr_name (str) – The the chromosome name value (unchanged).
start_pos (int) – The integer start position value.
end_pos (int) – The integer end position value.
- gwas_norm.utils.genome_chunks.get_ref_end_pos(chr_name, start_pos, ref_allele)#
Extract the positional data as integers.
This uses the start position value and the reference allele length to calculate the end position.
- Parameters:
chr_name (str) – The the chromosome name value.
start_pos (str) – The start position value.
ref_allele (str) – The reference allele value.
- Returns:
chr_name (str) – The the chromosome name value (unchanged).
start_pos (int) – The integer start position value.
end_pos (int) – The integer end position value.