gwas_norm.variants.downloads sub-package#

gwas_norm.variants.downloads.format_dbsnp#

gwas_norm.variants.downloads.format_dbsnp.parse_dbsnp_vcf(infile, assembly_map, ignore_version=False)#

Parse the dbSNP VCF file into a simplified chromosome format.

Parameters:
  • infile (str) – The path to the input dbSNP VCF file. See dbSNP downloads

  • assembly_map (str) – The path to the input assembly map file. See assembly info downloads

  • ignore_version (bool, optional, default: False) – Ignore version numbers when converting the NCBI chromosomes to regular chromosomes, this basically strips the trailing \.\d+ from the value in the #CHROM field of the dbSNP VCF file.

Yields:

output_rows (str) – A processed row from the dbSNP VCF file. The header rows are also yielded, i.e. any yielded row is good for direct writing to file.

Raises:

ValueError – If the VCF column headings are not correct

gwas_norm.variants.downloads.format_alfa#

gwas_norm.variants.downloads.format_alfa.parse_alfa_vcf(infile, assembly_map, ignore_version=False)#
Parse the ALFA VCF file into a simplified format and alter the sample

names.

Parameters:
  • infile (str) – The path to the input ALFA VCF file. See ALFA downloads

  • assembly_map (str) – The path to the input assembly map file. See assembly INFO

  • ignore_version (bool, optional, default: False) – Ignore version numbers when converting the NCBI chromosomes to regular chromosomes, this basically strips the trailing \.\d+ from the value in the #CHROM field of the dbSNP VCF file.

Yields:

output_rows (str) – A processed row from the ALFA VCF file. The header rows are also yielded, i.e. any yielded row is good for direct writing to file.

Raises:

ValueError – If the VCF column headings are not correct

gwas_norm.variants.downloads.format_snpstats#

gwas_norm.variants.downloads.format_snpstats.parse_snpstats(files, count_col, reference_genome=None)#

Parses a bunch of snpstats files into a single VCF file.

Parameters:
  • files (list of str) – The paths to one or more SNPSTATs files. Files should not be compressed.

  • count_col (str) – The name of the count column as you want it to appear in the output VCF row, it must not contain spaces. Must be A-Z, a-z, 0-9 or _.

  • reference_genome (str or NoneType, optional, default: NoneType) – The path to an indexed reference genome fasta file. If provided, the chromosome names are taken from this and placed as contig names in the VCF header.

Yields:

output_rows (str) – A processed row from the dbSNP VCF file. The header rows are also yielded, i.e. any yielded row is good for direct writing to file.

Raises:

ValueError – If the SNPSTATs file column headings are not correct or if the count_col name is not correct.

gwas_norm.variants.downloads.fix_split_counts#

gwas_norm.variants.downloads.fix_split_counts.fix_vcf_split_counts(infile, allow_seen=True)#

Fix the allele number count in a AN:AC VCF “sample”.

Parameters:
  • infile (str) – The path to the input dbSNP VCF file. It is expected that the VCF file has bi-allelic sites and that the AN:AC fields are where the samples should be (i.e. not in the INFO field). It is also assumed that the VCF file is sorted, so split sites will be following each other in the file. if NoneType then input is assumed to be from STDIN.

  • allow_seen (bool, optional, default: True) – Allow previously seen variants to be processed as separate sites. If this is False and the site has already been seen previously (within the last 100 sites) then an IndexError will be raised.

Yields:

output_rows (str) – A processed row from the VCF file. The header rows are also yielded, i.e. any yielded row is good for direct writing to file.

Raises:
  • ValueError – If the VCF column headings are not correct or if there is no data in the file.

  • IndexError – If the sites are out of sort order with recently processed sites (last 100).

Notes

This will only perform the adjustment variants sites that have the same AN number. This is a safe guard for any future fixes of bcftools. A site is defined as having the same chr_name, start_pos and ref_allele.

gwas_norm.variants.downloads.merge_cadd#

gwas_norm.variants.downloads.merge_cadd.merge_cadd_files(vcf_file, cadd_files)#

Merge the matching sites in the CADD files into the merged VCF file.

Parameters:
  • vcf_file (str) – The path to the reference VCF.

  • cadd_files (list or str) – The paths to the CADD files being merged into the VCF file.

Yields:

row (list of str) – A row in the final merged VCF, this also yields the header as well as the data in the merged VCF.

gwas_norm.variants.downloads.merge_counts#

gwas_norm.variants.downloads.merge_counts.merge_count_files(ref_file, merge_files, ref_name=None, data_set_names=None, reference_genome=None)#

Merge “count” VCF files into a reference VCF.

Parameters:
  • ref_file (str) – The path to the reference VCF.

  • merge_files (list or str) – The paths to the files being merged into the reference file.

  • ref_name (str or NoneType, optional, default: NoneType) – The dataset name for the reference file. If it is NoneType then the name of the reference file is not added to the DS INFO field in the merged file.

  • data_set_names (list of str, optional, default: NoneType) – The names of the datasets being merged into ref. These are added to a DS INFO field in the final merged VCF file. If NoneType then dummy names are generated i.e. ds1, ds2, ds3….dsN . If supplied then they must be the same length as the merge_files.

  • reference_genome (str or NoneType, optional, default: NoneType) – The path to the relevant fasta reference genome assemblies that have been indexed to create a .fai file. If provided, contigs from this will be used in the VCF header. If not then chrs 1-22, X,Y and MT are used.

Yields:

row (list of str) – A row in the final merged VCF.

Notes

This is the main API entry point. Please note that the reference file donates all the INFO and IDs to the output file. None are taken from the files being merged. Importantly ALL VCF files must be sorted on the chromosome as a string and position as an integer.

gwas_norm.variants.downloads.split_mapping#

gwas_norm.variants.downloads.split_mapping.split_mapping_file(mapping_file, common_out, rare_out, tmp_dir=None, maf=0.01, mac=50, verbose=False)#

Partition the input VCF mapping file into two based on maf OR mac values.

Parameters:
  • mapping_file (str) – The path to the mapping file.

  • common_out (str) – The path to the common output VCF file.

  • rare_out (str) – The path to the rare output VCF file.

  • tmp_dir (str or NoneType, optional, default: NoneType) – An alternative temp directory to use, the default (NoneType) is to use the system temp location.

  • maf (float, optional, default: 0.01) – The minor allele frequency cutoff to use, values >= this are common.

  • mac (int, optional, default: 50) – The minor allele count cutoff to use, values >= this are common.

  • verbose (bool, optional, default: False) – Should progress through the file be reported.

Notes

Writing is via a temp file and common is defined as >= to maf or mac with rare being < maf or mac. Only one population has to exceed maf or mac to be classed as common. The output files are tabix indexed after creation.