VCF tools
Functions related to loading, processing, and transforming VCF (Variant Call Format) files. These tools allow for efficient reading of VCF files into PyRanges objects and flexible manipulation of their fields. For further explanation check the Dealing with VCF files section of the tutorial.
- pyrangeyes.vcf.read_vcf(f: str | Path, nrows: bool | None = None)
Read a VCF (Variant Call Format) file and convert it into a PyRanges object.
This function processes a VCF file by reading the data, extracting the header and data lines, and creating a PyRanges object for genomic analysis. The metadata lines (lines starting with ‘##’) are ignored, and the column names are extracted from the header line (starting with ‘#CHROM’).
- Parameters:
f (str | Path) – The file path to the VCF file to be read.
nrows (bool | None, optional) – The number of rows to read from the file. If None, reads the entire file.
- Returns:
A PyRanges object containing the VCF data, adding the following columns: - Chromosome: Chromosome names (from ‘CHROM’ in the VCF). - Start: Start positions of variants (from ‘POS’ in the VCF). - End: End positions of variants (calculated as Start + 1).
- Return type:
pr.PyRanges
- Raises:
FileNotFoundError – If the provided file path does not exist.
ValueError – If the VCF file is malformed or missing essential fields.
Notes
Missing quality scores (‘.’) are replaced with pandas.NA.
The function reads the file in chunks for large VCF files to handle memory usage.
Columns ‘CHROM’ and ‘POS’ are renamed to ‘Chromosome’ and ‘Start’ respectively, to align with PyRanges conventions.
Examples
>>> vcf_pyranges = pre.vcf.read_vcf("example.vcf") >>> vcf_ranges index | Chromosome Start ID REF ALT QUAL FILTER INFO End int64 | category int32 category object object object category object int32 ------- --- ------------ ------- ---------- -------- -------- -------- ---------- ------------------------- ------- 0 | 1 500 . A T <NA> PASS TRANSCRIPT=t1;SECOND_ID=a 501 1 | 1 3500 . A T <NA> PASS TRANSCRIPT=t1;SECOND_ID=a 3501 2 | 1 300 . A T <NA> PASS TRANSCRIPT=t2;SECOND_ID=a 301 3 | 1 1300 . A T <NA> PASS TRANSCRIPT=t2;SECOND_ID=a 1301 ... | ... ... ... ... ... ... ... ... ... 5 | 1 4500 . A T <NA> PASS TRANSCRIPT=t3;SECOND_ID=b 4501 6 | 1 4900 . A T <NA> PASS TRANSCRIPT=t3;SECOND_ID=b 4901 7 | 1 5600 . A T <NA> PASS TRANSCRIPT=t3;SECOND_ID=b 5601 8 | 1 6000 . A T <NA> PASS TRANSCRIPT=t4;SECOND_ID=b 6001 PyRanges with 9 rows, 9 columns, and 1 index columns. Contains 1 chromosomes.
- pyrangeyes.vcf.split_fields(data, target_cols: str | list, field_sep: str, col_name_sep: str | None = None, col_names: list[str] | None = None, col_types: list[str] | None = None, keep_col: bool = False)
Splits a column or columns into multiple columns based on specified separators.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing the columns to be split.
target_cols ({str or list of strings}) – Column name(s) in the DataFrame to be split. Can be a single column (str) or a list of column names.
field_sep (str) – Separator used to split the fields in the target column(s).
col_name_sep (str, default None) – If provided, this separator is used to split each field into a column name and value. For example, “key=value” will generate a column named key with the corresponding value. Defaults to None.
col_names (list[str], default None) – A list of names for the new columns. If not provided, column names are generated automatically based on the target column name and field index. If col_name_sep is specified, the column names can be inferred from the field keys. Defaults to None.
col_types (list[str], default None) – A list of data types for the new columns. If not provided, columns will retain their default inferred types. Defaults to None.
keep_col (bool , default False) – Whether to retain the original target column(s) in the output DataFrame. Defaults to False (the original column(s) will be removed).
- Returns:
A Pyranges object with the new columns added (and the target columns removed if keep_col is False).
- Return type:
Pyranges
- Raises:
ValueError: – If any specified target_cols are not present in the DataFrame.
ValueError: – If the number of provided col_names does not match the number of new columns generated.
ValueError: – If the number of provided col_types does not match the number of new columns generated.
Example
>>> vcf = pre.example_data.ncbi_vcf() >>> vcf index | Chromosome Start ID REF ALT QUAL FILTER ... int64 | object int32 object object object object category ... ------- --- ------------ -------- ------------ -------- -------- -------- ---------- ----- 0 | 1 943995 rs761448939 C G,T nan . ... 1 | 1 964512 rs756054473 C A,T nan . ... 2 | 1 976215 rs7417106 A C,G,T nan . ... 3 | 1 1013983 rs1644247121 G A nan . ... ... | ... ... ... ... ... ... ... ... 242182 | Y 2787592 rs104894975 A T nan . ... 242183 | Y 2787600 rs104894977 G A nan . ... 242184 | Y 7063898 rs199659121 A T nan . ... 242185 | Y 12735725 rs778145751 TAAGT T nan . ... PyRanges with 242186 rows, 9 columns, and 1 index columns. (2 columns not shown: "INFO", "End"). Contains 25 chromosomes. >>> pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";",col_name_sep="=") index | Chromosome Start ID REF ALT QUAL FILTER End INFO_0 TSA INFO_2 INFO_3 ... int64 | object int32 object object object object category int32 object object object object ... ------- --- ------------ -------- ------------ -------- -------- -------- ---------- -------- --------- -------- ---------------------- ---------------------- ----- 0 | 1 943995 rs761448939 C G,T nan . 943996 dbSNP_156 SNV E_Freq E_Cited ... 1 | 1 964512 rs756054473 C A,T nan . 964513 dbSNP_156 SNV E_Freq E_Cited ... 2 | 1 976215 rs7417106 A C,G,T nan . 976216 dbSNP_156 SNV E_Freq E_1000G ... 3 | 1 1013983 rs1644247121 G A nan . 1013984 dbSNP_156 SNV E_Phenotype_or_Disease CLIN_pathogenic ... ... | ... ... ... ... ... ... ... ... ... ... ... ... ... 242182 | Y 2787592 rs104894975 A T nan . 2787593 dbSNP_156 SNV E_Cited E_Phenotype_or_Disease ... 242183 | Y 2787600 rs104894977 G A nan . 2787601 dbSNP_156 SNV E_Cited E_Phenotype_or_Disease ... 242184 | Y 7063898 rs199659121 A T nan . 7063899 dbSNP_156 SNV E_Freq E_Cited ... 242185 | Y 12735725 rs778145751 TAAGT T nan . 12735726 dbSNP_156 indel E_Freq E_Cited ... PyRanges with 242186 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...). Contains 25 chromosomes.