VCF tools

Functions related to loading, processing, and transforming VCF (Variant Call Format) files. These tools allow for efficient reading of VCF files into PyRanges objects and flexible manipulation of their fields. For further explanation check the Dealing with VCF files section of the tutorial.

pyrangeyes.vcf.read_vcf(f: str | Path, nrows: bool | None = None)

Read a VCF (Variant Call Format) file and convert it into a PyRanges object.

This function processes a VCF file by reading the data, extracting the header and data lines, and creating a PyRanges object for genomic analysis. The metadata lines (lines starting with ‘##’) are ignored, and the column names are extracted from the header line (starting with ‘#CHROM’).

Parameters:

f (str | Path) – The file path to the VCF file to be read.
nrows (bool | None, optional) – The number of rows to read from the file. If None, reads the entire file.

Returns:

A PyRanges object containing the VCF data, adding the following columns: - Chromosome: Chromosome names (from ‘CHROM’ in the VCF). - Start: Start positions of variants (from ‘POS’ in the VCF). - End: End positions of variants (calculated as Start + 1).

Return type:

pr.PyRanges

Raises:

FileNotFoundError – If the provided file path does not exist.
ValueError – If the VCF file is malformed or missing essential fields.

Notes

Missing quality scores (‘.’) are replaced with pandas.NA.
The function reads the file in chunks for large VCF files to handle memory usage.
Columns ‘CHROM’ and ‘POS’ are renamed to ‘Chromosome’ and ‘Start’ respectively, to align with PyRanges conventions.

Examples

>>> vcf_pyranges = pre.vcf.read_vcf("example.vcf")
>>> vcf_ranges
index    |    Chromosome    Start    ID          REF       ALT       QUAL      FILTER      INFO                       End
int64    |    category      int32    category    object    object    object    category    object                     int32
-------  ---  ------------  -------  ----------  --------  --------  --------  ----------  -------------------------  -------
0        |    1             500      .           A         T         <NA>      PASS        TRANSCRIPT=t1;SECOND_ID=a  501
1        |    1             3500     .           A         T         <NA>      PASS        TRANSCRIPT=t1;SECOND_ID=a  3501
2        |    1             300      .           A         T         <NA>      PASS        TRANSCRIPT=t2;SECOND_ID=a  301
3        |    1             1300     .           A         T         <NA>      PASS        TRANSCRIPT=t2;SECOND_ID=a  1301
...      |    ...           ...      ...         ...       ...       ...       ...         ...                        ...
5        |    1             4500     .           A         T         <NA>      PASS        TRANSCRIPT=t3;SECOND_ID=b  4501
6        |    1             4900     .           A         T         <NA>      PASS        TRANSCRIPT=t3;SECOND_ID=b  4901
7        |    1             5600     .           A         T         <NA>      PASS        TRANSCRIPT=t3;SECOND_ID=b  5601
8        |    1             6000     .           A         T         <NA>      PASS        TRANSCRIPT=t4;SECOND_ID=b  6001
PyRanges with 9 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes.

pyrangeyes.vcf.split_fields(data, target_cols: str | list, field_sep: str, col_name_sep: str | None = None, col_names: list[str] | None = None, col_types: list[str] | None = None, keep_col: bool = False)

Splits a column or columns into multiple columns based on specified separators.

Parameters:

data (pd.DataFrame) – The input DataFrame containing the columns to be split.
target_cols ({str or list of strings}) – Column name(s) in the DataFrame to be split. Can be a single column (str) or a list of column names.
field_sep (str) – Separator used to split the fields in the target column(s).
col_name_sep (str, default None) – If provided, this separator is used to split each field into a column name and value. For example, “key=value” will generate a column named key with the corresponding value. Defaults to None.
col_names (list[str], default None) – A list of names for the new columns. If not provided, column names are generated automatically based on the target column name and field index. If col_name_sep is specified, the column names can be inferred from the field keys. Defaults to None.
col_types (list[str], default None) – A list of data types for the new columns. If not provided, columns will retain their default inferred types. Defaults to None.
keep_col (bool , default False) – Whether to retain the original target column(s) in the output DataFrame. Defaults to False (the original column(s) will be removed).

Returns:

A Pyranges object with the new columns added (and the target columns removed if keep_col is False).

Return type:

Pyranges

Raises:

ValueError: – If any specified target_cols are not present in the DataFrame.
ValueError: – If the number of provided col_names does not match the number of new columns generated.
ValueError: – If the number of provided col_types does not match the number of new columns generated.

Example

>>> vcf = pre.example_data.ncbi_vcf()
>>> vcf
index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      ...
int64    |    object        int32     object        object    object    object    category    ...
-------  ---  ------------  --------  ------------  --------  --------  --------  ----------  -----
0        |    1             943995    rs761448939   C         G,T       nan       .           ...
1        |    1             964512    rs756054473   C         A,T       nan       .           ...
2        |    1             976215    rs7417106     A         C,G,T     nan       .           ...
3        |    1             1013983   rs1644247121  G         A         nan       .           ...
...      |    ...           ...       ...           ...       ...       ...       ...         ...
242182   |    Y             2787592   rs104894975   A         T         nan       .           ...
242183   |    Y             2787600   rs104894977   G         A         nan       .           ...
242184   |    Y             7063898   rs199659121   A         T         nan       .           ...
242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           ...
PyRanges with 242186 rows, 9 columns, and 1 index columns. (2 columns not shown: "INFO", "End").
Contains 25 chromosomes.
>>> pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";",col_name_sep="=")
index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      End       INFO_0     TSA       INFO_2                  INFO_3                  ...
int64    |    object        int32     object        object    object    object    category    int32     object     object    object                  object                  ...
-------  ---  ------------  --------  ------------  --------  --------  --------  ----------  --------  ---------  --------  ----------------------  ----------------------  -----
0        |    1             943995    rs761448939   C         G,T       nan       .           943996    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
1        |    1             964512    rs756054473   C         A,T       nan       .           964513    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
2        |    1             976215    rs7417106     A         C,G,T     nan       .           976216    dbSNP_156  SNV       E_Freq                  E_1000G                 ...
3        |    1             1013983   rs1644247121  G         A         nan       .           1013984   dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_pathogenic         ...
...      |    ...           ...       ...           ...       ...       ...       ...         ...       ...        ...       ...                     ...                     ...
242182   |    Y             2787592   rs104894975   A         T         nan       .           2787593   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
242183   |    Y             2787600   rs104894977   G         A         nan       .           2787601   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
242184   |    Y             7063898   rs199659121   A         T         nan       .           7063899   dbSNP_156  SNV       E_Freq                  E_Cited                 ...
242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           12735726  dbSNP_156  indel     E_Freq                  E_Cited                 ...
PyRanges with 242186 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
Contains 25 chromosomes.