Tutorial

This tutorial assumes some familiarity with pyranges v1. If necessary, go through its tutorial first: https://pyranges1.readthedocs.io/

Getting started

The first compulsory step to obtain a plot is setting the engine, using function set_engine after importing. We also register the plot function using register_plot, which is optional but convenient: it allows to use the plot function directly from PyRanges objects (further explained later).

>>> import pyrangeyes as pre
>>> pre.set_engine("plotly")  # possible engines: "plotly" and "matplotlib"
>>> pre.register_plot()

Pyrangeyes centralizes the interface to producing graphics in the plot function. It offers plenty of options to customize the appearance of the plot, showcased in this tutorial. To that end, we will use some example data included in the Pyrangeyes package. Yet, any PyRanges object can be used, e.g. loaded from gff, gtf, bam files.

>>> p = pre.example_data.p1
>>> print(p)
  index  |      Chromosome  Strand      Start      End  transcript_id    feature1    feature2
  int64  |           int64  object      int64    int64  object           object      object
-------  ---  ------------  --------  -------  -------  ---------------  ----------  ----------
      0  |               1  +               1       11  t1               a           A
      1  |               1  +              40       60  t1               a           A
      2  |               2  -              10       25  t2               b           B
      3  |               2  -              70       80  t2               b           B
      4  |               2  +              85      100  t3               c           C
      5  |               2  +             110      115  t3               c           C
      6  |               2  +             150      180  t3               c           C
      7  |               3  +             140      152  t4               d           D
PyRanges with 8 rows, 7 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

By default, plot produces an interactive plot. If the Matplotlib engine is selected, a window appears. If the Plotly engine is selected, a server is automatically opened, and an address is printed in the console. The plot can be accessed by opening this address in a browser.

>>> pre.plot(p)
_images/prp_rtd_01.png

Interactive navigation is intuitive:

  • Hover over intervals to see their details in a tooltip

  • Click and drag to zoom in on a region.

  • Double-click to reset the zoom level.

  • Inspect the rest of buttons on the top-right to see other available actions.

To create a pdf or png image file instead of opening an interactive plot, use the to_file parameter of plot.

>>> pre.plot(p, to_file="my_plot.png")

Because we registered the plot function, we can also invoke it like a method of the PyRanges object, as PyRanges.plot(...). This is equivalent to the previous code:

>>> p.plot(to_file="my_plot.png")

In the figure above, intervals are displayed individually, i.e. each PyRanges row is treated as a separate entity. To link the intervals instead, as to represent a transcript composed of exons, use the id_column parameter, indicating the column name that defines the groups of intervals.

>>> pre.plot(p, id_col="transcript_id")
_images/prp_rtd_02.png

Because the id_col parameter is used frequently, it can be set as default for all plots using function set_id_col. The following code is equivalent to the previous one:

>>> pre.set_id_col("transcript_id")
>>> pre.plot(p)

Selecting what to plot

The data above has only 4 interval groups (hereafter, “transcripts”) so all of them were included in the plot. By default, a maximum of 25 transcripts are plotted, customizable with the max_shown parameter of plot. Below, we can set the maximum number of transcripts show as 2. Note the warning shown:

>>> pre.plot(p, max_shown=2)
_images/prp_rtd_03.png

To plot only a subset of the data, use the Pandas/PyRanges object’s slicing capabilities. For example, this plots the intervals on chromosome 2, positive strand, between positions 100 and 200:

>>> (p.loci[2, '+', 100:200]).plot()

By default, the limits of plot coordinates are set to show all the data, and leave some margin at the edges. This is customizable with the limits parameter. The user can decide to change all or some of the coordinate limits leaving the rest as default if desired. The limits parameter accepts different input types:

  • Dictionary with chromosome names as keys, and a tuple of two integer numbers indicating the limits`` to leave as default).

  • Tuple of two integer numbers, which sets the same limits for all plotted chromosomes.

  • PyRanges object, wherein Start and End columns define the limits for the corresponding Chromosome.

    >>> pre.plot(p, limit, 100), 2: (60, 20})
    
_images/prp_rtd_04.png

To plot with specified limits, use the following code:

>>> pre.plot(p, limits=(0,300))
_images/prp_rtd_05.png

Coloring

By default, the intervals are colored according to the ID column (transcript_id in this case, previously set as default with set_id_col).

We can select any other column to color the intervals by using the color_col parameter of plot. For example, let’s color by the Strand column:

>>> pre.plot(p, color_col="Strand")
_images/prp_rtd_06.png

Now the “+” strand transcripts are displayed in one color and the ones on the “-” strand in another color. Note that pyrangeyes used its default color scheme, and mapped each value in the color_col column to a color.

The colormap parameter of plot centralizes coloring customization. It is a versatile parameter, accepting many different types of input. Using a dictionary allows to exert full control over the coloring, explicitly setting each value-color pair:

>>> pre.plot(p, color_col="Strand",
...          colormap={"+": "green", "-": "red"})
_images/prp_rtd_07.png

Alternatively, the user may just define the sequence of colors used (letting pyrangeyes pick which color to assign to each value). One can provide a list of colors in hex or rgb; or a string recognized as the name of an available Matplotlib or Plotly colormap; or an actual Matplotlib or Plotly colormap object. Below, we invoke the “Dark2” Matplotlib colormap:

>>> pre.plot(p, colormap="Dark2")
_images/prp_rtd_08.png

To improve the clarity of the plot, we can enable a legend that labels each color, making it easier to interpret the intervals based on their assigned colors. This can be done by setting the legend parameter of plot as True:

>>> pre.plot(p, colormap="Dark2", legend=True)
_images/prp_rtd_20.png

In this section, we have seen how to color intervals based on their attributes. Next, we will see how to customize the appearance of the plot itself.

Appearance customization options: cheatsheet

A wide range of options are available to customize appearance, as summarized below:

_images/options_fig_wm.png

These options can be provided as parameters to the plot function, or set as default beforehand. Let’s see an example of providing them as parameters:

>>> pre.plot(p, plot_bkg="rgb(173, 216, 230)", plot_border="#808080", title_color="magenta")
_images/prp_rtd_15.png

To instead set these options as default, use the set_options function:

>>> pre.set_options('plot_bkg', 'rgb(173, 216, 230)')
>>> pre.set_options('plot_border', '#808080')
>>> pre.set_options('title_color', 'magenta')
>>> pre.plot(p)  # this will now open a plot identical to the previous one

To inspect the current default options, use the print_options function. Note that any modified values from the built-in defaults will be marked with an asterisk (*):

>>> pre.print_options()
+------------------+--------------------+---------+--------------------------------------------------------------+
|     Feature      |       Value        | Edited? |                         Description                          |
+------------------+--------------------+---------+--------------------------------------------------------------+
|     colormap     |       popart       |         | Sequence of colors to assign to every group of intervals     |
|                  |                    |         | sharing the same “color_col” value. It can be provided as a  |
|                  |                    |         | Matplotlib colormap, a Plotly color sequence (built as       |
|                  |                    |         | lists), a string naming the previously mentioned color       |
|                  |                    |         | objects from Matplotlib and Plotly, or a dictionary with     |
|                  |                    |         | the following structure {color_column_value1: color1,        |
|                  |                    |         | color_column_value2: color2, ...}. When a specific           |
|                  |                    |         | color_col value is not specified in the dictionary it will   |
|                  |                    |         | be colored in black.                                         |
|   exon_border    |         |         | Color of the interval's rectangle border.                    |
|     fig_bkg      |       white        |         | Bakground color of the whole figure.                         |
|    grid_color    |     lightgrey      |         | Color of x coordinates grid lines.                           |
|     plot_bkg     | rgb(173, 216, 230) |    *    | Background color of the plots.                               |
|   plot_border    |      #808080       |    *    | Color of the line delimiting the plots.                      |
|    shrunk_bkg    |    lightyellow     |         | Color of the shrunk region background.                       |
|     tag_bkg      |        grey        |         | Background color of the tooltip annotation for the gene in   |
|                  |                    |         | Matplotlib.                                                  |
|   title_color    |      magenta       |    *    | Color of the plots' titles.                                  |
|    title_size    |         18         |         | Size of the plots' titles.                                   |
|     x_ticks      |         |         | Int, list or dict defining the x_ticks to be displayed.      |
|                  |                    |         | When int, number of ticks to be placed on each plot. When    |
|                  |                    |         | list, it corresponds to de values used as ticks. When dict,  |
|                  |                    |         | the keys must match the Chromosome values of the data,       |
|                  |                    |         | while the values can be either int or list of int; when int  |
|                  |                    |         | it corresponds to the number of ticks to be placed; when     |
|                  |                    |         | list of int it corresponds to de values used as ticks. Note  |
|                  |                    |         | that when the tick falls within a shrunk region it will not  |
|                  |                    |         | be diplayed.                                                 |
+------------------+--------------------+---------+--------------------------------------------------------------+
|   arrow_color    |        grey        |         | Color of the arrow indicating strand.                        |
| arrow_line_width |         1          |         | Line width of the arrow lines                                |
|    arrow_size    |       0.006        |         | Float corresponding to the fraction of the plot or int       |
|                  |                    |         | corresponding to the number of positions occupied by a       |
|                  |                    |         | direction arrow.                                             |
|   exon_height    |        0.6         |         | Height of the exon rectangle in the plot.                    |
|   intron_color   |         |         | Color of the intron lines, the color of the       |
|                  |                    |         | first interval will be used.                                 |
|     text_pad     |       0.005        |         | Space where the id annotation is placed beside the           |
|                  |                    |         | interval. When text_pad is float, it represents the          |
|                  |                    |         | percentage of the plot space, while an int pad represents    |
|                  |                    |         | number of positions or base pairs.                           |
|    text_size     |         10         |         | Fontsize of the text annotation beside the intervals.        |
|     v_spacer     |        0.5         |         | Vertical distance between the intervals and plot border.     |
+------------------+--------------------+---------+--------------------------------------------------------------+
|   plotly_port    |        8050        |         | Port to run plotly app.                                      |
| shrink_threshold |        0.01        |         | Minimum length of an intron or intergenic region in order    |
|                  |                    |         | for it to be shrunk while using the “shrink” feature. When   |
|                  |                    |         | threshold is float, it represents the fraction of the plot   |
|                  |                    |         | space, while an int threshold represents number of           |
|                  |                    |         | positions or base pairs.                                     |
+------------------+--------------------+---------+--------------------------------------------------------------+

To reset options to built-in defaults, use reset_options. By default, it will reset all options. Providing arguments, you can select which options to reset:

>>> pre.reset_options('plot_background')  # reset one feature
>>> pre.reset_options(['plot_border', 'title_color'])  # reset a few features
>>> pre.reset_options()  # reset all features

Built-in and custom themes

A pyrangeyes theme is a collection of options for appearance customization (those displayed above with print_options) each with a set value. Themes are implemented as dictionaries, that are passed to the set_theme function. In practice, setting a theme is equivalent to setting options like we did above with set_options, but with a single command.

For example, below we create a theme corresponding to the appearance of our last plot:

>>> my_theme = {
...     "plot_bkg": "rgb(173, 216, 230)",
...     "plot_border": "#808080",
...     "title_color": "magenta"
... }
>>> pre.set_theme(my_theme)
>>> pre.plot(p)  # this will now open a plot identical to the previous one

Pyrangeyes comes with a few built-in themes, listed in the set_theme function’s documentation. For example, here’s the “dark” theme:

>>> pre.set_theme('dark')
>>> pre.plot(p)
_images/prp_rtd_16.png

To reset the theme, you can resort again to reset_options.

Managing space: packed/unpacked, shrink

By default, pyrangeyes tries to save as much vertical space as possible, so the transcripts are placed one beside the other, in a “packed” disposition. To instead display one transcript per row, set the packed parameter as False:

pre.plot(p, packed=False, legend = False)
_images/prp_rtd_09.png

Pyrangeyes offers the option to reduce horizontal space, occupied by introns or intergenic regions, by activating the shrink parameter. The shrink_threshold determines the minimum length of a region without visible intervals to be shrunk. When a float is provided, it will be interpreted as a fraction of the visible coordinate limits, while when an int is given it will be interpreted as number of base pairs.

ppp = pre.example_data.p3
print(ppp)
index    |    Chromosome    Strand    Start    End      transcript_id
int64    |    object        object    int64    int64    object
-------  ---  ------------  --------  -------  -------  ---------------
0        |    1             +         90       92       t1
1        |    1             +         61       64       t1
2        |    1             +         104      113      t1
3        |    1             +         228      229      t1
...      |    ...           ...       ...      ...      ...
16       |    2             -         42       46       t5
17       |    2             -         37       40       t5
18       |    2             +         60       70       t6
19       |    2             +         80       90       t6
PyRanges with 20 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
pre.plot(ppp, shrink=True)
_images/prp_rtd_13.png
pre.plot(ppp, shrink=True, shrink_threshold=0.2)
_images/prp_rtd_14.png

Showing mRNA structure

A familiar visualization to many bioinformaticians involves showing the mRNA structure with coding sequences (CDS) displayed thicker than UTR (untranslated) regions. This is achieved by setting the thick_cds parameter to True. Note that data must be coded like standard GFF/GTF files, with different rows for exons and for CDS, wherein CDS are subsets of exons. A “Feature” column must be present and contain “exon” or “CDS” values:

pp = pre.example_data.p2
print(pp)
index    |    Chromosome    Strand    Start    End      transcript_id    feature1    feature2    Feature
int64    |    int64         object    int64    int64    object           object      object      object
-------  ---  ------------  --------  -------  -------  ---------------  ----------  ----------  ---------
0        |    1             +         1        11       t1               1           A           exon
1        |    1             +         40       60       t1               1           A           exon
2        |    2             -         10       25       t2               1           B           CDS
3        |    2             -         70       80       t2               1           B           CDS
...      |    ...           ...       ...      ...      ...              ...         ...         ...
10       |    4             -         30500    30700    t5               2           E           CDS
11       |    4             -         30647    30700    t5               2           E           exon
12       |    4             +         29850    29900    t6               2           F           CDS
13       |    4             +         29970    30000    t6               2           F           CDS
PyRanges with 14 rows, 8 columns, and 1 index columns.
Contains 4 chromosomes and 2 strands.
pre.plot(pp, thick_cds=True)
_images/prp_rtd_12.png

Displaying multiple PyRanges objects

In some cases, the data intervals might overlap. An example could be when some intervals in the PyRanges object correspond to exons and others correspond to “GCA” appearances. For such cases, the thickness_col and depth_col parameters are implemented.

The plot function can accept more than one PyRanges object, provided as a list. In this case, pyrangeyes will display them in the same plot, one on top of the other, for each common chromosome. The intervals of different PyRanges object are separated by a vertical spacer.

Let’s see an example with two PyRanges objects, mapping the occurrences of two amino acids, alanine and cysteine:

p_ala = pre.example_data.p_ala
p_cys = pre.example_data.p_cys

print(p_ala)
print(p_cys)
index  |      Start      End    Chromosome  id        trait1    trait2      depth      thick
int64  |      int64    int64         int64  object    object    object      int64    float64
-------  ---  -------  -------  ------------  --------  --------  --------  -------  ---------
    0  |         10       20             1  gene1     exon      gene_1          0        0.3
    1  |         50       75             1  gene1     exon      gene_1          0        0.3
    2  |         90      130             1  gene1     exon      gene_1          0        0.3
    3  |         13       16             1  gene1     aa        Ala             1        0.6
    4  |         60       63             1  gene1     aa        Ala             1        0.6
    5  |         72       75             1  gene1     aa        Ala             1        0.6
    6  |        120      123             1  gene1     aa        Ala             1        0.6
PyRanges with 7 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes.

index  |      Start      End    Chromosome  id        trait1    trait2      depth      thick
int64  |      int64    int64         int64  object    object    object      int64    float64
-------  ---  -------  -------  ------------  --------  --------  --------  -------  ---------
    0  |         10       20             1  gene1     exon      gene_1          0        0.3
    1  |         50       75             1  gene1     exon      gene_1          0        0.3
    2  |         90      130             1  gene1     exon      gene_1          0        0.3
    3  |         15       18             1  gene1     aa        Cys             1        0.6
    4  |         55       58             1  gene1     aa        Cys             1        0.6
    5  |         62       65             1  gene1     aa        Cys             1        0.6
    6  |        100      103             1  gene1     aa        Cys             1        0.6
    7  |        110      113             1  gene1     aa        Cys             1        0.6
PyRanges with 8 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes.
pre.plot([p_ala, p_cys])
_images/prp_rtd_17.png

When providing multiple PyRanges objects, it is useful to differentiate them in the plot. The y_labels parameter allows to provide a list of strings, one for each PyRanges object, to be displayed on the left side of the plot:

pre.plot(
    [p_ala, p_cys],
    y_labels=["pr Alanine", "pr Cysteine"]
)
_images/prp_rtd_18.png

Customizing depth and thickness

When dealing with overlapping intervals (e.g. see data above), the default visualization may fail to show relevant information, because some intervals are hidden behind others. To address this, the depth_col parameter can be used to highlight overlapping intervals. This parameter accepts a column name from the PyRanges object, which must contain integer values. The higher the value, the closer the interval will be to the top of the plot, ensuring its visibility:

pre.plot(
    [p_ala, p_cys],
    id_col="id",
    y_labels=["pr Alanine", "pr Cysteine"],
    depth_col="depth"
)
_images/prp_rtd_19.png

Another way to highlight overlapping regions is by playing with the height (or thickness) of the blocks representing intervals. This is achieved by using the thickness_col parameter, which defines a data column name whose values determine thickness of the corresponding intervals:

pre.plot(
    [p_ala, p_cys],
    id_col="id",
    color_col="trait1",
    y_labels=["pr Alanine", "pr Cysteine"],
    thickness_col="thick",
)
_images/prp_rtd_11.png

Additional information: tooltips and titles

In interactive plots there is the option of showing information about the gene when the mouse is placed over its structure. This information always shows the gene’s strand if it exists, the start and end coordinates and the ID. To add information contained in other dataframe columns to the tooltip, a string should be given to the tooltip parameter. This string must contain the desired column names within curly brackets as shown below.

Similarly, the title of the chromosome plots can be customized giving the desired string to the title_chr parameter, where the correspondent chromosome value of the data is referred to as {chrom}. An example could be the following:

pre.plot(
    p,
    tooltip="first feature: {feature1}\nsecond feature: {feature2}",
    title_chr='Chr: {chrom}'
    )
_images/prp_rtd_10.png

Dealing with vcf files

While Pyrangeyes is widely recognized for its robust capabilities in visualizing and managing gene annotations, its functionality extends well beyond this. Pyrangeyes also provides versatile tools for working with Variant Call Format (VCF) files, a standard file format used for storing genetic variant information. This includes parsing VCF files, handling complex metadata and visualizing genetic variants alongside gene annotations.

To begin, we need to set Plotly as the rendering engine for visualizing the data. Then, we can load an example annotation in GFF3 format, which consists of a portion of the genome annotation of Homo sapiens chromosome 1:

>>> pre.set_engine("plotly")
>>> ann = pre.example_data.ncbi_gff()
>>> ann
index    |    Chromosome    Source         Feature     Start      End        Score     Strand      Frame     frame     ID                          logic_name           Name             ...
int64    |    category      object         category    int64      int64      object    category    object    object    object                      object               object           ...
-------  ---  ------------  -------------  ----------  ---------  ---------  --------  ----------  --------  --------  --------------------------  -------------------  ---------------  -----
0        |    1             havana         ncRNA_gene  173851423  173868940  .         -           .         .         gene:ENSG00000234741        havana_homo_sapiens  GAS5             ...
1        |    1             havana_tagene  lnc_RNA     173851423  173867989  .         -           .         .         transcript:ENST00000827943  nan                  GAS5-292         ...
2        |    1             havana_tagene  exon        173851423  173851602  .         -           .         .         nan                         nan                  ENSE00004240426  ...
3        |    1             havana_tagene  exon        173859207  173859305  .         -           .         .         nan                         nan                  ENSE00004240438  ...
...      |    ...           ...            ...         ...        ...        ...       ...         ...       ...       ...                         ...                  ...              ...
2009     |    1             havana         CDS         173947368  173947582  .         -           .         0         CDS:ENSP00000356667         nan                  nan              ...
2010     |    1             havana         lnc_RNA     173938575  173941449  .         -           .         .         transcript:ENST00000479099  nan                  RC3H1-203        ...
2011     |    1             havana         exon        173938575  173938871  .         -           .         .         nan                         nan                  ENSE00001445398  ...
2012     |    1             havana         exon        173941264  173941449  .         -           .         .         nan                         nan                  ENSE00001946317  ...
PyRanges with 2013 rows, 28 columns, and 1 index columns. (16 columns not shown: "biotype", "description", "gene_id", ...).
Contains 1 chromosomes and 1 strands.

Next, let’s load a VCF file, which contains variant information for Homo sapiens. This file is provided as part of the example dataset and can be loaded into memory as follows:

>>> vcf = pre.example_data.ncbi_vcf()
>>> vcf
index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      ...
int64    |    object        int32     object        object    object    object    category    ...
-------  ---  ------------  --------  ------------  --------  --------  --------  ----------  -----
0        |    1             943995    rs761448939   C         G,T       nan       .           ...
1        |    1             964512    rs756054473   C         A,T       nan       .           ...
2        |    1             976215    rs7417106     A         C,G,T     nan       .           ...
3        |    1             1013983   rs1644247121  G         A         nan       .           ...
...      |    ...           ...       ...           ...       ...       ...       ...         ...
242182   |    Y             2787592   rs104894975   A         T         nan       .           ...
242183   |    Y             2787600   rs104894977   G         A         nan       .           ...
242184   |    Y             7063898   rs199659121   A         T         nan       .           ...
242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           ...
PyRanges with 242186 rows, 9 columns, and 1 index columns. (2 columns not shown: "INFO", "End").
Contains 25 chromosomes.

Above, we leveraged the builtin example data. In real use cases, you would load data from a file, using read_vcf().

By default, read_vcf() generates a PyRanges object that includes all the columns extracted from the VCF file. Additionally, it adds or modifies the following three columns, required to be a Pyranges object:

  • Chromosome: The chromosome name.

  • Start: The start position of the variant.

  • End: The end position of the variant.

The INFO column in the VCF file contains a wealth of additional information, often encoded as key-value pairs separated by semicolons. However, in its current form, this column is not readily interpretable or easy to analyze due to its compact format. Fortunately, you can easily manipulate the INFO column to expand and extract this embedded information into separate, more accessible columns using the split_fields() function:

>>> vcf_split = pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";")
>>> vcf_split
index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      End       INFO_0     INFO_1     INFO_2                  INFO_3                  ...
int64    |    object        int32     object        object    object    object    category    int32     object     object     object                  object                  ...
-------  ---  ------------  --------  ------------  --------  --------  --------  ----------  --------  ---------  ---------  ----------------------  ----------------------  -----
0        |    1             943995    rs761448939   C         G,T       nan       .           943996    dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
1        |    1             964512    rs756054473   C         A,T       nan       .           964513    dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
2        |    1             976215    rs7417106     A         C,G,T     nan       .           976216    dbSNP_156  TSA=SNV    E_Freq                  E_1000G                 ...
3        |    1             1013983   rs1644247121  G         A         nan       .           1013984   dbSNP_156  TSA=SNV    E_Phenotype_or_Disease  CLIN_pathogenic         ...
...      |    ...           ...       ...           ...       ...       ...       ...         ...       ...        ...        ...                     ...                     ...
242182   |    Y             2787592   rs104894975   A         T         nan       .           2787593   dbSNP_156  TSA=SNV    E_Cited                 E_Phenotype_or_Disease  ...
242183   |    Y             2787600   rs104894977   G         A         nan       .           2787601   dbSNP_156  TSA=SNV    E_Cited                 E_Phenotype_or_Disease  ...
242184   |    Y             7063898   rs199659121   A         T         nan       .           7063899   dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           12735726  dbSNP_156  TSA=indel  E_Freq                  E_Cited                 ...
PyRanges with 242186 rows, 28 columns, and 1 index columns. (16 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
Contains 25 chromosomes.

Note that the column names generated when splitting the INFO column are assigned sequentially, prefixed with the name of the original column (e.g., INFO_0, INFO_1, and so on). If you prefer more descriptive column names, you have two options. You can use the col_name_sep parameter to automatically extract the column names written in the VCF file (e.g., key-value pairs like DP=10 will produce a column named DP). Alternatively, you can use the col_names parameter to manually specify each column name, giving you full control over the naming scheme. Both approaches allow you to tailor the resulting column names to your specific needs, enhancing the readability and usability of your data.In this case, we are going to use the col_name_sep parameter to extract column names directly from the VCF file:

>>> vcf_split = pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";",col_name_sep="=")
>>> vcf_split
index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      End       INFO_0     TSA       INFO_2                  INFO_3                  ...
int64    |    object        int32     object        object    object    object    category    int32     object     object    object                  object                  ...
-------  ---  ------------  --------  ------------  --------  --------  --------  ----------  --------  ---------  --------  ----------------------  ----------------------  -----
0        |    1             943995    rs761448939   C         G,T       nan       .           943996    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
1        |    1             964512    rs756054473   C         A,T       nan       .           964513    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
2        |    1             976215    rs7417106     A         C,G,T     nan       .           976216    dbSNP_156  SNV       E_Freq                  E_1000G                 ...
3        |    1             1013983   rs1644247121  G         A         nan       .           1013984   dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_pathogenic         ...
...      |    ...           ...       ...           ...       ...       ...       ...         ...       ...        ...       ...                     ...                     ...
242182   |    Y             2787592   rs104894975   A         T         nan       .           2787593   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
242183   |    Y             2787600   rs104894977   G         A         nan       .           2787601   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
242184   |    Y             7063898   rs199659121   A         T         nan       .           7063899   dbSNP_156  SNV       E_Freq                  E_Cited                 ...
242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           12735726  dbSNP_156  indel     E_Freq                  E_Cited                 ...
PyRanges with 242186 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
Contains 25 chromosomes.

Let’s begin plotting! First, we’ll select a specific region to focus on and observe the genes within it. For this example, the chosen region is 173900000:173920000:

>>> reg = ann.loci["1","-",173900000:173920000]
>>> reg['ID'] = reg['Parent']
>>> reg
index    |    Chromosome    Source          Feature          Start      End        Score     Strand      Frame     frame     ID                          logic_name                        ...
int64    |    category      object          category         int64      int64      object    category    object    object    object                      object                            ...
-------  ---  ------------  --------------  ---------------  ---------  ---------  --------  ----------  --------  --------  --------------------------  --------------------------------  -----
1953     |    1             ensembl_havana  gene             173903799  173917327  .         -           .         .         nan        ensembl_havana_gene_homo_sapiens  ...
1954     |    1             ensembl_havana  mRNA             173903799  173917327  .         -           .         .         gene:ENSG00000117601  nan                               ...
1955     |    1             ensembl_havana  three_prime_UTR  173903799  173903888  .         -           .         .         transcript:ENST00000367698                         nan                               ...
1956     |    1             ensembl_havana  exon             173903799  173904065  .         -           .         .         transcript:ENST00000367698                         nan                               ...
...      |    ...           ...             ...              ...        ...        ...       ...         ...       ...       ...                         ...                               ...
1977     |    1             havana          exon             173911979  173912014  .         -           .         .         transcript:ENST00000494024                         nan                               ...
1978     |    1             havana          exon             173914552  173914919  .         -           .         .         transcript:ENST00000494024                         nan                               ...
1979     |    1             havana          exon             173915017  173915186  .         -           .         .         transcript:ENST00000494024                         nan                               ...
1980     |    1             havana          exon             173917218  173917316  .         -           .         .         transcript:ENST00000494024                         nan                               ...
PyRanges with 28 rows, 28 columns, and 1 index columns. (17 columns not shown: "Name", "biotype", "description", ...).
Contains 1 chromosomes and 1 strands.

Similarly, we need to focus on the SNPs within the selected region:

>>> coord_vcf = vcf_split.loci["1",173900000:173920000]
>>> coord_vcf
index    |    Chromosome    Start      ID            REF         ALT       QUAL      FILTER      End        INFO_0     TSA       INFO_2                  INFO_3                  ...
int64    |    object        int32      object        object      object    object    category    int32      object     object    object                  object                  ...
-------  ---  ------------  ---------  ------------  ----------  --------  --------  ----------  ---------  ---------  --------  ----------------------  ----------------------  -----
12765    |    1             173903891  rs1572084425  A           G         nan       .           173903892  dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
12766    |    1             173903902  rs121909564   G           A         nan       .           173903903  dbSNP_156  SNV       E_Freq                  E_Cited                 ...
12767    |    1             173903902  rs2102772927  GGGTTGGCTA  G         nan       .           173903903  dbSNP_156  deletion  E_Cited                 E_Phenotype_or_Disease  ...
12768    |    1             173903908  rs1572084448  G           T         nan       .           173903909  dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
...      |    ...           ...        ...           ...         ...       ...       ...         ...        ...        ...       ...                     ...                     ...
12856    |    1             173914920  rs1572092195  C           G         nan       .           173914921  dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_likely_pathogenic  ...
12857    |    1             173917217  rs199469508   A           G         nan       .           173917218  dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_pathogenic         ...
12858    |    1             173917231  rs61736655    G           T         nan       .           173917232  dbSNP_156  SNV       E_Freq                  E_1000G                 ...
12859    |    1             173917430  rs1658038847  G           C         nan       .           173917431  dbSNP_156  SNV       E_Freq                  E_Cited                 ...
PyRanges with 95 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
Contains 1 chromosomes.

Finally, we are ready to visualize our data. By combining the gene annotation from the selected genomic region with the prepared PyRanges object representing the SNPs, we can generate an insightful plot that overlays both datasets. Using the pre.plot function, you can pass the gene annotations and the SNPs together to create a detailed visualization. or this, simply specify the id_col parameter to indicate the column containing unique identifiers, such as the SNP IDs. Here’s how you can do it:

>>> pre.plot([reg,coord_vcf],id_col='ID')
_images/prp_rtd_21.png

In the figure above, the text displaying the ID of each variant may be misinterpreted due to overlapping with other SNP labels. To address this, you can create an artificial column that selectively displays this text only for annotation data while omitting it for VCF data:

>>> reg["Text_col"]=reg["Parent"]
>>> coord_vcf['Text_col'] = ''
>>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}')
_images/prp_rtd_22.png

However, genome variant analysis is not limited to simply identifying the positions of variants. You might also want to explore the distribution of variants by analyzing the number of variants at each position. With Pyrangeyes, you can achieve this by first creating a scatterplot that visualizes these counts, and then including it as input in the add_aligned_plots parameter:

>>> import plotly.graph_objects as go
>>> aligned_traces = [
...     (go.Scatter(
...         x=[173905000, 173905500, 173906000, 173906500, 173907000, 173907500, 173908000, 173908500, 173909000, 173909500],
...         y=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...         mode='markers'
...     ),{'title': 'Scatterplot', 'title_size': 18, 'title_color': 'green'})
... ]
>>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=aligned_traces)
_images/prp_rtd_23.png

Warning

Be careful! The add_aligned_plots parameter is currently only supported when your input data contains a single chromosome. If your dataset spans multiple chromosomes, you will need to filter it beforehand to focus on a specific chromosome for this feature to work correctly.

As you observed, the add_aligned_plots parameter accepts as input a list of tuples, where each tuple consists of two elements: the first is the scatterplot object, and the second is a dictionary for customizing the title of the aligned plot.This dictionary allows you to control three title parameters:

  • title: The text of the title.

  • title_size: The font size of the title.

  • title_color: The color of the title text.

  • y_space: Determines de distance between the main plot and the aligned plots

  • height: Determines the height of the added plot

We already used the options to customise the title., let’s now customise the y axis length and the space between these plots:

>>> aligned_traces = [
...          (go.Scatter(
...              x=[173905000, 173905500, 173906000, 173906500, 173907000, 173907500, 173908000, 173908500, 173909000, 173909500],
...              y=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...              mode='markers'
...          ),{'title': 'Scatterplot', 'title_size': 18, 'title_color': 'green', 'height': 0.5, 'y_space': 0.5})
... ]
>>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=aligned_traces)
_images/prp_rtd_24.png

If your dataset is too large to manually create a Plotly scatterplot, Pyrangeyes offers a convenient function called make_scatter(). This function allows you to automatically generate a scatterplot directly from your data, introducing the numeric column for the y axis.

First we will use Numpy to create a random Count column

>>> import numpy as np
>>> coord_vcf['Count']=coord_vcf.apply(lambda row: np.random.randint(0, 100), axis=1)

Next, we will use this column to define the y-axis for the plot:

>>> aligned = pre.make_scatter(coord_vcf, y='Count')
>>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=[aligned])
_images/prp_rtd_25.png

The make_scatter() function includes several options that allow you to customize your plot to better fit your needs. For instance, you can use the following parameters:

  • color_by: Specify a column from your dataset to color the markers based on its values.

  • title: Set a custom title for your scatterplot.

  • title_size: Adjust the font size of the title for better visibility.

  • title_color: Change the color of the title text to match your design preferences.

  • size_by: Define a column to dynamically adjust the marker sizes based on its values.

  • y_space: Determines de distance between the main plot and the aligned plots

  • height: Determines the height of the added plot

These customization options make it easy to generate informative and visually appealing scatterplots tailored to your data. In our case we are going to color our genetic variants by its type (TSA column):

>>> aligned = pre.make_scatter(coord_vcf, y='Count',color_by="TSA", title="Human Variants", title_color="Magenta",title_size=18)
>>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=[aligned])
_images/prp_rtd_26.png

Enhancing Pyrangeyes with External Visualizations

A typical genomic analysis often involves more than just visualizing genomic intervals. Researchers frequently need to incorporate additional plots—potentially using different axes or plot types—to provide context or enhance the interpretation of results. Pyrangeyes allows you to export your plot to a variable by using the return_plot parameter. This parameter accepts two values:

  • app: Returns a Dash object, which can be integrated into a custom dashboard.

  • fig: Returns the figure and axes of the data, enabling direct manipulation or combination with other Plotly figures.

Example:

>>> p = pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}', return_plot='app')
>>> p
<dash.dash.Dash object at 0x73321d74e990>

Imagine you have your VCF plot and want to visualize how many variants are present in your dataset. For instance, first you can export the Pyrangeyes dash object and then you can create a pie chart to display the distribution of variants by type and seamlessly integrate it into the Pyrangeyes layout. Below is an example of a Pyrangeyes combined with a horizontally aligned pie chart:

from dash import Dash, html, dcc

p = pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}', return_plot='app')

# Example additional data
variant_types = ["Missense", "Synonymous", "Nonsense", "Frameshift", "Splice Site"]
variant_counts = [30, 20, 10, 15, 25]  # Example counts or proportions

# Create a pie chart
pie_chart = go.Figure(
    go.Pie(
        labels=variant_types,
        values=variant_counts,
        hoverinfo="label+percent",
        textinfo="label+percent",
    )
)
pie_chart.update_layout(title={"text": "Variant Types", "font": {"color": "black", "size": 18}, "x": 0.5},
                    margin=dict(l=10, r=10, t=30, b=10))

# Access and extend the existing Dash app's layout
p.layout = html.Div(
    [
        html.Div(
            [
                html.Div([p.layout], style={"width": "70%", "display": "flex", "justify-content": "center"}),
                html.Div(
                    [dcc.Graph(figure=pie_chart)],
                    style={
                        "width": "70%",
                        "display": "flex",
                        "align-items": "center",
                        "justify-content": "center",
                    },
                ),
            ],
            style={"display": "flex", "flex-direction": "row"},  # Arrange side by side
        )
    ]
)

# Run the Dash app
if __name__ == "__main__":
    p.run_server(debug=True)
_images/prp_rtd_27.png

Warning

Hey! This code may cause issues if it is run in an IPython shell. For a smoother experience, consider using a Jupyter Notebook instead.

This layout can also be implemented vertically, allowing you to stack the Pyrangeyes and the pie chart for a clear and intuitive visualization. Here’s how you can achieve this configuration:

from dash import Dash, html, dcc

p = pre.plot([reg,p_vcf[0]],id_col='ID',text = '{Artificial_col}', return_plot='app')

# Example additional data
variant_types = ["Missense", "Synonymous", "Nonsense", "Frameshift", "Splice Site"]
variant_counts = [30, 20, 10, 15, 25]  # Example counts or proportions

# Create a pie chart
pie_chart = go.Figure(
    go.Pie(
        labels=variant_types,
        values=variant_counts,
        hoverinfo="label+percent",
        textinfo="label+percent",
    )
)
pie_chart.update_layout(title={"text": "Variant Types", "font": {"color": "black", "size": 18}, "x": 0.5},
                    margin=dict(l=10, r=10, t=30, b=10))


# Access and extend the existing Dash app's layout
p.layout = html.Div(
    [
        p.layout,  # Retain the existing layout from pre.plot
        html.Div(
            [
                dcc.Graph(figure=pie_chart, style={"margin-bottom": "20px"}),
            ],
            style={"display": "flex", "flex-direction": "column"}  # Arrange vertically
        )
    ]
)

# Run the Dash app
if __name__ == "__main__":
    p.run_server(debug=True)
_images/prp_rtd_28.png