.. _tutorial:

Tutorial
~~~~~~~~

This tutorial assumes some familiarity with pyranges v1.
If necessary, go through its tutorial first: https://pyranges1.readthedocs.io/

.. contents:: Contents of Tutorial
   :depth: 3


Getting started
---------------

The first compulsory step to obtain a plot is setting the **engine**, using function
:func:`set_engine <pyrangeyes.set_engine>` after importing. We also **register** the plot function
using :func:`register_plot <pyrangeyes.register_plot>`, which is optional but convenient:
it allows to use the plot function directly from PyRanges objects (further explained later).

    >>> import pyrangeyes as pre
    >>> pre.set_engine("plotly")  # possible engines: "plotly" and "matplotlib"
    >>> pre.register_plot()


Pyrangeyes centralizes the interface to producing graphics in
the :func:`plot <pyrangeyes.plot>` function. It offers plenty of options to
customize the appearance of the plot, showcased in this tutorial.
To that end, we will use some example data included in the Pyrangeyes package.
Yet, any PyRanges object can be used, e.g. loaded from gff, gtf, bam files.

    >>> p = pre.example_data.p1
    >>> print(p)
      index  |      Chromosome  Strand      Start      End  transcript_id    feature1    feature2
      int64  |           int64  object      int64    int64  object           object      object
    -------  ---  ------------  --------  -------  -------  ---------------  ----------  ----------
          0  |               1  +               1       11  t1               a           A
          1  |               1  +              40       60  t1               a           A
          2  |               2  -              10       25  t2               b           B
          3  |               2  -              70       80  t2               b           B
          4  |               2  +              85      100  t3               c           C
          5  |               2  +             110      115  t3               c           C
          6  |               2  +             150      180  t3               c           C
          7  |               3  +             140      152  t4               d           D
    PyRanges with 8 rows, 7 columns, and 1 index columns.
    Contains 3 chromosomes and 2 strands.

By default, :func:`plot <pyrangeyes.plot>` produces an interactive plot. If the Matplotlib engine is selected,
a window appears. If the Plotly engine is selected, a server is automatically opened, and
an address is printed in the console. The plot can be accessed by opening this address in a browser.

    >>> pre.plot(p)

.. image:: images/prp_rtd_01.png

Interactive navigation is intuitive:

* Hover over intervals to see their details in a **tooltip**
* Click and drag to zoom in on a region.
* Double-click to reset the zoom level.
* Inspect the rest of buttons on the top-right to see other available actions.

To create a pdf or png image file instead of opening an interactive plot,
use the ``to_file`` parameter of :func:`plot <pyrangeyes.plot>`.

    >>> pre.plot(p, to_file="my_plot.png")

Because we **registered** the plot function, we can also invoke it like a method of the PyRanges object, as
``PyRanges.plot(...)``. This is equivalent to the previous code:

    >>> p.plot(to_file="my_plot.png")

In the figure above, intervals are displayed individually, i.e. each PyRanges row is treated as a separate entity.
To link the intervals instead, as to represent a transcript composed of exons, use the ``id_column`` parameter,
indicating the column name that defines the groups of intervals.

    >>> pre.plot(p, id_col="transcript_id")

.. image:: images/prp_rtd_02.png

Because the ``id_col`` parameter is used frequently, it can be set as default for all plots using function
:func:`set_id_col <pyrangeyes.set_id_col>`. The following code is equivalent to the previous one:

    >>> pre.set_id_col("transcript_id")
    >>> pre.plot(p)


Selecting what to plot
----------------------
The data above has only 4 interval groups (hereafter, "transcripts") so all of them were included in the plot.
By default, a **maximum of 25 transcripts** are plotted, customizable with the ``max_shown`` parameter of
:func:`plot <pyrangeyes.plot>`.
Below, we can set the maximum number of transcripts show as 2. Note the warning shown:

    >>> pre.plot(p, max_shown=2)

.. image:: images/prp_rtd_03.png

To plot only a subset of the data, use the Pandas/PyRanges object's slicing capabilities.
For example, this plots the intervals on chromosome 2, positive strand, between positions 100 and 200:

    >>> (p.loci[2, '+', 100:200]).plot()

By default, the **limits of plot coordinates** are set to show all the data, and leave some margin at the edges.
This is customizable with the ``limits`` parameter.
The user can decide to change all or some of the coordinate limits leaving the rest as default if desired.
The ``limits`` parameter accepts different input types:

* Dictionary with chromosome names as keys, and a tuple of two integer numbers indicating the limits`` to leave as default).

* Tuple of two integer numbers, which sets the same limits for all plotted chromosomes.

* PyRanges object, wherein Start and End columns define the limits for the corresponding Chromosome.

    >>> pre.plot(p, limit, 100), 2: (60, 20})

.. image:: images/prp_rtd_04.png

To plot with specified limits, use the following code:

    >>> pre.plot(p, limits=(0,300))

.. image:: images/prp_rtd_05.png

Coloring
--------
By default, the intervals are **colored** according to the ID column
(``transcript_id`` in this case,  previously set as default with :func:`set_id_col <pyrangeyes.set_id_col>`).

We can select any other column to color the intervals by using the ``color_col`` parameter
of :func:`plot <pyrangeyes.plot>`.
For example, let's color by the Strand column:

    >>> pre.plot(p, color_col="Strand")

.. image:: images/prp_rtd_06.png

Now the "+" strand transcripts are displayed in one color and the ones on the "-" strand in another color.
Note that pyrangeyes used its default color scheme, and mapped each value in the  ``color_col`` column to a color.

The  **colormap** parameter of :func:`plot <pyrangeyes.plot>` centralizes coloring customization.
It is a versatile parameter, accepting many different types of input.
Using a dictionary allows to exert full control over the coloring, explicitly setting each value-color pair:

    >>> pre.plot(p, color_col="Strand",
    ...          colormap={"+": "green", "-": "red"})

.. image:: images/prp_rtd_07.png

Alternatively, the user may just define the sequence of colors used
(letting pyrangeyes pick which color to assign to each value).
One can provide a list of colors in hex or rgb; or a string recognized as the name of an available
Matplotlib or Plotly colormap;
or an actual Matplotlib or Plotly colormap object. Below, we invoke the "Dark2" Matplotlib colormap:

    >>> pre.plot(p, colormap="Dark2")

.. image:: images/prp_rtd_08.png

To improve the clarity of the plot, we can enable a legend that labels each color, making it easier 
to interpret the intervals based on their assigned colors. This can be done by setting the 
**legend** parameter of :func:`plot <pyrangeyes.plot>` as True:

    >>> pre.plot(p, colormap="Dark2", legend=True)

.. image:: images/prp_rtd_20.png

In this section, we have seen how to color intervals based on their attributes.
Next, we will see how to customize the appearance of the plot itself.


Appearance customization options: cheatsheet
--------------------------------------------

A wide range of **options** are available to customize appearance, as summarized below:

.. image:: images/options_fig_wm.png

These options can be provided as parameters to the :func:`plot <pyrangeyes.plot>` function, or
set as default beforehand. Let's see an example of providing them as parameters:

    >>> pre.plot(p, plot_bkg="rgb(173, 216, 230)", plot_border="#808080", title_color="magenta")

.. image:: images/prp_rtd_15.png

To instead set these options as default, use the :func:`set_options <pyrangeyes.set_options>` function:

    >>> pre.set_options('plot_bkg', 'rgb(173, 216, 230)')
    >>> pre.set_options('plot_border', '#808080')
    >>> pre.set_options('title_color', 'magenta')
    >>> pre.plot(p)  # this will now open a plot identical to the previous one

To inspect the current default options, use the
:func:`print_options <pyrangeyes.print_options>` function.
Note that any modified values from the built-in defaults will be marked with an asterisk (*):

    >>> pre.print_options()
    +------------------+--------------------+---------+--------------------------------------------------------------+
    |     Feature      |       Value        | Edited? |                         Description                          |
    +------------------+--------------------+---------+--------------------------------------------------------------+
    |     colormap     |       popart       |         | Sequence of colors to assign to every group of intervals     |
    |                  |                    |         | sharing the same “color_col” value. It can be provided as a  |
    |                  |                    |         | Matplotlib colormap, a Plotly color sequence (built as       |
    |                  |                    |         | lists), a string naming the previously mentioned color       |
    |                  |                    |         | objects from Matplotlib and Plotly, or a dictionary with     |
    |                  |                    |         | the following structure {color_column_value1: color1,        |
    |                  |                    |         | color_column_value2: color2, ...}. When a specific           |
    |                  |                    |         | color_col value is not specified in the dictionary it will   |
    |                  |                    |         | be colored in black.                                         |
    |   exon_border    |         |         | Color of the interval's rectangle border.                    |
    |     fig_bkg      |       white        |         | Bakground color of the whole figure.                         |
    |    grid_color    |     lightgrey      |         | Color of x coordinates grid lines.                           |
    |     plot_bkg     | rgb(173, 216, 230) |    *    | Background color of the plots.                               |
    |   plot_border    |      #808080       |    *    | Color of the line delimiting the plots.                      |
    |    shrunk_bkg    |    lightyellow     |         | Color of the shrunk region background.                       |
    |     tag_bkg      |        grey        |         | Background color of the tooltip annotation for the gene in   |
    |                  |                    |         | Matplotlib.                                                  |
    |   title_color    |      magenta       |    *    | Color of the plots' titles.                                  |
    |    title_size    |         18         |         | Size of the plots' titles.                                   |
    |     x_ticks      |         |         | Int, list or dict defining the x_ticks to be displayed.      |
    |                  |                    |         | When int, number of ticks to be placed on each plot. When    |
    |                  |                    |         | list, it corresponds to de values used as ticks. When dict,  |
    |                  |                    |         | the keys must match the Chromosome values of the data,       |
    |                  |                    |         | while the values can be either int or list of int; when int  |
    |                  |                    |         | it corresponds to the number of ticks to be placed; when     |
    |                  |                    |         | list of int it corresponds to de values used as ticks. Note  |
    |                  |                    |         | that when the tick falls within a shrunk region it will not  |
    |                  |                    |         | be diplayed.                                                 |
    +------------------+--------------------+---------+--------------------------------------------------------------+
    |   arrow_color    |        grey        |         | Color of the arrow indicating strand.                        |
    | arrow_line_width |         1          |         | Line width of the arrow lines                                |
    |    arrow_size    |       0.006        |         | Float corresponding to the fraction of the plot or int       |
    |                  |                    |         | corresponding to the number of positions occupied by a       |
    |                  |                    |         | direction arrow.                                             |
    |   exon_height    |        0.6         |         | Height of the exon rectangle in the plot.                    |
    |   intron_color   |         |         | Color of the intron lines, the color of the       |
    |                  |                    |         | first interval will be used.                                 |
    |     text_pad     |       0.005        |         | Space where the id annotation is placed beside the           |
    |                  |                    |         | interval. When text_pad is float, it represents the          |
    |                  |                    |         | percentage of the plot space, while an int pad represents    |
    |                  |                    |         | number of positions or base pairs.                           |
    |    text_size     |         10         |         | Fontsize of the text annotation beside the intervals.        |
    |     v_spacer     |        0.5         |         | Vertical distance between the intervals and plot border.     |
    +------------------+--------------------+---------+--------------------------------------------------------------+
    |   plotly_port    |        8050        |         | Port to run plotly app.                                      |
    | shrink_threshold |        0.01        |         | Minimum length of an intron or intergenic region in order    |
    |                  |                    |         | for it to be shrunk while using the “shrink” feature. When   |
    |                  |                    |         | threshold is float, it represents the fraction of the plot   |
    |                  |                    |         | space, while an int threshold represents number of           |
    |                  |                    |         | positions or base pairs.                                     |
    +------------------+--------------------+---------+--------------------------------------------------------------+

To reset options to built-in defaults,  use :func:`reset_options <pyrangeyes.reset_options>`.
By default, it will reset all options. Providing arguments, you can select which options to reset:

    >>> pre.reset_options('plot_background')  # reset one feature
    >>> pre.reset_options(['plot_border', 'title_color'])  # reset a few features
    >>> pre.reset_options()  # reset all features


Built-in and custom themes
--------------------------

A pyrangeyes **theme** is a collection of options for appearance customization (those displayed above
with :func:`print_options <pyrangeyes.print_options>`) each with a set value.
Themes are implemented as dictionaries, that are passed to the :func:`set_theme <pyrangeyes.set_theme>` function.
In practice, setting a theme is equivalent to setting options like we did above
with :func:`set_options <pyrangeyes.set_options>`, but with a single command.

For example, below we create a theme corresponding to the appearance of our last plot:

    >>> my_theme = {
    ...     "plot_bkg": "rgb(173, 216, 230)",
    ...     "plot_border": "#808080",
    ...     "title_color": "magenta"
    ... }
    >>> pre.set_theme(my_theme)
    >>> pre.plot(p)  # this will now open a plot identical to the previous one

Pyrangeyes comes with a few built-in themes, listed in the :func:`set_theme <pyrangeyes.set_theme>` function's
documentation. For example, here's the "dark" theme:

    >>> pre.set_theme('dark')
    >>> pre.plot(p)

.. image:: images/prp_rtd_16.png

To reset the theme, you can resort again to :func:`reset_options <pyrangeyes.reset_options>`.


Managing space: packed/unpacked, shrink
---------------------------------------

By default, pyrangeyes tries to save as much vertical space as possible,
so the transcripts are placed one beside the other, in a "packed" disposition.
To instead display one transcript per row, set the ``packed`` parameter as ``False``:

.. code-block::

    pre.plot(p, packed=False, legend = False)

.. image:: images/prp_rtd_09.png


Pyrangeyes offers the option to reduce horizontal space, occupied by introns or intergenic regions,
by activating the ``shrink`` parameter.
The  ``shrink_threshold`` determines the minimum length of a region without visible intervals to be shrunk.
When a float is provided, it will be interpreted as a fraction of the visible coordinate limits,
while when an int is given it will be interpreted as number of base pairs.

.. code-block::

    ppp = pre.example_data.p3
    print(ppp)


.. code-block::

    index    |    Chromosome    Strand    Start    End      transcript_id
    int64    |    object        object    int64    int64    object
    -------  ---  ------------  --------  -------  -------  ---------------
    0        |    1             +         90       92       t1
    1        |    1             +         61       64       t1
    2        |    1             +         104      113      t1
    3        |    1             +         228      229      t1
    ...      |    ...           ...       ...      ...      ...
    16       |    2             -         42       46       t5
    17       |    2             -         37       40       t5
    18       |    2             +         60       70       t6
    19       |    2             +         80       90       t6
    PyRanges with 20 rows, 5 columns, and 1 index columns.
    Contains 2 chromosomes and 2 strands.


.. code-block::

    pre.plot(ppp, shrink=True)

.. image:: images/prp_rtd_13.png

.. code-block::

    pre.plot(ppp, shrink=True, shrink_threshold=0.2)

.. image:: images/prp_rtd_14.png


Showing mRNA structure
----------------------

A familiar visualization to many bioinformaticians involves showing the mRNA structure with coding sequences (CDS)
displayed thicker than UTR (untranslated) regions. This is achieved by setting the ``thick_cds`` parameter to ``True``.
Note that data must be coded like standard GFF/GTF files,
with different rows for exons and for CDS, wherein CDS are subsets of exons. A "Feature" column must be present
and contain "exon" or "CDS" values:

.. code-block::

    pp = pre.example_data.p2
    print(pp)


.. code-block::

    index    |    Chromosome    Strand    Start    End      transcript_id    feature1    feature2    Feature
    int64    |    int64         object    int64    int64    object           object      object      object
    -------  ---  ------------  --------  -------  -------  ---------------  ----------  ----------  ---------
    0        |    1             +         1        11       t1               1           A           exon
    1        |    1             +         40       60       t1               1           A           exon
    2        |    2             -         10       25       t2               1           B           CDS
    3        |    2             -         70       80       t2               1           B           CDS
    ...      |    ...           ...       ...      ...      ...              ...         ...         ...
    10       |    4             -         30500    30700    t5               2           E           CDS
    11       |    4             -         30647    30700    t5               2           E           exon
    12       |    4             +         29850    29900    t6               2           F           CDS
    13       |    4             +         29970    30000    t6               2           F           CDS
    PyRanges with 14 rows, 8 columns, and 1 index columns.
    Contains 4 chromosomes and 2 strands.


.. code-block::

    pre.plot(pp, thick_cds=True)

.. image:: images/prp_rtd_12.png


Displaying multiple PyRanges objects
------------------------------------

In some cases, the data intervals might overlap. An example could be when some intervals in
the PyRanges object correspond to exons and others correspond to "GCA" appearances. For such
cases, the ``thickness_col`` and ``depth_col`` parameters are implemented.

The :func:`plot <pyrangeyes.plot>` function can accept more than one PyRanges object, provided as a list.
In this case, pyrangeyes will display them in the same plot, one on top of the other, for each common chromosome.
The intervals of different PyRanges object are separated by a vertical spacer.

Let's see an example with two PyRanges objects, mapping the occurrences of two amino acids, alanine and cysteine:

.. code-block::

    p_ala = pre.example_data.p_ala
    p_cys = pre.example_data.p_cys

    print(p_ala)
    print(p_cys)


.. code-block::

    index  |      Start      End    Chromosome  id        trait1    trait2      depth      thick
    int64  |      int64    int64         int64  object    object    object      int64    float64
    -------  ---  -------  -------  ------------  --------  --------  --------  -------  ---------
        0  |         10       20             1  gene1     exon      gene_1          0        0.3
        1  |         50       75             1  gene1     exon      gene_1          0        0.3
        2  |         90      130             1  gene1     exon      gene_1          0        0.3
        3  |         13       16             1  gene1     aa        Ala             1        0.6
        4  |         60       63             1  gene1     aa        Ala             1        0.6
        5  |         72       75             1  gene1     aa        Ala             1        0.6
        6  |        120      123             1  gene1     aa        Ala             1        0.6
    PyRanges with 7 rows, 8 columns, and 1 index columns.
    Contains 1 chromosomes.

    index  |      Start      End    Chromosome  id        trait1    trait2      depth      thick
    int64  |      int64    int64         int64  object    object    object      int64    float64
    -------  ---  -------  -------  ------------  --------  --------  --------  -------  ---------
        0  |         10       20             1  gene1     exon      gene_1          0        0.3
        1  |         50       75             1  gene1     exon      gene_1          0        0.3
        2  |         90      130             1  gene1     exon      gene_1          0        0.3
        3  |         15       18             1  gene1     aa        Cys             1        0.6
        4  |         55       58             1  gene1     aa        Cys             1        0.6
        5  |         62       65             1  gene1     aa        Cys             1        0.6
        6  |        100      103             1  gene1     aa        Cys             1        0.6
        7  |        110      113             1  gene1     aa        Cys             1        0.6
    PyRanges with 8 rows, 8 columns, and 1 index columns.
    Contains 1 chromosomes.


.. code-block::

    pre.plot([p_ala, p_cys])

.. image:: images/prp_rtd_17.png

When providing multiple PyRanges objects, it is useful to differentiate them in the plot. The ``y_labels`` parameter
allows to provide a list of strings, one for each PyRanges object, to be displayed on the left side of the plot:

.. code-block::

    pre.plot(
        [p_ala, p_cys],
        y_labels=["pr Alanine", "pr Cysteine"]
    )

.. image:: images/prp_rtd_18.png

Customizing depth and thickness
-------------------------------

When dealing with overlapping intervals (e.g. see data above), the default visualization may fail to show
relevant information, because some intervals are hidden behind others. To address this, the
``depth_col`` parameter can be used to highlight overlapping intervals. This parameter accepts a
column name from the PyRanges object, which must contain integer values. The higher the value, the
closer the interval will be to the top of the plot, ensuring its visibility:

.. code-block::

    pre.plot(
        [p_ala, p_cys],
        id_col="id",
        y_labels=["pr Alanine", "pr Cysteine"],
        depth_col="depth"
    )

.. image:: images/prp_rtd_19.png

Another way to highlight overlapping regions is by playing with the height (or thickness) of the blocks representing
intervals. This is achieved by using the ``thickness_col`` parameter, which defines a data column name whose values
determine thickness of the corresponding intervals:

.. code-block::

    pre.plot(
        [p_ala, p_cys],
        id_col="id",
        color_col="trait1",
        y_labels=["pr Alanine", "pr Cysteine"],
        thickness_col="thick",
    )

.. image:: images/prp_rtd_11.png


Additional information: tooltips and titles
-------------------------------------------

In interactive plots there is the option of showing information about the gene when the
mouse is placed over its structure. This information always shows the gene's strand if
it exists, the start and end coordinates and the ID. To add information contained in other
dataframe columns to the tooltip, a string should be given to the ``tooltip`` parameter. This
string must contain the desired column names within curly brackets as shown below.

Similarly, the title of the chromosome plots can be customized giving the desired string to
the ``title_chr`` parameter, where the correspondent chromosome value of the data is referred
to as {chrom}. An example could be the following:

.. code-block::

    pre.plot(
        p,
        tooltip="first feature: {feature1}\nsecond feature: {feature2}",
        title_chr='Chr: {chrom}'
        )

.. image:: images/prp_rtd_10.png

Dealing with vcf files
----------------------

While Pyrangeyes is widely recognized for its robust capabilities in visualizing and managing 
gene annotations, its functionality extends well beyond this. Pyrangeyes also provides 
versatile tools for working with Variant Call Format (VCF) files, a standard file format used 
for storing genetic variant information. This includes parsing VCF files, handling complex metadata 
and visualizing genetic variants alongside gene annotations.

To begin, we need to set **Plotly** as the rendering engine for visualizing the data. Then, we can load 
an example annotation in GFF3 format, which consists of a portion of the genome annotation of Homo 
sapiens chromosome 1:

.. code-block::

    >>> pre.set_engine("plotly")
    >>> ann = pre.example_data.ncbi_gff()
    >>> ann
    index    |    Chromosome    Source         Feature     Start      End        Score     Strand      Frame     frame     ID                          logic_name           Name             ...
    int64    |    category      object         category    int64      int64      object    category    object    object    object                      object               object           ...
    -------  ---  ------------  -------------  ----------  ---------  ---------  --------  ----------  --------  --------  --------------------------  -------------------  ---------------  -----
    0        |    1             havana         ncRNA_gene  173851423  173868940  .         -           .         .         gene:ENSG00000234741        havana_homo_sapiens  GAS5             ...
    1        |    1             havana_tagene  lnc_RNA     173851423  173867989  .         -           .         .         transcript:ENST00000827943  nan                  GAS5-292         ...
    2        |    1             havana_tagene  exon        173851423  173851602  .         -           .         .         nan                         nan                  ENSE00004240426  ...
    3        |    1             havana_tagene  exon        173859207  173859305  .         -           .         .         nan                         nan                  ENSE00004240438  ...
    ...      |    ...           ...            ...         ...        ...        ...       ...         ...       ...       ...                         ...                  ...              ...
    2009     |    1             havana         CDS         173947368  173947582  .         -           .         0         CDS:ENSP00000356667         nan                  nan              ...
    2010     |    1             havana         lnc_RNA     173938575  173941449  .         -           .         .         transcript:ENST00000479099  nan                  RC3H1-203        ...
    2011     |    1             havana         exon        173938575  173938871  .         -           .         .         nan                         nan                  ENSE00001445398  ...
    2012     |    1             havana         exon        173941264  173941449  .         -           .         .         nan                         nan                  ENSE00001946317  ...
    PyRanges with 2013 rows, 28 columns, and 1 index columns. (16 columns not shown: "biotype", "description", "gene_id", ...).
    Contains 1 chromosomes and 1 strands.

Next, let's load a VCF file, which contains variant information for Homo sapiens. This file is 
provided as part of the example dataset and can be loaded into memory as follows:

.. code-block::

    >>> vcf = pre.example_data.ncbi_vcf()
    >>> vcf
    index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      ...
    int64    |    object        int32     object        object    object    object    category    ...
    -------  ---  ------------  --------  ------------  --------  --------  --------  ----------  -----
    0        |    1             943995    rs761448939   C         G,T       nan       .           ...
    1        |    1             964512    rs756054473   C         A,T       nan       .           ...
    2        |    1             976215    rs7417106     A         C,G,T     nan       .           ...
    3        |    1             1013983   rs1644247121  G         A         nan       .           ...
    ...      |    ...           ...       ...           ...       ...       ...       ...         ...
    242182   |    Y             2787592   rs104894975   A         T         nan       .           ...
    242183   |    Y             2787600   rs104894977   G         A         nan       .           ...
    242184   |    Y             7063898   rs199659121   A         T         nan       .           ...
    242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           ...
    PyRanges with 242186 rows, 9 columns, and 1 index columns. (2 columns not shown: "INFO", "End").
    Contains 25 chromosomes.

Above, we leveraged the builtin example data. In real use cases, you would load data from a file, 
using :func:`read_vcf() <pyrangeyes.vcf.read_vcf>`.

By default, :func:`read_vcf() <pyrangeyes.vcf.read_vcf>` generates a PyRanges object that includes all the columns extracted 
from the VCF file. Additionally, it adds or modifies the following three columns, required to be a Pyranges object:

* **Chromosome**: The chromosome name.
* **Start**: The start position of the variant.
* **End**: The end position of the variant.

The INFO column in the VCF file contains a wealth of additional information, often encoded as key-value 
pairs separated by semicolons. However, in its current form, this column is not readily interpretable 
or easy to analyze due to its compact format. Fortunately, you can easily manipulate the INFO column to 
expand and extract this embedded information into separate, more accessible columns using the 
:func:`split_fields() <pyrangeyes.vcf.split_fields>` function:

.. code-block::

    >>> vcf_split = pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";")
    >>> vcf_split
    index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      End       INFO_0     INFO_1     INFO_2                  INFO_3                  ...
    int64    |    object        int32     object        object    object    object    category    int32     object     object     object                  object                  ...
    -------  ---  ------------  --------  ------------  --------  --------  --------  ----------  --------  ---------  ---------  ----------------------  ----------------------  -----
    0        |    1             943995    rs761448939   C         G,T       nan       .           943996    dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
    1        |    1             964512    rs756054473   C         A,T       nan       .           964513    dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
    2        |    1             976215    rs7417106     A         C,G,T     nan       .           976216    dbSNP_156  TSA=SNV    E_Freq                  E_1000G                 ...
    3        |    1             1013983   rs1644247121  G         A         nan       .           1013984   dbSNP_156  TSA=SNV    E_Phenotype_or_Disease  CLIN_pathogenic         ...
    ...      |    ...           ...       ...           ...       ...       ...       ...         ...       ...        ...        ...                     ...                     ...
    242182   |    Y             2787592   rs104894975   A         T         nan       .           2787593   dbSNP_156  TSA=SNV    E_Cited                 E_Phenotype_or_Disease  ...
    242183   |    Y             2787600   rs104894977   G         A         nan       .           2787601   dbSNP_156  TSA=SNV    E_Cited                 E_Phenotype_or_Disease  ...
    242184   |    Y             7063898   rs199659121   A         T         nan       .           7063899   dbSNP_156  TSA=SNV    E_Freq                  E_Cited                 ...
    242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           12735726  dbSNP_156  TSA=indel  E_Freq                  E_Cited                 ...
    PyRanges with 242186 rows, 28 columns, and 1 index columns. (16 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
    Contains 25 chromosomes.

Note that the column names generated when splitting the INFO column are assigned sequentially, prefixed with 
the name of the original column (e.g., INFO_0, INFO_1, and so on). If you prefer more descriptive column names, 
you have two options. You can use the **col_name_sep** parameter to automatically extract the column names written 
in the VCF file (e.g., key-value pairs like DP=10 will produce a column named DP). Alternatively, you can use 
the **col_names** parameter to manually specify each column name, giving you full control over the naming scheme. 
Both approaches allow you to tailor the resulting column names to your specific needs, enhancing the readability 
and usability of your data.In this case, we are going to use the col_name_sep parameter to extract column names 
directly from the VCF file:

.. code-block::

    >>> vcf_split = pre.vcf.split_fields(vcf,target_cols="INFO",field_sep=";",col_name_sep="=")
    >>> vcf_split
    index    |    Chromosome    Start     ID            REF       ALT       QUAL      FILTER      End       INFO_0     TSA       INFO_2                  INFO_3                  ...
    int64    |    object        int32     object        object    object    object    category    int32     object     object    object                  object                  ...
    -------  ---  ------------  --------  ------------  --------  --------  --------  ----------  --------  ---------  --------  ----------------------  ----------------------  -----
    0        |    1             943995    rs761448939   C         G,T       nan       .           943996    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
    1        |    1             964512    rs756054473   C         A,T       nan       .           964513    dbSNP_156  SNV       E_Freq                  E_Cited                 ...
    2        |    1             976215    rs7417106     A         C,G,T     nan       .           976216    dbSNP_156  SNV       E_Freq                  E_1000G                 ...
    3        |    1             1013983   rs1644247121  G         A         nan       .           1013984   dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_pathogenic         ...
    ...      |    ...           ...       ...           ...       ...       ...       ...         ...       ...        ...       ...                     ...                     ...
    242182   |    Y             2787592   rs104894975   A         T         nan       .           2787593   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
    242183   |    Y             2787600   rs104894977   G         A         nan       .           2787601   dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
    242184   |    Y             7063898   rs199659121   A         T         nan       .           7063899   dbSNP_156  SNV       E_Freq                  E_Cited                 ...
    242185   |    Y             12735725  rs778145751   TAAGT     T         nan       .           12735726  dbSNP_156  indel     E_Freq                  E_Cited                 ...
    PyRanges with 242186 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
    Contains 25 chromosomes.

Let's begin plotting! First, we'll select a specific region to focus on and observe the genes within it. For this 
example, the chosen region is 173900000:173920000:

.. code-block::

    >>> reg = ann.loci["1","-",173900000:173920000]
    >>> reg['ID'] = reg['Parent']
    >>> reg
    index    |    Chromosome    Source          Feature          Start      End        Score     Strand      Frame     frame     ID                          logic_name                        ...
    int64    |    category      object          category         int64      int64      object    category    object    object    object                      object                            ...
    -------  ---  ------------  --------------  ---------------  ---------  ---------  --------  ----------  --------  --------  --------------------------  --------------------------------  -----
    1953     |    1             ensembl_havana  gene             173903799  173917327  .         -           .         .         nan        ensembl_havana_gene_homo_sapiens  ...
    1954     |    1             ensembl_havana  mRNA             173903799  173917327  .         -           .         .         gene:ENSG00000117601  nan                               ...
    1955     |    1             ensembl_havana  three_prime_UTR  173903799  173903888  .         -           .         .         transcript:ENST00000367698                         nan                               ...
    1956     |    1             ensembl_havana  exon             173903799  173904065  .         -           .         .         transcript:ENST00000367698                         nan                               ...
    ...      |    ...           ...             ...              ...        ...        ...       ...         ...       ...       ...                         ...                               ...
    1977     |    1             havana          exon             173911979  173912014  .         -           .         .         transcript:ENST00000494024                         nan                               ...
    1978     |    1             havana          exon             173914552  173914919  .         -           .         .         transcript:ENST00000494024                         nan                               ...
    1979     |    1             havana          exon             173915017  173915186  .         -           .         .         transcript:ENST00000494024                         nan                               ...
    1980     |    1             havana          exon             173917218  173917316  .         -           .         .         transcript:ENST00000494024                         nan                               ...
    PyRanges with 28 rows, 28 columns, and 1 index columns. (17 columns not shown: "Name", "biotype", "description", ...).
    Contains 1 chromosomes and 1 strands.

Similarly, we need to focus on the SNPs within the selected region:

.. code-block::

    >>> coord_vcf = vcf_split.loci["1",173900000:173920000]
    >>> coord_vcf
    index    |    Chromosome    Start      ID            REF         ALT       QUAL      FILTER      End        INFO_0     TSA       INFO_2                  INFO_3                  ...
    int64    |    object        int32      object        object      object    object    category    int32      object     object    object                  object                  ...
    -------  ---  ------------  ---------  ------------  ----------  --------  --------  ----------  ---------  ---------  --------  ----------------------  ----------------------  -----
    12765    |    1             173903891  rs1572084425  A           G         nan       .           173903892  dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
    12766    |    1             173903902  rs121909564   G           A         nan       .           173903903  dbSNP_156  SNV       E_Freq                  E_Cited                 ...
    12767    |    1             173903902  rs2102772927  GGGTTGGCTA  G         nan       .           173903903  dbSNP_156  deletion  E_Cited                 E_Phenotype_or_Disease  ...
    12768    |    1             173903908  rs1572084448  G           T         nan       .           173903909  dbSNP_156  SNV       E_Cited                 E_Phenotype_or_Disease  ...
    ...      |    ...           ...        ...           ...         ...       ...       ...         ...        ...        ...       ...                     ...                     ...
    12856    |    1             173914920  rs1572092195  C           G         nan       .           173914921  dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_likely_pathogenic  ...
    12857    |    1             173917217  rs199469508   A           G         nan       .           173917218  dbSNP_156  SNV       E_Phenotype_or_Disease  CLIN_pathogenic         ...
    12858    |    1             173917231  rs61736655    G           T         nan       .           173917232  dbSNP_156  SNV       E_Freq                  E_1000G                 ...
    12859    |    1             173917430  rs1658038847  G           C         nan       .           173917431  dbSNP_156  SNV       E_Freq                  E_Cited                 ...
    PyRanges with 95 rows, 31 columns, and 1 index columns. (19 columns not shown: "INFO_4", "INFO_5", "INFO_6", ...).
    Contains 1 chromosomes.

Finally, we are ready to visualize our data. By combining the gene annotation from the selected genomic region with 
the prepared PyRanges object representing the SNPs, we can generate an insightful plot that overlays both datasets. 
Using the pre.plot function, you can pass the gene annotations and the SNPs together to create a detailed visualization. 
or this, simply specify the id_col parameter to indicate the column containing unique identifiers, such as the SNP IDs. 
Here's how you can do it:

.. code-block::

    >>> pre.plot([reg,coord_vcf],id_col='ID')

.. image:: images/prp_rtd_21.png

In the figure above, the text displaying the ID of each variant may be misinterpreted due to overlapping with other SNP 
labels. To address this, you can create an artificial column that selectively displays this text only for annotation data 
while omitting it for VCF data:

.. code-block::

    >>> reg["Text_col"]=reg["Parent"]
    >>> coord_vcf['Text_col'] = ''
    >>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}')

.. image:: images/prp_rtd_22.png

However, genome variant analysis is not limited to simply identifying the positions of variants. You might also want to 
explore the distribution of variants by analyzing the number of variants at each position. With Pyrangeyes, you can achieve 
this by first creating a scatterplot that visualizes these counts, and then including it as input in the **add_aligned_plots**
parameter:

.. code-block::

    >>> import plotly.graph_objects as go
    >>> aligned_traces = [
    ...     (go.Scatter(
    ...         x=[173905000, 173905500, 173906000, 173906500, 173907000, 173907500, 173908000, 173908500, 173909000, 173909500],
    ...         y=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    ...         mode='markers'
    ...     ),{'title': 'Scatterplot', 'title_size': 18, 'title_color': 'green'})
    ... ]
    >>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=aligned_traces)

.. image:: images/prp_rtd_23.png

.. warning::

    Be careful! The add_aligned_plots parameter is currently only supported when your input data contains a single chromosome. 
    If your dataset spans multiple chromosomes, you will need to filter it beforehand to focus on a specific chromosome for this 
    feature to work correctly.

As you observed, the add_aligned_plots parameter accepts as input a list of tuples, where each tuple consists of two elements: 
the first is the scatterplot object, and the second is a dictionary for customizing the title of the aligned plot.This dictionary 
allows you to control three title parameters:

* title: The text of the title.
* title_size: The font size of the title.
* title_color: The color of the title text.
* y_space: Determines de distance between the main plot and the aligned plots
* height: Determines the height of the added plot

We already used the options to customise the title., let's now customise the y axis length and the space between these plots:

.. code-block::

    >>> aligned_traces = [
    ...          (go.Scatter(
    ...              x=[173905000, 173905500, 173906000, 173906500, 173907000, 173907500, 173908000, 173908500, 173909000, 173909500],
    ...              y=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    ...              mode='markers'
    ...          ),{'title': 'Scatterplot', 'title_size': 18, 'title_color': 'green', 'height': 0.5, 'y_space': 0.5})
    ... ]
    >>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=aligned_traces)

.. image:: images/prp_rtd_24.png

If your dataset is too large to manually create a Plotly scatterplot, Pyrangeyes offers a convenient function called :func:`make_scatter() <pyrangeyes.make_scatter>`. 
This function allows you to automatically generate a scatterplot directly from your data, introducing the numeric column for the
y axis.

First we will use Numpy to create a random Count column

.. code-block::

    >>> import numpy as np
    >>> coord_vcf['Count']=coord_vcf.apply(lambda row: np.random.randint(0, 100), axis=1)

Next, we will use this column to define the y-axis for the plot:

.. code-block::

    >>> aligned = pre.make_scatter(coord_vcf, y='Count')
    >>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=[aligned])

.. image:: images/prp_rtd_25.png

The :func:`make_scatter() <pyrangeyes.make_scatter>` function includes several options that allow you to customize your plot to better fit your needs. For instance, 
you can use the following parameters:

* color_by: Specify a column from your dataset to color the markers based on its values.
* title: Set a custom title for your scatterplot.
* title_size: Adjust the font size of the title for better visibility.
* title_color: Change the color of the title text to match your design preferences.
* size_by: Define a column to dynamically adjust the marker sizes based on its values.
* y_space: Determines de distance between the main plot and the aligned plots
* height: Determines the height of the added plot

These customization options make it easy to generate informative and visually appealing scatterplots tailored to your data.
In our case we are going to color our genetic variants by its type (**TSA** column):

.. code-block::

    >>> aligned = pre.make_scatter(coord_vcf, y='Count',color_by="TSA", title="Human Variants", title_color="Magenta",title_size=18)
    >>> pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}',add_aligned_plots=[aligned])

.. image:: images/prp_rtd_26.png

Enhancing Pyrangeyes with External Visualizations
----------------------------------------------------

A typical genomic analysis often involves more than just visualizing genomic intervals. Researchers frequently need to incorporate additional 
plots—potentially using different axes or plot types—to provide context or enhance the interpretation of results. Pyrangeyes allows you 
to export your plot to a variable by using the **return_plot** parameter. This parameter accepts two values:

* app: Returns a Dash object, which can be integrated into a custom dashboard.
* fig: Returns the figure and axes of the data, enabling direct manipulation or combination with other Plotly figures.

Example:

.. code-block::

    >>> p = pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}', return_plot='app')
    >>> p
    <dash.dash.Dash object at 0x73321d74e990>

Imagine you have your VCF plot and want to visualize how many variants are present in your dataset. For instance, first you can export 
the Pyrangeyes dash object and then you can create a pie chart to display the distribution of variants by type and seamlessly 
integrate it into the Pyrangeyes layout. Below is an example of a Pyrangeyes combined with a horizontally aligned pie chart:

.. code-block::
    
    from dash import Dash, html, dcc
    
    p = pre.plot([reg,coord_vcf],id_col='ID',text = '{Text_col}', return_plot='app')

    # Example additional data
    variant_types = ["Missense", "Synonymous", "Nonsense", "Frameshift", "Splice Site"]
    variant_counts = [30, 20, 10, 15, 25]  # Example counts or proportions

    # Create a pie chart
    pie_chart = go.Figure(
        go.Pie(
            labels=variant_types,
            values=variant_counts,
            hoverinfo="label+percent",
            textinfo="label+percent",
        )
    )
    pie_chart.update_layout(title={"text": "Variant Types", "font": {"color": "black", "size": 18}, "x": 0.5},
                        margin=dict(l=10, r=10, t=30, b=10))

    # Access and extend the existing Dash app's layout
    p.layout = html.Div(
        [
            html.Div(
                [
                    html.Div([p.layout], style={"width": "70%", "display": "flex", "justify-content": "center"}),
                    html.Div(
                        [dcc.Graph(figure=pie_chart)],
                        style={
                            "width": "70%",
                            "display": "flex",
                            "align-items": "center",
                            "justify-content": "center",
                        },
                    ),
                ],
                style={"display": "flex", "flex-direction": "row"},  # Arrange side by side
            )
        ]
    )

    # Run the Dash app
    if __name__ == "__main__":
        p.run_server(debug=True)

.. image:: images/prp_rtd_27.png

.. warning::
    Hey! This code may cause issues if it is run in an IPython shell. 
    For a smoother experience, consider using a Jupyter Notebook instead.

This layout can also be implemented vertically, allowing you to stack the Pyrangeyes and the pie chart for a clear and intuitive 
visualization. Here's how you can achieve this configuration:

.. code-block::

    from dash import Dash, html, dcc

    p = pre.plot([reg,p_vcf[0]],id_col='ID',text = '{Artificial_col}', return_plot='app')

    # Example additional data
    variant_types = ["Missense", "Synonymous", "Nonsense", "Frameshift", "Splice Site"]
    variant_counts = [30, 20, 10, 15, 25]  # Example counts or proportions

    # Create a pie chart
    pie_chart = go.Figure(
        go.Pie(
            labels=variant_types,
            values=variant_counts,
            hoverinfo="label+percent",
            textinfo="label+percent",
        )
    )
    pie_chart.update_layout(title={"text": "Variant Types", "font": {"color": "black", "size": 18}, "x": 0.5},
                        margin=dict(l=10, r=10, t=30, b=10))


    # Access and extend the existing Dash app's layout
    p.layout = html.Div(
        [
            p.layout,  # Retain the existing layout from pre.plot
            html.Div(
                [
                    dcc.Graph(figure=pie_chart, style={"margin-bottom": "20px"}),
                ],
                style={"display": "flex", "flex-direction": "column"}  # Arrange vertically
            )
        ]
    )

    # Run the Dash app
    if __name__ == "__main__":
        p.run_server(debug=True)

.. image:: images/prp_rtd_28.png