Calling DMRs with modality#

This notebooks illustrates how modality can call DMRs (differentially methylated regions). The code looks for specific group differences (e.g. healthy versus control, or sex). A brief outline of the code is described below:

We divide genome in smaller regions. Two options are available:

tiles of fixed length (e.g. 2kb)
segmentation based, in which we identify gaps in the genome where there are no CpG within a certain distance (e.g. 500bp) and divide accordingly. This is the default mode.

The contexts of each region are summed together to get the total number of methylated and unmethylated counts.
We use a logistic regression to model the methylation levels of each region. A DMR is defined as a region with a significant difference in methylation levels between two conditions. This is quantified by a p-value.
Based on this p-value, we can decide if the tile is a DMR. We also report a q-value, which is the p-value corrected using the Benjamini-Hochberg method.

from modality.analysis import call_dmrs
from modality import load_biomodal_dataset

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyranges as pr
import seaborn as sns

Load the data#

For the purposes of this demo we can use a duet +modC dataset consisting of 14 Genome in a Bottle (GIAB) samples. The data are publicly available and can be loaded using the load_biomodal_dataset() function. This will pull the dataset and load it in to our session as a ContigDataset object.

The dataset consists of 14 samples, 2 replicates of 7 cell lines. In this notebook we will look for regions of genome that are differentially methylated between two families: the Han mother-father-son trio on one hand, and the Ashkenazi mother-father-son trio on the other hand.

We merge the 2 technical replicates, and we merge the CpGs on the two strands.

ds = load_biomodal_dataset(
    name="giab",
    merge_technical_replicates=True,
    merge_cpgs=True,
    )

ds

<xarray.Dataset> Size: 12GB
Dimensions:                         (pos: 29294625, sample_id: 7)
Coordinates:
    contig                          (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray>
  * sample_id                       (sample_id) object 56B 'HG001' ... 'HG007'
    group                           (sample_id) object 56B 'HG001' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 56B 80 80 80 80 80 80 80
    Protocol Version                (sample_id) object 56B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 56B 'NA12878' ... 'NA2...
    family                          (sample_id) object 56B 'Utah' ... 'Han Ch...
    family_status                   (sample_id) object 56B 'Singleton' ... 'M...
    sex                             (sample_id) object 56B 'Female' ... 'Female'
    tech_replicate_number           (sample_id) int64 56B 1 1 1 1 1 1 1
Attributes: (100/205)
    context:                     CG
    context_encoding:            {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':...
    contigs:                     ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch...
    coordinate_basis:            0
    description:                 A +modC dataset of human Genome in a bottle ...
    fasta_path:                  GRCh38Decoy.fa
    input_path:                  ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_...
    quant_subtype:               modC
    ref_name:                    GRCh38Decoy
    sample_ids:                  ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12...
    slice_chr1:                  slice(0, 2375159, 1)
    slice_chr10:                 slice(14996202, 16385180, 1)
    slice_chr11:                 slice(16385180, 17718294, 1)
    slice_chr11_KI270721v1_ran:  slice(29173382, 29176776, 1)
    slice_chr12:                 slice(17718294, 19034262, 1)
    slice_chr13:                 slice(19034262, 19876731, 1)
    slice_chr14:                 slice(19876731, 20739159, 1)
    slice_chr14_GL000009v2_ran:  slice(29176776, 29178255, 1)
    slice_chr14_GL000194v1_ran:  slice(29183653, 29185755, 1)
    slice_chr14_GL000225v1_ran:  slice(29178255, 29182074, 1)
    slice_chr14_KI270722v1_ran:  slice(29182074, 29183653, 1)
    slice_chr14_KI270723v1_ran:  slice(29185755, 29186353, 1)
    slice_chr14_KI270724v1_ran:  slice(29186353, 29186907, 1)
    slice_chr14_KI270725v1_ran:  slice(29186907, 29188509, 1)
    slice_chr14_KI270726v1_ran:  slice(29188509, 29189025, 1)
    slice_chr15:                 slice(20739159, 21645185, 1)
    slice_chr15_KI270727v1_ran:  slice(29189025, 29192989, 1)
    slice_chr16:                 slice(21645185, 22796076, 1)
    slice_chr16_KI270728v1_ran:  slice(29192989, 29208664, 1)
    slice_chr17:                 slice(22796076, 24044404, 1)
    slice_chr17_GL000205v2_ran:  slice(29208664, 29210537, 1)
    slice_chr17_KI270729v1_ran:  slice(29210537, 29217219, 1)
    slice_chr17_KI270730v1_ran:  slice(29217219, 29219552, 1)
    slice_chr18:                 slice(24044404, 24800418, 1)
    slice_chr19:                 slice(24800418, 25857083, 1)
    slice_chr1_KI270706v1_rand:  slice(29152891, 29154480, 1)
    slice_chr1_KI270707v1_rand:  slice(29154480, 29154791, 1)
    slice_chr1_KI270708v1_rand:  slice(29154791, 29155844, 1)
    slice_chr1_KI270709v1_rand:  slice(29155844, 29157288, 1)
    slice_chr1_KI270710v1_rand:  slice(29157288, 29157505, 1)
    slice_chr1_KI270711v1_rand:  slice(29157505, 29157882, 1)
    slice_chr1_KI270712v1_rand:  slice(29157882, 29160999, 1)
    slice_chr1_KI270713v1_rand:  slice(29160999, 29161864, 1)
    slice_chr1_KI270714v1_rand:  slice(29161864, 29163019, 1)
    slice_chr2:                  slice(2375159, 4567829, 1)
    slice_chr20:                 slice(25857083, 26630560, 1)
    slice_chr21:                 slice(26630560, 27059406, 1)
    slice_chr22:                 slice(27059406, 27660298, 1)
    slice_chr22_KI270731v1_ran:  slice(29219552, 29222033, 1)
    slice_chr22_KI270732v1_ran:  slice(29222033, 29222518, 1)
    ...                          ...
    slice_chrUn_KI270510v1:      slice(29247760, 29247775, 1)
    slice_chrUn_KI270511v1:      slice(29248979, 29249024, 1)
    slice_chrUn_KI270512v1:      slice(29247837, 29248020, 1)
    slice_chrUn_KI270515v1:      slice(29249024, 29249065, 1)
    slice_chrUn_KI270516v1:      slice(29247829, 29247837, 1)
    slice_chrUn_KI270517v1:      slice(29249104, 29249116, 1)
    slice_chrUn_KI270518v1:      slice(29247795, 29247808, 1)
    slice_chrUn_KI270519v1:      slice(29248020, 29248909, 1)
    slice_chrUn_KI270521v1:      slice(29251667, 29251743, 1)
    slice_chrUn_KI270522v1:      slice(29248909, 29248979, 1)
    slice_chrUn_KI270528v1:      slice(29249140, 29249240, 1)
    slice_chrUn_KI270529v1:      slice(29249116, 29249140, 1)
    slice_chrUn_KI270530v1:      slice(29249240, 29249275, 1)
    slice_chrUn_KI270538v1:      slice(29249353, 29250059, 1)
    slice_chrUn_KI270539v1:      slice(29249275, 29249353, 1)
    slice_chrUn_KI270544v1:      slice(29250059, 29250078, 1)
    slice_chrUn_KI270548v1:      slice(29250078, 29250103, 1)
    slice_chrUn_KI270579v1:      slice(29250198, 29250411, 1)
    slice_chrUn_KI270580v1:      slice(29250129, 29250141, 1)
    slice_chrUn_KI270581v1:      slice(29250141, 29250198, 1)
    slice_chrUn_KI270582v1:      slice(29250753, 29250816, 1)
    slice_chrUn_KI270583v1:      slice(29250103, 29250114, 1)
    slice_chrUn_KI270584v1:      slice(29250726, 29250753, 1)
    slice_chrUn_KI270587v1:      slice(29250114, 29250129, 1)
    slice_chrUn_KI270588v1:      slice(29250816, 29250857, 1)
    slice_chrUn_KI270589v1:      slice(29250411, 29250700, 1)
    slice_chrUn_KI270590v1:      slice(29250700, 29250726, 1)
    slice_chrUn_KI270591v1:      slice(29250879, 29250912, 1)
    slice_chrUn_KI270593v1:      slice(29250857, 29250879, 1)
    slice_chrUn_KI270741v1:      slice(29263727, 29265416, 1)
    slice_chrUn_KI270742v1:      slice(29288641, 29290053, 1)
    slice_chrUn_KI270743v1:      slice(29267377, 29269464, 1)
    slice_chrUn_KI270744v1:      slice(29269464, 29271450, 1)
    slice_chrUn_KI270745v1:      slice(29271450, 29272082, 1)
    slice_chrUn_KI270746v1:      slice(29272082, 29272487, 1)
    slice_chrUn_KI270747v1:      slice(29272487, 29275904, 1)
    slice_chrUn_KI270748v1:      slice(29275904, 29277399, 1)
    slice_chrUn_KI270749v1:      slice(29277399, 29278817, 1)
    slice_chrUn_KI270750v1:      slice(29278817, 29280152, 1)
    slice_chrUn_KI270751v1:      slice(29280152, 29281268, 1)
    slice_chrUn_KI270752v1:      slice(29281268, 29281503, 1)
    slice_chrUn_KI270753v1:      slice(29281503, 29282300, 1)
    slice_chrUn_KI270754v1:      slice(29282300, 29283979, 1)
    slice_chrUn_KI270755v1:      slice(29283979, 29284530, 1)
    slice_chrUn_KI270756v1:      slice(29284530, 29286560, 1)
    slice_chrUn_KI270757v1:      slice(29286560, 29286986, 1)
    slice_chrX:                  slice(27660298, 28983007, 1)
    slice_chrY:                  slice(28983007, 29152456, 1)
    slice_chrY_KI270740v1_rand:  slice(29239647, 29239942, 1)
    zarr_version:                0.1

Select the two families#

First, we use the select_samples method to select only the samples corresponding to the two families of interest (i.e., the families that are not from the HG001/Utah cell line).

ds = ds.select_samples([i for i in ds.sample_id.values if i != "HG001"])

ds

<xarray.Dataset> Size: 11GB
Dimensions:                         (pos: 29294625, sample_id: 6)
Coordinates:
    contig                          (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray>
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes: (100/205)
    context:                     CG
    context_encoding:            {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':...
    contigs:                     ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch...
    coordinate_basis:            0
    description:                 A +modC dataset of human Genome in a bottle ...
    fasta_path:                  GRCh38Decoy.fa
    input_path:                  ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_...
    quant_subtype:               modC
    ref_name:                    GRCh38Decoy
    sample_ids:                  ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12...
    slice_chr1:                  slice(0, 2375159, 1)
    slice_chr10:                 slice(14996202, 16385180, 1)
    slice_chr11:                 slice(16385180, 17718294, 1)
    slice_chr11_KI270721v1_ran:  slice(29173382, 29176776, 1)
    slice_chr12:                 slice(17718294, 19034262, 1)
    slice_chr13:                 slice(19034262, 19876731, 1)
    slice_chr14:                 slice(19876731, 20739159, 1)
    slice_chr14_GL000009v2_ran:  slice(29176776, 29178255, 1)
    slice_chr14_GL000194v1_ran:  slice(29183653, 29185755, 1)
    slice_chr14_GL000225v1_ran:  slice(29178255, 29182074, 1)
    slice_chr14_KI270722v1_ran:  slice(29182074, 29183653, 1)
    slice_chr14_KI270723v1_ran:  slice(29185755, 29186353, 1)
    slice_chr14_KI270724v1_ran:  slice(29186353, 29186907, 1)
    slice_chr14_KI270725v1_ran:  slice(29186907, 29188509, 1)
    slice_chr14_KI270726v1_ran:  slice(29188509, 29189025, 1)
    slice_chr15:                 slice(20739159, 21645185, 1)
    slice_chr15_KI270727v1_ran:  slice(29189025, 29192989, 1)
    slice_chr16:                 slice(21645185, 22796076, 1)
    slice_chr16_KI270728v1_ran:  slice(29192989, 29208664, 1)
    slice_chr17:                 slice(22796076, 24044404, 1)
    slice_chr17_GL000205v2_ran:  slice(29208664, 29210537, 1)
    slice_chr17_KI270729v1_ran:  slice(29210537, 29217219, 1)
    slice_chr17_KI270730v1_ran:  slice(29217219, 29219552, 1)
    slice_chr18:                 slice(24044404, 24800418, 1)
    slice_chr19:                 slice(24800418, 25857083, 1)
    slice_chr1_KI270706v1_rand:  slice(29152891, 29154480, 1)
    slice_chr1_KI270707v1_rand:  slice(29154480, 29154791, 1)
    slice_chr1_KI270708v1_rand:  slice(29154791, 29155844, 1)
    slice_chr1_KI270709v1_rand:  slice(29155844, 29157288, 1)
    slice_chr1_KI270710v1_rand:  slice(29157288, 29157505, 1)
    slice_chr1_KI270711v1_rand:  slice(29157505, 29157882, 1)
    slice_chr1_KI270712v1_rand:  slice(29157882, 29160999, 1)
    slice_chr1_KI270713v1_rand:  slice(29160999, 29161864, 1)
    slice_chr1_KI270714v1_rand:  slice(29161864, 29163019, 1)
    slice_chr2:                  slice(2375159, 4567829, 1)
    slice_chr20:                 slice(25857083, 26630560, 1)
    slice_chr21:                 slice(26630560, 27059406, 1)
    slice_chr22:                 slice(27059406, 27660298, 1)
    slice_chr22_KI270731v1_ran:  slice(29219552, 29222033, 1)
    slice_chr22_KI270732v1_ran:  slice(29222033, 29222518, 1)
    ...                          ...
    slice_chrUn_KI270510v1:      slice(29247760, 29247775, 1)
    slice_chrUn_KI270511v1:      slice(29248979, 29249024, 1)
    slice_chrUn_KI270512v1:      slice(29247837, 29248020, 1)
    slice_chrUn_KI270515v1:      slice(29249024, 29249065, 1)
    slice_chrUn_KI270516v1:      slice(29247829, 29247837, 1)
    slice_chrUn_KI270517v1:      slice(29249104, 29249116, 1)
    slice_chrUn_KI270518v1:      slice(29247795, 29247808, 1)
    slice_chrUn_KI270519v1:      slice(29248020, 29248909, 1)
    slice_chrUn_KI270521v1:      slice(29251667, 29251743, 1)
    slice_chrUn_KI270522v1:      slice(29248909, 29248979, 1)
    slice_chrUn_KI270528v1:      slice(29249140, 29249240, 1)
    slice_chrUn_KI270529v1:      slice(29249116, 29249140, 1)
    slice_chrUn_KI270530v1:      slice(29249240, 29249275, 1)
    slice_chrUn_KI270538v1:      slice(29249353, 29250059, 1)
    slice_chrUn_KI270539v1:      slice(29249275, 29249353, 1)
    slice_chrUn_KI270544v1:      slice(29250059, 29250078, 1)
    slice_chrUn_KI270548v1:      slice(29250078, 29250103, 1)
    slice_chrUn_KI270579v1:      slice(29250198, 29250411, 1)
    slice_chrUn_KI270580v1:      slice(29250129, 29250141, 1)
    slice_chrUn_KI270581v1:      slice(29250141, 29250198, 1)
    slice_chrUn_KI270582v1:      slice(29250753, 29250816, 1)
    slice_chrUn_KI270583v1:      slice(29250103, 29250114, 1)
    slice_chrUn_KI270584v1:      slice(29250726, 29250753, 1)
    slice_chrUn_KI270587v1:      slice(29250114, 29250129, 1)
    slice_chrUn_KI270588v1:      slice(29250816, 29250857, 1)
    slice_chrUn_KI270589v1:      slice(29250411, 29250700, 1)
    slice_chrUn_KI270590v1:      slice(29250700, 29250726, 1)
    slice_chrUn_KI270591v1:      slice(29250879, 29250912, 1)
    slice_chrUn_KI270593v1:      slice(29250857, 29250879, 1)
    slice_chrUn_KI270741v1:      slice(29263727, 29265416, 1)
    slice_chrUn_KI270742v1:      slice(29288641, 29290053, 1)
    slice_chrUn_KI270743v1:      slice(29267377, 29269464, 1)
    slice_chrUn_KI270744v1:      slice(29269464, 29271450, 1)
    slice_chrUn_KI270745v1:      slice(29271450, 29272082, 1)
    slice_chrUn_KI270746v1:      slice(29272082, 29272487, 1)
    slice_chrUn_KI270747v1:      slice(29272487, 29275904, 1)
    slice_chrUn_KI270748v1:      slice(29275904, 29277399, 1)
    slice_chrUn_KI270749v1:      slice(29277399, 29278817, 1)
    slice_chrUn_KI270750v1:      slice(29278817, 29280152, 1)
    slice_chrUn_KI270751v1:      slice(29280152, 29281268, 1)
    slice_chrUn_KI270752v1:      slice(29281268, 29281503, 1)
    slice_chrUn_KI270753v1:      slice(29281503, 29282300, 1)
    slice_chrUn_KI270754v1:      slice(29282300, 29283979, 1)
    slice_chrUn_KI270755v1:      slice(29283979, 29284530, 1)
    slice_chrUn_KI270756v1:      slice(29284530, 29286560, 1)
    slice_chrUn_KI270757v1:      slice(29286560, 29286986, 1)
    slice_chrX:                  slice(27660298, 28983007, 1)
    slice_chrY:                  slice(28983007, 29152456, 1)
    slice_chrY_KI270740v1_rand:  slice(29239647, 29239942, 1)
    zarr_version:                0.1

Restrict to chromosome 1#

In this notebook, we will look for DMRs on a subset of the whole genome, here chromosome 1.

substore = ds["chr1"].copy()

Filter by coverage#

It is good practice to remove sites of low coverage before calling DMRs. This can be done with the subset_bycoverage method of the ContigDataset. Here we call it with the method="any", which removes a position if ay of the samples have a coverage strictly lower than min_coverage at that CpG context. Other options are "all" and "mean".

substore = substore.subset_bycoverage(min_coverage=5, method="any")

Call DMRs#

Now we can call DMRs using the call_dmrs function. We need to specify the ContigDataset object we are going to use (here substore), as well as the count variables (here num_modc and num_total_c) and the condition (here we look for DMRs across two different families, so we set use family).

Finally, call_dmrs needs to know the regions where to look for DMRs:

We use the default mode, which is to pre-segment the CpGs, creating a new region every time two consecutive CpGs are separated by a distance larger than presegment_threshold (here we use the default presegment_threshold of 500bp). This determines the regions in which modality is going to look for DMRs.
Note we could instead pass a ranges argument, which can either be an integer (e.g. ranges=2000 will divide the genome in tiles of size 2kbp), or an actual set of genomic ranges. If such object is provided, modality will call DMRs on those ranges. This object should be a pyranges object. If you have a set of genomic ranges as a bed file, you can for instance use ranges = pr.read_bed(path_to_bedfile) and pass it to call_dmrs.

For this dataset, it takes about 1 minutes to call DMRs on chromosome 1 on a MacBook Pro M2.

Note that you don’t have to restrict to one chromosome at a time. You can use call_dmrs on your ContigDataset ds.

substore

<xarray.Dataset> Size: 845MB
Dimensions:                         (pos: 2241991, sample_id: 6)
Coordinates:
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
    contig                          (pos) <U20 179MB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 9MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 9MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 2MB dask.array<chunksize=(40000,), meta=np.ndarray>
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes:
    context:           CG
    context_encoding:  {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1...
    contigs:           ['chr1']
    coordinate_basis:  0
    description:       A +modC dataset of human Genome in a bottle (GIAB) sam...
    fasta_path:        GRCh38Decoy.fa
    input_path:        ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as...
    quant_subtype:     modC
    ref_name:          GRCh38Decoy
    sample_ids:        ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '...
    slice_chr1:        slice(0, 2241991, 1)
    zarr_version:      0.1

Now we are ready to call DMRs.

dmr_regions = call_dmrs(
    ds=substore,
    count_array_name="num_modc",
    total_counts_name="num_total_c",
    condition_array_name="family",
    presegment_threshold=500,
    condition_order=["Ashkenazi Jewish", "Han Chinese"],
    as_pyranges=True,
)

The output dmr_regions is a dataframe containing the following columns:

the chromosome, start, end (zero-based) of the region
the number of contexts in the region
the mean coverage of the region
the mean methylation fraction of each group
the methylation fold change (mean(group_2)/mean(group_1))
the methylation difference (mean(group_2)-mean(group_1))
the test statistic, in our case a Wald test
the p-value associated with the significance of the coefficient of the logistic regression associated with the group
the q-value (Benjamini-Hochberg corrected p-value)

Note that we could also ouptut a pandas Dataframe instead of a pyranges one by setting as_pyranges=False in the call_dmrs function.

dmr_regions.head()

	Chromosome	Start	End	num_contexts	mean_coverage	mean_mod_group_1	mean_mod_group_2	mod_fold_change	mod_difference	test_statistic	dmr_pvalue	dmr_qvalue
0	chr1	10578	10816	50	28.973333	0.966083	0.924386	0.956839	-0.041697	60.001180	9.480050e-15	3.337637e-14
1	chr1	14652	14791	9	70.907407	0.917142	0.934554	1.018985	0.017412	2.874834	8.997427e-02	1.071010e-01
2	chr1	16138	16869	8	81.333333	0.831444	0.841313	1.011870	0.009869	1.169600	2.794835e-01	3.111446e-01
3	chr1	17377	17613	10	129.400000	0.960567	0.936319	0.974757	-0.024247	22.613622	1.980753e-06	3.656151e-06
4	chr1	19159	19322	9	63.685185	0.921850	0.918903	0.996804	-0.002946	0.011737	9.137286e-01	9.236813e-01
5	chr1	29271	29435	25	25.933333	0.002069	0.002200	1.063500	0.000131	0.001907	9.651642e-01	9.692986e-01
6	chr1	102932	103249	8	117.750000	0.598732	0.724905	1.210733	0.126173	115.735209	5.431896e-27	4.532539e-26
7	chr1	108774	109575	6	107.555556	0.456157	0.347131	0.760990	-0.109026	21.070517	4.426873e-06	7.931364e-06

Even if you kept the pyranges object as the default type of the output, it is straightforward to convert this object to a pandas dataframe if we want:

dmr_regions = dmr_regions.df

It is also useful to associate a slice in the format “Chromosome:Start-End” to the dataframe, since our ContigDataset object can be sliced using such string. This will come in handy later on in this notebook when we plot DMRs.

dmr_regions["Slice"] = dmr_regions.apply(
    lambda row: f"{row['Chromosome']}:{row['Start']}-{row['End']}", axis=1
)    
dmr_regions.head()

	Chromosome	Start	End	num_contexts	mean_coverage	mean_mod_group_1	mean_mod_group_2	mod_fold_change	mod_difference	test_statistic	dmr_pvalue	dmr_qvalue	Slice
0	chr1	10578	10816	50	28.973333	0.966083	0.924386	0.956839	-0.041697	60.001180	9.480050e-15	3.337637e-14	chr1:10578-10816
1	chr1	14652	14791	9	70.907407	0.917142	0.934554	1.018985	0.017412	2.874834	8.997427e-02	1.071010e-01	chr1:14652-14791
2	chr1	16138	16869	8	81.333333	0.831444	0.841313	1.011870	0.009869	1.169600	2.794835e-01	3.111446e-01	chr1:16138-16869
3	chr1	17377	17613	10	129.400000	0.960567	0.936319	0.974757	-0.024247	22.613622	1.980753e-06	3.656151e-06	chr1:17377-17613
4	chr1	19159	19322	9	63.685185	0.921850	0.918903	0.996804	-0.002946	0.011737	9.137286e-01	9.236813e-01	chr1:19159-19322

Differentially Methylated Cytosines#

Instead of calling DMRs, you may be interested to know if your individual cytosines are differentially methyated or not (so called DMC’s).

In principle you could call DMCs genome-wide, but in practice it involves running one logistic regression per context. This task is time and memory intensive, and is not recommended. Instead, we can focus a small region of interest, for instance a gene or an enhancer, and look at differentially methylated contexts for this region only.

Let’s look at the GBA1 gene, associated with Gaucher disease, a genetic disease particularly common in the Ashkenazi population. The gene is on the negative strand of chr1, between positions 155234451 and 155244698 (human reference hg38). We can select this particular region in our dataset by using a slice string formatted as “Chromosome:Start-End”. So for our particular gene, we would do:

ds["chr1:155234451-155244698"]

<xarray.Dataset> Size: 74kB
Dimensions:                         (pos: 196, sample_id: 6)
Coordinates:
    contig                          (pos) <U20 16kB dask.array<chunksize=(196,), meta=np.ndarray>
    ref_position                    (pos) uint32 784B dask.array<chunksize=(196,), meta=np.ndarray>
    strand                          (pos) <U1 784B dask.array<chunksize=(196,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 196B dask.array<chunksize=(196,), meta=np.ndarray>
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes:
    context:           CG
    context_encoding:  {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1...
    contigs:           ['chr1']
    coordinate_basis:  0
    description:       A +modC dataset of human Genome in a bottle (GIAB) sam...
    fasta_path:        GRCh38Decoy.fa
    input_path:        ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as...
    quant_subtype:     modC
    ref_name:          GRCh38Decoy
    sample_ids:        ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '...
    zarr_version:      0.1
    slice_chr1:        slice(0, 196, 1)

Let’s look at DMCs in this gene. In order to achieve this, we run the call_dmrs function with the ranges=1 parameter. This creates ranges of size 1bp, so each context becomes its own genomic range on which we can call differential methylation.

dmc_calls = call_dmrs(
    ds=ds["chr1:155234451-155244698"],
    count_array_name="num_modc",
    total_counts_name="num_total_c",
    condition_array_name="family",
    condition_order=["Ashkenazi Jewish", "Han Chinese"],
    as_pyranges=True,
    ranges=1,
)

Let’s remove positions that do not contain a context, and sites of low coverage:

dmc_calls = dmc_calls[(dmc_calls.num_contexts>0) & (dmc_calls.mean_coverage>5)].df

The table below shows all the contexts in the gene, ordered by the p-value associated with their level of differential methylation.

dmc_calls.sort_values("dmr_pvalue")

	Chromosome	Start	End	num_contexts	mean_coverage	mean_mod_group_1	mean_mod_group_2	mod_fold_change	mod_difference	test_statistic	dmr_pvalue	dmr_qvalue
110	chr1	155241670	155241671	1	111.166667	0.236535	0.635130	2.685147	0.398595	101.937739	5.729453e-24	5.812530e-20
74	chr1	155239109	155239110	1	93.333333	0.520995	0.877068	1.683448	0.356073	75.437030	3.772482e-18	1.913591e-14
124	chr1	155243497	155243498	1	105.000000	0.423714	0.734200	1.732771	0.310485	67.912718	1.708953e-16	5.779111e-13
115	chr1	155242045	155242046	1	113.666667	0.677408	0.921636	1.360532	0.244228	56.317558	6.166232e-14	1.563911e-10
54	chr1	155237947	155237948	1	94.000000	0.665644	0.408962	0.614385	-0.256682	38.760004	4.792476e-10	9.723934e-07
...	...	...	...	...	...	...	...	...	...	...	...	...
59	chr1	155238289	155238290	1	64.500000	0.919168	0.917938	0.998663	-0.001229	0.001364	9.705439e-01	1.000000e+00
156	chr1	155244259	155244260	1	97.833333	0.010367	0.010317	0.995114	-0.000051	0.000862	9.765759e-01	1.000000e+00
64	chr1	155238682	155238683	1	106.000000	0.880094	0.880548	1.000515	0.000453	0.000854	9.766926e-01	1.000000e+00
17	chr1	155235813	155235814	1	99.833333	0.961497	0.963858	1.002456	0.002361	0.000584	9.807192e-01	1.000000e+00
93	chr1	155240047	155240048	1	80.333333	0.964794	0.966017	1.001268	0.001223	0.000207	9.885330e-01	1.000000e+00

186 rows × 12 columns

Let’s visualise the significant contexts as a Manhattan plot of their q-values:

dmc_calls["minus_log10_qvalue"] = -np.log10(dmc_calls["dmr_qvalue"])
g = sns.scatterplot(
    data=dmc_calls, 
    x="Start", 
    y="minus_log10_qvalue",
    edgecolor=None,
    s=5,
)

# Format axis labels
plt.xlabel("Genomic Position")
plt.ylabel("-log10(q-value)")
# add significance threshold line
plt.axhline(-np.log10(0.05), color='C2', linestyle='--', label="q-value = 0.05")

# Format x-axis labels
ticks = g.get_xticks()
xlabels = ["{:,.0f}".format(x) for x in ticks]
g.set_xticklabels(xlabels)
# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')
# Only three ticks on x-axis
plt.locator_params(axis='x', nbins=3)
# Add legend
plt.legend()

plt.show()

_images/89d3a611dc08301f2c65d4903abbed0d3907e1700817a599fe2778bc96caa5b8.png

Volcano plot of DMRs#

Now let’s return to our region-based calls. Let’s visualise the q-values of our DMR calls against the mean methylation difference between the two groups.

dmr_regions["dmr_qvalue"] = dmr_regions["dmr_qvalue"].replace(0, 10**-300)
dmr_regions["-log10 dmr_qvalue"] = -np.log10(dmr_regions["dmr_qvalue"])

g = sns.scatterplot(
    x="mod_difference",
    y="-log10 dmr_qvalue",
    data=dmr_regions,
    edgecolor=None,
    s=5,
)
g.set_xlim(-1, 1)
g.axvline(0, color="black", linewidth=0.5)
g.axvline(np.log2(1./2), color="black", linewidth=0.5, linestyle="--")
g.axvline(np.log2(2.), color="black", linewidth=0.5, linestyle="--")

<matplotlib.lines.Line2D at 0x7f060ef51b10>

_images/10ead41ef42e5f3b657db886539f375eca92e79ce52a77a9849068b5a4a8bc94.png

The solid black line marks the transition from hypo to hyper methylated DMRs. Hence the right-hand side of that line indicate DMRs where the Ashkenazi family is less methylated on average than the Han family.

A closer look at some DMRs#

Let’s look at regions with low q-values, high number of CpGs, and significant methylation difference. These should be our strong DMR calls.

dmrs = dmr_regions.query(
    "dmr_qvalue < 0.01 & num_contexts > 10 & (mod_difference < -0.05 | mod_difference > 0.05)"
).reset_index(drop=True)
dmrs = dmrs.sort_values("dmr_pvalue")
dmrs

	Chromosome	Start	End	num_contexts	mean_coverage	mean_mod_group_1	mean_mod_group_2	mod_fold_change	mod_difference	test_statistic	dmr_pvalue	dmr_qvalue	Slice	-log10 dmr_qvalue
18059	chr1	244280325	244284050	63	96.089947	0.678418	0.849053	1.251519	0.170635	1424.550130	9.720008e-312	6.357954e-307	chr1:244280325-244284050	306.196683
241	chr1	23030631	23034399	35	106.876190	0.550629	0.788732	1.432420	0.238103	1415.936394	7.234870e-310	2.366200e-305	chr1:23030631-23034399	304.625948
18060	chr1	244284050	244287775	63	97.216931	0.646204	0.860158	1.331093	0.213954	2122.689492	2.225074e-308	7.660227e-305	chr1:244284050-244287775	304.115758
5223	chr1	143326766	143327537	84	48.507937	0.733338	0.938419	1.279654	0.205081	2150.004558	2.225074e-308	7.660227e-305	chr1:143326766-143327537	304.115758
165	chr1	16699390	16700179	89	81.932584	0.392937	0.155515	0.395776	-0.237422	3091.502387	2.225074e-308	7.660227e-305	chr1:16699390-16700179	304.115758
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4987	chr1	121627144	121627659	12	33.152778	0.724702	0.783422	1.081027	0.058720	8.732711	3.125514e-03	4.345043e-03	chr1:121627144-121627659	2.362006
5427	chr1	149412235	149414853	20	39.116667	0.679546	0.618624	0.910349	-0.060922	8.229197	4.122176e-03	5.666162e-03	chr1:149412235-149414853	2.246711
5214	chr1	124807770	124808169	12	32.194444	0.665899	0.719740	1.080855	0.053841	7.787822	5.259957e-03	7.150764e-03	chr1:124807770-124808169	2.145648
5105	chr1	123652048	123652730	11	42.515152	0.586700	0.643289	1.096452	0.056589	7.782403	5.275756e-03	7.171200e-03	chr1:123652048-123652730	2.144408
5057	chr1	123112849	123113680	12	38.972222	0.661589	0.609712	0.921587	-0.051877	7.263604	7.036594e-03	9.445130e-03	chr1:123112849-123113680	2.024792

18301 rows × 14 columns

We can use the plot_methylation_tile method of the ContigDataset to plot visualise the DMRs.

In the two examples below, we can see tiles where the methylation is different between HG002, HG003, HG004 (the Ashkenazi trio) on one hand and HG005, HG006, HG007 (the Han trio) on the other hand. Each patch of color is a CpG. The dark patches correspond to CpGs with high methyation fraction, while the light patches correpond to a low methylation fraction. Grey patches (if any) are CpGs with low coverage. For each panel, the bottom plot is a locator which shows the genomic location of each CpG.

fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)

_images/e5910f32c4f5d2ea53c4d33edcb2f16ec200b6e5df3da780ba3793761519202b.png

We can also plot the methylation trace. It shows the methylation fraction across the region of interest. In our case, it clearly shows a distinct pattern of methylation between the Ashkenazi group (blue-ish colours) and the Han group (red-ish colours).

kwargs = {
    "palette": ["#003B49", "#9CDBD9", "#668991", "#F87C56", "#fbb59f", "#f5430d"],
}

fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)

_images/d487db2dab4e4dbce0c88d589bcdda8f64d3458e42ff03d31a86a5671e792ef7.png

Let’s look at a second example:

fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)

_images/1ff0113fa385a9da0e058d9e3349333e5918f91008a53720ba9edc1a78ab1369.png

fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)

_images/72a7ae3a8d0ba4123a351691cefdc31f40fee51b6dcacd797535239c2ea110c6.png

A non-DMR example#

In this example there is not clear difference in methylation between the two family trios, hence it was correctly not called as a DMR.

fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)

_images/69e53f63762ab9406a728540347143db98569083a3703c24846da748fa9ac322.png

fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)

_images/9b633af90bf6f5f4048c57212fa2e41baf28fd073d8ed93da45ee38b6ce4f44d.png