Calling DMRs with modality#

This notebooks illustrates how modality can call DMRs (differentially methylated regions). The code looks for specific group differences (e.g. healthy versus control, or sex). A brief outline of the code is described below:

  1. We divide genome in smaller regions. Two options are available:

  • tiles of fixed length (e.g. 2kb)

  • segmentation based, in which we identify gaps in the genome where there are no CpG within a certain distance (e.g. 500bp) and divide accordingly. This is the default mode.

  1. The contexts of each region are summed together to get the total number of methylated and unmethylated counts.

  2. We use a logistic regression to model the methylation levels of each region. A DMR is defined as a region with a significant difference in methylation levels between two conditions. This is quantified by a p-value.

  3. Based on this p-value, we can decide if the tile is a DMR. We also report a q-value, which is the p-value corrected using the Benjamini-Hochberg method.

from modality.analysis import call_dmrs
from modality import load_biomodal_dataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyranges as pr
import seaborn as sns

Load the data#

For the purposes of this demo we can use a duet +modC dataset consisting of 14 Genome in a Bottle (GIAB) samples. The data are publicly available and can be loaded using the load_biomodal_dataset() function. This will pull the dataset and load it in to our session as a ContigDataset object.

The dataset consists of 14 samples, 2 replicates of 7 cell lines. In this notebook we will look for regions of genome that are differentially methylated between two families: the Han mother-father-son trio on one hand, and the Ashkenazi mother-father-son trio on the other hand.

We merge the 2 technical replicates, and we merge the CpGs on the two strands.

ds = load_biomodal_dataset(
    name="giab",
    merge_technical_replicates=True,
    merge_cpgs=True,
    )
ds
<xarray.Dataset> Size: 12GB
Dimensions:                         (pos: 29294625, sample_id: 7)
Coordinates:
    contig                          (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray>
  * sample_id                       (sample_id) object 56B 'HG001' ... 'HG007'
    group                           (sample_id) object 56B 'HG001' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 56B 80 80 80 80 80 80 80
    Protocol Version                (sample_id) object 56B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 56B 'NA12878' ... 'NA2...
    family                          (sample_id) object 56B 'Utah' ... 'Han Ch...
    family_status                   (sample_id) object 56B 'Singleton' ... 'M...
    sex                             (sample_id) object 56B 'Female' ... 'Female'
    tech_replicate_number           (sample_id) int64 56B 1 1 1 1 1 1 1
Attributes: (100/205)
    context:                     CG
    context_encoding:            {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':...
    contigs:                     ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch...
    coordinate_basis:            0
    description:                 A +modC dataset of human Genome in a bottle ...
    fasta_path:                  GRCh38Decoy.fa
    input_path:                  ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_...
    quant_subtype:               modC
    ref_name:                    GRCh38Decoy
    sample_ids:                  ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12...
    slice_chr1:                  slice(0, 2375159, 1)
    slice_chr10:                 slice(14996202, 16385180, 1)
    slice_chr11:                 slice(16385180, 17718294, 1)
    slice_chr11_KI270721v1_ran:  slice(29173382, 29176776, 1)
    slice_chr12:                 slice(17718294, 19034262, 1)
    slice_chr13:                 slice(19034262, 19876731, 1)
    slice_chr14:                 slice(19876731, 20739159, 1)
    slice_chr14_GL000009v2_ran:  slice(29176776, 29178255, 1)
    slice_chr14_GL000194v1_ran:  slice(29183653, 29185755, 1)
    slice_chr14_GL000225v1_ran:  slice(29178255, 29182074, 1)
    slice_chr14_KI270722v1_ran:  slice(29182074, 29183653, 1)
    slice_chr14_KI270723v1_ran:  slice(29185755, 29186353, 1)
    slice_chr14_KI270724v1_ran:  slice(29186353, 29186907, 1)
    slice_chr14_KI270725v1_ran:  slice(29186907, 29188509, 1)
    slice_chr14_KI270726v1_ran:  slice(29188509, 29189025, 1)
    slice_chr15:                 slice(20739159, 21645185, 1)
    slice_chr15_KI270727v1_ran:  slice(29189025, 29192989, 1)
    slice_chr16:                 slice(21645185, 22796076, 1)
    slice_chr16_KI270728v1_ran:  slice(29192989, 29208664, 1)
    slice_chr17:                 slice(22796076, 24044404, 1)
    slice_chr17_GL000205v2_ran:  slice(29208664, 29210537, 1)
    slice_chr17_KI270729v1_ran:  slice(29210537, 29217219, 1)
    slice_chr17_KI270730v1_ran:  slice(29217219, 29219552, 1)
    slice_chr18:                 slice(24044404, 24800418, 1)
    slice_chr19:                 slice(24800418, 25857083, 1)
    slice_chr1_KI270706v1_rand:  slice(29152891, 29154480, 1)
    slice_chr1_KI270707v1_rand:  slice(29154480, 29154791, 1)
    slice_chr1_KI270708v1_rand:  slice(29154791, 29155844, 1)
    slice_chr1_KI270709v1_rand:  slice(29155844, 29157288, 1)
    slice_chr1_KI270710v1_rand:  slice(29157288, 29157505, 1)
    slice_chr1_KI270711v1_rand:  slice(29157505, 29157882, 1)
    slice_chr1_KI270712v1_rand:  slice(29157882, 29160999, 1)
    slice_chr1_KI270713v1_rand:  slice(29160999, 29161864, 1)
    slice_chr1_KI270714v1_rand:  slice(29161864, 29163019, 1)
    slice_chr2:                  slice(2375159, 4567829, 1)
    slice_chr20:                 slice(25857083, 26630560, 1)
    slice_chr21:                 slice(26630560, 27059406, 1)
    slice_chr22:                 slice(27059406, 27660298, 1)
    slice_chr22_KI270731v1_ran:  slice(29219552, 29222033, 1)
    slice_chr22_KI270732v1_ran:  slice(29222033, 29222518, 1)
    ...                          ...
    slice_chrUn_KI270510v1:      slice(29247760, 29247775, 1)
    slice_chrUn_KI270511v1:      slice(29248979, 29249024, 1)
    slice_chrUn_KI270512v1:      slice(29247837, 29248020, 1)
    slice_chrUn_KI270515v1:      slice(29249024, 29249065, 1)
    slice_chrUn_KI270516v1:      slice(29247829, 29247837, 1)
    slice_chrUn_KI270517v1:      slice(29249104, 29249116, 1)
    slice_chrUn_KI270518v1:      slice(29247795, 29247808, 1)
    slice_chrUn_KI270519v1:      slice(29248020, 29248909, 1)
    slice_chrUn_KI270521v1:      slice(29251667, 29251743, 1)
    slice_chrUn_KI270522v1:      slice(29248909, 29248979, 1)
    slice_chrUn_KI270528v1:      slice(29249140, 29249240, 1)
    slice_chrUn_KI270529v1:      slice(29249116, 29249140, 1)
    slice_chrUn_KI270530v1:      slice(29249240, 29249275, 1)
    slice_chrUn_KI270538v1:      slice(29249353, 29250059, 1)
    slice_chrUn_KI270539v1:      slice(29249275, 29249353, 1)
    slice_chrUn_KI270544v1:      slice(29250059, 29250078, 1)
    slice_chrUn_KI270548v1:      slice(29250078, 29250103, 1)
    slice_chrUn_KI270579v1:      slice(29250198, 29250411, 1)
    slice_chrUn_KI270580v1:      slice(29250129, 29250141, 1)
    slice_chrUn_KI270581v1:      slice(29250141, 29250198, 1)
    slice_chrUn_KI270582v1:      slice(29250753, 29250816, 1)
    slice_chrUn_KI270583v1:      slice(29250103, 29250114, 1)
    slice_chrUn_KI270584v1:      slice(29250726, 29250753, 1)
    slice_chrUn_KI270587v1:      slice(29250114, 29250129, 1)
    slice_chrUn_KI270588v1:      slice(29250816, 29250857, 1)
    slice_chrUn_KI270589v1:      slice(29250411, 29250700, 1)
    slice_chrUn_KI270590v1:      slice(29250700, 29250726, 1)
    slice_chrUn_KI270591v1:      slice(29250879, 29250912, 1)
    slice_chrUn_KI270593v1:      slice(29250857, 29250879, 1)
    slice_chrUn_KI270741v1:      slice(29263727, 29265416, 1)
    slice_chrUn_KI270742v1:      slice(29288641, 29290053, 1)
    slice_chrUn_KI270743v1:      slice(29267377, 29269464, 1)
    slice_chrUn_KI270744v1:      slice(29269464, 29271450, 1)
    slice_chrUn_KI270745v1:      slice(29271450, 29272082, 1)
    slice_chrUn_KI270746v1:      slice(29272082, 29272487, 1)
    slice_chrUn_KI270747v1:      slice(29272487, 29275904, 1)
    slice_chrUn_KI270748v1:      slice(29275904, 29277399, 1)
    slice_chrUn_KI270749v1:      slice(29277399, 29278817, 1)
    slice_chrUn_KI270750v1:      slice(29278817, 29280152, 1)
    slice_chrUn_KI270751v1:      slice(29280152, 29281268, 1)
    slice_chrUn_KI270752v1:      slice(29281268, 29281503, 1)
    slice_chrUn_KI270753v1:      slice(29281503, 29282300, 1)
    slice_chrUn_KI270754v1:      slice(29282300, 29283979, 1)
    slice_chrUn_KI270755v1:      slice(29283979, 29284530, 1)
    slice_chrUn_KI270756v1:      slice(29284530, 29286560, 1)
    slice_chrUn_KI270757v1:      slice(29286560, 29286986, 1)
    slice_chrX:                  slice(27660298, 28983007, 1)
    slice_chrY:                  slice(28983007, 29152456, 1)
    slice_chrY_KI270740v1_rand:  slice(29239647, 29239942, 1)
    zarr_version:                0.1

Select the two families#

First, we use the select_samples method to select only the samples corresponding to the two families of interest (i.e., the families that are not from the HG001/Utah cell line).

ds = ds.select_samples([i for i in ds.sample_id.values if i != "HG001"])
ds
<xarray.Dataset> Size: 11GB
Dimensions:                         (pos: 29294625, sample_id: 6)
Coordinates:
    contig                          (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray>
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes: (100/205)
    context:                     CG
    context_encoding:            {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':...
    contigs:                     ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch...
    coordinate_basis:            0
    description:                 A +modC dataset of human Genome in a bottle ...
    fasta_path:                  GRCh38Decoy.fa
    input_path:                  ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_...
    quant_subtype:               modC
    ref_name:                    GRCh38Decoy
    sample_ids:                  ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12...
    slice_chr1:                  slice(0, 2375159, 1)
    slice_chr10:                 slice(14996202, 16385180, 1)
    slice_chr11:                 slice(16385180, 17718294, 1)
    slice_chr11_KI270721v1_ran:  slice(29173382, 29176776, 1)
    slice_chr12:                 slice(17718294, 19034262, 1)
    slice_chr13:                 slice(19034262, 19876731, 1)
    slice_chr14:                 slice(19876731, 20739159, 1)
    slice_chr14_GL000009v2_ran:  slice(29176776, 29178255, 1)
    slice_chr14_GL000194v1_ran:  slice(29183653, 29185755, 1)
    slice_chr14_GL000225v1_ran:  slice(29178255, 29182074, 1)
    slice_chr14_KI270722v1_ran:  slice(29182074, 29183653, 1)
    slice_chr14_KI270723v1_ran:  slice(29185755, 29186353, 1)
    slice_chr14_KI270724v1_ran:  slice(29186353, 29186907, 1)
    slice_chr14_KI270725v1_ran:  slice(29186907, 29188509, 1)
    slice_chr14_KI270726v1_ran:  slice(29188509, 29189025, 1)
    slice_chr15:                 slice(20739159, 21645185, 1)
    slice_chr15_KI270727v1_ran:  slice(29189025, 29192989, 1)
    slice_chr16:                 slice(21645185, 22796076, 1)
    slice_chr16_KI270728v1_ran:  slice(29192989, 29208664, 1)
    slice_chr17:                 slice(22796076, 24044404, 1)
    slice_chr17_GL000205v2_ran:  slice(29208664, 29210537, 1)
    slice_chr17_KI270729v1_ran:  slice(29210537, 29217219, 1)
    slice_chr17_KI270730v1_ran:  slice(29217219, 29219552, 1)
    slice_chr18:                 slice(24044404, 24800418, 1)
    slice_chr19:                 slice(24800418, 25857083, 1)
    slice_chr1_KI270706v1_rand:  slice(29152891, 29154480, 1)
    slice_chr1_KI270707v1_rand:  slice(29154480, 29154791, 1)
    slice_chr1_KI270708v1_rand:  slice(29154791, 29155844, 1)
    slice_chr1_KI270709v1_rand:  slice(29155844, 29157288, 1)
    slice_chr1_KI270710v1_rand:  slice(29157288, 29157505, 1)
    slice_chr1_KI270711v1_rand:  slice(29157505, 29157882, 1)
    slice_chr1_KI270712v1_rand:  slice(29157882, 29160999, 1)
    slice_chr1_KI270713v1_rand:  slice(29160999, 29161864, 1)
    slice_chr1_KI270714v1_rand:  slice(29161864, 29163019, 1)
    slice_chr2:                  slice(2375159, 4567829, 1)
    slice_chr20:                 slice(25857083, 26630560, 1)
    slice_chr21:                 slice(26630560, 27059406, 1)
    slice_chr22:                 slice(27059406, 27660298, 1)
    slice_chr22_KI270731v1_ran:  slice(29219552, 29222033, 1)
    slice_chr22_KI270732v1_ran:  slice(29222033, 29222518, 1)
    ...                          ...
    slice_chrUn_KI270510v1:      slice(29247760, 29247775, 1)
    slice_chrUn_KI270511v1:      slice(29248979, 29249024, 1)
    slice_chrUn_KI270512v1:      slice(29247837, 29248020, 1)
    slice_chrUn_KI270515v1:      slice(29249024, 29249065, 1)
    slice_chrUn_KI270516v1:      slice(29247829, 29247837, 1)
    slice_chrUn_KI270517v1:      slice(29249104, 29249116, 1)
    slice_chrUn_KI270518v1:      slice(29247795, 29247808, 1)
    slice_chrUn_KI270519v1:      slice(29248020, 29248909, 1)
    slice_chrUn_KI270521v1:      slice(29251667, 29251743, 1)
    slice_chrUn_KI270522v1:      slice(29248909, 29248979, 1)
    slice_chrUn_KI270528v1:      slice(29249140, 29249240, 1)
    slice_chrUn_KI270529v1:      slice(29249116, 29249140, 1)
    slice_chrUn_KI270530v1:      slice(29249240, 29249275, 1)
    slice_chrUn_KI270538v1:      slice(29249353, 29250059, 1)
    slice_chrUn_KI270539v1:      slice(29249275, 29249353, 1)
    slice_chrUn_KI270544v1:      slice(29250059, 29250078, 1)
    slice_chrUn_KI270548v1:      slice(29250078, 29250103, 1)
    slice_chrUn_KI270579v1:      slice(29250198, 29250411, 1)
    slice_chrUn_KI270580v1:      slice(29250129, 29250141, 1)
    slice_chrUn_KI270581v1:      slice(29250141, 29250198, 1)
    slice_chrUn_KI270582v1:      slice(29250753, 29250816, 1)
    slice_chrUn_KI270583v1:      slice(29250103, 29250114, 1)
    slice_chrUn_KI270584v1:      slice(29250726, 29250753, 1)
    slice_chrUn_KI270587v1:      slice(29250114, 29250129, 1)
    slice_chrUn_KI270588v1:      slice(29250816, 29250857, 1)
    slice_chrUn_KI270589v1:      slice(29250411, 29250700, 1)
    slice_chrUn_KI270590v1:      slice(29250700, 29250726, 1)
    slice_chrUn_KI270591v1:      slice(29250879, 29250912, 1)
    slice_chrUn_KI270593v1:      slice(29250857, 29250879, 1)
    slice_chrUn_KI270741v1:      slice(29263727, 29265416, 1)
    slice_chrUn_KI270742v1:      slice(29288641, 29290053, 1)
    slice_chrUn_KI270743v1:      slice(29267377, 29269464, 1)
    slice_chrUn_KI270744v1:      slice(29269464, 29271450, 1)
    slice_chrUn_KI270745v1:      slice(29271450, 29272082, 1)
    slice_chrUn_KI270746v1:      slice(29272082, 29272487, 1)
    slice_chrUn_KI270747v1:      slice(29272487, 29275904, 1)
    slice_chrUn_KI270748v1:      slice(29275904, 29277399, 1)
    slice_chrUn_KI270749v1:      slice(29277399, 29278817, 1)
    slice_chrUn_KI270750v1:      slice(29278817, 29280152, 1)
    slice_chrUn_KI270751v1:      slice(29280152, 29281268, 1)
    slice_chrUn_KI270752v1:      slice(29281268, 29281503, 1)
    slice_chrUn_KI270753v1:      slice(29281503, 29282300, 1)
    slice_chrUn_KI270754v1:      slice(29282300, 29283979, 1)
    slice_chrUn_KI270755v1:      slice(29283979, 29284530, 1)
    slice_chrUn_KI270756v1:      slice(29284530, 29286560, 1)
    slice_chrUn_KI270757v1:      slice(29286560, 29286986, 1)
    slice_chrX:                  slice(27660298, 28983007, 1)
    slice_chrY:                  slice(28983007, 29152456, 1)
    slice_chrY_KI270740v1_rand:  slice(29239647, 29239942, 1)
    zarr_version:                0.1

Restrict to chromosome 1#

In this notebook, we will look for DMRs on a subset of the whole genome, here chromosome 1.

substore = ds["chr1"].copy()

Filter by coverage#

It is good practice to remove sites of low coverage before calling DMRs. This can be done with the subset_bycoverage method of the ContigDataset. Here we call it with the method="any", which removes a position if ay of the samples have a coverage strictly lower than min_coverage at that CpG context. Other options are "all" and "mean".

substore = substore.subset_bycoverage(min_coverage=5, method="any")

Call DMRs#

Now we can call DMRs using the call_dmrs function. We need to specify the ContigDataset object we are going to use (here substore), as well as the count variables (here num_modc and num_total_c) and the condition (here we look for DMRs across two different families, so we set use family).

Finally, call_dmrs needs to know the regions where to look for DMRs:

  • We use the default mode, which is to pre-segment the CpGs, creating a new region every time two consecutive CpGs are separated by a distance larger than presegment_threshold (here we use the default presegment_threshold of 500bp). This determines the regions in which modality is going to look for DMRs.

  • Note we could instead pass a ranges argument, which can either be an integer (e.g. ranges=2000 will divide the genome in tiles of size 2kbp), or an actual set of genomic ranges. If such object is provided, modality will call DMRs on those ranges. This object should be a pyranges object. If you have a set of genomic ranges as a bed file, you can for instance use ranges = pr.read_bed(path_to_bedfile) and pass it to call_dmrs.

For this dataset, it takes about 1 minutes to call DMRs on chromosome 1 on a MacBook Pro M2.

Note that you don’t have to restrict to one chromosome at a time. You can use call_dmrs on your ContigDataset ds.

substore
<xarray.Dataset> Size: 845MB
Dimensions:                         (pos: 2241991, sample_id: 6)
Coordinates:
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
    contig                          (pos) <U20 179MB dask.array<chunksize=(40000,), meta=np.ndarray>
    ref_position                    (pos) uint32 9MB dask.array<chunksize=(40000,), meta=np.ndarray>
    strand                          (pos) <U1 9MB dask.array<chunksize=(40000,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 2MB dask.array<chunksize=(40000,), meta=np.ndarray>
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes:
    context:           CG
    context_encoding:  {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1...
    contigs:           ['chr1']
    coordinate_basis:  0
    description:       A +modC dataset of human Genome in a bottle (GIAB) sam...
    fasta_path:        GRCh38Decoy.fa
    input_path:        ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as...
    quant_subtype:     modC
    ref_name:          GRCh38Decoy
    sample_ids:        ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '...
    slice_chr1:        slice(0, 2241991, 1)
    zarr_version:      0.1

Now we are ready to call DMRs.

dmr_regions = call_dmrs(
    ds=substore,
    count_array_name="num_modc",
    total_counts_name="num_total_c",
    condition_array_name="family",
    presegment_threshold=500,
    condition_order=["Ashkenazi Jewish", "Han Chinese"],
    as_pyranges=True,
)

The output dmr_regions is a dataframe containing the following columns:

  • the chromosome, start, end (zero-based) of the region

  • the number of contexts in the region

  • the mean coverage of the region

  • the mean methylation fraction of each group

  • the methylation fold change (mean(group_2)/mean(group_1))

  • the methylation difference (mean(group_2)-mean(group_1))

  • the test statistic, in our case a Wald test

  • the p-value associated with the significance of the coefficient of the logistic regression associated with the group

  • the q-value (Benjamini-Hochberg corrected p-value)

Note that we could also ouptut a pandas Dataframe instead of a pyranges one by setting as_pyranges=False in the call_dmrs function.

dmr_regions.head()
Chromosome Start End num_contexts mean_coverage mean_mod_group_1 mean_mod_group_2 mod_fold_change mod_difference test_statistic dmr_pvalue dmr_qvalue
0 chr1 10578 10816 50 28.973333 0.966083 0.924386 0.956839 -0.041697 60.001180 9.480050e-15 3.337637e-14
1 chr1 14652 14791 9 70.907407 0.917142 0.934554 1.018985 0.017412 2.874834 8.997427e-02 1.071010e-01
2 chr1 16138 16869 8 81.333333 0.831444 0.841313 1.011870 0.009869 1.169600 2.794835e-01 3.111446e-01
3 chr1 17377 17613 10 129.400000 0.960567 0.936319 0.974757 -0.024247 22.613622 1.980753e-06 3.656151e-06
4 chr1 19159 19322 9 63.685185 0.921850 0.918903 0.996804 -0.002946 0.011737 9.137286e-01 9.236813e-01
5 chr1 29271 29435 25 25.933333 0.002069 0.002200 1.063500 0.000131 0.001907 9.651642e-01 9.692986e-01
6 chr1 102932 103249 8 117.750000 0.598732 0.724905 1.210733 0.126173 115.735209 5.431896e-27 4.532539e-26
7 chr1 108774 109575 6 107.555556 0.456157 0.347131 0.760990 -0.109026 21.070517 4.426873e-06 7.931364e-06

Even if you kept the pyranges object as the default type of the output, it is straightforward to convert this object to a pandas dataframe if we want:

dmr_regions = dmr_regions.df

It is also useful to associate a slice in the format “Chromosome:Start-End” to the dataframe, since our ContigDataset object can be sliced using such string. This will come in handy later on in this notebook when we plot DMRs.

dmr_regions["Slice"] = dmr_regions.apply(
    lambda row: f"{row['Chromosome']}:{row['Start']}-{row['End']}", axis=1
)    
dmr_regions.head()
Chromosome Start End num_contexts mean_coverage mean_mod_group_1 mean_mod_group_2 mod_fold_change mod_difference test_statistic dmr_pvalue dmr_qvalue Slice
0 chr1 10578 10816 50 28.973333 0.966083 0.924386 0.956839 -0.041697 60.001180 9.480050e-15 3.337637e-14 chr1:10578-10816
1 chr1 14652 14791 9 70.907407 0.917142 0.934554 1.018985 0.017412 2.874834 8.997427e-02 1.071010e-01 chr1:14652-14791
2 chr1 16138 16869 8 81.333333 0.831444 0.841313 1.011870 0.009869 1.169600 2.794835e-01 3.111446e-01 chr1:16138-16869
3 chr1 17377 17613 10 129.400000 0.960567 0.936319 0.974757 -0.024247 22.613622 1.980753e-06 3.656151e-06 chr1:17377-17613
4 chr1 19159 19322 9 63.685185 0.921850 0.918903 0.996804 -0.002946 0.011737 9.137286e-01 9.236813e-01 chr1:19159-19322

Differentially Methylated Cytosines#

Instead of calling DMRs, you may be interested to know if your individual cytosines are differentially methyated or not (so called DMC’s).

In principle you could call DMCs genome-wide, but in practice it involves running one logistic regression per context. This task is time and memory intensive, and is not recommended. Instead, we can focus a small region of interest, for instance a gene or an enhancer, and look at differentially methylated contexts for this region only.

Let’s look at the GBA1 gene, associated with Gaucher disease, a genetic disease particularly common in the Ashkenazi population. The gene is on the negative strand of chr1, between positions 155234451 and 155244698 (human reference hg38). We can select this particular region in our dataset by using a slice string formatted as “Chromosome:Start-End”. So for our particular gene, we would do:

ds["chr1:155234451-155244698"]
<xarray.Dataset> Size: 74kB
Dimensions:                         (pos: 196, sample_id: 6)
Coordinates:
    contig                          (pos) <U20 16kB dask.array<chunksize=(196,), meta=np.ndarray>
    ref_position                    (pos) uint32 784B dask.array<chunksize=(196,), meta=np.ndarray>
    strand                          (pos) <U1 784B dask.array<chunksize=(196,), meta=np.ndarray>
    trinucleotide_context           (pos) uint8 196B dask.array<chunksize=(196,), meta=np.ndarray>
  * sample_id                       (sample_id) object 48B 'HG002' ... 'HG007'
    group                           (sample_id) object 48B 'HG002' ... 'HG007'
Dimensions without coordinates: pos
Data variables:
    num_c                           (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_modc                        (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_n                           (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_other                       (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_total                       (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    num_total_c                     (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray>
    Input DNA Quantity (ng/sample)  (sample_id) int64 48B 80 80 80 80 80 80
    Protocol Version                (sample_id) object 48B '5-Letter v2.4' .....
    NA_id                           (sample_id) object 48B 'NA24385' ... 'NA2...
    family                          (sample_id) object 48B 'Ashkenazi Jewish'...
    family_status                   (sample_id) object 48B 'Son' ... 'Mother'
    sex                             (sample_id) object 48B 'Male' ... 'Female'
    tech_replicate_number           (sample_id) int64 48B 1 1 1 1 1 1
Attributes:
    context:           CG
    context_encoding:  {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1...
    contigs:           ['chr1']
    coordinate_basis:  0
    description:       A +modC dataset of human Genome in a bottle (GIAB) sam...
    fasta_path:        GRCh38Decoy.fa
    input_path:        ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as...
    quant_subtype:     modC
    ref_name:          GRCh38Decoy
    sample_ids:        ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '...
    zarr_version:      0.1
    slice_chr1:        slice(0, 196, 1)

Let’s look at DMCs in this gene. In order to achieve this, we run the call_dmrs function with the ranges=1 parameter. This creates ranges of size 1bp, so each context becomes its own genomic range on which we can call differential methylation.

dmc_calls = call_dmrs(
    ds=ds["chr1:155234451-155244698"],
    count_array_name="num_modc",
    total_counts_name="num_total_c",
    condition_array_name="family",
    condition_order=["Ashkenazi Jewish", "Han Chinese"],
    as_pyranges=True,
    ranges=1,
)

Let’s remove positions that do not contain a context, and sites of low coverage:

dmc_calls = dmc_calls[(dmc_calls.num_contexts>0) & (dmc_calls.mean_coverage>5)].df

The table below shows all the contexts in the gene, ordered by the p-value associated with their level of differential methylation.

dmc_calls.sort_values("dmr_pvalue")
Chromosome Start End num_contexts mean_coverage mean_mod_group_1 mean_mod_group_2 mod_fold_change mod_difference test_statistic dmr_pvalue dmr_qvalue
110 chr1 155241670 155241671 1 111.166667 0.236535 0.635130 2.685147 0.398595 101.937739 5.729453e-24 5.812530e-20
74 chr1 155239109 155239110 1 93.333333 0.520995 0.877068 1.683448 0.356073 75.437030 3.772482e-18 1.913591e-14
124 chr1 155243497 155243498 1 105.000000 0.423714 0.734200 1.732771 0.310485 67.912718 1.708953e-16 5.779111e-13
115 chr1 155242045 155242046 1 113.666667 0.677408 0.921636 1.360532 0.244228 56.317558 6.166232e-14 1.563911e-10
54 chr1 155237947 155237948 1 94.000000 0.665644 0.408962 0.614385 -0.256682 38.760004 4.792476e-10 9.723934e-07
... ... ... ... ... ... ... ... ... ... ... ... ...
59 chr1 155238289 155238290 1 64.500000 0.919168 0.917938 0.998663 -0.001229 0.001364 9.705439e-01 1.000000e+00
156 chr1 155244259 155244260 1 97.833333 0.010367 0.010317 0.995114 -0.000051 0.000862 9.765759e-01 1.000000e+00
64 chr1 155238682 155238683 1 106.000000 0.880094 0.880548 1.000515 0.000453 0.000854 9.766926e-01 1.000000e+00
17 chr1 155235813 155235814 1 99.833333 0.961497 0.963858 1.002456 0.002361 0.000584 9.807192e-01 1.000000e+00
93 chr1 155240047 155240048 1 80.333333 0.964794 0.966017 1.001268 0.001223 0.000207 9.885330e-01 1.000000e+00

186 rows × 12 columns

Let’s visualise the significant contexts as a Manhattan plot of their q-values:

dmc_calls["minus_log10_qvalue"] = -np.log10(dmc_calls["dmr_qvalue"])
g = sns.scatterplot(
    data=dmc_calls, 
    x="Start", 
    y="minus_log10_qvalue",
    edgecolor=None,
    s=5,
)

# Format axis labels
plt.xlabel("Genomic Position")
plt.ylabel("-log10(q-value)")
# add significance threshold line
plt.axhline(-np.log10(0.05), color='C2', linestyle='--', label="q-value = 0.05")

# Format x-axis labels
ticks = g.get_xticks()
xlabels = ["{:,.0f}".format(x) for x in ticks]
g.set_xticklabels(xlabels)
# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')
# Only three ticks on x-axis
plt.locator_params(axis='x', nbins=3)
# Add legend
plt.legend()

plt.show()
_images/89d3a611dc08301f2c65d4903abbed0d3907e1700817a599fe2778bc96caa5b8.png

Volcano plot of DMRs#

Now let’s return to our region-based calls. Let’s visualise the q-values of our DMR calls against the mean methylation difference between the two groups.

dmr_regions["dmr_qvalue"] = dmr_regions["dmr_qvalue"].replace(0, 10**-300)
dmr_regions["-log10 dmr_qvalue"] = -np.log10(dmr_regions["dmr_qvalue"])
g = sns.scatterplot(
    x="mod_difference",
    y="-log10 dmr_qvalue",
    data=dmr_regions,
    edgecolor=None,
    s=5,
)
g.set_xlim(-1, 1)
g.axvline(0, color="black", linewidth=0.5)
g.axvline(np.log2(1./2), color="black", linewidth=0.5, linestyle="--")
g.axvline(np.log2(2.), color="black", linewidth=0.5, linestyle="--")
<matplotlib.lines.Line2D at 0x7f060ef51b10>
_images/10ead41ef42e5f3b657db886539f375eca92e79ce52a77a9849068b5a4a8bc94.png

The solid black line marks the transition from hypo to hyper methylated DMRs. Hence the right-hand side of that line indicate DMRs where the Ashkenazi family is less methylated on average than the Han family.

A closer look at some DMRs#

Let’s look at regions with low q-values, high number of CpGs, and significant methylation difference. These should be our strong DMR calls.

dmrs = dmr_regions.query(
    "dmr_qvalue < 0.01 & num_contexts > 10 & (mod_difference < -0.05 | mod_difference > 0.05)"
).reset_index(drop=True)
dmrs = dmrs.sort_values("dmr_pvalue")
dmrs
Chromosome Start End num_contexts mean_coverage mean_mod_group_1 mean_mod_group_2 mod_fold_change mod_difference test_statistic dmr_pvalue dmr_qvalue Slice -log10 dmr_qvalue
18059 chr1 244280325 244284050 63 96.089947 0.678418 0.849053 1.251519 0.170635 1424.550130 9.720008e-312 6.357954e-307 chr1:244280325-244284050 306.196683
241 chr1 23030631 23034399 35 106.876190 0.550629 0.788732 1.432420 0.238103 1415.936394 7.234870e-310 2.366200e-305 chr1:23030631-23034399 304.625948
18060 chr1 244284050 244287775 63 97.216931 0.646204 0.860158 1.331093 0.213954 2122.689492 2.225074e-308 7.660227e-305 chr1:244284050-244287775 304.115758
5223 chr1 143326766 143327537 84 48.507937 0.733338 0.938419 1.279654 0.205081 2150.004558 2.225074e-308 7.660227e-305 chr1:143326766-143327537 304.115758
165 chr1 16699390 16700179 89 81.932584 0.392937 0.155515 0.395776 -0.237422 3091.502387 2.225074e-308 7.660227e-305 chr1:16699390-16700179 304.115758
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4987 chr1 121627144 121627659 12 33.152778 0.724702 0.783422 1.081027 0.058720 8.732711 3.125514e-03 4.345043e-03 chr1:121627144-121627659 2.362006
5427 chr1 149412235 149414853 20 39.116667 0.679546 0.618624 0.910349 -0.060922 8.229197 4.122176e-03 5.666162e-03 chr1:149412235-149414853 2.246711
5214 chr1 124807770 124808169 12 32.194444 0.665899 0.719740 1.080855 0.053841 7.787822 5.259957e-03 7.150764e-03 chr1:124807770-124808169 2.145648
5105 chr1 123652048 123652730 11 42.515152 0.586700 0.643289 1.096452 0.056589 7.782403 5.275756e-03 7.171200e-03 chr1:123652048-123652730 2.144408
5057 chr1 123112849 123113680 12 38.972222 0.661589 0.609712 0.921587 -0.051877 7.263604 7.036594e-03 9.445130e-03 chr1:123112849-123113680 2.024792

18301 rows × 14 columns

We can use the plot_methylation_tile method of the ContigDataset to plot visualise the DMRs.

In the two examples below, we can see tiles where the methylation is different between HG002, HG003, HG004 (the Ashkenazi trio) on one hand and HG005, HG006, HG007 (the Han trio) on the other hand. Each patch of color is a CpG. The dark patches correspond to CpGs with high methyation fraction, while the light patches correpond to a low methylation fraction. Grey patches (if any) are CpGs with low coverage. For each panel, the bottom plot is a locator which shows the genomic location of each CpG.

fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)
_images/e5910f32c4f5d2ea53c4d33edcb2f16ec200b6e5df3da780ba3793761519202b.png

We can also plot the methylation trace. It shows the methylation fraction across the region of interest. In our case, it clearly shows a distinct pattern of methylation between the Ashkenazi group (blue-ish colours) and the Han group (red-ish colours).

kwargs = {
    "palette": ["#003B49", "#9CDBD9", "#668991", "#F87C56", "#fbb59f", "#f5430d"],
}

fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)
_images/d487db2dab4e4dbce0c88d589bcdda8f64d3458e42ff03d31a86a5671e792ef7.png

Let’s look at a second example:

fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)
_images/1ff0113fa385a9da0e058d9e3349333e5918f91008a53720ba9edc1a78ab1369.png
fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)
_images/72a7ae3a8d0ba4123a351691cefdc31f40fee51b6dcacd797535239c2ea110c6.png

A non-DMR example#

In this example there is not clear difference in methylation between the two family trios, hence it was correctly not called as a DMR.

fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_tile(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=5,
)
_images/69e53f63762ab9406a728540347143db98569083a3703c24846da748fa9ac322.png
fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_trace(
    numerators=["num_modc"],
    denominator="num_total_c",
    min_coverage=5,
    **kwargs
)
_images/9b633af90bf6f5f4048c57212fa2e41baf28fd073d8ed93da45ee38b6ce4f44d.png