Calling DMRs with modality#
This notebooks illustrates how modality
can call DMRs (differentially methylated regions). The code looks for specific group differences (e.g. healthy versus control, or sex). A brief outline of the code is described below:
We divide genome in smaller regions. Two options are available:
tiles of fixed length (e.g. 2kb)
segmentation based, in which we identify gaps in the genome where there are no CpG within a certain distance (e.g. 500bp) and divide accordingly. This is the default mode.
The contexts of each region are summed together to get the total number of methylated and unmethylated counts.
We use a logistic regression to model the methylation levels of each region. A DMR is defined as a region with a significant difference in methylation levels between two conditions. This is quantified by a p-value.
Based on this p-value, we can decide if the tile is a DMR. We also report a q-value, which is the p-value corrected using the Benjamini-Hochberg method.
from modality.analysis import call_dmrs
from modality import load_biomodal_dataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyranges as pr
import seaborn as sns
Load the data#
For the purposes of this demo we can use a duet +modC dataset consisting of 14 Genome in a Bottle (GIAB) samples. The data are publicly available and can be loaded using the load_biomodal_dataset()
function. This will pull the dataset and load it in to our session as a ContigDataset
object.
The dataset consists of 14 samples, 2 replicates of 7 cell lines. In this notebook we will look for regions of genome that are differentially methylated between two families: the Han mother-father-son trio on one hand, and the Ashkenazi mother-father-son trio on the other hand.
We merge the 2 technical replicates, and we merge the CpGs on the two strands.
ds = load_biomodal_dataset(
name="giab",
merge_technical_replicates=True,
merge_cpgs=True,
)
ds
<xarray.Dataset> Size: 12GB Dimensions: (pos: 29294625, sample_id: 7) Coordinates: contig (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray> ref_position (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray> strand (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray> trinucleotide_context (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray> * sample_id (sample_id) object 56B 'HG001' ... 'HG007' group (sample_id) object 56B 'HG001' ... 'HG007' Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_modc (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_n (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_other (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_total (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_total_c (pos, sample_id) uint64 2GB dask.array<chunksize=(40000, 1), meta=np.ndarray> Input DNA Quantity (ng/sample) (sample_id) int64 56B 80 80 80 80 80 80 80 Protocol Version (sample_id) object 56B '5-Letter v2.4' ..... NA_id (sample_id) object 56B 'NA12878' ... 'NA2... family (sample_id) object 56B 'Utah' ... 'Han Ch... family_status (sample_id) object 56B 'Singleton' ... 'M... sex (sample_id) object 56B 'Female' ... 'Female' tech_replicate_number (sample_id) int64 56B 1 1 1 1 1 1 1 Attributes: (100/205) context: CG context_encoding: {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':... contigs: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch... coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle ... fasta_path: GRCh38Decoy.fa input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_... quant_subtype: modC ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12... slice_chr1: slice(0, 2375159, 1) slice_chr10: slice(14996202, 16385180, 1) slice_chr11: slice(16385180, 17718294, 1) slice_chr11_KI270721v1_ran: slice(29173382, 29176776, 1) slice_chr12: slice(17718294, 19034262, 1) slice_chr13: slice(19034262, 19876731, 1) slice_chr14: slice(19876731, 20739159, 1) slice_chr14_GL000009v2_ran: slice(29176776, 29178255, 1) slice_chr14_GL000194v1_ran: slice(29183653, 29185755, 1) slice_chr14_GL000225v1_ran: slice(29178255, 29182074, 1) slice_chr14_KI270722v1_ran: slice(29182074, 29183653, 1) slice_chr14_KI270723v1_ran: slice(29185755, 29186353, 1) slice_chr14_KI270724v1_ran: slice(29186353, 29186907, 1) slice_chr14_KI270725v1_ran: slice(29186907, 29188509, 1) slice_chr14_KI270726v1_ran: slice(29188509, 29189025, 1) slice_chr15: slice(20739159, 21645185, 1) slice_chr15_KI270727v1_ran: slice(29189025, 29192989, 1) slice_chr16: slice(21645185, 22796076, 1) slice_chr16_KI270728v1_ran: slice(29192989, 29208664, 1) slice_chr17: slice(22796076, 24044404, 1) slice_chr17_GL000205v2_ran: slice(29208664, 29210537, 1) slice_chr17_KI270729v1_ran: slice(29210537, 29217219, 1) slice_chr17_KI270730v1_ran: slice(29217219, 29219552, 1) slice_chr18: slice(24044404, 24800418, 1) slice_chr19: slice(24800418, 25857083, 1) slice_chr1_KI270706v1_rand: slice(29152891, 29154480, 1) slice_chr1_KI270707v1_rand: slice(29154480, 29154791, 1) slice_chr1_KI270708v1_rand: slice(29154791, 29155844, 1) slice_chr1_KI270709v1_rand: slice(29155844, 29157288, 1) slice_chr1_KI270710v1_rand: slice(29157288, 29157505, 1) slice_chr1_KI270711v1_rand: slice(29157505, 29157882, 1) slice_chr1_KI270712v1_rand: slice(29157882, 29160999, 1) slice_chr1_KI270713v1_rand: slice(29160999, 29161864, 1) slice_chr1_KI270714v1_rand: slice(29161864, 29163019, 1) slice_chr2: slice(2375159, 4567829, 1) slice_chr20: slice(25857083, 26630560, 1) slice_chr21: slice(26630560, 27059406, 1) slice_chr22: slice(27059406, 27660298, 1) slice_chr22_KI270731v1_ran: slice(29219552, 29222033, 1) slice_chr22_KI270732v1_ran: slice(29222033, 29222518, 1) ... ... slice_chrUn_KI270510v1: slice(29247760, 29247775, 1) slice_chrUn_KI270511v1: slice(29248979, 29249024, 1) slice_chrUn_KI270512v1: slice(29247837, 29248020, 1) slice_chrUn_KI270515v1: slice(29249024, 29249065, 1) slice_chrUn_KI270516v1: slice(29247829, 29247837, 1) slice_chrUn_KI270517v1: slice(29249104, 29249116, 1) slice_chrUn_KI270518v1: slice(29247795, 29247808, 1) slice_chrUn_KI270519v1: slice(29248020, 29248909, 1) slice_chrUn_KI270521v1: slice(29251667, 29251743, 1) slice_chrUn_KI270522v1: slice(29248909, 29248979, 1) slice_chrUn_KI270528v1: slice(29249140, 29249240, 1) slice_chrUn_KI270529v1: slice(29249116, 29249140, 1) slice_chrUn_KI270530v1: slice(29249240, 29249275, 1) slice_chrUn_KI270538v1: slice(29249353, 29250059, 1) slice_chrUn_KI270539v1: slice(29249275, 29249353, 1) slice_chrUn_KI270544v1: slice(29250059, 29250078, 1) slice_chrUn_KI270548v1: slice(29250078, 29250103, 1) slice_chrUn_KI270579v1: slice(29250198, 29250411, 1) slice_chrUn_KI270580v1: slice(29250129, 29250141, 1) slice_chrUn_KI270581v1: slice(29250141, 29250198, 1) slice_chrUn_KI270582v1: slice(29250753, 29250816, 1) slice_chrUn_KI270583v1: slice(29250103, 29250114, 1) slice_chrUn_KI270584v1: slice(29250726, 29250753, 1) slice_chrUn_KI270587v1: slice(29250114, 29250129, 1) slice_chrUn_KI270588v1: slice(29250816, 29250857, 1) slice_chrUn_KI270589v1: slice(29250411, 29250700, 1) slice_chrUn_KI270590v1: slice(29250700, 29250726, 1) slice_chrUn_KI270591v1: slice(29250879, 29250912, 1) slice_chrUn_KI270593v1: slice(29250857, 29250879, 1) slice_chrUn_KI270741v1: slice(29263727, 29265416, 1) slice_chrUn_KI270742v1: slice(29288641, 29290053, 1) slice_chrUn_KI270743v1: slice(29267377, 29269464, 1) slice_chrUn_KI270744v1: slice(29269464, 29271450, 1) slice_chrUn_KI270745v1: slice(29271450, 29272082, 1) slice_chrUn_KI270746v1: slice(29272082, 29272487, 1) slice_chrUn_KI270747v1: slice(29272487, 29275904, 1) slice_chrUn_KI270748v1: slice(29275904, 29277399, 1) slice_chrUn_KI270749v1: slice(29277399, 29278817, 1) slice_chrUn_KI270750v1: slice(29278817, 29280152, 1) slice_chrUn_KI270751v1: slice(29280152, 29281268, 1) slice_chrUn_KI270752v1: slice(29281268, 29281503, 1) slice_chrUn_KI270753v1: slice(29281503, 29282300, 1) slice_chrUn_KI270754v1: slice(29282300, 29283979, 1) slice_chrUn_KI270755v1: slice(29283979, 29284530, 1) slice_chrUn_KI270756v1: slice(29284530, 29286560, 1) slice_chrUn_KI270757v1: slice(29286560, 29286986, 1) slice_chrX: slice(27660298, 28983007, 1) slice_chrY: slice(28983007, 29152456, 1) slice_chrY_KI270740v1_rand: slice(29239647, 29239942, 1) zarr_version: 0.1
Select the two families#
First, we use the select_samples
method to select only the samples corresponding to the two families of interest (i.e., the families that are not from the HG001/Utah cell line).
ds = ds.select_samples([i for i in ds.sample_id.values if i != "HG001"])
ds
<xarray.Dataset> Size: 11GB Dimensions: (pos: 29294625, sample_id: 6) Coordinates: contig (pos) <U20 2GB dask.array<chunksize=(40000,), meta=np.ndarray> ref_position (pos) uint32 117MB dask.array<chunksize=(40000,), meta=np.ndarray> strand (pos) <U1 117MB dask.array<chunksize=(40000,), meta=np.ndarray> trinucleotide_context (pos) uint8 29MB dask.array<chunksize=(40000,), meta=np.ndarray> * sample_id (sample_id) object 48B 'HG002' ... 'HG007' group (sample_id) object 48B 'HG002' ... 'HG007' Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_modc (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_n (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_other (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_total (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> num_total_c (pos, sample_id) uint64 1GB dask.array<chunksize=(40000, 1), meta=np.ndarray> Input DNA Quantity (ng/sample) (sample_id) int64 48B 80 80 80 80 80 80 Protocol Version (sample_id) object 48B '5-Letter v2.4' ..... NA_id (sample_id) object 48B 'NA24385' ... 'NA2... family (sample_id) object 48B 'Ashkenazi Jewish'... family_status (sample_id) object 48B 'Son' ... 'Mother' sex (sample_id) object 48B 'Male' ... 'Female' tech_replicate_number (sample_id) int64 48B 1 1 1 1 1 1 Attributes: (100/205) context: CG context_encoding: {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100':... contigs: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'ch... coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle ... fasta_path: GRCh38Decoy.fa input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_... quant_subtype: modC ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A12... slice_chr1: slice(0, 2375159, 1) slice_chr10: slice(14996202, 16385180, 1) slice_chr11: slice(16385180, 17718294, 1) slice_chr11_KI270721v1_ran: slice(29173382, 29176776, 1) slice_chr12: slice(17718294, 19034262, 1) slice_chr13: slice(19034262, 19876731, 1) slice_chr14: slice(19876731, 20739159, 1) slice_chr14_GL000009v2_ran: slice(29176776, 29178255, 1) slice_chr14_GL000194v1_ran: slice(29183653, 29185755, 1) slice_chr14_GL000225v1_ran: slice(29178255, 29182074, 1) slice_chr14_KI270722v1_ran: slice(29182074, 29183653, 1) slice_chr14_KI270723v1_ran: slice(29185755, 29186353, 1) slice_chr14_KI270724v1_ran: slice(29186353, 29186907, 1) slice_chr14_KI270725v1_ran: slice(29186907, 29188509, 1) slice_chr14_KI270726v1_ran: slice(29188509, 29189025, 1) slice_chr15: slice(20739159, 21645185, 1) slice_chr15_KI270727v1_ran: slice(29189025, 29192989, 1) slice_chr16: slice(21645185, 22796076, 1) slice_chr16_KI270728v1_ran: slice(29192989, 29208664, 1) slice_chr17: slice(22796076, 24044404, 1) slice_chr17_GL000205v2_ran: slice(29208664, 29210537, 1) slice_chr17_KI270729v1_ran: slice(29210537, 29217219, 1) slice_chr17_KI270730v1_ran: slice(29217219, 29219552, 1) slice_chr18: slice(24044404, 24800418, 1) slice_chr19: slice(24800418, 25857083, 1) slice_chr1_KI270706v1_rand: slice(29152891, 29154480, 1) slice_chr1_KI270707v1_rand: slice(29154480, 29154791, 1) slice_chr1_KI270708v1_rand: slice(29154791, 29155844, 1) slice_chr1_KI270709v1_rand: slice(29155844, 29157288, 1) slice_chr1_KI270710v1_rand: slice(29157288, 29157505, 1) slice_chr1_KI270711v1_rand: slice(29157505, 29157882, 1) slice_chr1_KI270712v1_rand: slice(29157882, 29160999, 1) slice_chr1_KI270713v1_rand: slice(29160999, 29161864, 1) slice_chr1_KI270714v1_rand: slice(29161864, 29163019, 1) slice_chr2: slice(2375159, 4567829, 1) slice_chr20: slice(25857083, 26630560, 1) slice_chr21: slice(26630560, 27059406, 1) slice_chr22: slice(27059406, 27660298, 1) slice_chr22_KI270731v1_ran: slice(29219552, 29222033, 1) slice_chr22_KI270732v1_ran: slice(29222033, 29222518, 1) ... ... slice_chrUn_KI270510v1: slice(29247760, 29247775, 1) slice_chrUn_KI270511v1: slice(29248979, 29249024, 1) slice_chrUn_KI270512v1: slice(29247837, 29248020, 1) slice_chrUn_KI270515v1: slice(29249024, 29249065, 1) slice_chrUn_KI270516v1: slice(29247829, 29247837, 1) slice_chrUn_KI270517v1: slice(29249104, 29249116, 1) slice_chrUn_KI270518v1: slice(29247795, 29247808, 1) slice_chrUn_KI270519v1: slice(29248020, 29248909, 1) slice_chrUn_KI270521v1: slice(29251667, 29251743, 1) slice_chrUn_KI270522v1: slice(29248909, 29248979, 1) slice_chrUn_KI270528v1: slice(29249140, 29249240, 1) slice_chrUn_KI270529v1: slice(29249116, 29249140, 1) slice_chrUn_KI270530v1: slice(29249240, 29249275, 1) slice_chrUn_KI270538v1: slice(29249353, 29250059, 1) slice_chrUn_KI270539v1: slice(29249275, 29249353, 1) slice_chrUn_KI270544v1: slice(29250059, 29250078, 1) slice_chrUn_KI270548v1: slice(29250078, 29250103, 1) slice_chrUn_KI270579v1: slice(29250198, 29250411, 1) slice_chrUn_KI270580v1: slice(29250129, 29250141, 1) slice_chrUn_KI270581v1: slice(29250141, 29250198, 1) slice_chrUn_KI270582v1: slice(29250753, 29250816, 1) slice_chrUn_KI270583v1: slice(29250103, 29250114, 1) slice_chrUn_KI270584v1: slice(29250726, 29250753, 1) slice_chrUn_KI270587v1: slice(29250114, 29250129, 1) slice_chrUn_KI270588v1: slice(29250816, 29250857, 1) slice_chrUn_KI270589v1: slice(29250411, 29250700, 1) slice_chrUn_KI270590v1: slice(29250700, 29250726, 1) slice_chrUn_KI270591v1: slice(29250879, 29250912, 1) slice_chrUn_KI270593v1: slice(29250857, 29250879, 1) slice_chrUn_KI270741v1: slice(29263727, 29265416, 1) slice_chrUn_KI270742v1: slice(29288641, 29290053, 1) slice_chrUn_KI270743v1: slice(29267377, 29269464, 1) slice_chrUn_KI270744v1: slice(29269464, 29271450, 1) slice_chrUn_KI270745v1: slice(29271450, 29272082, 1) slice_chrUn_KI270746v1: slice(29272082, 29272487, 1) slice_chrUn_KI270747v1: slice(29272487, 29275904, 1) slice_chrUn_KI270748v1: slice(29275904, 29277399, 1) slice_chrUn_KI270749v1: slice(29277399, 29278817, 1) slice_chrUn_KI270750v1: slice(29278817, 29280152, 1) slice_chrUn_KI270751v1: slice(29280152, 29281268, 1) slice_chrUn_KI270752v1: slice(29281268, 29281503, 1) slice_chrUn_KI270753v1: slice(29281503, 29282300, 1) slice_chrUn_KI270754v1: slice(29282300, 29283979, 1) slice_chrUn_KI270755v1: slice(29283979, 29284530, 1) slice_chrUn_KI270756v1: slice(29284530, 29286560, 1) slice_chrUn_KI270757v1: slice(29286560, 29286986, 1) slice_chrX: slice(27660298, 28983007, 1) slice_chrY: slice(28983007, 29152456, 1) slice_chrY_KI270740v1_rand: slice(29239647, 29239942, 1) zarr_version: 0.1
Restrict to chromosome 1#
In this notebook, we will look for DMRs on a subset of the whole genome, here chromosome 1.
substore = ds["chr1"].copy()
Filter by coverage#
It is good practice to remove sites of low coverage before calling DMRs. This can be done with the subset_bycoverage
method of the ContigDataset. Here we call it with the method="any"
, which removes a position if ay of the samples have a coverage strictly lower than min_coverage at that CpG context. Other options are "all"
and "mean"
.
substore = substore.subset_bycoverage(min_coverage=5, method="any")
Call DMRs#
Now we can call DMRs using the call_dmrs
function. We need to specify the ContigDataset
object we are going to use (here substore
), as well as the count variables (here num_modc
and num_total_c
) and the condition (here we look for DMRs across two different families, so we set use family
).
Finally, call_dmrs
needs to know the regions where to look for DMRs:
We use the default mode, which is to pre-segment the CpGs, creating a new region every time two consecutive CpGs are separated by a distance larger than
presegment_threshold
(here we use the defaultpresegment_threshold
of 500bp). This determines the regions in whichmodality
is going to look for DMRs.Note we could instead pass a
ranges
argument, which can either be an integer (e.g.ranges=2000
will divide the genome in tiles of size 2kbp), or an actual set of genomic ranges. If such object is provided,modality
will call DMRs on those ranges. This object should be apyranges
object. If you have a set of genomic ranges as a bed file, you can for instance useranges = pr.read_bed(path_to_bedfile)
and pass it tocall_dmrs
.
For this dataset, it takes about 1 minutes to call DMRs on chromosome 1 on a MacBook Pro M2.
Note that you don’t have to restrict to one chromosome at a time. You can use call_dmrs
on your ContigDataset
ds
.
substore
<xarray.Dataset> Size: 845MB Dimensions: (pos: 2241991, sample_id: 6) Coordinates: * sample_id (sample_id) object 48B 'HG002' ... 'HG007' group (sample_id) object 48B 'HG002' ... 'HG007' contig (pos) <U20 179MB dask.array<chunksize=(40000,), meta=np.ndarray> ref_position (pos) uint32 9MB dask.array<chunksize=(40000,), meta=np.ndarray> strand (pos) <U1 9MB dask.array<chunksize=(40000,), meta=np.ndarray> trinucleotide_context (pos) uint8 2MB dask.array<chunksize=(40000,), meta=np.ndarray> Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> num_modc (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> num_n (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> num_other (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> num_total (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> num_total_c (pos, sample_id) uint64 108MB dask.array<chunksize=(40000, 6), meta=np.ndarray> Input DNA Quantity (ng/sample) (sample_id) int64 48B 80 80 80 80 80 80 Protocol Version (sample_id) object 48B '5-Letter v2.4' ..... NA_id (sample_id) object 48B 'NA24385' ... 'NA2... family (sample_id) object 48B 'Ashkenazi Jewish'... family_status (sample_id) object 48B 'Son' ... 'Mother' sex (sample_id) object 48B 'Male' ... 'Female' tech_replicate_number (sample_id) int64 48B 1 1 1 1 1 1 Attributes: context: CG context_encoding: {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1... contigs: ['chr1'] coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle (GIAB) sam... fasta_path: GRCh38Decoy.fa input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as... quant_subtype: modC ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '... slice_chr1: slice(0, 2241991, 1) zarr_version: 0.1
Now we are ready to call DMRs.
dmr_regions = call_dmrs(
ds=substore,
count_array_name="num_modc",
total_counts_name="num_total_c",
condition_array_name="family",
presegment_threshold=500,
condition_order=["Ashkenazi Jewish", "Han Chinese"],
as_pyranges=True,
)
The output dmr_regions
is a dataframe containing the following columns:
the chromosome, start, end (zero-based) of the region
the number of contexts in the region
the mean coverage of the region
the mean methylation fraction of each group
the methylation fold change
(mean(group_2)/mean(group_1))
the methylation difference
(mean(group_2)-mean(group_1))
the test statistic, in our case a Wald test
the p-value associated with the significance of the coefficient of the logistic regression associated with the group
the q-value (Benjamini-Hochberg corrected p-value)
Note that we could also ouptut a pandas
Dataframe instead of a pyranges
one by setting as_pyranges=False
in the call_dmrs
function.
dmr_regions.head()
Chromosome | Start | End | num_contexts | mean_coverage | mean_mod_group_1 | mean_mod_group_2 | mod_fold_change | mod_difference | test_statistic | dmr_pvalue | dmr_qvalue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10578 | 10816 | 50 | 28.973333 | 0.966083 | 0.924386 | 0.956839 | -0.041697 | 60.001180 | 9.480050e-15 | 3.337637e-14 |
1 | chr1 | 14652 | 14791 | 9 | 70.907407 | 0.917142 | 0.934554 | 1.018985 | 0.017412 | 2.874834 | 8.997427e-02 | 1.071010e-01 |
2 | chr1 | 16138 | 16869 | 8 | 81.333333 | 0.831444 | 0.841313 | 1.011870 | 0.009869 | 1.169600 | 2.794835e-01 | 3.111446e-01 |
3 | chr1 | 17377 | 17613 | 10 | 129.400000 | 0.960567 | 0.936319 | 0.974757 | -0.024247 | 22.613622 | 1.980753e-06 | 3.656151e-06 |
4 | chr1 | 19159 | 19322 | 9 | 63.685185 | 0.921850 | 0.918903 | 0.996804 | -0.002946 | 0.011737 | 9.137286e-01 | 9.236813e-01 |
5 | chr1 | 29271 | 29435 | 25 | 25.933333 | 0.002069 | 0.002200 | 1.063500 | 0.000131 | 0.001907 | 9.651642e-01 | 9.692986e-01 |
6 | chr1 | 102932 | 103249 | 8 | 117.750000 | 0.598732 | 0.724905 | 1.210733 | 0.126173 | 115.735209 | 5.431896e-27 | 4.532539e-26 |
7 | chr1 | 108774 | 109575 | 6 | 107.555556 | 0.456157 | 0.347131 | 0.760990 | -0.109026 | 21.070517 | 4.426873e-06 | 7.931364e-06 |
Even if you kept the pyranges
object as the default type of the output, it is straightforward to convert this object to a pandas
dataframe if we want:
dmr_regions = dmr_regions.df
It is also useful to associate a slice in the format “Chromosome:Start-End” to the dataframe, since our ContigDataset object can be sliced using such string. This will come in handy later on in this notebook when we plot DMRs.
dmr_regions["Slice"] = dmr_regions.apply(
lambda row: f"{row['Chromosome']}:{row['Start']}-{row['End']}", axis=1
)
dmr_regions.head()
Chromosome | Start | End | num_contexts | mean_coverage | mean_mod_group_1 | mean_mod_group_2 | mod_fold_change | mod_difference | test_statistic | dmr_pvalue | dmr_qvalue | Slice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10578 | 10816 | 50 | 28.973333 | 0.966083 | 0.924386 | 0.956839 | -0.041697 | 60.001180 | 9.480050e-15 | 3.337637e-14 | chr1:10578-10816 |
1 | chr1 | 14652 | 14791 | 9 | 70.907407 | 0.917142 | 0.934554 | 1.018985 | 0.017412 | 2.874834 | 8.997427e-02 | 1.071010e-01 | chr1:14652-14791 |
2 | chr1 | 16138 | 16869 | 8 | 81.333333 | 0.831444 | 0.841313 | 1.011870 | 0.009869 | 1.169600 | 2.794835e-01 | 3.111446e-01 | chr1:16138-16869 |
3 | chr1 | 17377 | 17613 | 10 | 129.400000 | 0.960567 | 0.936319 | 0.974757 | -0.024247 | 22.613622 | 1.980753e-06 | 3.656151e-06 | chr1:17377-17613 |
4 | chr1 | 19159 | 19322 | 9 | 63.685185 | 0.921850 | 0.918903 | 0.996804 | -0.002946 | 0.011737 | 9.137286e-01 | 9.236813e-01 | chr1:19159-19322 |
Differentially Methylated Cytosines#
Instead of calling DMRs, you may be interested to know if your individual cytosines are differentially methyated or not (so called DMC’s).
In principle you could call DMCs genome-wide, but in practice it involves running one logistic regression per context. This task is time and memory intensive, and is not recommended. Instead, we can focus a small region of interest, for instance a gene or an enhancer, and look at differentially methylated contexts for this region only.
Let’s look at the GBA1 gene, associated with Gaucher disease, a genetic disease particularly common in the Ashkenazi population. The gene is on the negative strand of chr1, between positions 155234451 and 155244698 (human reference hg38). We can select this particular region in our dataset by using a slice string formatted as “Chromosome:Start-End”. So for our particular gene, we would do:
ds["chr1:155234451-155244698"]
<xarray.Dataset> Size: 74kB Dimensions: (pos: 196, sample_id: 6) Coordinates: contig (pos) <U20 16kB dask.array<chunksize=(196,), meta=np.ndarray> ref_position (pos) uint32 784B dask.array<chunksize=(196,), meta=np.ndarray> strand (pos) <U1 784B dask.array<chunksize=(196,), meta=np.ndarray> trinucleotide_context (pos) uint8 196B dask.array<chunksize=(196,), meta=np.ndarray> * sample_id (sample_id) object 48B 'HG002' ... 'HG007' group (sample_id) object 48B 'HG002' ... 'HG007' Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> num_modc (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> num_n (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> num_other (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> num_total (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> num_total_c (pos, sample_id) uint64 9kB dask.array<chunksize=(196, 1), meta=np.ndarray> Input DNA Quantity (ng/sample) (sample_id) int64 48B 80 80 80 80 80 80 Protocol Version (sample_id) object 48B '5-Letter v2.4' ..... NA_id (sample_id) object 48B 'NA24385' ... 'NA2... family (sample_id) object 48B 'Ashkenazi Jewish'... family_status (sample_id) object 48B 'Son' ... 'Mother' sex (sample_id) object 48B 'Male' ... 'Female' tech_replicate_number (sample_id) int64 48B 1 1 1 1 1 1 Attributes: context: CG context_encoding: {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '1... contigs: ['chr1'] coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle (GIAB) sam... fasta_path: GRCh38Decoy.fa input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy_primary_as... quant_subtype: modC ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '... zarr_version: 0.1 slice_chr1: slice(0, 196, 1)
Let’s look at DMCs in this gene. In order to achieve this, we run the call_dmrs
function with the ranges=1
parameter. This creates ranges of size 1bp, so each context becomes its own genomic range on which we can call differential methylation.
dmc_calls = call_dmrs(
ds=ds["chr1:155234451-155244698"],
count_array_name="num_modc",
total_counts_name="num_total_c",
condition_array_name="family",
condition_order=["Ashkenazi Jewish", "Han Chinese"],
as_pyranges=True,
ranges=1,
)
Let’s remove positions that do not contain a context, and sites of low coverage:
dmc_calls = dmc_calls[(dmc_calls.num_contexts>0) & (dmc_calls.mean_coverage>5)].df
The table below shows all the contexts in the gene, ordered by the p-value associated with their level of differential methylation.
dmc_calls.sort_values("dmr_pvalue")
Chromosome | Start | End | num_contexts | mean_coverage | mean_mod_group_1 | mean_mod_group_2 | mod_fold_change | mod_difference | test_statistic | dmr_pvalue | dmr_qvalue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
110 | chr1 | 155241670 | 155241671 | 1 | 111.166667 | 0.236535 | 0.635130 | 2.685147 | 0.398595 | 101.937739 | 5.729453e-24 | 5.812530e-20 |
74 | chr1 | 155239109 | 155239110 | 1 | 93.333333 | 0.520995 | 0.877068 | 1.683448 | 0.356073 | 75.437030 | 3.772482e-18 | 1.913591e-14 |
124 | chr1 | 155243497 | 155243498 | 1 | 105.000000 | 0.423714 | 0.734200 | 1.732771 | 0.310485 | 67.912718 | 1.708953e-16 | 5.779111e-13 |
115 | chr1 | 155242045 | 155242046 | 1 | 113.666667 | 0.677408 | 0.921636 | 1.360532 | 0.244228 | 56.317558 | 6.166232e-14 | 1.563911e-10 |
54 | chr1 | 155237947 | 155237948 | 1 | 94.000000 | 0.665644 | 0.408962 | 0.614385 | -0.256682 | 38.760004 | 4.792476e-10 | 9.723934e-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59 | chr1 | 155238289 | 155238290 | 1 | 64.500000 | 0.919168 | 0.917938 | 0.998663 | -0.001229 | 0.001364 | 9.705439e-01 | 1.000000e+00 |
156 | chr1 | 155244259 | 155244260 | 1 | 97.833333 | 0.010367 | 0.010317 | 0.995114 | -0.000051 | 0.000862 | 9.765759e-01 | 1.000000e+00 |
64 | chr1 | 155238682 | 155238683 | 1 | 106.000000 | 0.880094 | 0.880548 | 1.000515 | 0.000453 | 0.000854 | 9.766926e-01 | 1.000000e+00 |
17 | chr1 | 155235813 | 155235814 | 1 | 99.833333 | 0.961497 | 0.963858 | 1.002456 | 0.002361 | 0.000584 | 9.807192e-01 | 1.000000e+00 |
93 | chr1 | 155240047 | 155240048 | 1 | 80.333333 | 0.964794 | 0.966017 | 1.001268 | 0.001223 | 0.000207 | 9.885330e-01 | 1.000000e+00 |
186 rows × 12 columns
Let’s visualise the significant contexts as a Manhattan plot of their q-values:
dmc_calls["minus_log10_qvalue"] = -np.log10(dmc_calls["dmr_qvalue"])
g = sns.scatterplot(
data=dmc_calls,
x="Start",
y="minus_log10_qvalue",
edgecolor=None,
s=5,
)
# Format axis labels
plt.xlabel("Genomic Position")
plt.ylabel("-log10(q-value)")
# add significance threshold line
plt.axhline(-np.log10(0.05), color='C2', linestyle='--', label="q-value = 0.05")
# Format x-axis labels
ticks = g.get_xticks()
xlabels = ["{:,.0f}".format(x) for x in ticks]
g.set_xticklabels(xlabels)
# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')
# Only three ticks on x-axis
plt.locator_params(axis='x', nbins=3)
# Add legend
plt.legend()
plt.show()

Volcano plot of DMRs#
Now let’s return to our region-based calls. Let’s visualise the q-values of our DMR calls against the mean methylation difference between the two groups.
dmr_regions["dmr_qvalue"] = dmr_regions["dmr_qvalue"].replace(0, 10**-300)
dmr_regions["-log10 dmr_qvalue"] = -np.log10(dmr_regions["dmr_qvalue"])
g = sns.scatterplot(
x="mod_difference",
y="-log10 dmr_qvalue",
data=dmr_regions,
edgecolor=None,
s=5,
)
g.set_xlim(-1, 1)
g.axvline(0, color="black", linewidth=0.5)
g.axvline(np.log2(1./2), color="black", linewidth=0.5, linestyle="--")
g.axvline(np.log2(2.), color="black", linewidth=0.5, linestyle="--")
<matplotlib.lines.Line2D at 0x7f060ef51b10>

The solid black line marks the transition from hypo to hyper methylated DMRs. Hence the right-hand side of that line indicate DMRs where the Ashkenazi family is less methylated on average than the Han family.
A closer look at some DMRs#
Let’s look at regions with low q-values, high number of CpGs, and significant methylation difference. These should be our strong DMR calls.
dmrs = dmr_regions.query(
"dmr_qvalue < 0.01 & num_contexts > 10 & (mod_difference < -0.05 | mod_difference > 0.05)"
).reset_index(drop=True)
dmrs = dmrs.sort_values("dmr_pvalue")
dmrs
Chromosome | Start | End | num_contexts | mean_coverage | mean_mod_group_1 | mean_mod_group_2 | mod_fold_change | mod_difference | test_statistic | dmr_pvalue | dmr_qvalue | Slice | -log10 dmr_qvalue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18059 | chr1 | 244280325 | 244284050 | 63 | 96.089947 | 0.678418 | 0.849053 | 1.251519 | 0.170635 | 1424.550130 | 9.720008e-312 | 6.357954e-307 | chr1:244280325-244284050 | 306.196683 |
241 | chr1 | 23030631 | 23034399 | 35 | 106.876190 | 0.550629 | 0.788732 | 1.432420 | 0.238103 | 1415.936394 | 7.234870e-310 | 2.366200e-305 | chr1:23030631-23034399 | 304.625948 |
18060 | chr1 | 244284050 | 244287775 | 63 | 97.216931 | 0.646204 | 0.860158 | 1.331093 | 0.213954 | 2122.689492 | 2.225074e-308 | 7.660227e-305 | chr1:244284050-244287775 | 304.115758 |
5223 | chr1 | 143326766 | 143327537 | 84 | 48.507937 | 0.733338 | 0.938419 | 1.279654 | 0.205081 | 2150.004558 | 2.225074e-308 | 7.660227e-305 | chr1:143326766-143327537 | 304.115758 |
165 | chr1 | 16699390 | 16700179 | 89 | 81.932584 | 0.392937 | 0.155515 | 0.395776 | -0.237422 | 3091.502387 | 2.225074e-308 | 7.660227e-305 | chr1:16699390-16700179 | 304.115758 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4987 | chr1 | 121627144 | 121627659 | 12 | 33.152778 | 0.724702 | 0.783422 | 1.081027 | 0.058720 | 8.732711 | 3.125514e-03 | 4.345043e-03 | chr1:121627144-121627659 | 2.362006 |
5427 | chr1 | 149412235 | 149414853 | 20 | 39.116667 | 0.679546 | 0.618624 | 0.910349 | -0.060922 | 8.229197 | 4.122176e-03 | 5.666162e-03 | chr1:149412235-149414853 | 2.246711 |
5214 | chr1 | 124807770 | 124808169 | 12 | 32.194444 | 0.665899 | 0.719740 | 1.080855 | 0.053841 | 7.787822 | 5.259957e-03 | 7.150764e-03 | chr1:124807770-124808169 | 2.145648 |
5105 | chr1 | 123652048 | 123652730 | 11 | 42.515152 | 0.586700 | 0.643289 | 1.096452 | 0.056589 | 7.782403 | 5.275756e-03 | 7.171200e-03 | chr1:123652048-123652730 | 2.144408 |
5057 | chr1 | 123112849 | 123113680 | 12 | 38.972222 | 0.661589 | 0.609712 | 0.921587 | -0.051877 | 7.263604 | 7.036594e-03 | 9.445130e-03 | chr1:123112849-123113680 | 2.024792 |
18301 rows × 14 columns
We can use the plot_methylation_tile
method of the ContigDataset
to plot visualise the DMRs.
In the two examples below, we can see tiles where the methylation is different between HG002, HG003, HG004 (the Ashkenazi trio) on one hand and HG005, HG006, HG007 (the Han trio) on the other hand. Each patch of color is a CpG. The dark patches correspond to CpGs with high methyation fraction, while the light patches correpond to a low methylation fraction. Grey patches (if any) are CpGs with low coverage. For each panel, the bottom plot is a locator which shows the genomic location of each CpG.
fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_tile(
numerator="num_modc",
denominator="num_total_c",
min_coverage=5,
)

We can also plot the methylation trace. It shows the methylation fraction across the region of interest. In our case, it clearly shows a distinct pattern of methylation between the Ashkenazi group (blue-ish colours) and the Han group (red-ish colours).
kwargs = {
"palette": ["#003B49", "#9CDBD9", "#668991", "#F87C56", "#fbb59f", "#f5430d"],
}
fig = substore[dmrs["Slice"].iloc[0]].plot_methylation_trace(
numerators=["num_modc"],
denominator="num_total_c",
min_coverage=5,
**kwargs
)

Let’s look at a second example:
fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_tile(
numerator="num_modc",
denominator="num_total_c",
min_coverage=5,
)

fig = substore[dmrs["Slice"].iloc[1]].plot_methylation_trace(
numerators=["num_modc"],
denominator="num_total_c",
min_coverage=5,
**kwargs
)

A non-DMR example#
In this example there is not clear difference in methylation between the two family trios, hence it was correctly not called as a DMR.
fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_tile(
numerator="num_modc",
denominator="num_total_c",
min_coverage=5,
)

fig = substore[dmr_regions["Slice"].iloc[-7]].plot_methylation_trace(
numerators=["num_modc"],
denominator="num_total_c",
min_coverage=5,
**kwargs
)
