Working with genomic ranges in modality#
A common pattern when analysing methylation datasets is to perform a series of operations over specified genomic regions. modality
has extensive support for working with genomic ranges via pyranges
objects from the PyRanges
package. In this notebook we showcase some of the working patterns and analyses that are enabled by modality
when working with genomic ranges.
Load data#
For the purposes of this demo we can use a duet +modC dataset of Genome in a bottle (GIAB) samples. The data are publicly available and can be loaded using the load_biomodal_dataset()
function from the modality.datasets
module. This will pull the dataset and load it in to our session as a ContigDataset object.
from modality.datasets import load_biomodal_dataset
ds = load_biomodal_dataset("giab")
We will restrict to a single trio in the dataset here for the purposes of the demonstration. Once we have subset the dataset we use assign_fractions
method of the ContigDataset
to calculate methylation fractions.
sample_selection = ds["family"].values == "Ashkenazi Jewish"
ds = ds.sel(sample_id=sample_selection).copy()
ds.assign_fractions(
numerators="num_modc",
denominator="num_total_c",
min_coverage=10,
inplace=True,
)
Annotation via intervals#
Often we will want to overlay some annotation from external sources onto the genomic regions captured in the ContigDataset
. modality
provides a series of built-in functions to extract genomic annotations from UCSC
and GENECODE
. These functions are availble via the modality.annotation
module.
Using built-in functions for retrieving annotation#
from modality.annotation import (
get_cpg_islands,
get_cpg_shores,
get_cpg_shelves,
get_exons,
get_five_prime_utrs,
)
We can wrap multiple get_*
functions into lists or dictionaries and then use these for downstream analysis:
cpg_annotations = [
get_cpg_islands(contigs=ds.attrs["contigs"]),
get_cpg_shelves(contigs=ds.attrs["contigs"]),
get_cpg_shores(contigs=ds.attrs["contigs"]),
]
For example, the first item in the cpg_annotations
list here is a pyranges
object of CpG islands:
cpg_islands = cpg_annotations[0]
cpg_islands
Index | Bin | Chromosome | Start | End | Length | Cpgnum | Gcnum | Percpg | Pergc | Obsexp | Type | Ranges_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9 | 585 | chr1 | 28735 | 29737 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island | 0 |
1 | 10 | 586 | chr1 | 135124 | 135563 | 439 | 30 | 295 | 13.7 | 67.2 | 0.64 | cpg_island | 1 |
2 | 11 | 586 | chr1 | 199251 | 200121 | 870 | 104 | 643 | 23.9 | 73.9 | 0.89 | cpg_island | 2 |
3 | 12 | 587 | chr1 | 368792 | 370063 | 1271 | 99 | 777 | 15.6 | 61.1 | 0.84 | cpg_island | 3 |
4 | 13 | 587 | chr1 | 381172 | 382185 | 1013 | 84 | 734 | 16.6 | 72.5 | 0.64 | cpg_island | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
27944 | 189 | 779 | chrY | 25464370 | 25464941 | 571 | 51 | 403 | 17.9 | 70.6 | 0.72 | cpg_island | 27944 |
27945 | 190 | 786 | chrY | 26409388 | 26409785 | 397 | 32 | 252 | 16.1 | 63.5 | 0.82 | cpg_island | 27945 |
27946 | 191 | 788 | chrY | 26627168 | 26627397 | 229 | 25 | 172 | 21.8 | 75.1 | 0.78 | cpg_island | 27946 |
27947 | 192 | 1020 | chrY | 57067645 | 57068034 | 389 | 36 | 257 | 18.5 | 66.1 | 0.85 | cpg_island | 27947 |
27948 | 193 | 1021 | chrY | 57203115 | 57203423 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | cpg_island | 27948 |
27949 rows × 13 columns
Providing your own set of genomic ranges#
You can also specify you own set of genomic ranges, either by loading in a file, for example a bedfile using pyranges.read_bed
or building a ranges object:
import pyranges as pr
custom_ranges = pr.PyRanges(
chromosomes=["chr1", "chr1", "chr1"],
starts=[28735, 135124, 199251],
ends=[29737, 135563, 200121],
)
custom_ranges
Chromosome | Start | End | |
---|---|---|---|
0 | chr1 | 28735 | 29737 |
1 | chr1 | 135124 | 135563 |
2 | chr1 | 199251 | 200121 |
Computing over genomic ranges#
Once you have a set of genomic ranges of interest we can use the reduce_byranges()
method of the ContigDataset
, to summarise methylation information across these ranges. In a bit more detail:
Broadly, from a set of N (chr, start, stop) intervals, applied to a dataset of C (contexts) x S (samples) we transform to an array of (start_ix, stop_ix) windows. When the set of intervals is applied, we select a function (numpy.mean
or numpy.sum
) that is applied to rows of the selected variable, and applied across axis 0, ie summarizing all values in the window.
Please note: currently we only support mean or sum reduction methods. We hope to address this in future releases
reduced_ds = ds.reduce_byranges(
ranges=cpg_annotations,
var="num_total_c",
)
This method returns an xarray.Dataset
with reduced variables now computed over the ranges we specified.
reduced_ds
<xarray.Dataset> Dimensions: (ranges: 137687, sample_id: 6) Coordinates: contig (ranges) <U5 'chr1' 'chr1' 'chr1' ... 'chrY' 'chrY' start (ranges) int64 28735 135124 ... 57201115 57203423 end (ranges) int64 29737 135563 ... 57203115 57205423 range_id (ranges) int64 0 1 2 3 ... 137684 137685 137686 num_contexts (ranges) int64 222 60 208 198 168 58 ... 36 0 0 0 0 range_length (ranges) int64 1002 439 870 1271 ... 2000 2000 2000 NA_id (sample_id) object 'NA24385' 'NA24149' ... 'NA24143' giab_id (sample_id) object 'HG002' 'HG003' ... 'HG004' * sample_id (sample_id) <U22 'CEG1530-EL01-A1200-004' ... 'CEG... Dimensions without coordinates: ranges Data variables: num_total_c_sum (ranges, sample_id) uint64 dask.array<chunksize=(34422, 2), meta=np.ndarray> num_total_c_mean (ranges, sample_id) float64 dask.array<chunksize=(34422, 2), meta=np.ndarray> num_total_c_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(34422, 2), meta=np.ndarray> Index (ranges) int64 9 10 11 12 13 ... 191 192 192 193 193 Bin (ranges) int64 585 586 586 587 ... 1020 1021 1021 Length (ranges) int64 1002 439 870 1271 ... 389 389 308 308 Cpgnum (ranges) int64 111 30 104 99 84 29 ... 25 36 36 29 29 Gcnum (ranges) int64 731 295 643 777 ... 257 257 190 190 Percpg (ranges) float64 22.2 13.7 23.9 ... 18.5 18.8 18.8 Pergc (ranges) float64 73.0 67.2 73.9 ... 66.1 61.7 61.7 Obsexp (ranges) float64 0.85 0.64 0.89 ... 0.85 0.99 0.99 Type (ranges) object 'cpg_island' ... 'shores'
Because we have now substantially reduced the size of the dataset by performing a series of range-based reductions we can now seemlessly integrate with other standard analysis packages such as pandas
.
Please Note: This transformation should only be performed when you have used modality
to subset or reduce your dataset - transforming to a pandas.DataFrame
before you have done so will likely crash your notebook!
reduced_ds_pandas = reduced_ds.to_dataframe()
reduced_ds_pandas
num_total_c_sum | num_total_c_mean | num_total_c_cpg_count | contig | start | end | range_id | num_contexts | range_length | NA_id | giab_id | Index | Bin | Length | Cpgnum | Gcnum | Percpg | Pergc | Obsexp | Type | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ranges | sample_id | ||||||||||||||||||||
0 | CEG1530-EL01-A1200-004 | 367 | 1.653153 | 222 | chr1 | 28735 | 29737 | 0 | 222 | 1002 | NA24385 | HG002 | 9 | 585 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island |
CEG1530-EL01-A1200-007 | 303 | 1.364865 | 222 | chr1 | 28735 | 29737 | 0 | 222 | 1002 | NA24149 | HG003 | 9 | 585 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island | |
CEG1530-EL01-A1200-010 | 314 | 1.414414 | 222 | chr1 | 28735 | 29737 | 0 | 222 | 1002 | NA24143 | HG004 | 9 | 585 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island | |
CEG1532-EL01-A1200-005 | 344 | 1.549550 | 222 | chr1 | 28735 | 29737 | 0 | 222 | 1002 | NA24385 | HG002 | 9 | 585 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island | |
CEG1532-EL01-A1200-008 | 401 | 1.806306 | 222 | chr1 | 28735 | 29737 | 0 | 222 | 1002 | NA24149 | HG003 | 9 | 585 | 1002 | 111 | 731 | 22.2 | 73.0 | 0.85 | cpg_island | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
137686 | CEG1530-EL01-A1200-007 | 0 | NaN | 0 | chrY | 57203423 | 57205423 | 137686 | 0 | 2000 | NA24149 | HG003 | 193 | 1021 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | shores |
CEG1530-EL01-A1200-010 | 0 | NaN | 0 | chrY | 57203423 | 57205423 | 137686 | 0 | 2000 | NA24143 | HG004 | 193 | 1021 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | shores | |
CEG1532-EL01-A1200-005 | 0 | NaN | 0 | chrY | 57203423 | 57205423 | 137686 | 0 | 2000 | NA24385 | HG002 | 193 | 1021 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | shores | |
CEG1532-EL01-A1200-008 | 0 | NaN | 0 | chrY | 57203423 | 57205423 | 137686 | 0 | 2000 | NA24149 | HG003 | 193 | 1021 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | shores | |
CEG1532-EL01-A1200-011 | 0 | NaN | 0 | chrY | 57203423 | 57205423 | 137686 | 0 | 2000 | NA24143 | HG004 | 193 | 1021 | 308 | 29 | 190 | 18.8 | 61.7 | 0.99 | shores |
826122 rows × 20 columns
Plotting features from a reduced dataset#
Having provided a set of genomic ranges of interest and then computing methylation information across these ranges and returning a new reduced dataset we can use some of modality's
plotting functions to visualise the results. Below we make use of catplot
and lmplot
from the modality.plot_features
module.
from modality.plot_features import catplot, lmplot
catplot(
reduced_ds,
col="sample_id",
y="num_total_c_mean",
hue="Type",
kind="violin",
col_wrap=3,
)
<seaborn.axisgrid.FacetGrid at 0x7f1bdf25c410>
lmplot(reduced_ds, x="Length", y="num_contexts", col="Type", sharey=False)
<seaborn.axisgrid.FacetGrid at 0x7f1bd73f4c50>
Worked example: Comparing exons to 5-prime UTRs#
As another example, we can quickly reduce the dataset over exons and 5-prime UTRs and compare the methylation fractions between these two classes:
exons_range = get_exons(
contig="chr1",
).unstrand()
five_prime_utr_range = get_five_prime_utrs(
contig="chr1",
).unstrand()
exons_utrs_reduced = ds.reduce_byranges(
ranges=[exons_range, five_prime_utr_range],
var="frac_modc",
)
exons_utrs_reduced
<xarray.Dataset> Dimensions: (ranges: 68925, sample_id: 6) Coordinates: contig (ranges) <U5 'chr1' 'chr1' ... 'chr1' 'chr1' start (ranges) int64 65418 65519 ... 248858917 248859014 end (ranges) int64 65432 65572 ... 248859084 248859084 range_id (ranges) int64 0 1 2 3 ... 68921 68922 68923 68924 num_contexts (ranges) int64 0 0 36 299 299 16 ... 10 0 0 38 20 range_length (ranges) int64 14 53 2548 1025 ... 11 11 167 70 NA_id (sample_id) object 'NA24385' ... 'NA24143' giab_id (sample_id) object 'HG002' 'HG003' ... 'HG004' * sample_id (sample_id) <U22 'CEG1530-EL01-A1200-004' ... '... Dimensions without coordinates: ranges Data variables: frac_modc_sum (ranges, sample_id) float64 dask.array<chunksize=(17232, 3), meta=np.ndarray> frac_modc_mean (ranges, sample_id) float64 dask.array<chunksize=(17232, 3), meta=np.ndarray> frac_modc_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(17232, 3), meta=np.ndarray> Source (ranges) object 'HAVANA' 'HAVANA' ... 'HAVANA' Type (ranges) object 'exon' 'exon' ... 'five_prime_UTR' Score (ranges) object '.' '.' '.' '.' ... '.' '.' '.' Phase (ranges) object '.' '.' '.' '.' ... '.' '.' '.' Id (ranges) object 'exon:ENST00000641515.2:1' ... ... Parent (ranges) object 'ENST00000641515.2' ... 'ENST00... Gene_id (ranges) object 'ENSG00000186092.7' ... 'ENSG00... Transcript_id (ranges) object 'ENST00000641515.2' ... 'ENST00... Gene_type (ranges) object 'protein_coding' ... 'protein_c... Gene_name (ranges) object 'OR4F5' 'OR4F5' ... 'ZNF692' Transcript_type (ranges) object 'protein_coding' ... 'protein_c... Transcript_name (ranges) object 'OR4F5-201' ... 'ZNF692-202' Exon_number (ranges) object '1' '2' '3' '1' ... '2' '1' '1' Exon_id (ranges) object 'ENSE00003812156.1' ... 'ENSE00... Level (ranges) object '2' '2' '2' '2' ... '2' '2' '2' Transcript_support_level (ranges) object '' '' '' '5' ... '1' '1' '1' '1' Tag (ranges) object 'RNA_Seq_supported_partial,basi... Havana_transcript (ranges) object 'OTTHUMT00000003223.4' ... 'OTT... Hgnc_id (ranges) object 'HGNC:14825' ... 'HGNC:26049' Ont (ranges) object '' '' '' '' '' ... nan nan nan nan Havana_gene (ranges) object 'OTTHUMG00000001094.4' ... 'OTT... Protein_id (ranges) object 'ENSP00000493376.2' ... 'ENSP00... Ccdsid (ranges) object 'CCDS30547.2' ... 'CCDS53487.1'
catplot(
exons_utrs_reduced,
kind='box',
hue='sample_id',
y='frac_modc_mean',
x='Type',
)
<seaborn.axisgrid.FacetGrid at 0x7f1bb3103490>
Subsetting the dataset based on genomic ranges#
As well as reducing the dataset over a series of ranges, modality
also supports subsetting the dataset, such that only CpGs that fall within a series of ranges are retained for downstream analysis. This can be very useful to restrict analyses to key regions for example. To do this we can use the subset_byranges()
method of the ContigDataset
. Below we use this along with a pca analysis (leveraging modality.pca
) to demonstrate this functionality and how one might use it:
from modality.pca import run_pca, plot_pca_scatter
islands_ds = ds.subset_byranges(
ranges=cpg_islands,
)
islands_ds
<xarray.Dataset> Dimensions: (pos: 4315720, sample_id: 6) Coordinates: NA_id (sample_id) object 'NA24385' ... 'NA24143' giab_id (sample_id) object 'HG002' ... 'HG004' group (sample_id) <U22 'CEG1530-EL01-A1200-004'... * sample_id (sample_id) <U22 'CEG1530-EL01-A1200-004'... contig (pos) <U5 dask.array<chunksize=(100000,), meta=np.ndarray> ref_position (pos) int64 dask.array<chunksize=(100000,), meta=np.ndarray> strand (pos) <U2 dask.array<chunksize=(100000,), meta=np.ndarray> Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) uint16 dask.array<chunksize=(100000, 6), meta=np.ndarray> num_modc (pos, sample_id) uint16 dask.array<chunksize=(100000, 6), meta=np.ndarray> num_other (pos, sample_id) uint16 dask.array<chunksize=(100000, 6), meta=np.ndarray> num_total (pos, sample_id) uint16 dask.array<chunksize=(100000, 6), meta=np.ndarray> num_total_c (pos, sample_id) uint16 dask.array<chunksize=(100000, 6), meta=np.ndarray> frac_modc (pos, sample_id) float64 dask.array<chunksize=(100000, 6), meta=np.ndarray> Protocol Version (sample_id) object '5-Letter v2.4' ... '5... family (sample_id) object 'Ashkenazi Jewish' ...... family_status (sample_id) object 'Son' ... 'Mother' sex (sample_id) object 'Male' ... 'Female' tech_replicate_number (sample_id) int64 1 1 1 2 2 2 Input DNA Quantity (ng/sample) (sample_id) int64 80 80 80 80 80 80 Attributes: context: CG context_sampling: 1.0 contigs: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr... coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle (GIAB) sa... fasta_path: GRCh38Decoy-ss-ctrls-v23.fa.gz frac_denominator: num_total_c frac_min_coverage: 10 input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy.dedup.due... quant_type: quant5L ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', ... slice_chr1: slice(0, 388278, 1) slice_chr10: slice(2006948, 2207692, 1) slice_chr11: slice(2207692, 2416334, 1) slice_chr12: slice(2416334, 2589760, 1) slice_chr13: slice(2589760, 2683988, 1) slice_chr14: slice(2683988, 2814352, 1) slice_chr15: slice(2814352, 2950224, 1) slice_chr16: slice(2950224, 3173146, 1) slice_chr17: slice(3173146, 3433468, 1) slice_chr18: slice(3433468, 3523034, 1) slice_chr19: slice(3523034, 3841880, 1) slice_chr2: slice(388278, 671346, 1) slice_chr20: slice(3841880, 3972924, 1) slice_chr21: slice(3972924, 4044738, 1) slice_chr22: slice(4044738, 4168048, 1) slice_chr3: slice(671346, 852496, 1) slice_chr4: slice(852496, 1020792, 1) slice_chr5: slice(1020792, 1214570, 1) slice_chr6: slice(1214570, 1404772, 1) slice_chr7: slice(1404772, 1638222, 1) slice_chr8: slice(1638222, 1813300, 1) slice_chr9: slice(1813300, 2006948, 1) slice_chrX: slice(4168048, 4307462, 1) slice_chrY: slice(4307462, 4315720, 1)
pca_object, transformed_data = run_pca(
input_array=islands_ds.frac_modc,
)
plot_pca_scatter(
pca_object=pca_object,
pca_result=transformed_data,
hue=islands_ds.family_status,
)
<seaborn.axisgrid.FacetGrid at 0x7f1bb4e8af90>