Predict Chromatin Accessibility in ES-E14 Cells, Part I: Feature Generation#
This is the first of two notebooks which use 6-base data to predict chromatin accessibility in mouse embryonic stem cells ES-E14. We aim to use methylation data in a simple subset of genomic regions around the TSS, and train a model to predict ATAC-seq data, using a public dataset of accessibility. The regions are based on the GENCODE annotations, for the mm10 mouse genome.
In this notebook, for each of the regions above, we compute a mean 5mC fraction, a mean 5hmC fraction, a record of the number of CpGs in the region (regardless of whether they are methylated or not), and the length of the region. Thus, we capture 4 features per region.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyranges as pr
import seaborn as sns
modality
contains a set of functions to extract annotations from GENCODE. Note that GENCODE only provides annotations for the Human and Mouse genomes, and the annotation
module of modality
only supports the hg38 (GRCh38), mm10 (GRCm38) and mm39 (GRCm39) genomes. In this notebook, we use mm10.
from modality.datasets import load_biomodal_dataset
from modality.contig_dataset import ContigDataset
from modality.annotation import (
get_tss_region,
)
sns.set_theme()
sns.set_style("whitegrid")
biomodal_palette = ["#003B49", "#9CDBD9", "#F87C56", "#C0DF16", "#05868E"]
sns.set_palette(biomodal_palette)
Load the 6-base data#
Our data is stored in a compressed zarr store. modality
allows its users to load the zarr store into an object called a ContigDataset
, which is a subclass of xarray.Dataset. The ContigDataset
gives a useful multidimensional view of our data. In our case, our methylation data (for instance the number of methylated C’s at a given CpG) is represented along two dimensions: the CpG genomic position, and the sample ID.
The function below loads our public mouse dataset for the ES-E14 cell line, using the load_biomodal_dataset function.
The function below does several things:
load the data into a
ContigDataset
calledds
.Drop some variables that we won’t need.
Because the samples are all technical replicates, we sum over all 4 of them with
ds.sum
to gain some power.Now that we’ve summed over the samples,
ds
is a one-dimensional array along thepos
dimension. However, internally,modality
often has to assume thatsample_id
is a dimension ofds
. So we add it back withds.expand_dims(dim="sample_id", axis=1)
, and we give it a name withds.assign_coords(sample_id=["sample_0"])
.Finally we compute methylation fractions with
ds.assign_fractions
.
def load_data(
zarr_store_path,
sum_samples=True,
merge_contexts=False,
min_coverage=None,
):
if zarr_store_path == "esE14":
ds = load_biomodal_dataset("esE14")
ds = ds.drop_vars(["Input DNA Quantity (ng/sample)", "tech_replicate_number"])
else:
ds = ContigDataset.from_zarrz(zarr_store_path)
if sum_samples:
ds = ds.sum(dim="sample_id", keep_attrs=True)
ds = ds.expand_dims(dim="sample_id", axis=1)
ds = ds.assign_coords(sample_id=["sample_0"])
if merge_contexts:
ds = ds.merge_cpgs()
if min_coverage is not None:
ds = ds.subset_bycoverage(min_coverage=min_coverage)
ds.assign_fractions(
numerators=["num_modc", "num_mc", "num_hmc"],
denominator="num_total_c",
min_coverage=10,
inplace=True,
)
return ds
ds = load_data(
"esE14",
sum_samples=True,
merge_contexts=False,
min_coverage=10
)
Below is what our final ContigDataset
looks like. It has 26 million CpGs along the pos
dimension, and one sample along the sample_id
dimension.
print(ds)
<modality.contig_dataset.ContigDataset>
<xarray.Dataset>
Dimensions: (pos: 24830501, sample_id: 1)
Coordinates:
* sample_id (sample_id) <U8 'sample_0'
group (sample_id) <U8 'sample_0'
contig (pos) <U5 dask.array<chunksize=(100000,), meta=np.ndarray>
ref_position (pos) int64 dask.array<chunksize=(100000,), meta=np.ndarray>
strand (pos) <U2 dask.array<chunksize=(100000,), meta=np.ndarray>
Dimensions without coordinates: pos
Data variables:
num_c (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_hmc (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_mc (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_modc (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_other (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_total (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
num_total_c (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
frac_modc (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
frac_mc (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
frac_hmc (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
Attributes:
context: CG
context_sampling: 1.0
contigs: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '...
coordinate_basis: 0
description: An evoC dataset of mouse ES-E14 cells, data contains ...
fasta_path: GRCm38-ss-ctrls-v23.fa.gz
input_path: ['CEG1485-EL01-D1115-001.genome.GRCm38.dedup.duet-mod...
quant_type: quant6L
ref_name: GRCm38
sample_ids: ['CEG1485-EL01-D1115-001', 'CEG1485-EL01-D1115-002', ...
slice_1: slice(0, 1686385, 1)
slice_10: slice(13470018, 14802757, 1)
slice_11: slice(14802757, 16308053, 1)
slice_12: slice(16308053, 17405694, 1)
slice_13: slice(17405694, 18562567, 1)
slice_14: slice(18562567, 19604813, 1)
slice_15: slice(19604813, 20653494, 1)
slice_16: slice(20653494, 21539973, 1)
slice_17: slice(21539973, 22549921, 1)
slice_18: slice(22549921, 23411019, 1)
slice_19: slice(23411019, 24108652, 1)
slice_2: slice(1686385, 3569945, 1)
slice_3: slice(3569945, 4912268, 1)
slice_4: slice(4912268, 6430526, 1)
slice_5: slice(6430526, 8053335, 1)
slice_6: slice(8053335, 9406418, 1)
slice_7: slice(9406418, 10762044, 1)
slice_8: slice(10762044, 12159643, 1)
slice_9: slice(12159643, 13470018, 1)
slice_MT: slice(24830068, 24830501, 1)
slice_X: slice(24108652, 24791522, 1)
slice_Y: slice(24791522, 24830068, 1)
frac_denominator: num_total_c
frac_min_coverage: 10
Get regions near the TSS#
We want to use methylation information around the TSS as our input data to predict ATAC-seq. modality
has a function called get_tss_region
, which uses an annotated genome from GENCODE to identify the region around the TSS. The two arguments start_offset
and span
allow the user to create regions that are off-centred, and which span a certain length. The positive (resp. negative) value of start_offset
indicate that we go downstream (resp. upstream) of the TSS.
starts = [-1000, -500, 0, 500, 1000]
span = 1000
reference_genome = "mm10"
regions_dict = {}
for start in starts:
print(f"Extracting TSS region for offset: {start} and span: {span}")
tss = get_tss_region(
contig=None,
start=None,
end=None,
reference=reference_genome,
protein_coding=True,
filterby=None,
start_offset=start,
span=span,
as_pyranges=True,
)
regions_dict[start] = tss.unstrand()
regions_dict[start].Region = str(start)
Extracting TSS region for offset: -1000 and span: 1000
Extracting TSS region for offset: -500 and span: 1000
Extracting TSS region for offset: 0 and span: 1000
Extracting TSS region for offset: 500 and span: 1000
Extracting TSS region for offset: 1000 and span: 1000
ranges = list(regions_dict.values())
Create features with reduce_byranges
#
We use a method of modality.ContigDataset
called reduce_byranges
which allows to reduce our contig dataset to summarise methylation information over a list of genomic ranges (in our case, the ranges that we created above). These ranges should be in the form of pyranges objects. In principle, a user could also pass their own ranges, for instance reading them from a bed or gff file - see pyranges documentation.
rdr = ds.reduce_byranges(
ranges=ranges,
var=["num_mc", "num_hmc", "num_modc", "num_total_c"],
)
rdr
<xarray.Dataset> Dimensions: (ranges: 108365, sample_id: 1) Coordinates: contig (ranges) <U2 '1' '1' '1' '1' '1' ... 'Y' 'Y' 'Y' 'Y' start (ranges) int64 4806787 4806891 ... 85528907 89743531 end (ranges) int64 4807787 4807891 ... 85529907 89744531 range_id (ranges) int64 0 1 2 3 ... 108362 108363 108364 num_contexts (ranges) int64 78 88 78 0 20 0 44 ... 0 0 0 0 0 0 2 range_length (ranges) int64 1000 1000 1000 1000 ... 1000 1000 1000 * sample_id (sample_id) <U8 'sample_0' Dimensions without coordinates: ranges Data variables: num_mc_sum (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_mc_mean (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_mc_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_hmc_sum (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_hmc_mean (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_hmc_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_modc_sum (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_modc_mean (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_modc_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_total_c_sum (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_total_c_mean (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray> num_total_c_cpg_count (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray> Source (ranges) object 'HAVANA' 'HAVANA' ... 'HAVANA' Type (ranges) object 'gene' 'gene' ... 'gene' 'gene' Score (ranges) object '.' '.' '.' '.' ... '.' '.' '.' '.' Phase (ranges) object '.' '.' '.' '.' ... '.' '.' '.' '.' Id (ranges) object 'ENSMUSG00000025903.14' ... 'ENSMU... Gene_id (ranges) object 'ENSMUSG00000025903.14' ... 'ENSMU... Gene_type (ranges) object 'protein_coding' ... 'protein_coding' Gene_name (ranges) object 'Lypla1' 'Gm37988' ... 'Gm21996' Level (ranges) object '2' '2' '2' '2' ... '2' '2' '2' '2' Mgi_id (ranges) object 'MGI:1344588' ... 'MGI:5440224' Havana_gene (ranges) object 'OTTMUSG00000021562.4' ... 'OTTMUS... Tag (ranges) object 'overlapping_locus' ... '' Region (ranges) object '-1000' '-1000' ... '1000' '1000'
The outcome of reduce_byranges
is an xarray.Dataset
object which contains summarised methylation information across a genomic range (mean, sum, and CpG count of the range) for each of the variables that we specified in var
(typically the number of modified C’s num_mc
, or the methylation fraction frac_mc
).
We use the number of mC (or hmc or modC) summed over the range, divided by the total number of C’s summed over that same range to get a mean methylation fraction over the range.
rdr = rdr.assign(
mean_mc=rdr["num_mc_sum"] / rdr["num_total_c_sum"],
mean_hmc=rdr["num_hmc_sum"] / rdr["num_total_c_sum"],
mean_modc=rdr["num_modc_sum"] / rdr["num_total_c_sum"],
)
Let’s turn this xarray dataset into a pandas dataframe, which will form the base of our feature set.
df = rdr.to_dataframe().reset_index(level="sample_id", drop=True)
df["cpg_count"] = df["num_total_c_cpg_count"]
df_pivot = df.pivot(
index=["Gene_id", "Gene_name", "contig"],
columns="Region",
values=["mean_mc", "mean_hmc", "mean_modc", "cpg_count", "range_length"],
)
df_features = df_pivot.copy()
df_features.columns = [" ".join(col).strip() for col in df_features.columns.values]
# replace white spaces with underscores in features
features = df_features.columns
features = [f.replace(" ", "_") for f in features]
df_features.columns = features
df_features = df_features.reset_index()
The above dataframe is our feature set, containing a series of features (mean 5mC, mean 5hmC, mean modC, CpG count, and region length) for each of the genomic regions of interest, and for each gene.
Write to file#
Finally, we write this feature file as a pickle file, which preserves the variable type, and is therefore prefered over a more basic text file.
df_features = df_features.reset_index()
df_features.head()
index | Gene_id | Gene_name | contig | mean_mc_-1000 | mean_mc_-500 | mean_mc_0 | mean_mc_1000 | mean_mc_500 | mean_hmc_-1000 | ... | cpg_count_-1000 | cpg_count_-500 | cpg_count_0 | cpg_count_1000 | cpg_count_500 | range_length_-1000 | range_length_-500 | range_length_0 | range_length_1000 | range_length_500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | ENSMUSG00000000001.4 | Gnai3 | 3 | 0.072351 | 0.003064 | 0.003045 | 0.562945 | 0.296720 | 0.004368 | ... | 52.0 | 122.0 | 98.0 | 24.0 | 36.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
1 | 1 | ENSMUSG00000000003.15 | Pbsn | X | 0.650000 | 0.753479 | 0.750416 | NaN | 0.734694 | 0.008333 | ... | 2.0 | 10.0 | 12.0 | 0.0 | 2.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
2 | 2 | ENSMUSG00000000028.15 | Cdc45 | 16 | 0.004541 | 0.000480 | 0.003059 | 0.661132 | 0.101471 | 0.001090 | ... | 106.0 | 96.0 | 60.0 | 12.0 | 20.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
3 | 3 | ENSMUSG00000000037.17 | Scml2 | X | 0.693548 | 0.749503 | 0.802658 | 0.856354 | 0.849265 | 0.056452 | ... | 16.0 | 35.0 | 23.0 | 4.0 | 6.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
4 | 4 | ENSMUSG00000000049.11 | Apoh | 11 | 0.064476 | 0.003680 | 0.000332 | 0.345097 | 0.074608 | 0.009784 | ... | 92.0 | 146.0 | 164.0 | 48.0 | 128.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
5 rows × 29 columns
df_features.columns
Index(['index', 'Gene_id', 'Gene_name', 'contig', 'mean_mc_-1000',
'mean_mc_-500', 'mean_mc_0', 'mean_mc_1000', 'mean_mc_500',
'mean_hmc_-1000', 'mean_hmc_-500', 'mean_hmc_0', 'mean_hmc_1000',
'mean_hmc_500', 'mean_modc_-1000', 'mean_modc_-500', 'mean_modc_0',
'mean_modc_1000', 'mean_modc_500', 'cpg_count_-1000', 'cpg_count_-500',
'cpg_count_0', 'cpg_count_1000', 'cpg_count_500', 'range_length_-1000',
'range_length_-500', 'range_length_0', 'range_length_1000',
'range_length_500'],
dtype='object')
df_features.to_pickle("features_atacseq.pickle")
df_features
index | Gene_id | Gene_name | contig | mean_mc_-1000 | mean_mc_-500 | mean_mc_0 | mean_mc_1000 | mean_mc_500 | mean_hmc_-1000 | ... | cpg_count_-1000 | cpg_count_-500 | cpg_count_0 | cpg_count_1000 | cpg_count_500 | range_length_-1000 | range_length_-500 | range_length_0 | range_length_1000 | range_length_500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | ENSMUSG00000000001.4 | Gnai3 | 3 | 0.072351 | 0.003064 | 0.003045 | 0.562945 | 0.296720 | 0.004368 | ... | 52.0 | 122.0 | 98.0 | 24.0 | 36.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
1 | 1 | ENSMUSG00000000003.15 | Pbsn | X | 0.650000 | 0.753479 | 0.750416 | NaN | 0.734694 | 0.008333 | ... | 2.0 | 10.0 | 12.0 | 0.0 | 2.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
2 | 2 | ENSMUSG00000000028.15 | Cdc45 | 16 | 0.004541 | 0.000480 | 0.003059 | 0.661132 | 0.101471 | 0.001090 | ... | 106.0 | 96.0 | 60.0 | 12.0 | 20.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
3 | 3 | ENSMUSG00000000037.17 | Scml2 | X | 0.693548 | 0.749503 | 0.802658 | 0.856354 | 0.849265 | 0.056452 | ... | 16.0 | 35.0 | 23.0 | 4.0 | 6.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
4 | 4 | ENSMUSG00000000049.11 | Apoh | 11 | 0.064476 | 0.003680 | 0.000332 | 0.345097 | 0.074608 | 0.009784 | ... | 92.0 | 146.0 | 164.0 | 48.0 | 128.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21668 | 21668 | ENSMUSG00000118623.1 | AL935121.1 | 2 | 0.386207 | 0.577088 | 0.662259 | 0.628448 | 0.680887 | 0.006897 | ... | 5.0 | 16.0 | 22.0 | 10.0 | 20.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
21669 | 21669 | ENSMUSG00000118638.1 | AL805980.1 | X | 0.752593 | 0.802657 | 0.813023 | 0.655172 | 0.740448 | 0.019259 | ... | 14.0 | 20.0 | 20.0 | 8.0 | 14.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
21670 | 21670 | ENSMUSG00000118640.1 | AC167036.2 | 1 | 0.846839 | 0.803757 | 0.795282 | 0.734275 | 0.796283 | 0.017809 | ... | 12.0 | 30.0 | 34.0 | 8.0 | 18.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
21671 | 21671 | ENSMUSG00000118646.1 | AC160405.1 | 10 | 0.771987 | 0.793955 | 0.797990 | 0.761525 | 0.779537 | 0.020630 | ... | 10.0 | 26.0 | 30.0 | 16.0 | 16.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
21672 | 21672 | ENSMUSG00000118653.1 | AC159819.1 | 9 | 0.376609 | 0.304145 | 0.293673 | 0.692201 | 0.649180 | 0.113734 | ... | 12.0 | 63.0 | 57.0 | 6.0 | 12.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
21673 rows × 29 columns
For all 21,673 protein-coding genes in the mm10 genome, this file contains the ID, name, chromosome, start and end, as well as information about each sub-region around the TSS (i.e. mean methylation, cpg count, and length at each region).