Predict Chromatin Accessibility in ES-E14 Cells, Part I: Feature Generation#

This is the first of two notebooks which use 6-base data to predict chromatin accessibility in mouse embryonic stem cells ES-E14. We aim to use methylation data in a simple subset of genomic regions around the TSS, and train a model to predict ATAC-seq data, using a public dataset of accessibility. The regions are based on the GENCODE annotations, for the mm10 mouse genome.

In this notebook, for each of the regions above, we compute a mean 5mC fraction, a mean 5hmC fraction, a record of the number of CpGs in the region (regardless of whether they are methylated or not), and the length of the region. Thus, we capture 4 features per region.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyranges as pr
import seaborn as sns

modality contains a set of functions to extract annotations from GENCODE. Note that GENCODE only provides annotations for the Human and Mouse genomes, and the annotation module of modality only supports the hg38 (GRCh38), mm10 (GRCm38) and mm39 (GRCm39) genomes. In this notebook, we use mm10.

from modality import load_biomodal_dataset
from modality.annotation import (
    get_tss_region,
)

sns.set_theme()
sns.set_style("whitegrid")
biomodal_palette = ["#003B49", "#9CDBD9", "#F87C56", "#C0DF16", "#05868E"]
sns.set_palette(biomodal_palette)

Load the 6-base data#

Our data is stored in a compressed zarr store. modality allows its users to load the zarr store into an object called a ContigDataset, which is a subclass of xarray.Dataset. The ContigDataset gives a useful multidimensional view of our data. In our case, our methylation data (for instance the number of methylated C’s at a given CpG) is represented along two dimensions: the CpG genomic position, and the sample ID.

The function below loads our public mouse dataset for the ES-E14 cell line, using the load_biomodal_dataset function.

The function below does several things:

load the data into a ContigDataset called ds.
Drop some variables that we won’t need.
Because the samples are all technical replicates, we sum over all 4 of them with ds.sum to gain some power.
Now that we’ve summed over the samples, ds is a one-dimensional array along the pos dimension. However, internally, modality often has to assume that sample_id is a dimension of ds. So we add it back with ds.expand_dims(dim="sample_id", axis=1), and we give it a name with ds.assign_coords(sample_id=["sample_0"]).
Finally we compute methylation fractions with ds.assign_fractions.

def load_data(
    merge_technical_replicates=True,
    merge_cpgs=False,
    min_coverage=None,
):
    ds = load_biomodal_dataset(
        "esE14",
        merge_technical_replicates=merge_technical_replicates,
        merge_cpgs=merge_cpgs,
    )

    if min_coverage is not None:
        ds = ds.subset_bycoverage(min_coverage=min_coverage)

    ds.assign_fractions(
        numerators=["num_modc", "num_mc", "num_hmc"],
        denominator="num_total_c",
        min_coverage=10,
        inplace=True,
    )
    return ds

ds = load_data(
    merge_technical_replicates=True,
    merge_cpgs=False,
    min_coverage=10
    )

Below is what our final ContigDataset looks like. It has 26 million CpGs along the pos dimension, and one sample along the sample_id dimension.

print(ds)

<modality.ContigDataset>
<xarray.Dataset> Size: 7GB
Dimensions:                (pos: 38787463, sample_id: 1)
Coordinates:
  * sample_id              (sample_id) <U6 24B 'sample'
    group                  (sample_id) <U6 24B 'sample'
    contig                 (pos) <U20 3GB dask.array<chunksize=(80000,), meta=np.ndarray>
    ref_position           (pos) uint32 155MB dask.array<chunksize=(80000,), meta=np.ndarray>
    strand                 (pos) <U2 310MB dask.array<chunksize=(80000,), meta=np.ndarray>
    trinucleotide_context  (pos) uint8 39MB dask.array<chunksize=(80000,), meta=np.ndarray>
Dimensions without coordinates: pos
Data variables:
    num_c                  (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_hmc                (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_mc                 (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_modc               (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_n                  (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_other              (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_total              (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    num_total_c            (pos, sample_id) uint64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    frac_modc              (pos, sample_id) float64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    frac_mc                (pos, sample_id) float64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
    frac_hmc               (pos, sample_id) float64 310MB dask.array<chunksize=(80000, 1), meta=np.ndarray>
Attributes:
    context:            CG
    context_encoding:   {'0': 'AAA', '1': 'AAC', '10': 'AGA', '100': 'TAA', '...
    contigs:            ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr...
    coordinate_basis:   0
    description:        An evoC dataset of mouse ES-E14 cells, data contains ...
    fasta_path:         GRCm38p6.fa
    input_path:         ['CEG1485-EL01-D1115-001.genome.GRCm38p6_primary_asse...
    quant_subtype:      evoC
    ref_name:           GRCm38p6
    sample_ids:         ['CEG1485-EL01-D1115-001', 'CEG1485-EL01-D1115-002', ...
    slice_GL456210.1:   slice(38760725, 38762983, 1)
    slice_GL456211.1:   slice(38762983, 38765606, 1)
    slice_GL456212.1:   slice(38765606, 38766985, 1)
    slice_GL456216.1:   slice(38766985, 38768808, 1)
    slice_GL456221.1:   slice(38768808, 38770731, 1)
    slice_GL456233.1:   slice(38770731, 38773440, 1)
    slice_GL456239.1:   slice(38773440, 38773885, 1)
    slice_GL456354.1:   slice(38773885, 38774243, 1)
    slice_GL456359.1:   slice(38774243, 38774482, 1)
    slice_GL456367.1:   slice(38774482, 38775051, 1)
    slice_GL456368.1:   slice(38775051, 38775163, 1)
    slice_GL456370.1:   slice(38775163, 38775657, 1)
    slice_GL456372.1:   slice(38775657, 38776011, 1)
    slice_GL456378.1:   slice(38776011, 38776272, 1)
    slice_GL456379.1:   slice(38776272, 38776958, 1)
    slice_GL456381.1:   slice(38776958, 38777177, 1)
    slice_GL456382.1:   slice(38777177, 38777397, 1)
    slice_GL456383.1:   slice(38777397, 38778447, 1)
    slice_GL456385.1:   slice(38778447, 38778745, 1)
    slice_GL456387.1:   slice(38778745, 38778960, 1)
    slice_GL456389.1:   slice(38778960, 38779732, 1)
    slice_GL456390.1:   slice(38779732, 38780195, 1)
    slice_GL456392.1:   slice(38780195, 38780936, 1)
    slice_GL456393.1:   slice(38780936, 38781730, 1)
    slice_GL456394.1:   slice(38781730, 38782021, 1)
    slice_GL456396.1:   slice(38782021, 38782780, 1)
    slice_JH584292.1:   slice(38782780, 38782845, 1)
    slice_JH584294.1:   slice(38782845, 38782849, 1)
    slice_JH584295.1:   slice(38782849, 38783058, 1)
    slice_JH584296.1:   slice(38783058, 38783059, 1)
    slice_JH584297.1:   slice(38783059, 38783079, 1)
    slice_JH584299.1:   slice(38783079, 38785871, 1)
    slice_JH584301.1:   slice(38785871, 38785873, 1)
    slice_JH584304.1:   slice(38785873, 38787463, 1)
    slice_chr1:         slice(0, 2683651, 1)
    slice_chr10:        slice(20816493, 22875812, 1)
    slice_chr11:        slice(22875812, 25073929, 1)
    slice_chr12:        slice(25073929, 26774508, 1)
    slice_chr13:        slice(26774508, 28571808, 1)
    slice_chr14:        slice(28571808, 30195909, 1)
    slice_chr15:        slice(30195909, 31779607, 1)
    slice_chr16:        slice(31779607, 33169140, 1)
    slice_chr17:        slice(33169140, 34705722, 1)
    slice_chr18:        slice(34705722, 36023233, 1)
    slice_chr19:        slice(36023233, 37051640, 1)
    slice_chr2:         slice(2683651, 5512322, 1)
    slice_chr3:         slice(5512322, 7684834, 1)
    slice_chr4:         slice(7684834, 10033881, 1)
    slice_chr5:         slice(10033881, 12510242, 1)
    slice_chr6:         slice(12510242, 14648397, 1)
    slice_chr7:         slice(14648397, 16758876, 1)
    slice_chr8:         slice(16758876, 18817679, 1)
    slice_chr9:         slice(18817679, 20816493, 1)
    slice_chrM:         slice(38760292, 38760725, 1)
    slice_chrX:         slice(37051640, 38619916, 1)
    slice_chrY:         slice(38619916, 38760292, 1)
    zarr_version:       0.1
    frac_denominator:   num_total_c
    frac_min_coverage:  10

Get regions near the TSS#

We want to use methylation information around the TSS as our input data to predict ATAC-seq. modality has a function called get_tss_region, which uses an annotated genome from GENCODE to identify the region around the TSS. The two arguments start_offset and span allow the user to create regions that are off-centred, and which span a certain length. The positive (resp. negative) value of start_offset indicate that we go downstream (resp. upstream) of the TSS.

starts = [-1000, -500, 0, 500, 1000]
span = 1000
reference_genome = "mm10"

regions_dict = {}
for start in starts:
    print(f"Extracting TSS region for offset: {start} and span: {span}")
    tss = get_tss_region(
        contig=None,
        start=None,
        end=None,
        reference=reference_genome,
        protein_coding=True,
        filterby=None,
        start_offset=start,
        span=span,
        as_pyranges=True,
    )

    regions_dict[start] = tss.unstrand()
    regions_dict[start].Region = str(start)

Extracting TSS region for offset: -1000 and span: 1000

Extracting TSS region for offset: -500 and span: 1000

Extracting TSS region for offset: 0 and span: 1000

Extracting TSS region for offset: 500 and span: 1000

Extracting TSS region for offset: 1000 and span: 1000

ranges = list(regions_dict.values())

Create features with `reduce_byranges`#

We use a method of modality.ContigDataset called reduce_byranges which allows to reduce our contig dataset to summarise methylation information over a list of genomic ranges (in our case, the ranges that we created above). These ranges should be in the form of pyranges objects. In principle, a user could also pass their own ranges, for instance reading them from a bed or gff file - see pyranges documentation.

rdr = ds.reduce_byranges(
    ranges=ranges,
    var=["num_mc", "num_hmc", "num_modc", "num_total_c"],
    )

rdr

<xarray.Dataset> Size: 30MB
Dimensions:                (ranges: 108365, sample_id: 1)
Coordinates:
    contig                 (ranges) <U10 4MB 'chr1' 'chr1' ... 'chrY' 'chrY'
    start                  (ranges) int64 867kB 4806787 4806891 ... 89743531
    end                    (ranges) int64 867kB 4807787 4807891 ... 89744531
    range_id               (ranges) int64 867kB 0 1 2 3 ... 108362 108363 108364
    num_contexts           (ranges) int64 867kB 80 114 88 0 20 20 ... 0 0 0 0 2
    range_length           (ranges) int64 867kB 1000 1000 1000 ... 1000 1000
  * sample_id              (sample_id) <U6 24B 'sample'
Dimensions without coordinates: ranges
Data variables:
    num_mc_sum             (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_mc_mean            (ranges, sample_id) float64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_mc_cpg_count       (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_sum            (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_mean           (ranges, sample_id) float64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_cpg_count      (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_sum           (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_mean          (ranges, sample_id) float64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_cpg_count     (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_sum        (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_mean       (ranges, sample_id) float64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_cpg_count  (ranges, sample_id) uint64 867kB dask.array<chunksize=(27092, 1), meta=np.ndarray>
    Source                 (ranges) object 867kB 'HAVANA' 'HAVANA' ... 'HAVANA'
    Type                   (ranges) object 867kB 'gene' 'gene' ... 'gene' 'gene'
    Score                  (ranges) object 867kB '.' '.' '.' '.' ... '.' '.' '.'
    Phase                  (ranges) object 867kB '.' '.' '.' '.' ... '.' '.' '.'
    Id                     (ranges) object 867kB 'ENSMUSG00000025903.14' ... ...
    Gene_id                (ranges) object 867kB 'ENSMUSG00000025903.14' ... ...
    Gene_type              (ranges) object 867kB 'protein_coding' ... 'protei...
    Gene_name              (ranges) object 867kB 'Lypla1' ... 'Gm21996'
    Level                  (ranges) object 867kB '2' '2' '2' '2' ... '2' '2' '2'
    Mgi_id                 (ranges) object 867kB 'MGI:1344588' ... 'MGI:5440224'
    Havana_gene            (ranges) object 867kB 'OTTMUSG00000021562.4' ... '...
    Tag                    (ranges) object 867kB 'overlapping_locus' ... ''
    Region                 (ranges) object 867kB '-1000' '-1000' ... '1000'

The outcome of reduce_byranges is an xarray.Dataset object which contains summarised methylation information across a genomic range (mean, sum, and CpG count of the range) for each of the variables that we specified in var (typically the number of modified C’s num_mc, or the methylation fraction frac_mc).

We use the number of mC (or hmc or modC) summed over the range, divided by the total number of C’s summed over that same range to get a mean methylation fraction over the range.

rdr = rdr.assign(
    mean_mc=rdr["num_mc_sum"] / rdr["num_total_c_sum"],
    mean_hmc=rdr["num_hmc_sum"] / rdr["num_total_c_sum"],
    mean_modc=rdr["num_modc_sum"] / rdr["num_total_c_sum"],
)

Let’s turn this xarray dataset into a pandas dataframe, which will form the base of our feature set.

df = rdr.to_dataframe().reset_index(level="sample_id", drop=True)
df["cpg_count"] = df["num_total_c_cpg_count"]

df_pivot = df.pivot(
    index=["Gene_id", "Gene_name", "contig"],
    columns="Region",
    values=["mean_mc", "mean_hmc", "mean_modc", "cpg_count", "range_length"],
)

df_features = df_pivot.copy()

df_features.columns = [" ".join(col).strip() for col in df_features.columns.values]

# replace white spaces with underscores in features
features = df_features.columns
features = [f.replace(" ", "_") for f in features]
df_features.columns = features

df_features = df_features.reset_index()

The above dataframe is our feature set, containing a series of features (mean 5mC, mean 5hmC, mean modC, CpG count, and region length) for each of the genomic regions of interest, and for each gene.

Write to file#

Finally, we write this feature file as a pickle file, which preserves the variable type, and is therefore prefered over a more basic text file.

df_features = df_features.reset_index()
df_features.head()

	index	Gene_id	Gene_name	contig	mean_mc_-1000	mean_mc_-500	mean_mc_0	mean_mc_1000	mean_mc_500	mean_hmc_-1000	...	cpg_count_-1000	cpg_count_-500	cpg_count_0	cpg_count_1000	cpg_count_500	range_length_-1000	range_length_-500	range_length_0	range_length_1000	range_length_500
0	0	ENSMUSG00000000001.4	Gnai3	chr3	0.080928	0.003064	0.003045	0.575949	0.296720	0.004567	...	54.0	122.0	98.0	26.0	36.0	1000.0	1000.0	1000.0	1000.0	1000.0
1	1	ENSMUSG00000000003.15	Pbsn	chrX	0.650000	0.753479	0.750700	0.806373	0.760073	0.008333	...	2.0	10.0	14.0	14.0	10.0	1000.0	1000.0	1000.0	1000.0	1000.0
2	2	ENSMUSG00000000028.15	Cdc45	chr16	0.004541	0.000384	0.002900	0.565512	0.233115	0.001090	...	106.0	96.0	60.0	26.0	32.0	1000.0	1000.0	1000.0	1000.0	1000.0
3	3	ENSMUSG00000000037.17	Scml2	chrX	0.707838	0.749503	0.802658	0.764205	0.827795	0.049881	...	18.0	35.0	23.0	17.0	8.0	1000.0	1000.0	1000.0	1000.0	1000.0
4	4	ENSMUSG00000000049.11	Apoh	chr11	0.064476	0.003366	0.000365	0.357172	0.070527	0.009784	...	92.0	165.0	192.0	52.0	137.0	1000.0	1000.0	1000.0	1000.0	1000.0

5 rows × 29 columns

df_features.columns

Index(['index', 'Gene_id', 'Gene_name', 'contig', 'mean_mc_-1000',
       'mean_mc_-500', 'mean_mc_0', 'mean_mc_1000', 'mean_mc_500',
       'mean_hmc_-1000', 'mean_hmc_-500', 'mean_hmc_0', 'mean_hmc_1000',
       'mean_hmc_500', 'mean_modc_-1000', 'mean_modc_-500', 'mean_modc_0',
       'mean_modc_1000', 'mean_modc_500', 'cpg_count_-1000', 'cpg_count_-500',
       'cpg_count_0', 'cpg_count_1000', 'cpg_count_500', 'range_length_-1000',
       'range_length_-500', 'range_length_0', 'range_length_1000',
       'range_length_500'],
      dtype='object')

df_features.to_pickle("features_atacseq.pickle")

df_features

	index	Gene_id	Gene_name	contig	mean_mc_-1000	mean_mc_-500	mean_mc_0	mean_mc_1000	mean_mc_500	mean_hmc_-1000	...	cpg_count_-1000	cpg_count_-500	cpg_count_0	cpg_count_1000	cpg_count_500	range_length_-1000	range_length_-500	range_length_0	range_length_1000	range_length_500
0	0	ENSMUSG00000000001.4	Gnai3	chr3	0.080928	0.003064	0.003045	0.575949	0.296720	0.004567	...	54.0	122.0	98.0	26.0	36.0	1000.0	1000.0	1000.0	1000.0	1000.0
1	1	ENSMUSG00000000003.15	Pbsn	chrX	0.650000	0.753479	0.750700	0.806373	0.760073	0.008333	...	2.0	10.0	14.0	14.0	10.0	1000.0	1000.0	1000.0	1000.0	1000.0
2	2	ENSMUSG00000000028.15	Cdc45	chr16	0.004541	0.000384	0.002900	0.565512	0.233115	0.001090	...	106.0	96.0	60.0	26.0	32.0	1000.0	1000.0	1000.0	1000.0	1000.0
3	3	ENSMUSG00000000037.17	Scml2	chrX	0.707838	0.749503	0.802658	0.764205	0.827795	0.049881	...	18.0	35.0	23.0	17.0	8.0	1000.0	1000.0	1000.0	1000.0	1000.0
4	4	ENSMUSG00000000049.11	Apoh	chr11	0.064476	0.003366	0.000365	0.357172	0.070527	0.009784	...	92.0	165.0	192.0	52.0	137.0	1000.0	1000.0	1000.0	1000.0	1000.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
21668	21668	ENSMUSG00000118623.1	AL935121.1	chr2	0.566393	0.590390	0.662259	0.628991	0.681172	0.016393	...	11.0	18.0	22.0	10.0	20.0	1000.0	1000.0	1000.0	1000.0	1000.0
21669	21669	ENSMUSG00000118638.1	AL805980.1	chrX	0.752593	0.802657	0.813023	0.655172	0.740448	0.019259	...	14.0	20.0	20.0	8.0	14.0	1000.0	1000.0	1000.0	1000.0	1000.0
21670	21670	ENSMUSG00000118640.1	AC167036.2	chr1	0.827548	0.801969	0.794483	0.735178	0.796562	0.024242	...	22.0	30.0	34.0	8.0	18.0	1000.0	1000.0	1000.0	1000.0	1000.0
21671	21671	ENSMUSG00000118646.1	AC160405.1	chr10	0.771987	0.793955	0.797990	0.761525	0.779537	0.020630	...	10.0	26.0	30.0	16.0	16.0	1000.0	1000.0	1000.0	1000.0	1000.0
21672	21672	ENSMUSG00000118653.1	AC159819.1	chr9	0.564994	0.304204	0.294077	0.694519	0.649180	0.075257	...	20.0	63.0	57.0	14.0	12.0	1000.0	1000.0	1000.0	1000.0	1000.0

21673 rows × 29 columns

For all 21,673 protein-coding genes in the mm10 genome, this file contains the ID, name, chromosome, start and end, as well as information about each sub-region around the TSS (i.e. mean methylation, cpg count, and length at each region).

Predict Chromatin Accessibility in ES-E14 Cells, Part I: Feature Generation#

Load the 6-base data#

Get regions near the TSS#

Create features with reduce_byranges#

Write to file#

Create features with `reduce_byranges`#