Predict Chromatin Accessibility in ES-E14 Cells, Part I: Feature Generation#

This is the first of two notebooks which use 6-base data to predict chromatin accessibility in mouse embryonic stem cells ES-E14. We aim to use methylation data in a simple subset of genomic regions around the TSS, and train a model to predict ATAC-seq data, using a public dataset of accessibility. The regions are based on the GENCODE annotations, for the mm10 mouse genome.

In this notebook, for each of the regions above, we compute a mean 5mC fraction, a mean 5hmC fraction, a record of the number of CpGs in the region (regardless of whether they are methylated or not), and the length of the region. Thus, we capture 4 features per region.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyranges as pr
import seaborn as sns

modality contains a set of functions to extract annotations from GENCODE. Note that GENCODE only provides annotations for the Human and Mouse genomes, and the annotation module of modality only supports the hg38 (GRCh38), mm10 (GRCm38) and mm39 (GRCm39) genomes. In this notebook, we use mm10.

from modality.datasets import load_biomodal_dataset
from modality.contig_dataset import ContigDataset
from modality.annotation import (
    get_tss_region,
)
sns.set_theme()
sns.set_style("whitegrid")
biomodal_palette = ["#003B49", "#9CDBD9", "#F87C56", "#C0DF16", "#05868E"]
sns.set_palette(biomodal_palette)

Load the 6-base data#

Our data is stored in a compressed zarr store. modality allows its users to load the zarr store into an object called a ContigDataset, which is a subclass of xarray.Dataset. The ContigDataset gives a useful multidimensional view of our data. In our case, our methylation data (for instance the number of methylated C’s at a given CpG) is represented along two dimensions: the CpG genomic position, and the sample ID.

The function below loads our public mouse dataset for the ES-E14 cell line, using the load_biomodal_dataset function.

The function below does several things:

  • load the data into a ContigDataset called ds.

  • Drop some variables that we won’t need.

  • Because the samples are all technical replicates, we sum over all 4 of them with ds.sum to gain some power.

  • Now that we’ve summed over the samples, ds is a one-dimensional array along the pos dimension. However, internally, modality often has to assume that sample_id is a dimension of ds. So we add it back with ds.expand_dims(dim="sample_id", axis=1), and we give it a name with ds.assign_coords(sample_id=["sample_0"]).

  • Finally we compute methylation fractions with ds.assign_fractions.

def load_data(
    zarr_store_path,
    sum_samples=True,
    merge_contexts=False,
    min_coverage=None,
):
    if zarr_store_path == "esE14":
        ds = load_biomodal_dataset("esE14")
        ds = ds.drop_vars(["Input DNA Quantity (ng/sample)", "tech_replicate_number"])

    else:
        ds = ContigDataset.from_zarrz(zarr_store_path)

    if sum_samples:
        ds = ds.sum(dim="sample_id", keep_attrs=True)
        ds = ds.expand_dims(dim="sample_id", axis=1)
        ds = ds.assign_coords(sample_id=["sample_0"])

    if merge_contexts:
        ds = ds.merge_cpgs()

    if min_coverage is not None:
        ds = ds.subset_bycoverage(min_coverage=min_coverage)

    ds.assign_fractions(
        numerators=["num_modc", "num_mc", "num_hmc"],
        denominator="num_total_c",
        min_coverage=10,
        inplace=True,
    )
    return ds
ds = load_data(
    "esE14", 
    sum_samples=True, 
    merge_contexts=False, 
    min_coverage=10
    )

Below is what our final ContigDataset looks like. It has 26 million CpGs along the pos dimension, and one sample along the sample_id dimension.

print(ds)

<modality.contig_dataset.ContigDataset>
<xarray.Dataset>
Dimensions:       (pos: 24830501, sample_id: 1)
Coordinates:
  * sample_id     (sample_id) <U8 'sample_0'
    group         (sample_id) <U8 'sample_0'
    contig        (pos) <U5 dask.array<chunksize=(100000,), meta=np.ndarray>
    ref_position  (pos) int64 dask.array<chunksize=(100000,), meta=np.ndarray>
    strand        (pos) <U2 dask.array<chunksize=(100000,), meta=np.ndarray>
Dimensions without coordinates: pos
Data variables:
    num_c         (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_hmc       (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_mc        (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_modc      (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_other     (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_total     (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_total_c   (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_modc     (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_mc       (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_hmc      (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
Attributes:
    context:            CG
    context_sampling:   1.0
    contigs:            ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '...
    coordinate_basis:   0
    description:        An evoC dataset of mouse ES-E14 cells, data contains ...
    fasta_path:         GRCm38-ss-ctrls-v23.fa.gz
    input_path:         ['CEG1485-EL01-D1115-001.genome.GRCm38.dedup.duet-mod...
    quant_type:         quant6L
    ref_name:           GRCm38
    sample_ids:         ['CEG1485-EL01-D1115-001', 'CEG1485-EL01-D1115-002', ...
    slice_1:            slice(0, 1686385, 1)
    slice_10:           slice(13470018, 14802757, 1)
    slice_11:           slice(14802757, 16308053, 1)
    slice_12:           slice(16308053, 17405694, 1)
    slice_13:           slice(17405694, 18562567, 1)
    slice_14:           slice(18562567, 19604813, 1)
    slice_15:           slice(19604813, 20653494, 1)
    slice_16:           slice(20653494, 21539973, 1)
    slice_17:           slice(21539973, 22549921, 1)
    slice_18:           slice(22549921, 23411019, 1)
    slice_19:           slice(23411019, 24108652, 1)
    slice_2:            slice(1686385, 3569945, 1)
    slice_3:            slice(3569945, 4912268, 1)
    slice_4:            slice(4912268, 6430526, 1)
    slice_5:            slice(6430526, 8053335, 1)
    slice_6:            slice(8053335, 9406418, 1)
    slice_7:            slice(9406418, 10762044, 1)
    slice_8:            slice(10762044, 12159643, 1)
    slice_9:            slice(12159643, 13470018, 1)
    slice_MT:           slice(24830068, 24830501, 1)
    slice_X:            slice(24108652, 24791522, 1)
    slice_Y:            slice(24791522, 24830068, 1)
    frac_denominator:   num_total_c
    frac_min_coverage:  10

Get regions near the TSS#

We want to use methylation information around the TSS as our input data to predict ATAC-seq. modality has a function called get_tss_region, which uses an annotated genome from GENCODE to identify the region around the TSS. The two arguments start_offset and span allow the user to create regions that are off-centred, and which span a certain length. The positive (resp. negative) value of start_offset indicate that we go downstream (resp. upstream) of the TSS.

starts = [-1000, -500, 0, 500, 1000]
span = 1000
reference_genome = "mm10"
regions_dict = {}
for start in starts:
    print(f"Extracting TSS region for offset: {start} and span: {span}")
    tss = get_tss_region(
        contig=None,
        start=None,
        end=None,
        reference=reference_genome,
        protein_coding=True,
        filterby=None,
        start_offset=start,
        span=span,
        as_pyranges=True,
    )

    regions_dict[start] = tss.unstrand()
    regions_dict[start].Region = str(start)
Extracting TSS region for offset: -1000 and span: 1000
Extracting TSS region for offset: -500 and span: 1000
Extracting TSS region for offset: 0 and span: 1000
Extracting TSS region for offset: 500 and span: 1000
Extracting TSS region for offset: 1000 and span: 1000
ranges = list(regions_dict.values())

Create features with reduce_byranges#

We use a method of modality.ContigDataset called reduce_byranges which allows to reduce our contig dataset to summarise methylation information over a list of genomic ranges (in our case, the ranges that we created above). These ranges should be in the form of pyranges objects. In principle, a user could also pass their own ranges, for instance reading them from a bed or gff file - see pyranges documentation.

rdr = ds.reduce_byranges(
    ranges=ranges,
    var=["num_mc", "num_hmc", "num_modc", "num_total_c"],
    )
rdr
<xarray.Dataset>
Dimensions:                (ranges: 108365, sample_id: 1)
Coordinates:
    contig                 (ranges) <U2 '1' '1' '1' '1' '1' ... 'Y' 'Y' 'Y' 'Y'
    start                  (ranges) int64 4806787 4806891 ... 85528907 89743531
    end                    (ranges) int64 4807787 4807891 ... 85529907 89744531
    range_id               (ranges) int64 0 1 2 3 ... 108362 108363 108364
    num_contexts           (ranges) int64 78 88 78 0 20 0 44 ... 0 0 0 0 0 0 2
    range_length           (ranges) int64 1000 1000 1000 1000 ... 1000 1000 1000
  * sample_id              (sample_id) <U8 'sample_0'
Dimensions without coordinates: ranges
Data variables:
    num_mc_sum             (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_mc_mean            (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_mc_cpg_count       (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_sum            (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_mean           (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_hmc_cpg_count      (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_sum           (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_mean          (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_modc_cpg_count     (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_sum        (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_mean       (ranges, sample_id) float64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    num_total_c_cpg_count  (ranges, sample_id) uint64 dask.array<chunksize=(27092, 1), meta=np.ndarray>
    Source                 (ranges) object 'HAVANA' 'HAVANA' ... 'HAVANA'
    Type                   (ranges) object 'gene' 'gene' ... 'gene' 'gene'
    Score                  (ranges) object '.' '.' '.' '.' ... '.' '.' '.' '.'
    Phase                  (ranges) object '.' '.' '.' '.' ... '.' '.' '.' '.'
    Id                     (ranges) object 'ENSMUSG00000025903.14' ... 'ENSMU...
    Gene_id                (ranges) object 'ENSMUSG00000025903.14' ... 'ENSMU...
    Gene_type              (ranges) object 'protein_coding' ... 'protein_coding'
    Gene_name              (ranges) object 'Lypla1' 'Gm37988' ... 'Gm21996'
    Level                  (ranges) object '2' '2' '2' '2' ... '2' '2' '2' '2'
    Mgi_id                 (ranges) object 'MGI:1344588' ... 'MGI:5440224'
    Havana_gene            (ranges) object 'OTTMUSG00000021562.4' ... 'OTTMUS...
    Tag                    (ranges) object 'overlapping_locus' ... ''
    Region                 (ranges) object '-1000' '-1000' ... '1000' '1000'

The outcome of reduce_byranges is an xarray.Dataset object which contains summarised methylation information across a genomic range (mean, sum, and CpG count of the range) for each of the variables that we specified in var (typically the number of modified C’s num_mc, or the methylation fraction frac_mc).

We use the number of mC (or hmc or modC) summed over the range, divided by the total number of C’s summed over that same range to get a mean methylation fraction over the range.

rdr = rdr.assign(
    mean_mc=rdr["num_mc_sum"] / rdr["num_total_c_sum"],
    mean_hmc=rdr["num_hmc_sum"] / rdr["num_total_c_sum"],
    mean_modc=rdr["num_modc_sum"] / rdr["num_total_c_sum"],
)

Let’s turn this xarray dataset into a pandas dataframe, which will form the base of our feature set.

df = rdr.to_dataframe().reset_index(level="sample_id", drop=True)
df["cpg_count"] = df["num_total_c_cpg_count"]
df_pivot = df.pivot(
    index=["Gene_id", "Gene_name", "contig"],
    columns="Region",
    values=["mean_mc", "mean_hmc", "mean_modc", "cpg_count", "range_length"],
)
df_features = df_pivot.copy()
df_features.columns = [" ".join(col).strip() for col in df_features.columns.values]
# replace white spaces with underscores in features
features = df_features.columns
features = [f.replace(" ", "_") for f in features]
df_features.columns = features
df_features = df_features.reset_index()

The above dataframe is our feature set, containing a series of features (mean 5mC, mean 5hmC, mean modC, CpG count, and region length) for each of the genomic regions of interest, and for each gene.

Write to file#

Finally, we write this feature file as a pickle file, which preserves the variable type, and is therefore prefered over a more basic text file.

df_features = df_features.reset_index()
df_features.head()
index Gene_id Gene_name contig mean_mc_-1000 mean_mc_-500 mean_mc_0 mean_mc_1000 mean_mc_500 mean_hmc_-1000 ... cpg_count_-1000 cpg_count_-500 cpg_count_0 cpg_count_1000 cpg_count_500 range_length_-1000 range_length_-500 range_length_0 range_length_1000 range_length_500
0 0 ENSMUSG00000000001.4 Gnai3 3 0.072351 0.003064 0.003045 0.562945 0.296720 0.004368 ... 52.0 122.0 98.0 24.0 36.0 1000.0 1000.0 1000.0 1000.0 1000.0
1 1 ENSMUSG00000000003.15 Pbsn X 0.650000 0.753479 0.750416 NaN 0.734694 0.008333 ... 2.0 10.0 12.0 0.0 2.0 1000.0 1000.0 1000.0 1000.0 1000.0
2 2 ENSMUSG00000000028.15 Cdc45 16 0.004541 0.000480 0.003059 0.661132 0.101471 0.001090 ... 106.0 96.0 60.0 12.0 20.0 1000.0 1000.0 1000.0 1000.0 1000.0
3 3 ENSMUSG00000000037.17 Scml2 X 0.693548 0.749503 0.802658 0.856354 0.849265 0.056452 ... 16.0 35.0 23.0 4.0 6.0 1000.0 1000.0 1000.0 1000.0 1000.0
4 4 ENSMUSG00000000049.11 Apoh 11 0.064476 0.003680 0.000332 0.345097 0.074608 0.009784 ... 92.0 146.0 164.0 48.0 128.0 1000.0 1000.0 1000.0 1000.0 1000.0

5 rows × 29 columns

df_features.columns
Index(['index', 'Gene_id', 'Gene_name', 'contig', 'mean_mc_-1000',
       'mean_mc_-500', 'mean_mc_0', 'mean_mc_1000', 'mean_mc_500',
       'mean_hmc_-1000', 'mean_hmc_-500', 'mean_hmc_0', 'mean_hmc_1000',
       'mean_hmc_500', 'mean_modc_-1000', 'mean_modc_-500', 'mean_modc_0',
       'mean_modc_1000', 'mean_modc_500', 'cpg_count_-1000', 'cpg_count_-500',
       'cpg_count_0', 'cpg_count_1000', 'cpg_count_500', 'range_length_-1000',
       'range_length_-500', 'range_length_0', 'range_length_1000',
       'range_length_500'],
      dtype='object')
df_features.to_pickle("features_atacseq.pickle")
df_features
index Gene_id Gene_name contig mean_mc_-1000 mean_mc_-500 mean_mc_0 mean_mc_1000 mean_mc_500 mean_hmc_-1000 ... cpg_count_-1000 cpg_count_-500 cpg_count_0 cpg_count_1000 cpg_count_500 range_length_-1000 range_length_-500 range_length_0 range_length_1000 range_length_500
0 0 ENSMUSG00000000001.4 Gnai3 3 0.072351 0.003064 0.003045 0.562945 0.296720 0.004368 ... 52.0 122.0 98.0 24.0 36.0 1000.0 1000.0 1000.0 1000.0 1000.0
1 1 ENSMUSG00000000003.15 Pbsn X 0.650000 0.753479 0.750416 NaN 0.734694 0.008333 ... 2.0 10.0 12.0 0.0 2.0 1000.0 1000.0 1000.0 1000.0 1000.0
2 2 ENSMUSG00000000028.15 Cdc45 16 0.004541 0.000480 0.003059 0.661132 0.101471 0.001090 ... 106.0 96.0 60.0 12.0 20.0 1000.0 1000.0 1000.0 1000.0 1000.0
3 3 ENSMUSG00000000037.17 Scml2 X 0.693548 0.749503 0.802658 0.856354 0.849265 0.056452 ... 16.0 35.0 23.0 4.0 6.0 1000.0 1000.0 1000.0 1000.0 1000.0
4 4 ENSMUSG00000000049.11 Apoh 11 0.064476 0.003680 0.000332 0.345097 0.074608 0.009784 ... 92.0 146.0 164.0 48.0 128.0 1000.0 1000.0 1000.0 1000.0 1000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21668 21668 ENSMUSG00000118623.1 AL935121.1 2 0.386207 0.577088 0.662259 0.628448 0.680887 0.006897 ... 5.0 16.0 22.0 10.0 20.0 1000.0 1000.0 1000.0 1000.0 1000.0
21669 21669 ENSMUSG00000118638.1 AL805980.1 X 0.752593 0.802657 0.813023 0.655172 0.740448 0.019259 ... 14.0 20.0 20.0 8.0 14.0 1000.0 1000.0 1000.0 1000.0 1000.0
21670 21670 ENSMUSG00000118640.1 AC167036.2 1 0.846839 0.803757 0.795282 0.734275 0.796283 0.017809 ... 12.0 30.0 34.0 8.0 18.0 1000.0 1000.0 1000.0 1000.0 1000.0
21671 21671 ENSMUSG00000118646.1 AC160405.1 10 0.771987 0.793955 0.797990 0.761525 0.779537 0.020630 ... 10.0 26.0 30.0 16.0 16.0 1000.0 1000.0 1000.0 1000.0 1000.0
21672 21672 ENSMUSG00000118653.1 AC159819.1 9 0.376609 0.304145 0.293673 0.692201 0.649180 0.113734 ... 12.0 63.0 57.0 6.0 12.0 1000.0 1000.0 1000.0 1000.0 1000.0

21673 rows × 29 columns

For all 21,673 protein-coding genes in the mm10 genome, this file contains the ID, name, chromosome, start and end, as well as information about each sub-region around the TSS (i.e. mean methylation, cpg count, and length at each region).