Predict Gene Expression Part I: Feature Generation#

This is the first of two notebooks which use 6-base data to predict gene expression in mouse embryonic stem cells ES-E14. We aim to use methylation data in a simple set of genomic regions (upstream regions, around TSSs, gene bodies, first exons and introns, other exons and introns, 5 and 3’ UTRs, downstream regions, and CpG islands), and train a gene expression prediction model, using a public dataset of expression. The regions are based on the Gencode annotations, for the mm10 mouse genome.

In this notebook, for each of the regions above, we compute a mean 5mC fraction, a mean 5hmC fraction, a record the number of CpGs in the region (regardless of whether they are methylated or not), and the length of the region. Thus, we capture 4 features per region.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyranges as pr
import seaborn as sns

modality contains a set of functions to extract annotations from GENCODE. Note that GENCODE only provides annotations for the Human and Mouse genomes, and the annotation module of modality only supports the hg38 (GRCh38), mm10 (GRCm38) and mm39 (GRCm39) genomes. In this notebook, we use mm10.

from modality.datasets import load_biomodal_dataset
from modality.annotation import (
    get_genes,
    get_transcription_end_region,
    get_tss_region,
    get_exons,
    get_introns,
    get_five_prime_utrs,
    get_three_prime_utrs,
    get_transcripts,
    get_cpg_islands,
)
sns.set_theme()
sns.set_style("whitegrid")
biomodal_palette = ["#003B49", "#9CDBD9", "#F87C56", "#C0DF16", "#05868E"]
sns.set_palette(biomodal_palette)

Load the data#

Our data is stored in a compressed zarr store. modality allows its users to load the zarr store into an object called a ContigDataset, which is a subclass of xarray.Dataset. The ContigDataset gives a useful multidimensional view of our data. In our case, our methylation data (for instance the number of methylated C’s at a given CpG) is represented along two dimensions: the CpG genomic position, and the sample ID.

The function below loads our public mouse dataset for the ES-E14 cell line, using the load_biomodal_dataset function. Note that a user could decide to load their own zarr store instead, using the more general loader method from_zarrz.

The function below does several things:

  • load the data into a ContigDataset called ds.

  • Drop some variables that we won’t need.

  • Because the samples are all technical replicates, we sum over all 4 of them with ds.sum to gain some power.

  • Now that we’ve summed over the samples, ds is a one-dimensional array along the pos dimension. However, internally, modality often has to assume that sample_id is a dimension of ds. So we add it back with ds.expand_dims(dim="sample_id", axis=1), and we give it a name with ds.assign_coords(sample_id=["sample_0"]).

  • Finally we compute methylation fractions with ds.assign_fractions.

def load_data():
    ds = load_biomodal_dataset("esE14")
    ds = ds.drop_vars(["Input DNA Quantity (ng/sample)", "tech_replicate_number"])  
    ds = ds.sum(dim="sample_id", keep_attrs=True)
    ds = ds.expand_dims(dim="sample_id", axis=1)
    ds = ds.assign_coords(sample_id=["sample_0"])
    ds.assign_fractions(
        numerators=["num_modc", "num_mc", "num_hmc"],
        denominator="num_total_c",
        min_coverage=10,
        inplace=True,
    )
    return ds
ds = load_data()

Below is what our final ContigDataset looks like. It has 26 million CpGs along the pos dimension, and one sample along the sample_id dimension.

print(ds)

<modality.contig_dataset.ContigDataset>
<xarray.Dataset>
Dimensions:       (pos: 26074280, sample_id: 1)
Coordinates:
    contig        (pos) <U5 dask.array<chunksize=(100000,), meta=np.ndarray>
    ref_position  (pos) int64 dask.array<chunksize=(100000,), meta=np.ndarray>
    strand        (pos) <U2 dask.array<chunksize=(100000,), meta=np.ndarray>
  * sample_id     (sample_id) <U8 'sample_0'
    group         (sample_id) <U8 'sample_0'
Dimensions without coordinates: pos
Data variables:
    num_c         (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_hmc       (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_mc        (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_modc      (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_other     (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_total     (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    num_total_c   (pos, sample_id) uint64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_modc     (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_mc       (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
    frac_hmc      (pos, sample_id) float64 dask.array<chunksize=(100000, 1), meta=np.ndarray>
Attributes:
    context:            CG
    context_sampling:   1.0
    contigs:            ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '...
    coordinate_basis:   0
    description:        An evoC dataset of mouse ES-E14 cells, data contains ...
    fasta_path:         GRCm38-ss-ctrls-v23.fa.gz
    input_path:         ['CEG1485-EL01-D1115-001.genome.GRCm38.dedup.duet-mod...
    quant_type:         quant6L
    ref_name:           GRCm38
    sample_ids:         ['CEG1485-EL01-D1115-001', 'CEG1485-EL01-D1115-002', ...
    slice_1:            slice(0, 1737408, 1)
    slice_10:           slice(13946402, 15311424, 1)
    slice_11:           slice(15311424, 16849294, 1)
    slice_12:           slice(16849294, 18005730, 1)
    slice_13:           slice(18005730, 19196682, 1)
    slice_14:           slice(19196682, 20320810, 1)
    slice_15:           slice(20320810, 21410730, 1)
    slice_16:           slice(21410730, 22319618, 1)
    slice_17:           slice(22319618, 23386164, 1)
    slice_18:           slice(23386164, 24278174, 1)
    slice_19:           slice(24278174, 25010342, 1)
    slice_2:            slice(1737408, 3662080, 1)
    slice_3:            slice(3662080, 5043804, 1)
    slice_4:            slice(5043804, 6647958, 1)
    slice_5:            slice(6647958, 8321048, 1)
    slice_6:            slice(8321048, 9714434, 1)
    slice_7:            slice(9714434, 11153720, 1)
    slice_8:            slice(11153720, 12605040, 1)
    slice_9:            slice(12605040, 13946402, 1)
    slice_MT:           slice(26073706, 26074280, 1)
    slice_X:            slice(25010342, 25760674, 1)
    slice_Y:            slice(25760674, 26073706, 1)
    frac_denominator:   num_total_c
    frac_min_coverage:  10

Load genes#

Now it is time to start loading genomic annotations. First we load the genes from GENCODE, focusing only on HAVANA protein-coding genes for the mm10 mouse reference genome.

gene_filter = {
    "gene_type": "protein_coding",
    "source": "HAVANA",
}

genes = get_genes(
    reference="mm10",
    as_pyranges=True,
    filterby=gene_filter,
)

genes = genes.unstrand()
genes.head(5)
Chromosome Source Type Start End Score Phase Id Gene_id Gene_type Gene_name Level Mgi_id Havana_gene Tag Ranges_ID
0 1 HAVANA gene 4807787 4848409 . . ENSMUSG00000025903.14 ENSMUSG00000025903.14 protein_coding Lypla1 2 MGI:1344588 OTTMUSG00000021562.4 overlapping_locus 0
1 1 HAVANA gene 4807891 4886769 . . ENSMUSG00000104217.1 ENSMUSG00000104217.1 protein_coding Gm37988 2 MGI:5611216 OTTMUSG00000050100.1 overlapping_locus 1
2 1 HAVANA gene 4857813 4897908 . . ENSMUSG00000033813.15 ENSMUSG00000033813.15 protein_coding Tcea1 2 MGI:1196624 OTTMUSG00000042348.1 overlapping_locus 2
3 1 HAVANA gene 5070017 5162528 . . ENSMUSG00000033793.12 ENSMUSG00000033793.12 protein_coding Atp6v1h 2 MGI:1914864 OTTMUSG00000050145.9 3
4 1 HAVANA gene 5588465 5606130 . . ENSMUSG00000025905.14 ENSMUSG00000025905.14 protein_coding Oprk1 2 MGI:97439 OTTMUSG00000034734.3 4

Identify relevant transcript for each gene#

For each gene, we also need to extract the most relevant transcript. This is because some of the genomic annotations we’ll want to extract later are defined at a per-transcript level instead of a per-gene level (exons, introns, 5’ UTRs, and 3’ UTRs).

transcripts = get_transcripts(
        reference="mm10",
        contig=None,
        start=None,
        end=None,
        as_pyranges=False,
    )
def select_transcript_based_on_tag(df):
    # for each transcript in df, select the one with the highest priority tag
    # priorities are:
        # 1. 'basic,appris_principal_1,CCDS'
        # 2. 'basic,appris_principal_1'
        # 3. 'basic,CCDS'
        # 4. 'basic'
    # but with 'exp_conf' (experimentally confirmed) tag, the priority is higher.
    
    priorties = {
        'basic,appris_principal_1,exp_conf,CCDS': 1,
        'basic,appris_principal_1,CCDS': 1,
        'basic,appris_principal_1,exp_conf': 3,
        'basic,appris_principal_1': 4,
        'basic,exp_conf,CCDS': 5,
        'basic,CCDS': 6,
        'basic,exp_conf': 7,
        'basic': 8
    }

    # sort the dataframe by the priority of the tags
    df['tag_priority'] = df.tag.map(priorties)

    df = df.sort_values(by='tag_priority')

    # drop duplicates, keeping the first one
    df = df.drop_duplicates(subset='gene_id', keep='first')

    return df[["gene_id", "transcript_id"]]
selected_transcripts = transcripts.groupby('gene_id').apply(
    select_transcript_based_on_tag
    ).reset_index(drop=True)
selected_transcripts.head()
gene_id transcript_id
0 ENSMUSG00000000001.4 ENSMUST00000000001.4
1 ENSMUSG00000000003.15 ENSMUST00000000003.13
2 ENSMUSG00000000028.15 ENSMUST00000000028.13
3 ENSMUSG00000000037.17 ENSMUST00000101113.8
4 ENSMUSG00000000049.11 ENSMUST00000000049.5

Get the upstream region#

We define the upstream region as the region 2kb upstream of the TSS, split into subregions. This region is important as it is likely to contain the promoter.

There is some granularity in the upstream region, which we don’t want to wash out by summarising over a region that would be too large. In particular, promoters are small regions located upstream of the gene, which play a fundamental role in transcription. They may be core promoters very close to the TSS, proximal promoters within a few hundred bp, or more distal promoters, further from the TSS.

Here we’re going to use modality’s get_tss function, which contains a start_offset argument. A negative value, for instance -500, means that we start from 500bp upstream of the TSS. The span argument determines the length of the region we are interested. So start_offset=-500 and span=250 will grab the region from 500 to 250bp upstream of the TSS. Here we include 5 regions, from 2kb to 1.5kb, from 1.5kb to 1kb, from 1kb to 500bp, from 500bp to 250bp, and from 250bp to 0 (the TSS). We chose these to get a bit more granularity near the TSS.

distances = [-2000, -1500, -1000, -500, -250]
spans = np.diff(distances+[0])
default_args = {
    "contig": None,
    "start": None,
    "end": None,
    "reference": "mm10",
    "as_pyranges": True,
    "protein_coding": True,
    "filterby": None,
}
upstream_dict = {}
for distance, span in zip(distances, spans):
    upstream = get_tss_region(
        start_offset=distance, 
        span=span,
        **default_args,
        )
    upstream = upstream.unstrand()
    upstream_dict[f"upstream_{-distance}"] = upstream
upstream_dict["upstream_250"].head()
Chromosome Source Type Start End Score Phase Id Gene_id Gene_type Gene_name Level Mgi_id Havana_gene Tag Ranges_ID
0 1 HAVANA gene 4807537 4807787 . . ENSMUSG00000025903.14 ENSMUSG00000025903.14 protein_coding Lypla1 2 MGI:1344588 OTTMUSG00000021562.4 overlapping_locus 0
1 1 HAVANA gene 4807641 4807891 . . ENSMUSG00000104217.1 ENSMUSG00000104217.1 protein_coding Gm37988 2 MGI:5611216 OTTMUSG00000050100.1 overlapping_locus 1
2 1 HAVANA gene 4857563 4857813 . . ENSMUSG00000033813.15 ENSMUSG00000033813.15 protein_coding Tcea1 2 MGI:1196624 OTTMUSG00000042348.1 overlapping_locus 2
3 1 HAVANA gene 5069767 5070017 . . ENSMUSG00000033793.12 ENSMUSG00000033793.12 protein_coding Atp6v1h 2 MGI:1914864 OTTMUSG00000050145.9 3
4 1 HAVANA gene 5588215 5588465 . . ENSMUSG00000025905.14 ENSMUSG00000025905.14 protein_coding Oprk1 2 MGI:97439 OTTMUSG00000034734.3 4
5 1 HAVANA gene 6205946 6206196 . . ENSMUSG00000025907.14 ENSMUSG00000025907.14 protein_coding Rb1cc1 2 MGI:1341850 OTTMUSG00000033467.12 5
6 1 HAVANA gene 6358967 6359217 . . ENSMUSG00000087247.3 ENSMUSG00000087247.3 protein_coding Alkal1 2 MGI:3645495 OTTMUSG00000050239.2 overlapping_locus 6
7 1 HAVANA gene 6486980 6487230 . . ENSMUSG00000033740.17 ENSMUSG00000033740.17 protein_coding St18 1 MGI:2446700 OTTMUSG00000024833.6 7

Get the region around TSS#

We define the TSS region as 250bp each side of the TSS. Note that this region will overlap with the upstream region above, but also capture relevant information for 250bp downstream of the TSS, inside the gene region.

around_tss = get_tss_region(
    start_offset=-200,
    span=400,
    **default_args,
)

around_tss = around_tss.unstrand()

Get the downstream region#

We define the downstream region as a series of subregions of 1kb length, downstream of the gene body.

distances = np.arange(0, 5000, 1000)

downstream_dict = {}
for distance in distances:
    downstream = get_transcription_end_region(
        span=1000,
        start_offset=distance,
        **default_args,
        )
    downstream = downstream.unstrand()
    downstream_dict[f"downstream_{distance}"] = downstream

Get exons#

Next we grab all the exons. This will return all the exons for all the genes in the GENCODE annotation. But in a previous cell, we have identified a list of selected transcripts, so we subset our exons to that list of transcripts.

exons = get_exons(reference="mm10")
exons = exons[exons.Transcript_id.isin(selected_transcripts.transcript_id)]

Get the first exon#

Exons in GENCODE have an Exon_number field which gives the position of the exon in the transcript from its 5’ end. Therefore we can directly pull out the first exon of each transcript as a standalone annotation.

first_exons = exons[exons.Exon_number == "1"]

Now that we have the first exon, we can remove it from the other exons feature.

exons = exons[exons.Exon_number.astype("int") > 1]

Get introns#

Introns are not readily available in GENCODE annotations, but we can use modality’s get_introns function to collect these. This function collects all exons, 5’ UTRS, 3’UTRs and groups them by transcript ID. It then matches the transcript ID with a gene ID, and grabs that gene from the gene list. Finally it subtracts the exons, 5’ UTRS and 3’UTRs from the gene, leaving a set of introns per transcript.

introns = get_introns(
    reference="mm10",
    transcripts=selected_transcripts.transcript_id.values,
    nb_workers=8,
)                                                                                                                                   

Get the first intron#

Similarly to the first exon, we can also grab the first intron, as get_introns() returns a field with the order in which the introns appear from the 5’ end.

first_introns = introns[introns.Intron_number == "1"]

Now that we have the first intron, we can remove it from the other intron feature.

introns = introns[introns.Intron_number.astype("int") > 1]
first_introns.head()
Transcript_id Level_1 Chromosome Source Type Start End Score Strand Phase ... Gene_id Gene_type Gene_name Level Mgi_id Havana_gene Tag Ranges_id Intron_number Ranges_ID
0 ENSMUST00000000514.10 0 1 HAVANA intron 107590286 107597459 . + . ... ENSMUSG00000026315.13 protein_coding Serpinb8 2 MGI:894657 OTTMUSG00000020994.2 317.0 1 0
1 ENSMUST00000001027.6 0 1 HAVANA intron 58030057 58041435 . + . ... ENSMUSG00000063558.4 protein_coding Aox1 2 MGI:88035 OTTMUSG00000029784.3 117.0 1 7
2 ENSMUST00000001166.13 0 1 HAVANA intron 36513152 36518951 . + . ... ENSMUSG00000001138.13 protein_coding Cnnm3 2 MGI:2151055 OTTMUSG00000021763.2 57.0 1 41
3 ENSMUST00000004829.12 0 1 HAVANA intron 171559384 171573680 . + . ... ENSMUSG00000004709.14 protein_coding Cd244a 2 MGI:109294 OTTMUSG00000016914.2 510.0 1 48
4 ENSMUST00000005824.11 0 1 HAVANA intron 171225053 171225081 . + . ... ENSMUSG00000005681.12 protein_coding Apoa2 2 MGI:88050 OTTMUSG00000021666.2 499.0 1 57
5 ENSMUST00000005907.11 0 1 HAVANA intron 165788680 165788688 . + . ... ENSMUSG00000005763.15 protein_coding Cd247 2 MGI:88334 OTTMUSG00000034863.5 overlapping_locus 480.0 1 62
6 ENSMUST00000006716.7 0 1 HAVANA intron 74772221 74782101 . + . ... ENSMUSG00000033227.7 protein_coding Wnt6 2 MGI:98960 OTTMUSG00000048253.1 179.0 1 71
7 ENSMUST00000006718.14 0 1 HAVANA intron 74792282 74793362 . + . ... ENSMUSG00000026167.14 protein_coding Wnt10a 2 MGI:108071 OTTMUSG00000020879.2 180.0 1 74

8 rows × 21 columns

Get 5’ UTRs#

We also grab all the 5’ UTRs that are in the transcripts of interest.

five_prime_utrs = get_five_prime_utrs(reference="mm10")
five_prime_utrs = five_prime_utrs[
    five_prime_utrs.Transcript_id.isin(selected_transcripts.transcript_id)
    ]

Get 3’ UTR#

And we do the same for all the 3’ UTRs that are in the transcripts of interest.

three_prime_utrs = get_three_prime_utrs(reference="mm10")
three_prime_utrs = three_prime_utrs[
    three_prime_utrs.Transcript_id.isin(selected_transcripts.transcript_id)
    ]

Get CpG Islands#

CpG islands are regions that contain an elevated number of CpGs and tend to have low overall methylation. They are often associated with promoters and can regulate gene expression (see, e.g., Deaton & Bird 2011). modality has a get_cpg_islands which gets CpG islands from the UCSC Genome Browser.

Note that when loading the mouse CpG islands from UCSC, the chromosomes are called “chr1”, “chr2”, etc…, while GENCODE mouse chromosomes are called “1”, “2”, etc…, so we need a quick function to fix that.

def modify_chrom_series(df):
    df.Chromosome = df.Chromosome.apply(lambda val: val.replace("chr", ""))
    return df

def fix_chrom(regions):
    return regions.apply(modify_chrom_series)
cpg_islands = get_cpg_islands(reference="mm10")
cpg_islands = fix_chrom(cpg_islands)
cpg_islands.head()
Index Bin Chromosome Start End Length Cpgnum Gcnum Percpg Pergc Obsexp Type Ranges_ID
0 4 611 1 3531624 3531843 219 27 167 24.7 76.3 0.86 cpg_island 0
1 5 613 1 3670619 3671074 455 34 293 14.9 64.4 0.72 cpg_island 1
2 6 613 1 3671654 3672156 502 45 348 17.9 69.3 0.75 cpg_island 2
3 7 619 1 4491701 4493673 1972 165 1286 16.7 65.2 0.79 cpg_island 3
4 8 619 1 4496947 4497608 661 47 393 14.2 59.5 0.81 cpg_island 4
5 9 619 1 4571641 4572075 434 44 265 20.3 61.1 1.09 cpg_island 5
6 10 620 1 4689184 4689397 213 24 149 22.5 70.0 0.96 cpg_island 6
7 11 621 1 4785376 4785814 438 49 289 22.4 66.0 1.04 cpg_island 7

Now that we have the CpG islands, we need to match them to genes. Here, as a first approximation, we use pyranges.nearest to find the CpG island nearest to each gene.

cpg_islands_by_genes = genes.nearest(cpg_islands, suffix="_island")
cpg_islands_by_genes.Start = cpg_islands_by_genes.Start_island
cpg_islands_by_genes.End = cpg_islands_by_genes.End_island
cpg_islands_by_genes
Chromosome Source Type Start End Score Phase Id Gene_id Gene_type ... End_island Length Cpgnum Gcnum Percpg Pergc Obsexp Type_island Ranges_ID_island Distance
0 1 HAVANA gene 4807559 4808103 . . ENSMUSG00000025903.14 ENSMUSG00000025903.14 protein_coding ... 4808103 544 73 367 26.8 67.5 1.18 cpg_island 8 0
1 1 HAVANA gene 4807559 4808103 . . ENSMUSG00000104217.1 ENSMUSG00000104217.1 protein_coding ... 4808103 544 73 367 26.8 67.5 1.18 cpg_island 8 0
2 1 HAVANA gene 4857465 4858372 . . ENSMUSG00000033813.15 ENSMUSG00000033813.15 protein_coding ... 4858372 907 83 545 18.3 60.1 1.01 cpg_island 9 0
3 1 HAVANA gene 5083039 5083536 . . ENSMUSG00000033793.12 ENSMUSG00000033793.12 protein_coding ... 5083536 497 49 329 19.7 66.2 0.90 cpg_island 11 0
4 1 HAVANA gene 6214430 6215332 . . ENSMUSG00000025907.14 ENSMUSG00000025907.14 protein_coding ... 6215332 902 107 645 23.7 71.5 0.93 cpg_island 13 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21668 Y HAVANA gene 90741077 90744975 . . ENSMUSG00000099856.1 ENSMUSG00000099856.1 protein_coding ... 90744975 3898 251 2210 12.9 56.7 0.84 cpg_island 15998 3165388
21669 Y HAVANA gene 90741077 90744975 . . ENSMUSG00000101915.1 ENSMUSG00000101915.1 protein_coding ... 90744975 3898 251 2210 12.9 56.7 0.84 cpg_island 15998 2661585
21670 Y HAVANA gene 90741077 90744975 . . ENSMUSG00000102045.1 ENSMUSG00000102045.1 protein_coding ... 90744975 3898 251 2210 12.9 56.7 0.84 cpg_island 15998 1662064
21671 Y HAVANA gene 90741077 90744975 . . ENSMUSG00000100608.1 ENSMUSG00000100608.1 protein_coding ... 90744975 3898 251 2210 12.9 56.7 0.84 cpg_island 15998 995547
21672 Y HAVANA gene 90741077 90744975 . . ENSMUSG00000096178.7 ENSMUSG00000096178.7 protein_coding ... 90744975 3898 251 2210 12.9 56.7 0.84 cpg_island 15998 307816

21673 rows × 29 columns

Create features with reduce_byranges#

We use a method of modality.ContigDataset called reduce_byranges which allows to reduce our contig dataset to summarise methylation information over a list of genomic ranges (in our case, the ranges that we created above). These ranges should be in the form of pyranges objects. In principle, a user could also pass their own ranges, for instance reading them from a bed or gff file - see pyranges documentation.

regions_dict = upstream_dict.copy()
regions_dict.update(
    {
        "around_tss": around_tss,
        "genes": genes,
        "first_exons": first_exons,
        "first_introns": first_introns,
        "cpg_islands": cpg_islands_by_genes,
    }
)
regions_dict.update(downstream_dict)
to_drop = ["Score", "Source", "Phase", "Type"]

for region in regions_dict:
    regions_dict[region].Region = region
    try:
        regions_dict[region] = regions_dict[region].drop(to_drop)
    except:
        pass
rdr = ds.reduce_byranges(
    ranges=list(regions_dict.values()), 
    var=["num_mc", "num_hmc", "num_modc", "num_total_c"]
    )

The outcome of reduce_byranges is an xarray.Dataset object which contains summarised methylation information across a genomic range (mean, sum, and CpG count of the range) for each of the variables that we specified in var (typically the number of modified C’s num_mc, or the methylation fraction frac_mc).

# Compute mean methylation levels as mc / total_c
rdr = rdr.assign(
    mean_mc = rdr["num_mc_sum"] / rdr["num_total_c_sum"],
    mean_hmc = rdr["num_hmc_sum"] / rdr["num_total_c_sum"],
    mean_modc = rdr["num_modc_sum"] / rdr["num_total_c_sum"],   
)
rdr
<xarray.Dataset>
Dimensions:                   (ranges: 323688, sample_id: 1)
Coordinates:
    contig                    (ranges) <U2 '1' '1' '1' '1' ... 'Y' 'Y' 'Y' 'Y'
    start                     (ranges) int64 4805787 4805891 ... 89708423
    end                       (ranges) int64 4806287 4806391 ... 89709423
    range_id                  (ranges) int64 0 1 2 3 ... 323685 323686 323687
    num_contexts              (ranges) int64 0 0 10 0 0 6 0 4 ... 4 2 0 0 0 0 4
    range_length              (ranges) int64 500 500 500 500 ... 1000 1000 1000
  * sample_id                 (sample_id) <U8 'sample_0'
Dimensions without coordinates: ranges
Data variables:
    num_mc_sum                (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_mc_mean               (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_mc_cpg_count          (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_hmc_sum               (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_hmc_mean              (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_hmc_cpg_count         (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_modc_sum              (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_modc_mean             (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_modc_cpg_count        (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_total_c_sum           (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_total_c_mean          (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    num_total_c_cpg_count     (ranges, sample_id) uint64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    Id                        (ranges) object 'ENSMUSG00000025903.14' ... 'EN...
    Gene_id                   (ranges) object 'ENSMUSG00000025903.14' ... 'EN...
    Gene_type                 (ranges) object 'protein_coding' ... 'protein_c...
    Gene_name                 (ranges) object 'Lypla1' 'Gm37988' ... 'Gm21996'
    Level                     (ranges) object '2' '2' '2' '2' ... '2' '2' '2'
    Mgi_id                    (ranges) object 'MGI:1344588' ... 'MGI:5440224'
    Havana_gene               (ranges) object 'OTTMUSG00000021562.4' ... 'OTT...
    Tag                       (ranges) object 'overlapping_locus' ... ''
    Region                    (ranges) object 'upstream_2000' ... 'downstream...
    Strand                    (ranges) object '.' '.' '.' '.' ... '.' '.' '.'
    Parent                    (ranges) object nan nan nan nan ... nan nan nan
    Transcript_id             (ranges) object nan nan nan nan ... nan nan nan
    Transcript_type           (ranges) object nan nan nan nan ... nan nan nan
    Transcript_name           (ranges) object nan nan nan nan ... nan nan nan
    Exon_number               (ranges) object nan nan nan nan ... nan nan nan
    Exon_id                   (ranges) object nan nan nan nan ... nan nan nan
    Transcript_support_level  (ranges) object nan nan nan nan ... nan nan nan
    Havana_transcript         (ranges) object nan nan nan nan ... nan nan nan
    Protein_id                (ranges) object nan nan nan nan ... nan nan nan
    Ccdsid                    (ranges) object nan nan nan nan ... nan nan nan
    Ont                       (ranges) object nan nan nan nan ... nan nan nan
    Level_1                   (ranges) float64 nan nan nan nan ... nan nan nan
    Ranges_id                 (ranges) float64 nan nan nan nan ... nan nan nan
    Intron_number             (ranges) object nan nan nan nan ... nan nan nan
    Index                     (ranges) float64 nan nan nan nan ... nan nan nan
    Bin                       (ranges) float64 nan nan nan nan ... nan nan nan
    Start_island              (ranges) float64 nan nan nan nan ... nan nan nan
    End_island                (ranges) float64 nan nan nan nan ... nan nan nan
    Length                    (ranges) float64 nan nan nan nan ... nan nan nan
    Cpgnum                    (ranges) float64 nan nan nan nan ... nan nan nan
    Gcnum                     (ranges) float64 nan nan nan nan ... nan nan nan
    Percpg                    (ranges) float64 nan nan nan nan ... nan nan nan
    Pergc                     (ranges) float64 nan nan nan nan ... nan nan nan
    Obsexp                    (ranges) float64 nan nan nan nan ... nan nan nan
    Type_island               (ranges) object nan nan nan nan ... nan nan nan
    Ranges_ID_island          (ranges) float64 nan nan nan nan ... nan nan nan
    Distance                  (ranges) float64 nan nan nan nan ... nan nan nan
    mean_mc                   (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    mean_hmc                  (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>
    mean_modc                 (ranges, sample_id) float64 dask.array<chunksize=(40461, 1), meta=np.ndarray>

Add features that require patching#

For each transcript that we select, there are multiple exons and introns. We need to patch them together to have one feature per gene.

def get_mean_across_region(df, var):
    sum_var = df[f"{var}_sum"].sum()
    sum_total_c = df["num_total_c_sum"].sum()
    return sum_var / sum_total_c
def summarise_across_region(rdr, region):
    """
    Summarise the methylation data across a region.

    Parameters
        rdr: pyranges object
        region: str

    Returns
        grouped_df: pd.DataFrame
    """

    df = rdr.to_dataframe().reset_index(level="sample_id", drop=True)
    grouped = df.groupby("Gene_id")

    mean_mc = grouped.apply(get_mean_across_region, var="num_mc")
    mean_hmc = grouped.apply(get_mean_across_region, var="num_hmc")
    mean_modc = grouped.apply(get_mean_across_region, var="num_modc")
    cpg_count = grouped["num_total_c_cpg_count"].sum()
    range_length = grouped["range_length"].sum()
    gene_name = grouped["Gene_name"].first()
    contig = grouped["contig"].first()
    
    grouped_df = pd.DataFrame(
        {
            "mean_mc": mean_mc,
            "mean_hmc": mean_hmc,
            "mean_modc": mean_modc,
            "cpg_count": cpg_count,
            "range_length": range_length,
            "Gene_name": gene_name,
            "contig": contig,
        }
    )

    grouped_df = grouped_df.reset_index()
    grouped_df["Region"] = region
    return grouped_df
dict_regions_to_patch = {
    "exons": exons,
    "introns": introns,
    "five_prime_utrs": five_prime_utrs,
    "three_prime_utrs": three_prime_utrs,
}
rdr_dict = {}
for region, gr in dict_regions_to_patch.items():
    rdr_dict[region] = ds.reduce_byranges(
        ranges=gr.unstrand(), 
        var=["num_mc", "num_hmc", "num_modc", "num_total_c"]
    )
dict_df = {}
for region, gr in dict_regions_to_patch.items():
    dict_df[region] = summarise_across_region(rdr_dict[region], region)
df_patched_regions = pd.concat(dict_df.values())
df_patched_regions.head()
Gene_id mean_mc mean_hmc mean_modc cpg_count range_length Gene_name contig Region
0 ENSMUSG00000000001.4 0.808834 0.025974 0.847796 102 2995 Gnai3 3 exons
1 ENSMUSG00000000003.15 0.760580 0.019347 0.793229 14 681 Pbsn X exons
2 ENSMUSG00000000028.15 0.569449 0.031517 0.608647 120 1955 Cdc45 16 exons
3 ENSMUSG00000000037.17 0.762484 0.016005 0.789052 66 2777 Scml2 X exons
4 ENSMUSG00000000049.11 0.788561 0.031635 0.834495 56 1068 Apoh 11 exons

Putting all the features together#

Now for each region we have a mean mC fraction, a mean hmC fraction, a mean modC fraction, as well as the number of CpGs and the length of the region. We can concatenate these results together to have a big dataframe containing all the information we need.

df = rdr.to_dataframe().reset_index(level="sample_id", drop=True)
df["cpg_count"] = df["num_total_c_cpg_count"]
df = df[df_patched_regions.columns].reset_index(drop=True)
df = pd.concat([df, df_patched_regions]).reset_index()

Plots#

We can take this opportunity to look at the distribution of the methylation fraction in all the genomic regions we are interested in.

column_order = [
    'upstream_2000', 'upstream_1000', 'upstream_500', 'upstream_250', 
    'around_tss', 'five_prime_utrs', 'first_exons', 'first_introns', 
    'exons', 'introns', 'three_prime_utrs', 'genes', 'downstream_1000',
    'downstream_2000', 'downstream_3000', 'downstream_4000', 'cpg_islands',
    ]
g = sns.violinplot(
    data=df,
    x="Region",
    y="mean_mc",
    hue="Region",
    order=column_order,
    cut=0,
    palette=biomodal_palette,
    saturation=1,
)
g.set_ylabel("Mean 5mC Fraction")
g.set_xticklabels(g.get_xticklabels(), rotation=45, ha="right")
plt.show()
_images/3afd9c2d7e6c84faa2eed56ca127f0671db87d19ca65fb11831652ee3d61b325.png

We see that overall, as we approach the TSS from the upstream regions, the methylation fraction starts to decrease. Each of the subregions of the genome have their own methylation levels, and it slowly increases in the downstream regions to go back to mean background level similar to that of the far-away upstream regions. CpG islands have near-zero methylation levels, as expected.

We can do the same plot for the 5hmC levels. Overall the levels of 5hmC in the genome are much lower than that of 5mC, but a similar trend can be observed.

g = sns.violinplot(
    data=df,
    x="Region",
    y="mean_hmc",
    hue="Region",
    order=column_order,
    cut=0,
    palette=biomodal_palette,
    saturation=1,
)
g.set_ylim(0, 0.1)
g.set_ylabel("Mean 5hmC Fraction")
g.set_xticklabels(g.get_xticklabels(), rotation=45, ha="right")
plt.show()
_images/4e2bc5cae798107d42424bdbbbed25369c09266d7529760367db1255b2e5e6d4.png

Create a final feature set#

We want a dataframe containing one line per gene, and where each column is a specific feature (e.g. mean mC fraction at exons, or number of CpGs in the first introns). We want the data to be in this format so that we can easily match it to datasets of gene expression, which will have one expression value per gene.

df_pivot = df.pivot(
    index=["Gene_id", "Gene_name", "contig"],
    # index = ["Transcript_id", "contig"],
    columns="Region",
    values=["mean_mc", "mean_hmc", "mean_modc", "cpg_count", "range_length"],
)
df_pivot.head()
mean_mc ... range_length
Region around_tss cpg_islands downstream_0 downstream_1000 downstream_2000 downstream_3000 downstream_4000 exons first_exons first_introns ... first_introns five_prime_utrs genes introns three_prime_utrs upstream_1000 upstream_1500 upstream_2000 upstream_250 upstream_500
Gene_id Gene_name contig
ENSMUSG00000000001.4 Gnai3 3 0.000474 0.000370 0.748425 0.807040 0.706579 0.756383 0.800728 0.808834 0.000240 0.581343 ... 22051.0 140.0 38866.0 13562.0 2054.0 500.0 500.0 500.0 250.0 250.0
ENSMUSG00000000003.15 Pbsn X 0.726073 0.001538 0.631579 0.700000 NaN NaN 0.763359 0.760580 0.726073 0.743142 ... 5295.0 139.0 15722.0 9532.0 235.0 500.0 500.0 500.0 250.0 250.0
ENSMUSG00000000028.15 Cdc45 16 0.000515 0.001563 0.780788 0.664640 0.073536 0.012273 0.135029 0.569449 0.001155 NaN ... 15.0 311.0 31540.0 29402.0 127.0 500.0 500.0 500.0 250.0 250.0
ENSMUSG00000000037.17 Scml2 X 0.738338 0.649756 0.783784 0.745455 0.711864 0.762791 0.532710 0.762484 0.742297 0.730605 ... 34668.0 NaN 175688.0 138112.0 NaN 500.0 500.0 500.0 250.0 250.0
ENSMUSG00000000049.11 Apoh 11 0.003213 0.002429 0.815992 0.812207 NaN 0.754902 0.489796 0.788561 0.826833 0.549999 ... 51939.0 50.0 71042.0 17921.0 100.0 500.0 500.0 500.0 250.0 250.0

5 rows × 95 columns

df_features = df_pivot.copy()
df_features.columns = [" ".join(col).strip() for col in df_features.columns.values]
features = df_features.columns
# replace white spaces with underscores in features
features = [f.replace(" ", "_") for f in features]
df_features.columns = features

The above dataframe is our feature set, containing a series of features (mean 5mC, mean 5hmC, mean modC, CpG count, and region length) for each of the genomic regions of interest, and for each gene.

Write to file#

Finally, we write this feature file as a pickle file, which preserves the variable type, and is therefore prefered over a more basic text file.

df_features = df_features.reset_index()
df_features.head()
Gene_id Gene_name contig mean_mc_around_tss mean_mc_cpg_islands mean_mc_downstream_0 mean_mc_downstream_1000 mean_mc_downstream_2000 mean_mc_downstream_3000 mean_mc_downstream_4000 ... range_length_first_introns range_length_five_prime_utrs range_length_genes range_length_introns range_length_three_prime_utrs range_length_upstream_1000 range_length_upstream_1500 range_length_upstream_2000 range_length_upstream_250 range_length_upstream_500
0 ENSMUSG00000000001.4 Gnai3 3 0.000474 0.000370 0.748425 0.807040 0.706579 0.756383 0.800728 ... 22051.0 140.0 38866.0 13562.0 2054.0 500.0 500.0 500.0 250.0 250.0
1 ENSMUSG00000000003.15 Pbsn X 0.726073 0.001538 0.631579 0.700000 NaN NaN 0.763359 ... 5295.0 139.0 15722.0 9532.0 235.0 500.0 500.0 500.0 250.0 250.0
2 ENSMUSG00000000028.15 Cdc45 16 0.000515 0.001563 0.780788 0.664640 0.073536 0.012273 0.135029 ... 15.0 311.0 31540.0 29402.0 127.0 500.0 500.0 500.0 250.0 250.0
3 ENSMUSG00000000037.17 Scml2 X 0.738338 0.649756 0.783784 0.745455 0.711864 0.762791 0.532710 ... 34668.0 NaN 175688.0 138112.0 NaN 500.0 500.0 500.0 250.0 250.0
4 ENSMUSG00000000049.11 Apoh 11 0.003213 0.002429 0.815992 0.812207 NaN 0.754902 0.489796 ... 51939.0 50.0 71042.0 17921.0 100.0 500.0 500.0 500.0 250.0 250.0

5 rows × 98 columns

# add a column with a boolean depending on whether the transcript is is selected_transcripts or not
df_features["selected_transcript"] = df_features.Gene_id.isin(selected_transcripts.gene_id)
df_features.columns
Index(['Gene_id', 'Gene_name', 'contig', 'mean_mc_around_tss',
       'mean_mc_cpg_islands', 'mean_mc_downstream_0',
       'mean_mc_downstream_1000', 'mean_mc_downstream_2000',
       'mean_mc_downstream_3000', 'mean_mc_downstream_4000', 'mean_mc_exons',
       'mean_mc_first_exons', 'mean_mc_first_introns',
       'mean_mc_five_prime_utrs', 'mean_mc_genes', 'mean_mc_introns',
       'mean_mc_three_prime_utrs', 'mean_mc_upstream_1000',
       'mean_mc_upstream_1500', 'mean_mc_upstream_2000',
       'mean_mc_upstream_250', 'mean_mc_upstream_500', 'mean_hmc_around_tss',
       'mean_hmc_cpg_islands', 'mean_hmc_downstream_0',
       'mean_hmc_downstream_1000', 'mean_hmc_downstream_2000',
       'mean_hmc_downstream_3000', 'mean_hmc_downstream_4000',
       'mean_hmc_exons', 'mean_hmc_first_exons', 'mean_hmc_first_introns',
       'mean_hmc_five_prime_utrs', 'mean_hmc_genes', 'mean_hmc_introns',
       'mean_hmc_three_prime_utrs', 'mean_hmc_upstream_1000',
       'mean_hmc_upstream_1500', 'mean_hmc_upstream_2000',
       'mean_hmc_upstream_250', 'mean_hmc_upstream_500',
       'mean_modc_around_tss', 'mean_modc_cpg_islands',
       'mean_modc_downstream_0', 'mean_modc_downstream_1000',
       'mean_modc_downstream_2000', 'mean_modc_downstream_3000',
       'mean_modc_downstream_4000', 'mean_modc_exons', 'mean_modc_first_exons',
       'mean_modc_first_introns', 'mean_modc_five_prime_utrs',
       'mean_modc_genes', 'mean_modc_introns', 'mean_modc_three_prime_utrs',
       'mean_modc_upstream_1000', 'mean_modc_upstream_1500',
       'mean_modc_upstream_2000', 'mean_modc_upstream_250',
       'mean_modc_upstream_500', 'cpg_count_around_tss',
       'cpg_count_cpg_islands', 'cpg_count_downstream_0',
       'cpg_count_downstream_1000', 'cpg_count_downstream_2000',
       'cpg_count_downstream_3000', 'cpg_count_downstream_4000',
       'cpg_count_exons', 'cpg_count_first_exons', 'cpg_count_first_introns',
       'cpg_count_five_prime_utrs', 'cpg_count_genes', 'cpg_count_introns',
       'cpg_count_three_prime_utrs', 'cpg_count_upstream_1000',
       'cpg_count_upstream_1500', 'cpg_count_upstream_2000',
       'cpg_count_upstream_250', 'cpg_count_upstream_500',
       'range_length_around_tss', 'range_length_cpg_islands',
       'range_length_downstream_0', 'range_length_downstream_1000',
       'range_length_downstream_2000', 'range_length_downstream_3000',
       'range_length_downstream_4000', 'range_length_exons',
       'range_length_first_exons', 'range_length_first_introns',
       'range_length_five_prime_utrs', 'range_length_genes',
       'range_length_introns', 'range_length_three_prime_utrs',
       'range_length_upstream_1000', 'range_length_upstream_1500',
       'range_length_upstream_2000', 'range_length_upstream_250',
       'range_length_upstream_500', 'selected_transcript'],
      dtype='object')
genes.df.columns
Index(['Chromosome', 'Source', 'Type', 'Start', 'End', 'Score', 'Phase', 'Id',
       'Gene_id', 'Gene_type', 'Gene_name', 'Level', 'Mgi_id', 'Havana_gene',
       'Tag', 'Ranges_ID', 'Region'],
      dtype='object')
df_features = genes.df[["Gene_id", "Chromosome", "Start", "End"]].merge(
    df_features, on="Gene_id", how="left",
    )
df_features
Gene_id Chromosome Start End Gene_name contig mean_mc_around_tss mean_mc_cpg_islands mean_mc_downstream_0 mean_mc_downstream_1000 ... range_length_five_prime_utrs range_length_genes range_length_introns range_length_three_prime_utrs range_length_upstream_1000 range_length_upstream_1500 range_length_upstream_2000 range_length_upstream_250 range_length_upstream_500 selected_transcript
0 ENSMUSG00000025903.14 1 4807787 4848409 Lypla1 1 0.000410 0.000327 NaN 0.555201 ... 90.0 40622.0 38089.0 1722.0 500.0 500.0 500.0 250.0 250.0 True
1 ENSMUSG00000104217.1 1 4807891 4886769 Gm37988 1 0.000334 0.000327 0.744035 0.744090 ... NaN 78878.0 57461.0 NaN 500.0 500.0 500.0 250.0 250.0 True
2 ENSMUSG00000033813.15 1 4857813 4897908 Tcea1 1 0.001300 0.000912 0.790541 0.673387 ... 99.0 40095.0 28064.0 1540.0 500.0 500.0 500.0 250.0 250.0 True
3 ENSMUSG00000033793.12 1 5070017 5162528 Atp6v1h 1 0.732394 0.001487 0.796095 NaN ... NaN 92511.0 76589.0 NaN 500.0 500.0 500.0 250.0 250.0 True
4 ENSMUSG00000025905.14 1 5588465 5606130 Oprk1 1 0.228089 0.086337 0.772798 NaN ... 183.0 17665.0 12967.0 3346.0 500.0 500.0 500.0 250.0 250.0 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21668 ENSMUSG00000094739.2 Y 78835720 78838055 Gm20806 Y NaN 0.650937 1.000000 NaN ... 398.0 2335.0 NaN 138.0 500.0 500.0 500.0 250.0 250.0 True
21669 ENSMUSG00000095867.2 Y 79148788 79151121 Gm20917 Y NaN 0.650937 NaN NaN ... 399.0 2333.0 NaN 138.0 500.0 500.0 500.0 250.0 250.0 True
21670 ENSMUSG00000094660.2 Y 84562571 84564906 Gm21394 Y NaN 0.650937 NaN NaN ... 398.0 2335.0 NaN 138.0 500.0 500.0 500.0 250.0 250.0 True
21671 ENSMUSG00000095650.2 Y 85528516 85530907 Gm20854 Y 0.489796 0.650937 0.818182 NaN ... NaN 2391.0 6.0 NaN 500.0 500.0 500.0 250.0 250.0 True
21672 ENSMUSG00000100608.1 Y 89713423 89745531 Gm21996 Y 0.806723 0.650937 NaN NaN ... 56.0 32108.0 29823.0 203.0 500.0 500.0 500.0 250.0 250.0 True

21673 rows × 102 columns

df_features.to_pickle("features.pickle")

For all 21,673 protein-coding genes in the mm10 genome, this file contains the ID, name, chromosome, start and end, as well as information about each sub-region of the gene (i.e. mean methylation, cpg count, and length of regions such as exons, upstream, etc…). It also contains a transcript ID column to identify the relevant transcript that we kept for each gene.