Exploratory analysis with modality#
Generating rapid and sufficiently deep exploratory analyses of complex datasets is an important part of the analysis workflow. Here we demonstrate how modality
enables such analysis with a few key examples.
Load data#
For the purposes of this demo we can use a duet +modC dataset of Genome in a bottle (GIAB) samples. These data are publicly available and can be loaded using the load_biomodal_dataset()
function from the modality.datasets
module. This will pull the dataset and load it in to our session as a ContigDataset
object.
from modality.datasets import load_biomodal_dataset
ds = load_biomodal_dataset("giab")
ds
<xarray.Dataset> Dimensions: (sample_id: 14, pos: 58305782) Coordinates: NA_id (sample_id) object 'NA12878' ... 'NA24695' contig (pos) <U5 dask.array<chunksize=(100000,), meta=np.ndarray> giab_id (sample_id) object 'HG001' ... 'HG007' group (sample_id) <U22 'CEG1530-EL01-A1200-001'... ref_position (pos) int64 dask.array<chunksize=(100000,), meta=np.ndarray> * sample_id (sample_id) <U22 'CEG1530-EL01-A1200-001'... strand (pos) <U2 dask.array<chunksize=(100000,), meta=np.ndarray> Dimensions without coordinates: pos Data variables: Protocol Version (sample_id) object '5-Letter v2.4' ... '5... family (sample_id) object 'Utah' ... 'Han Chinese' family_status (sample_id) object 'Singleton' ... 'Mother' num_c (pos, sample_id) uint16 dask.array<chunksize=(100000, 7), meta=np.ndarray> num_modc (pos, sample_id) uint16 dask.array<chunksize=(100000, 7), meta=np.ndarray> num_other (pos, sample_id) uint16 dask.array<chunksize=(100000, 7), meta=np.ndarray> num_total (pos, sample_id) uint16 dask.array<chunksize=(100000, 7), meta=np.ndarray> num_total_c (pos, sample_id) uint16 dask.array<chunksize=(100000, 7), meta=np.ndarray> sex (sample_id) object 'Female' ... 'Female' tech_replicate_number (sample_id) int64 1 1 1 1 1 1 ... 2 2 2 2 2 Input DNA Quantity (ng/sample) (sample_id) int64 80 80 80 80 ... 80 80 80 Attributes: context: CG context_sampling: 1.0 contigs: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7... coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle (GIAB) sam... fasta_path: GRCh38Decoy-ss-ctrls-v23.fa.gz input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy.dedup.duet... quant_type: quant5L ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', '... slice_chr1: slice(0, 4750318, 1) slice_chr10: slice(29992404, 32770360, 1) slice_chr11: slice(32770360, 35436588, 1) slice_chr12: slice(35436588, 38068524, 1) slice_chr13: slice(38068524, 39753462, 1) slice_chr14: slice(39753462, 41478318, 1) slice_chr15: slice(41478318, 43290370, 1) slice_chr16: slice(43290370, 45592152, 1) slice_chr17: slice(45592152, 48088808, 1) slice_chr18: slice(48088808, 49600836, 1) slice_chr19: slice(49600836, 51714166, 1) slice_chr2: slice(4750318, 9135658, 1) slice_chr20: slice(51714166, 53261120, 1) slice_chr21: slice(53261120, 54118812, 1) slice_chr22: slice(54118812, 55320596, 1) slice_chr3: slice(9135658, 12482244, 1) slice_chr4: slice(12482244, 15489102, 1) slice_chr5: slice(15489102, 18536520, 1) slice_chr6: slice(18536520, 21558898, 1) slice_chr7: slice(21558898, 24804548, 1) slice_chr8: slice(24804548, 27480948, 1) slice_chr9: slice(27480948, 29992404, 1) slice_chrM: slice(58304912, 58305782, 1) slice_chrX: slice(55320596, 57966014, 1) slice_chrY: slice(57966014, 58304912, 1)
This is the ContigDataset
, the main modality
object, which is a multi-dimensional array following the xarray.Dataset
format. The two dimensions here are position and sample. For each of the 7 GIAB samples, we have two technical replicates, so 14 samples in total.
An initial investiagtion of a dataset#
A common goal for initial exploratory anlysis for a set of samples is to understand how they relate to each other. In terms of initial quality control (QC) this can help one indentify any potential issues, such as sample-swaps, very early in the analysis process. It can also give us some initial understanding of the underlying biology and help shape dowstream investigation. Below we go through a few examples of how modality
can be used for these kind of initial investiagtions.
The functionality shown here can be used genome-wide but for brevity we will subset our dataset to a single chromosome. We can use the slice functionality of the ContigDataset
to extract a single chromosome view of the data and work with this downstream.
ds = ds["chr2"]
Pearson correlation across samples#
One such primary analysis is to calculate the pearson correlation coefficients between pairs of samples in the dataset. This helps to summarise similarity between samples and can also flag potential issues, such as duplicated samples. This can be efficiently done using the compute_pearson_matrix
method of the ContigDataset
. This method calculates methylation fractions between numerator
and denominator
arguments and then calculates the pearson correlation coefficients across the captured genomic range (in this case chromosome 2) between each pair of samples and returns the resulting matrix. This method can be used as is, or one can use the sister method: plot_pearson_matrix()
, which uses compute_pearson_matrix
internally but returns a seaborn.heatmap
visualisation of the result.
pearson_matrix = ds.compute_pearson_matrix(
numerator="num_modc",
denominator="num_total_c",
min_coverage=15,
)
fig, axes = ds.plot_pearson_matrix(
numerator="num_modc",
denominator="num_total_c",
min_coverage=15,
)
PCA analysis#
We may also wish to perfom a principal components analysis to help us understand the structure of the dataset. Here we can make use of the modality.pca
module.
from modality.pca import run_pca, plot_pca_scree, plot_pca_scatter
We first calculate methylation fractions across the dataset, using the assign_fractions
method of the ContigDataset
, so we can then use a methylation fraction feature for our pca analysis
ds.assign_fractions(
numerators="num_modc",
denominator="num_total_c",
min_coverage=15,
inplace=True,
)
pca_object, transformed_data = run_pca(
input_array=ds.frac_modc,
n_components=2,
)
You can quickly generate a scree plot to visualise the variance explained by each principal component:
plot_pca_scree(
pca_object=pca_object,
)
<seaborn.axisgrid.FacetGrid at 0x7f4b6945b950>
You can also generate a scatter plot to show how the samples are structured along any two principal components. For example, to check that technical replicates group as expected with their coupled samples.
plot_pca_scatter(
pca_object=pca_object,
pca_result=transformed_data,
hue=ds.sample_id,
style=ds.tech_replicate_number,
)
<seaborn.axisgrid.FacetGrid at 0x7f4b6293d690>
Restricting the dataset to a single trio#
Here we show how to only select part of the dataset. We focus on the GIAB Han Chinese Trio
HG005 is the son
HG006 is the father
HG007 is the mother
Each member of the trio has two replicates, so we will also sum over the replicates to gain some coverage.
from modality.contig_dataset import ContigDataset
# Filter to Han Chinese trio
ds_trio = ds.where(ds.family == "Han Chinese", drop=True)
# Sum the counts for each trio
ds_trio = ds_trio.groupby("giab_id").sum().transpose()
# Rename the index to sample_id for compatibility with the ContigDataset
ds_trio = ds_trio.rename({"giab_id": "sample_id"})
ds_trio = ContigDataset(ds_trio)
ds_trio.assign_fractions(
numerators="num_modc",
denominator="num_total_c",
min_coverage=15,
inplace=True,
)
ds_trio
<xarray.Dataset> Dimensions: (pos: 4385340, sample_id: 3) Coordinates: contig (pos) <U5 dask.array<chunksize=(49682,), meta=np.ndarray> ref_position (pos) int64 dask.array<chunksize=(49682,), meta=np.ndarray> strand (pos) <U2 dask.array<chunksize=(49682,), meta=np.ndarray> * sample_id (sample_id) object 'HG005' 'HG006' 'HG007' group (sample_id) object 'HG005' 'HG006' 'HG007' Dimensions without coordinates: pos Data variables: num_c (pos, sample_id) int64 dask.array<chunksize=(49682, 1), meta=np.ndarray> num_modc (pos, sample_id) int64 dask.array<chunksize=(49682, 1), meta=np.ndarray> num_other (pos, sample_id) int64 dask.array<chunksize=(49682, 1), meta=np.ndarray> num_total (pos, sample_id) int64 dask.array<chunksize=(49682, 1), meta=np.ndarray> num_total_c (pos, sample_id) int64 dask.array<chunksize=(49682, 1), meta=np.ndarray> tech_replicate_number (sample_id) int64 3 3 3 Input DNA Quantity (ng/sample) (sample_id) int64 160 160 160 frac_modc (pos, sample_id) float64 dask.array<chunksize=(49682, 1), meta=np.ndarray> Attributes: context: CG context_sampling: 1.0 contigs: ['chr2'] coordinate_basis: 0 description: A +modC dataset of human Genome in a bottle (GIAB) sa... fasta_path: GRCh38Decoy-ss-ctrls-v23.fa.gz input_path: ['CEG1530-EL01-A1200-001.genome.GRCh38Decoy.dedup.due... quant_type: quant5L ref_name: GRCh38Decoy sample_ids: ['CEG1530-EL01-A1200-001', 'CEG1530-EL01-A1200-004', ... slice_chr2: slice(0, 4385340, 1) frac_denominator: num_total_c frac_min_coverage: 15
Heatmap of methylation fraction between samples#
We can get a more nuanced view of the correaltion between samples by plotting a heatmap of methylation fractions between pairs of samples using the plot_correlation_heatmaps()
method of the ContigDataset
.
fig, axes = ds_trio.plot_correlation_heatmaps(
numerator="num_modc",
denominator="num_total_c",
min_coverage=15,
)
Scatter plot of methylation fraction between samples#
Here we show a scatter plot of the methylation fraction across samples. Because it would represent too many points otherwise, we restrict our dataset to plot only the CpGs between the genomic positions 0 and 2,000,000 of Chromosome 2 using the following syntax: ds_trio["chr2:0-2_000_000"]
.
Note: Each plotting function returns a fig
and axes
object. You can then make adjustments to the figures as you see fit. For instance, below we increase the font size and rotate the axis.
fig, axes = ds_trio["chr2:0-2_000_000"].plot_correlation_scatter(
numerator="num_modc",
denominator="num_total_c",
min_coverage=15,
s=1,
)
for ax in axes:
for k in range(2):
ax[k].set_xlabel(ax[k].get_xlabel(), fontsize=18, rotation=45, ha="right")
ax[k].set_ylabel(ax[k].get_ylabel(), fontsize=18, rotation=45, ha="right")