A framework to discover Spatially Variable genes via spatial clusters (2023)

Peiying Cai1,2* and Simone Tiberi3**

1Institute for Molecular Life Sciences, University of Zurich, Switzerland
2SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland
3Departiment of Statistical Sciences, University of Bologna, Italy

*peiying.cai@uzh.ch
**simone.tiberi@unibo.it

04/25/2023

  • 1 Introduction
  • 2 Basics
  • 3 Data
    • 3.1 Input data
    • 3.2 Quality control/filtering
  • 4 Individual sample
    • 4.1 Clustering
      • 4.1.1 Manual annotation
      • 4.1.2 Spatially resolved clustering
    • 4.2 SV testing
      • 4.2.1 Gene-level test
      • 4.2.2 Individual cluster test
  • 5 Multiple samples
    • 5.1 Clustering
      • 5.1.1 Manual annotation
      • 5.1.2 Spatially resolved (multi-sample) clustering
    • 5.2 SV testing
      • 5.2.1 Gene-level test
      • 5.2.2 Individual cluster test
  • 6 Session info
  • References

DESpace is an intuitive framework for identifying spatially variable (SV) genes (SVGs) via edgeR (Robinson, McCarthy, and Smyth 2009), one of the most common methods for performing differential expression analyses.

Based on pre-annotated spatial clusters as summarized spatial information, DESpace models gene expression using a negative binomial (NB), via edgeR (Robinson, McCarthy, and Smyth 2009), with spatial clusters as covariates.SV genes (SVGs) are then identified by testing the significance of spatial clusters.

Our approach assumes that the spatial structure can be summarized by spatial clusters, which should reproduce the key features of the tissue (e.g., white matter and layers in brain cortex).A significant test of these covariates indicates that space influences gene expression, hence identifying spatially variable genes.

Our model is flexible and robust, and is significantly faster than the most SV methods.Furthermore, to the best of our knowledge, it is the only SV approach that allows:- performing a SV test on each individual spatial cluster, hence identifying the key regions affected by spatial variability;- jointly fitting multiple samples, targeting genes with consistent spatial patterns across biological replicates.

Below, we illustrate en example usage of the package.

DESpace is implemented as a R package within Bioconductor, which is the main venue for omics analyses, and we use various other Bioconductor packages (e.g., SpatialLIBD, and edgeR).

DESpace package is available on Bioconductor and can be installed with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager")}BiocManager::install("DESpace")## Check that you have a valid Bioconductor installationBiocManager::valid()

The development version of DESpacecan also be installed from the Bioconductor-devel branch or from GitHub.

To access the R code used in the vignettes, type:

browseVignettes("DESpace")

Questions relative to DESpace should be reported as a new issue at BugReports.

To cite DESpace, type:

citation("DESpace")

Load R packages:

suppressMessages({library(DESpace)library(ggplot2)library(SpatialExperiment)})

As an example dataset, we consider a human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics dataset from the 10x Genomics Visium platform, including three neurotypical adult donors (i.e., biological replicates), with four images per subject (Maynard 2021).The full dataset consists of 12 samples, which can be accessed via spatialLIBD Bioconductor package.

3.1 Input data

Here, we consider a subset of the original data, consisting of three biological replicates: 1 image for each of the three brain subjects.Initially, in Section 3 individual sample , we fit our approach on a single sample, whose data is stored in spe3 whereas all 3 samples will later be jointly used in Section 4 Multiple samples.

# Connect to ExperimentHubehub <- ExperimentHub::ExperimentHub()
## Warning: replacing previous import 'utils::findMatches' by## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
# Download the full real data (about 2.1 GB in RAM) use:spe_all <- spatialLIBD::fetch_data(type = "spe", eh = ehub)# Create three spe objects, one per sample:spe1 <- spe_all[, colData(spe_all)$sample_id == '151507']spe2 <- spe_all[, colData(spe_all)$sample_id == '151669']spe3 <- spe_all[, colData(spe_all)$sample_id == '151673']rm(spe_all)# Select small set of random genes for faster runtime in this exampleset.seed(123)sel_genes <- sample(dim(spe1)[1],2000)spe1 <- spe1[sel_genes,]spe2 <- spe2[sel_genes,]spe3 <- spe3[sel_genes,]# For covenience, we use “gene names” instead of “gene ids”:rownames(spe1) <- rowData(spe1)$gene_namerownames(spe2) <- rowData(spe2)$gene_namerownames(spe3) <- rowData(spe3)$gene_name# Specify column names of spatial coordinates in colData(spe) coordinates <- c("array_row", "array_col")# Specify column names of spatial clusters in colData(spe) spatial_cluster <- 'layer_guess_reordered'

The spatial tissues of each sample were manually annotated in the original manuscript (Maynard 2021), and spots were labeled into one of the following categories: white matter (WM) and layers 1 to 6.The manual annotations are stored in column layer_guess_reordered of the colData, while columns array_col and array_row provide the spatial coordinates of spots.

# We select a subset of columnskeep_col <- c(coordinates,spatial_cluster,"expr_chrM_ratio","cell_count")head(colData(spe3)[keep_col])
## DataFrame with 6 rows and 5 columns## array_row array_col layer_guess_reordered expr_chrM_ratio## <integer> <integer> <factor> <numeric>## AAACAAGTATCTCCCA-1 50 102 Layer3 0.166351## AAACAATCTACTAGCA-1 3 43 Layer1 0.122376## AAACACCAATAACTGC-1 59 19 WM 0.114089## AAACAGAGCGACTCCT-1 14 94 Layer3 0.242223## AAACAGCTTTCAGAAG-1 43 9 Layer5 0.152174## AAACAGGGTCTATATT-1 47 13 Layer6 0.155095## cell_count## <integer>## AAACAAGTATCTCCCA-1 6## AAACAATCTACTAGCA-1 16## AAACACCAATAACTGC-1 5## AAACAGAGCGACTCCT-1 2## AAACAGCTTTCAGAAG-1 4## AAACAGGGTCTATATT-1 6

3.2 Quality control/filtering

Quality control (QC) procedures at the spot and gene level aim to remove both low-quality spots, and lowly abundant genes.For QC, we adhere to the instructions from “Orchestrating Spatially Resolved Transcriptomics Analysis with Bioconductor” (OSTA).The library size, UMI counts, ratio of mitochondrial chromosome (chM) expression, and number of cells per spot are used to identify low-quality spots.

# Sample 1:# Calculate per-spot QC metrics and store in colDataspe1 <- scuttle::addPerCellQC(spe1,)# Remove combined set of low-quality spotsspe1 <- spe1[, !(colData(spe1)$sum < 10 | # library size colData(spe1)$detected < 10 | # number of expressed genes colData(spe1)$expr_chrM_ratio > 0.30| # mitochondrial expression ratio colData(spe1)$cell_count > 10)] # number of cells per spot# Sample 2:# Calculate per-spot QC metrics and store in colDataspe2 <- scuttle::addPerCellQC(spe2,)# Remove combined set of low-quality spotsspe2 <- spe2[, !(colData(spe2)$sum < 20 | colData(spe2)$detected < 15 | colData(spe2)$expr_chrM_ratio > 0.35| colData(spe2)$cell_count > 8)]# Sample 3:spe3 <- scuttle::addPerCellQC(spe3,)# Remove combined set of low-quality spotsspe3 <- spe3[, !(colData(spe3)$sum < 25 | colData(spe3)$detected < 25 | colData(spe3)$expr_chrM_ratio > 0.3| colData(spe3)$cell_count > 15)]

Then, we discard lowly abundant genes, which were detected in less than 20 spots.

# For each sample i:for(i in seq_len(3)){ spe_i <- eval(parse(text = paste0("spe", i))) # Select QC threshold for lowly expressed genes: at least 20 non-zero spots: qc_low_gene <- rowSums(assays(spe_i)$counts > 0) >= 20 # Remove lowly abundant genes spe_i <- spe_i[qc_low_gene,] assign(paste0("spe", i), spe_i) message("Dimension of spe", i, ": ", dim(spe_i)[1], ", ", dim(spe_i)[2])}
## Dimension of spe1: 847, 4174
## Dimension of spe2: 868, 3635
## Dimension of spe3: 908, 3601

We fit our approach to discover SVGs in an individual sample.In Section 4 Multiple samples, we will show how to jointly embed multiple replicates.

4.1 Clustering

This framework relies on spatial clusters being accessible and successfully summarizing the primary spatial characteristics of the data.In most datasets, these spatial features are either accessible or can be easily generated with spatial clustering algorithms.

4.1.1 Manual annotation

If manual annotations are provided (e.g., annotated by a pathologist), we can directly use those.With the spe or spe object that contains coordinates of the spot-level data, we can visualize spatial clusters.

# View LIBD layers for one sampleCD <- as.data.frame(colData(spe3))ggplot(CD, aes(x=array_col,y=array_row, color=factor(layer_guess_reordered))) + geom_point() + theme_void() + scale_y_reverse() + theme(legend.position="bottom") + labs(color = "", title = paste0("Manually annotated spatial clusters"))

A framework to discover Spatially Variable genes via spatial clusters (1)

4.1.2 Spatially resolved clustering

If manual annotations are not available, we can use spatially resolved clustering tools.These methods, by jointly employing spatial coordinates and gene expression data, enable obtaining spatial clusters.Although, in this vignette we use pre-computed manually annotated clusters, below we provide links to two popular spatially resolved clustering tools: BayesSpace (Zhao et al. 2021) and StLearn (Pham et al. 2020).

4.1.2.1 BayesSpace

BayesSpace is a Bioconductor package that provides a Bayesian statistical approach for spatial transcriptomics data clustering (BayesSpace).There is a specific vignette for using BayesSpace on this dataset (human DLPFC) here.

After using BayesSpace, the spatial cluster assignments (spatial.cluster) are available in the colData(spe).

4.1.2.2 StLearn

StLearn, a python-based package, is designed for spatial transciptomics data.It allows spatially-resolved clustering based on Louvain or k-means (stLearn).There is a tutorial for using StLearn on this dataset (human DLPFC) here.

After running stLearn, we can store results as a csv file.

# Save spatial resultsobsm.to_csv("stLearn_clusters.csv") 

Then, we can load these results in R and store spatial clusters in the spe object.

stLearn_results <- read.csv("stLearn_clusters.csv", sep = ',', header = TRUE)# Match colData(spe) and stLearn resultsstLearn_results <- stLearn_results[match(rownames(colData(spe3)), rownames(stLearn_results)), ]colData(spe3)$stLearn_clusters <- stLearn_results$stLearn_pca_kmeans

4.2 SV testing

Once we have spatial clusters, we can search for SVGs.

4.2.1 Gene-level test

Fit the model via DESpace_test function.Parameter spe specifies the input SpatialExperiment or SingleCellExperiment object, while spatial_cluster defines the column names of colData(spe) containing spatial clusters. To obtain all statistics, set verbose to TRUE (default value).

set.seed(123)results <- DESpace_test(spe = spe3, spatial_cluster = spatial_cluster, verbose = TRUE)
## using 'DESpace_test' for spatial gene/pattern detection.
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is 908.
## single sample test

A list of results is returned.The main results of interest are stored in the gene_results: a data.fame, where columns contain gene names (gene_id), likelihood ratio test statistics (LR), average (across spots) log-2 counts per million (logCPM), raw p-values (PValue) and Benjamini-Hochberg adjusted p-values (FDR).

head(results$gene_results, 3)
## gene_id LR logCPM PValue FDR## SNCG SNCG 1314.340 14.36222 8.520089e-281 7.736241e-278## ATP1A3 ATP1A3 1231.662 14.68926 6.719198e-263 3.050516e-260## HPCAL1 HPCAL1 1058.545 14.11872 1.940513e-225 5.873285e-223

The second element of the results (a DGEList object estimated_y) contains the estimated common dispersion, which can later be used to speed-up calculation when testing individual clusters.

The third and forth element of the results (DGEGLM and DGELRT objects) contain full statistics from edgeR::glmFit and edgeR::glmLRT.

class(results$estimated_y); class(results$glmLrt); class(results$glmFit)
## [1] "DGEList"## attr(,"package")## [1] "edgeR"
## [1] "DGELRT"## attr(,"package")## [1] "edgeR"
## [1] "DGEGLM"## attr(,"package")## [1] "edgeR"

Visualize the gene expression of the three most significant genes in FeaturePlot().Note that the gene names in vector feature, should also appear in the count matrix’s row names. Specifying the column names of spatial coordinates of spots is only necessary when they are not named row and col.

(feature <- results$gene_results$gene_id[seq_len(3)])
## [1] "SNCG" "ATP1A3" "HPCAL1"
FeaturePlot(spe3, feature, coordinates = coordinates, ncol = 3, title = TRUE)