Han’s MetaCell Model

Han Wang at 2024/6/11.

Step 1: Data Preparation and Preprocessing

1. Filtering Low-Quality Cells:

Remove cells with low total expression, few detected genes, or high mitochondrial gene expression percentage.

\text{Quality}(i) = \frac{\sum_{j} X_{ij}}{\sum_{i} \sum_{j} X_{ij}} > \text{threshold1}

\text{Detected Genes}(i) = \sum_{j} \mathbb{I}(X_{ij} > 0) > \text{threshold2}

\text{Mito Ratio}(i) = \frac{\sum_{j \in \text{Mito}} X_{ij}}{\sum_{j} X_{ij}} < \text{threshold3}

2. Filtering Low-Expression Genes:

Remove genes with low or no expression in most cells.

\text{Expression}(j) = \frac{\sum_{i} X_{ij}}{m} > \text{threshold4}

3. Data Normalization:

Normalize expression levels for comparability.

X'{ij} = \frac{X{ij}}{\sum_{j} X_{ij}} \times 10^6

4. Data Transformation:

Log-transform the normalized expression data.

X''{ij} = \log_2(X'{ij} + 1)

Step 2: NMF Decomposition

1. NMF Decomposition:

Decompose the preprocessed expression matrix $X$ into two non-negative matrices $W$ and $H$ .

X \approx WH

Where $W \in \mathbb{R}^{n \times k}$ , $H \in \mathbb{R}^{k \times m}$ .

2. Objective to Minimize Reconstruction Error:

\min_{W, H} ||X - WH||_F^2

3. Optimization Algorithms:

Use Alternating Least Squares (ALS) or Multiplicative Update Rules (MUR):

W_{ij} \leftarrow W_{ij} \frac{(XH^T){ij}}{(WHH^T){ij}}

H_{ij} \leftarrow H_{ij} \frac{(W^TX){ij}}{(W^TWH){ij}}

Step 3: Constructing the Cell Similarity Graph

1. Construct Cell Similarity Graph:

Use the weight matrix $H$ from NMF.

\mathbf{h}i = H{:,i}

2. Calculate Similarity Measures:

Calculate pairwise cell similarities using cosine similarity or Pearson correlation.

\text{CosSim}(\mathbf{h}_i, \mathbf{h}_j) = \frac{\mathbf{h}_i \cdot \mathbf{h}_j}{||\mathbf{h}_i|| \cdot ||\mathbf{h}_j||}

\text{Pearson}(\mathbf{h}_i, \mathbf{h}_j) = \frac{\text{Cov}(\mathbf{h}_i, \mathbf{h}j)}{\sigma{\mathbf{h}i} \sigma{\mathbf{h}_j}}

3. Construct Cell Similarity Graph $G$ :

Edges in $G$ represent similarities above a threshold.

E = \{ (i, j) \mid \text{Sim}(\mathbf{h}_i, \mathbf{h}_j) > \text{threshold} \}

Step 4: Graph-Based Clustering (Leiden Algorithm)

1. Local Moving:

Initialization:

Each node starts as an independent community.

Local Optimization:

For each node $i$ , calculate the modularity gain $\Delta Q$ when moved to a neighboring community $C_j$ :

\Delta Q = \left[ \frac{\sum_{\text{in} \in C_j} w_{uv} + k_i^{\text{in}}}{2m} - \left( \frac{\sum_{u \in C_j} k_u + k_i}{2m} \right)^2 \right] - \left[ \frac{\sum_{\text{in} \in C_i} w_{uv} - k_i^{\text{in}}}{2m} - \left( \frac{\sum_{u \in C_i} k_u - k_i}{2m} \right)^2 \right]

Where $w_{uv}$ is the edge weight between $u$ and $v$ , $k_i^{\text{in}}$ is the internal edge weight of node $i$ , $k_i$ is the degree of node $i$ , and $m$ is the total edge weight of the graph.

Move Nodes:

Move the node to the community that maximizes $\Delta Q$ and repeat until no further improvement.

2. Refinement Phase:

Community Merging:

Treat communities as supernodes and construct a new graph where supernodes are connected by aggregated edge weights.

3. Fine-Tuning Phase:

Apply Local Optimization Again:

Optimize modularity in the new graph of supernodes.

Modularity Definition:

Modularity $Q$ :

Q = \frac{1}{2m} \sum_{i,j} \left[ w_{ij} - \frac{k_i k_j}{2m} \right] \delta(C_i, C_j)

Where $w_{ij}$ is the edge weight between nodes $i$ and $j$ , $k_i$ and $k_j$ are the degrees of nodes $i$ and $j$ , and $\delta(C_i, C_j)$ is the Kronecker delta function indicating whether nodes $i$ and $j$ are in the same community.

Step 5: Evaluation and Annotation of Metacells

1. Statistical Metrics Calculation:

Calculate various metrics for each metacell.

\text{Cell Number}(C_i) = |C_i|

\text{Expressed Genes}(C_i) = \sum_{j} \mathbb{I}\left(\sum_{k \in C_i} X_{kj} > 0\right)

2. Differential Expression Analysis:

Identify genes that are differentially expressed in each metacell.

\text{DE}(C_i) = \{ g \mid \text{avg}(i)(X''{:,g}) > \text{avg}\text{(others)}(X''{:,g}) \}

3. Functional Enrichment Analysis:

Perform functional enrichment analysis on marker genes.

\text{Enrichment}(DE(C_i)) = \text{GO}(DE(C_i)), \text{KEGG}(DE(C_i))

4. Cell Type Comparison:

Compare metacells with known cell types or states.

\text{Annotation}(C_i) = \text{Compare}(DE(C_i), \text{known markers})

5. Visualization:

Visualize metacells using dimensionality reduction techniques such as t-SNE or UMAP.

\text{t-SNE}(H), \text{UMAP}(H)