Han’s MetaCell Model

Han Wang at 2024/6/11.


Step 1: Data Preparation and Preprocessing

1. Filtering Low-Quality Cells:

Remove cells with low total expression, few detected genes, or high mitochondrial gene expression percentage.

Quality(i)=jXijijXij>threshold1\text{Quality}(i) = \frac{\sum_{j} X_{ij}}{\sum_{i} \sum_{j} X_{ij}} > \text{threshold1}
Detected Genes(i)=jI(Xij>0)>threshold2\text{Detected Genes}(i) = \sum_{j} \mathbb{I}(X_{ij} > 0) > \text{threshold2}
Mito Ratio(i)=jMitoXijjXij<threshold3\text{Mito Ratio}(i) = \frac{\sum_{j \in \text{Mito}} X_{ij}}{\sum_{j} X_{ij}} < \text{threshold3}

2. Filtering Low-Expression Genes:

Remove genes with low or no expression in most cells.

Expression(j)=iXijm>threshold4\text{Expression}(j) = \frac{\sum_{i} X_{ij}}{m} > \text{threshold4}

3. Data Normalization:

Normalize expression levels for comparability.

Xij=XijjXij×106X'{ij} = \frac{X{ij}}{\sum_{j} X_{ij}} \times 10^6

4. Data Transformation:

Log-transform the normalized expression data.

Xij=log2(Xij+1)X''{ij} = \log_2(X'{ij} + 1)

Step 2: NMF Decomposition

1. NMF Decomposition:

Decompose the preprocessed expression matrix XX into two non-negative matrices WW and HH.

XWHX \approx WH

Where WRn×kW \in \mathbb{R}^{n \times k}, HRk×mH \in \mathbb{R}^{k \times m}.

2. Objective to Minimize Reconstruction Error:

minW,HXWHF2\min_{W, H} ||X - WH||_F^2

3. Optimization Algorithms:

Use Alternating Least Squares (ALS) or Multiplicative Update Rules (MUR):

WijWij(XHT)ij(WHHT)ijW_{ij} \leftarrow W_{ij} \frac{(XH^T){ij}}{(WHH^T){ij}}
HijHij(WTX)ij(WTWH)ijH_{ij} \leftarrow H_{ij} \frac{(W^TX){ij}}{(W^TWH){ij}}

Step 3: Constructing the Cell Similarity Graph

1. Construct Cell Similarity Graph:

Use the weight matrix HH from NMF.

hi=H:,i\mathbf{h}i = H{:,i}

2. Calculate Similarity Measures:

Calculate pairwise cell similarities using cosine similarity or Pearson correlation.

CosSim(hi,hj)=hihjhihj\text{CosSim}(\mathbf{h}_i, \mathbf{h}_j) = \frac{\mathbf{h}_i \cdot \mathbf{h}_j}{||\mathbf{h}_i|| \cdot ||\mathbf{h}_j||}
Pearson(hi,hj)=Cov(hi,hj)σhiσhj\text{Pearson}(\mathbf{h}_i, \mathbf{h}_j) = \frac{\text{Cov}(\mathbf{h}_i, \mathbf{h}j)}{\sigma{\mathbf{h}i} \sigma{\mathbf{h}_j}}

3. Construct Cell Similarity Graph GG:

Edges in GG represent similarities above a threshold.

E={(i,j)Sim(hi,hj)>threshold}E = \{ (i, j) \mid \text{Sim}(\mathbf{h}_i, \mathbf{h}_j) > \text{threshold} \}

Step 4: Graph-Based Clustering (Leiden Algorithm)

1. Local Moving:

Initialization:

Each node starts as an independent community.

Local Optimization:

For each node ii , calculate the modularity gain ΔQ\Delta Q when moved to a neighboring community CjC_j:

ΔQ=[inCjwuv+kiin2m(uCjku+ki2m)2][inCiwuvkiin2m(uCikuki2m)2]\Delta Q = \left[ \frac{\sum_{\text{in} \in C_j} w_{uv} + k_i^{\text{in}}}{2m} - \left( \frac{\sum_{u \in C_j} k_u + k_i}{2m} \right)^2 \right] - \left[ \frac{\sum_{\text{in} \in C_i} w_{uv} - k_i^{\text{in}}}{2m} - \left( \frac{\sum_{u \in C_i} k_u - k_i}{2m} \right)^2 \right]

Where wuvw_{uv} is the edge weight between uu and vv, kiink_i^{\text{in}} is the internal edge weight of node ii, kik_i is the degree of node ii, and mm is the total edge weight of the graph.

Move Nodes:

Move the node to the community that maximizes ΔQ\Delta Q and repeat until no further improvement.

2. Refinement Phase:

Community Merging:

Treat communities as supernodes and construct a new graph where supernodes are connected by aggregated edge weights.

3. Fine-Tuning Phase:

Apply Local Optimization Again:

Optimize modularity in the new graph of supernodes.

Modularity Definition:

Modularity QQ:

Q=12mi,j[wijkikj2m]δ(Ci,Cj)Q = \frac{1}{2m} \sum_{i,j} \left[ w_{ij} - \frac{k_i k_j}{2m} \right] \delta(C_i, C_j)

Where wijw_{ij} is the edge weight between nodes ii and jj, kik_i and kjk_j are the degrees of nodes ii and jj, and δ(Ci,Cj)\delta(C_i, C_j) is the Kronecker delta function indicating whether nodes ii and jj are in the same community.


Step 5: Evaluation and Annotation of Metacells

1. Statistical Metrics Calculation:

Calculate various metrics for each metacell.

Cell Number(Ci)=Ci\text{Cell Number}(C_i) = |C_i|
Expressed Genes(Ci)=jI(kCiXkj>0)\text{Expressed Genes}(C_i) = \sum_{j} \mathbb{I}\left(\sum_{k \in C_i} X_{kj} > 0\right)

2. Differential Expression Analysis:

Identify genes that are differentially expressed in each metacell.

DE(Ci)={gavg(i)(X:,g)>avg(others)(X:,g)}\text{DE}(C_i) = \{ g \mid \text{avg}(i)(X''{:,g}) > \text{avg}\text{(others)}(X''{:,g}) \}

3. Functional Enrichment Analysis:

Perform functional enrichment analysis on marker genes.

Enrichment(DE(Ci))=GO(DE(Ci)),KEGG(DE(Ci))\text{Enrichment}(DE(C_i)) = \text{GO}(DE(C_i)), \text{KEGG}(DE(C_i))

4. Cell Type Comparison:

Compare metacells with known cell types or states.

Annotation(Ci)=Compare(DE(Ci),known markers)\text{Annotation}(C_i) = \text{Compare}(DE(C_i), \text{known markers})

5. Visualization:

Visualize metacells using dimensionality reduction techniques such as t-SNE or UMAP.

t-SNE(H),UMAP(H)\text{t-SNE}(H), \text{UMAP}(H)