Hi,这里是有朴的第二大脑。
很高兴与你相遇
Homepage Archives Tags About Me Links

「转录组」001|WGCNA专题:实战原理两不误

Hello,这里是即将开学的陈有朴。

表达矩阵的处理

后续分析所用到的数据,均为FPKM标准化后的表达矩阵。

从流程上对WGCNA进行解读

1)当对芯片数据或者RNA-Seq等数据完成分析之后,我们可以得到一张表达矩阵

一行为一个样本,一列为一个gene。

2)读取表达矩阵之后,对其进行adjacency矩阵的计算

adjacency矩阵,基于gene之间相关性的矩阵。每一个单元代表了2个gene之间的关联性(similarity)。

adjacency矩阵的构建,涉及到软阈值的选择,即构建一个幂函数,选择一个指数(power),强化强相关性,弱化弱相关性。

「adjacency矩阵的拓展」

adjacency矩阵有2种计算方式,

1)unsigned:aij=cor(xx,xj)βa_{ij} = |cor(x_{x}, x_{j})|^{β}

2)signed:aij=(0.5(1+cor))βa_{ij}=(0.5*(1+cor))^{β}

第一种情况,认为具有高负相关的gene之间,是有联系的。

第二种情况,认为具有高负相关的gene之间,是没有联系的。

举个例子,

假设cor = -1,β=2,unsigned情况下,计算得到的adjacency为1,即gene之间高度关联。signed情况下,计算得到的adjacency为0,即gene之间无关联。

3)TOM矩阵的构建

引入一个指标,即topological overlaps的计算,用于定义该gene是否有多个高关联性的gene(用于gene module的构建

Cor(xi,xj)>TOM(xi,xj)Cor(x_{i},x_{j}) -> TOM(x_{i}, x_{j})

「TOM矩阵的拓展」

但是针对unsigned类型的adjacency矩阵,可能会出现无法判断几个基因之间的关联关系。

比如现有i,j,k 3个gene,计算得到它们之间的关联关系为(+,+,)(+,+,-)

1)i和j之间为正相关;2)i和k之间为正相关;3)j和k之间为负相关。

这就矛盾了,说明上面的关联关系可能受到了一些噪音的影响等等。

此时,引入signed TOM,其相较于unsigned TOM能够更好的解决gene之间的冲突情况。

Note:需要注意的是,当使用signed adjacency进行分析时,不会出现上述矛盾。

同时,这些矛盾作者都已经考虑到了,但还是有一定区别,

  • TOMsimilarity()中,默认设置为TOMType = "unsigned"
  • TOMsimilarityFromExpr()中,默认设置为TOMType = "signed"

4)TOM dissimilarity矩阵的构建

TOM dissimilarity矩阵代表了某个gene与其他gene之间的距离。

用上述矩阵进行聚类分析,得到gene module。

5)初步构建gene module

以TOM dissimilarity矩阵作为输入,进行聚类分析。

代码如下,

# Turn data expression into topological overlap matrix
power=sft$powerEstimate   # 使用sft$powerEstimate调用预估出的软阈值
TOM = TOMsimilarityFromExpr(datExpr, power = power)
dissTOM = 1-TOM
# Plot gene tree
geneTree = hclust(as.dist(dissTOM), method = "average");   # 用于后续cutreeDynamic(),对gene tree进行裁剪
# pdf(file = "3-gene_cluster.pdf", width = 12, height = 9);
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);
# dev.off()

6)使用cutreeDynamic()进行聚类分析的优化

使用pairwise eigengene,进一步计算得到eigengene dissimilarity,用于聚类分析,筛选指标,最终合并gene module。

Note:eigengene为gene表达模式的指标(PCA降维得到的第一个主成分)

cutreeDynamic()代码如下,

  • cutreeDynamic
  • labels2colors
  • plotDendroAndColors
# Module identification using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM, deepSplit = 2, 
                            pamRespectsDendro = FALSE, minClusterSize = 30);
table(dynamicMods)
length(table(dynamicMods)) 
# Convert numeric labels into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
# pdf(file = "4-module_tree.pdf", width = 8, height = 6);
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",dendroLabels = FALSE,
                    hang = 0.03,addGuide = TRUE, guideHang = 0.05,main = "Gene dendrogram and module colors")
# dev.off()

绘制图如下,

gene module合并代码如下 -> 基于cutreeDynamic()的分析结果进行聚类分析的gene module的合并

  • mergeCloseModules:合并gene module
  • plotDendroAndColors
# Merge close modules
MEDissThres=0.25
abline(h=MEDissThres, col = "red")
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3) 
mergedColors = merge$colors  
mergedMEs = merge$newMEs  
# Plot merged module tree
# pdf(file = "5-merged_Module_Tree.pdf", width = 12, height = 9)  
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), 
                    c("Dynamic Tree Cut", "Merged dynamic"), dendroLabels = FALSE, 
                    hang = 0.03, addGuide = TRUE, guideHang = 0.05)  
# dev.off()
# merge$oldMEs,为数据框,行为样本,列为对应的gene module,其中的数值代表了它们之间的关联度
write.table(merge$oldMEs,file="oldMEs.txt");
write.table(merge$newMEs,file="newMEs.txt");

绘制图如下,可以看到有一些gene module被合并了,

7)将gene module与表型特征相联系

使用标准化的eigengene进行计算。
Cor(MEs,Traits) Cor(MEs, Traits)
代码如下,

  • moduleEigengenes,计算每个gene module的特征值
  • moduleTraitCor = cor(MEs, datTraits, use = "p")
  • moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
# Define numbers of genes and samples
nGenes = ncol(datExpr);
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, mergedColors)$eigengenes
MEs = orderMEs(MEs0)
# r$> MEs[1:5, 1:5]
#           MElightcyan1 MEdarkolivegreen MEgreenyellow       MEcyan     MEwhite
# LH38_Z1_1    0.3093871      -0.01665288     0.1583588 -0.017809267 -0.01305706
# LH38_Z1_2    0.3318247      -0.01679628     0.1629585 -0.010450719  0.04613664
# LH38_Z1_3    0.3087032      -0.01762908     0.1527085 -0.018456470 -0.03661408
# LH38_Z2_1    0.3050361      -0.01431782     0.1589569 -0.020199995  0.06161998
# LH38_Z2_2    0.3422982      -0.01466196     0.1722949 -0.006663298  0.08192540

# 一些数据处理部分
# Read microbial data as traits
bac_traits = read.table("traits_file/b_order_234.txt", header = T, sep = "\t")
rownames(bac_traits) = bac_traits[, 1]
bac_traits = bac_traits[, -1]
# r$> bac_traits[1:5, 1:5]
#           Pseudomonadales Enterobacteriales Xanthomonadales Burkholderiales Verrucomicrobiales
# LH38_Z1_1      0.02120943       0.006338742      0.07261663      0.05920385         0.02674949
# LH38_Z1_2      0.04192444       0.009089757      0.06880071      0.05583164         0.02156440
# LH38_Z1_3      0.01393256       0.004525862      0.06961207      0.05100152         0.02189402
# LH39_Z1_1      0.11288033       0.013045132      0.07138692      0.04186106         0.01743154
# LH39_Z1_2      0.01214503       0.003410243      0.08236562      0.03733519         0.01953600


rownames(MEs) = paste(substr(rownames(MEs), 1, nchar(rownames(MEs))-1), rep(c("1", "2", "3"), 60), sep = "")
# sample names should be consistent in eigen genes and traits !!!!
bac_traits = bac_traits[match(rownames(MEs), rownames(bac_traits)), ]
table(rownames(MEs) == rownames(bac_traits))

# Calculate pearson correlation coefficients between module eigen-genes and traits
moduleTraitCor = cor(MEs, bac_traits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
write.table(moduleTraitCor,file="moduleTrait_correlation.txt");
write.table(moduleTraitPvalue,file="moduleTrait_pValue.txt");

结果图可视化代码如下,

sizeGrWindow(10,6)
# Will display correlations and their p-values
# 合并
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
# pdf("module-traits-bacteria-order.pdf", width = 100, height = 30)
par(mar = c(15, 12, 5, 5));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(bac_traits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = greenWhiteRed(50),
               textMatrix = textMatrix,  # 矩阵单元格上需要显示的信息
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))
# dev.off()

结果图如下,

Note:右边的图例,代表相关性程度

8)Hub gene的鉴定

2种方法,

  • 与目标module有强关联性的gene(gene significance)
  • 使用module membership来鉴定关键gene,Cor(i,ME)Cor(i,ME)

一般先绘制出以module membership为x,gene significance为y的散点图。

代码如下,

# 以相关性最高的模块为例
which(moduleTraitCor == max(moduleTraitCor, na.rm = TRUE), arr.ind = TRUE)
#            row col
# MEskyblue3  25  30

Nitrosomonadales <- as.data.frame(bac_traits[, 30])
names(Nitrosomonadales) = "Nitrosomonadales"

modNames = substring(names(MEs), 3) # 去除ME前缀

# 计算module membership
# 使用的是WGCNA自带cor函数,使用皮尔逊计算相关性
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
geneModuleMembership[1:5, 1:5]
#                MElightcyan1 MEdarkolivegreen MEgreenyellow      MEcyan       MEwhite
# Zm00001d001763 -0.045547074      -0.10334193   -0.20030129  0.22671189  0.1501121833
# Zm00001d001766  0.004752275      -0.01548396   -0.11130522 -0.22346048  0.0178421845
# Zm00001d001770 -0.286230379      -0.23340743   -0.39930773  0.22791746  0.0925346831
# Zm00001d001774  0.029505956      -0.06714112   -0.06323784 -0.34302212  0.1677206174
# Zm00001d001775 -0.029767437      -0.07642270   -0.13732818 -0.07876773 -0.0008678361

MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
MMPvalue[1:5, 1:5]

names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

# 计算gene significance
geneTraitSignificance = as.data.frame(cor(datExpr, Nitrosomonadales, use = "p"));
# head(geneTraitSignificance)
# nrow(geneTraitSignificance)
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(Nitrosomonadales), sep="");
names(GSPvalue) = paste("p.GS.", names(Nitrosomonadales), sep="");


module = "skyblue3"
column = match(module, modNames);  # match(x, y),找到y中x的索引位置
moduleGenes = mergedColors==module;
# length(mergedColors)
# nrow(geneModuleMembership)
# nrow(geneTraitSignificance)
# table(mergedColors)
table(moduleGenes)

sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for Nitrosomonadales",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

散点图结果如下,可以看到module membership值越高的gene,其gene significance也越高。

其他

绘制gene module eigengenes与samples之间的热图

library("pheatmap")

# Heatmap of old module eigen-genes and samples
pdf(file="oldMEs.pdf",heigh=80,width=20)
row.names(merge$oldMEs)=names(data0)  # oldMEs,是一个矩阵
pheatmap(merge$oldMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
dev.off()

# Heatmap of new module eigen-genes and samples
# pdf(file="newMEs.pdf",heigh=60,width=20)
row.names(merge$newMEs)=names(data0)
pheatmap(merge$newMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
# dev.off()

参考资料

[1] https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/

[2] https://github.com/PengYuMaize/Yu2021NaturePlants

[3] 跟着Nature Plants学数据分析:R语言WGCNA分析完整示例

[4] https://www.youtube.com/watch?v=BzYfg1lO3jw

[5] https://www.biostars.org/p/288153/

[6] https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/

[7] https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/TechnicalReports/signedTOM.pdf