差异分析
差异分析多使用R包DEseq2 或者edgeR 。这里使用DEseq2对featureCounts的结果进行差异分析。
在windows下的RStudio使用BiocManager安装DESeq2时出现了"Bioconductor version cannot be validated; no internet connection?"这个错误,查找了一下,输入下面这两行可解决,可通过把这两行命令输入到R的配置文件中,避免每次安装都要输入。
options ( download.file.method = "libcurl" )
options ( url.method = "libcurl" )
接下来还是按照基本原则,软件的安装与R包的安装等不再赘述。
DESeq2
以下是DESeq2处理获得差异矩阵的R script,如果是用counts而不是TPM则忽略计算TPM的步骤。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 library ( DESeq2 )
# 读入
data <- read.table ( "final_featureCounts.txt" , header = TRUE , skip = 1 , row.names = 1 )
colnames ( data ) <- gsub ( ".bam" , "" , colnames ( data ), fixed = TRUE )
colnames ( data ) <- gsub ( "bam." , "" , colnames ( data ), fixed = TRUE )
countdata <- data [ , 6 : ncol ( data )]
# 计算TPM
KB <- data $ Length / 1000
RPK <- countdata / KB
TPM <- t ( t ( RPK ) / colSums ( RPK ) * 1000000 )
TPM <- merge ( data , as.data.frame ( TPM ), by = "row.names" , sort = FALSE )
TPM <- TPM [, 1 : ncol ( TPM )]
write.table ( TPM , "final_featureCounts.TPM.txt" , sep = "\t" , quote = FALSE , row.names = FALSE )
以上可以计算获得TPM用作分析,不过这里还是继续使用counts。
接下来读入metadata,metadata格式如下:
Group Replicate sampleid
LoGlu Rep1 SRR1374921
LoGlu Rep2 SRR1374922
HiGlu Rep1 SRR1374923
HiGlu Rep2 SRR1374924
使用以下命令读入
# 读入metadata
metadata <- read.table ( "metadata.txt" , header = TRUE )
rownames ( metadata ) <- metadata $ sampleid
# 如果需要调整样本编号顺序使与counts结果中相同
metadata [ match ( colnames ( countdata ), metadata $ sampleid ), ]
接下来使用DESeq2计算差异
1
2
3
4
5
6
7
8
9
10
11
12 # 计算差异
dds <- DESeqDataSetFromMatrix ( countData = countdata , colData = metadata , design =~ Group )
dds <- dds [ rowSums ( counts ( dds )) > 1 , ]
dds <- DESeq ( dds )
res <- results ( dds , pAdjustMethod = "fdr" , alpha = 0.05 )
res <- res [ order ( res $ padj ),]
summary ( res )
mcols ( res , use.names = TRUE )
# 保存结果
resdata <- merge ( as.data.frame ( res ), as.data.frame ( counts ( dds , normalized = TRUE )), by = "row.names" , sort = FALSE )
write.csv ( resdata , file = "LoGlu_HiGlu_mm39_diff.csv" , row.names = FALSE )
计算完后保存的结果可用于作图。
# 也可以输出差异基因
diff_gene <- subset ( res , padj < 0.05 & ( log2FoldChange > 1 | log2FoldChange < -1 ))
write.table ( x = as.data.frame ( diff_gene ), file = "results_gene_annotated_significant.txt" , sep = "\t" , quote = F , col.names = NA )
计算获得差异结果后,进入下一步画图。