使用oligo软件包处理芯片数据

本博客介绍过 Affy芯片的处理方法，其中所使用的软件包有一定的局限性，无法读取和分析一些新版Affy芯片。本文介绍oligo软件包的处理方法以解决这些问题。oligo软件包并不是新出现的软件包，只因新类型芯片的不断推出，关注它的用户越来越多。而且，除了用于Affy芯片处理外，oligo软件包还可处理NimbleGen芯片。

oligo处理芯片的原理和其他方法相同，难点在最后一步：从探针到基因注释的转换。

1 准备工作

1.1 数据下载

本文还是以Affy芯片为例，介绍oligo软件包处理芯片数据的方法。所用数据集为NCBI GSE72154，可通过HTTP或FTP下载原始数据文件。建议使用FTP下载，登录ftp.ncbi.nlm.nih.gov站点，在/geo/datasets下找到GSE72154子目录后全部下载。下载的目录中包含matrix、miniml、soft和suppl四个子目录。

1.2 软件包安装

oligo是bioconductor软件包，可以通过下面语句安装：

source("https://bioconductor.org/biocLite.R")
biocLite("oligo")

2 文件读取与信息修改

oligo读取芯片原始数据文件的函数为 read.celfiles 或 read.xysfiles (NimbleGen芯片)，可以直接读取压缩文件，一张芯片对应一个文件。使用FTP下载后CEL文件位于suppl子目录下，把GSE72154_RAW.tar解压缩为gz文件或CEL文件即可。

library(oligo)
data.dir <- "GSE72154/suppl/"
(celfiles <- list.files(data.dir, "\\.gz$"))

## [1] "GSM1856253_Bohmer_685-3_Ler-0_Rep1_AraGene-1-1-st.CEL.gz"
## [2] "GSM1856254_Bohmer_685-4_Ler-0_Rep2_AraGene-1-1-st.CEL.gz"
## [3] "GSM1856255_Bohmer_685-1_wrky43_Rep1_AraGene-1-1-st.CEL.gz"
## [4] "GSM1856256_Bohmer_685-2_wrky43_Rep2_AraGene-1-1-st.CEL.gz"

data.raw <- read.celfiles(filenames = file.path(data.dir, celfiles))

## Reading in : GSE72154/suppl//GSM1856253_Bohmer_685-3_Ler-0_Rep1_AraGene-1-1-st.CEL.gz
## Reading in : GSE72154/suppl//GSM1856254_Bohmer_685-4_Ler-0_Rep2_AraGene-1-1-st.CEL.gz
## Reading in : GSE72154/suppl//GSM1856255_Bohmer_685-1_wrky43_Rep1_AraGene-1-1-st.CEL.gz
## Reading in : GSE72154/suppl//GSM1856256_Bohmer_685-2_wrky43_Rep2_AraGene-1-1-st.CEL.gz

read.celfiles 函数读取数据的过程中会检查系统中是否已安装相应的芯片注释软件包，没有安装的话将有警告。如果在读取的过程中没有自动安装相应注释软件包，请手动安装后再重新读取数据。

本例含4张芯片，文件名上面代码已列出，含野生型和突变体各两个重复。样品名称默认为文件名，需简单修改：

treats <- strsplit("Ler Ler wrky wrky", " ")[[1]]
(snames <- paste(treats, 1:2, sep = ""))

## [1] "Ler1"  "Ler2"  "wrky1" "wrky2"

sampleNames(data.raw) <- snames
pData(data.raw)$index <- treats
sampleNames(data.raw)

## [1] "Ler1"  "Ler2"  "wrky1" "wrky2"

pData(data.raw)

##       index
## Ler1    Ler
## Ler2    Ler
## wrky1  wrky
## wrky2  wrky

样品名称与芯片文件必需对应，不能有重复；pData的index数据表示处理类型。

3 探针水平分析

原始芯片数据读入后应简单评估一下芯片的质量，这在其他文章中介绍过，oligo中也提供了很多方法，如果样品重复量足够，boxplot是比较直观的方法：

fit1 <- fitProbeLevelModel(data.raw)
boxplot(fit1, names = NA, col = as.factor(treats))
legend("topright", legend = unique(treats), fill = as.factor(unique(treats)),box.col = NA, bg = "white", inset = 0.01)

当然也可以使用MAplot：

par(mfrow = c(2, 2))
MAplot(data.raw[, 1:4], pairs = FALSE)

4 表达量计算

含背景处理、归一化和表达量计算三个步骤，可选组合方法很多。下面就用通用的三合一处理函数rma一步完成：

data.eset <- rma(data.raw)

## Background correcting
## Normalizing
## Calculating Expression

得到eset类型的数据，用exprs函数可以提取其中的表达量矩阵：

data.exprs <- exprs(data.eset)
str(data.exprs)

##  num [1:38408, 1:4] 6.52 6.67 6.52 6.21 7.07 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:38408] "13320001" "13320003" "13320005" "13320007" ...
##   ..$ : chr [1:4] "Ler1" "Ler2" "wrky1" "wrky2"

head(data.exprs)

##              Ler1     Ler2    wrky1    wrky2
## 13320001 6.515240 6.975158 6.163749 6.121993
## 13320003 6.670580 6.816606 5.808836 5.653102
## 13320005 6.522566 7.396850 7.011411 6.604062
## 13320007 6.206310 6.538738 5.477303 6.572429
## 13320009 7.071985 7.015605 6.426396 6.932439
## 13320011 6.558686 6.333987 6.670292 6.277999

5 P/A过滤

目的是去除“不表达”的基因/探针数据，使用paCalls函数，选取p值小于0.05的探针：

xpa <- paCalls(data.raw)
head(xpa)

##           Ler1       Ler2       wrky1       wrky2
## 7  0.477083333 0.80416667 0.327083333 0.689583333
## 8  0.550619835 0.73347107 0.786157025 0.361570248
## 11 0.327731092 0.01890756 0.231092437 0.189075630
## 14 0.005138746 0.01130524 0.005138746 0.005138746
## 17 0.057851240 0.07231405 0.045454545 0.027892562
## 20 0.013360740 0.01027749 0.051387461 0.006166495

AP <- apply(xpa, 1, function(x) any(x < 0.05))
xids <- as.numeric(names(AP[AP]))
head(xids)

## [1] 11 14 17 20 28 42

计算后发现P/A结果和表达量分析结果所用的探针名称是不一样的，需要使用探针信息进行转换。getProbeInfo函数可以获取相关信息：

pinfo <- getProbeInfo(data.raw)
head(pinfo)

##   fid man_fsetid
## 1   7   13507007
## 2   8   13379105
## 3  11   13402401
## 4  14   13363588
## 5  17   13492405
## 6  20   13488138

两类id转换后进行筛选：

fids <- pinfo[pinfo$fid %in% xids, 2]
head(fids)

## [1] "13402401" "13363588" "13492405" "13488138" "13349578" "13529865"

nrow(data.exprs)

## [1] 38408

data.exprs <- data.exprs[rownames(data.exprs) %in% fids, ]
nrow(data.exprs)

## [1] 35697

第一个语句筛选出表达基因的探针ID，第四个语句用筛选出的ID过滤掉不表达的基因/探针。

6 获取差异表达探针

下面使用的limma方法，更多方法请参考本博客其他文章。先根据样品名称、类型和顺序构建试验设计矩阵：

library(limma)
(design <- model.matrix(~0 + treats))

##   treatsLer treatswrky
## 1         1          0
## 2         1          0
## 3         0          1
## 4         0          1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treats
## [1] "contr.treatment"

colnames(design) <- gsub("treats", "", colnames(design))
rownames(design) <- snames
design

##       Ler wrky
## Ler1    1    0
## Ler2    1    0
## wrky1   0    1
## wrky2   0    1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treats
## [1] "contr.treatment"

然后将试验设计矩阵应用于表达量数据（exprs）或eset数据，获取拟合模型：

fit1 <- lmFit(data.exprs, design)
## fit1 <- lmFit(data.eset, design)

接下来进行试验处理组间的比较。因为只有两组样品，只可能是突变体和野生型间的比较：

contrast.matrix <- makeContrasts("wrky-Ler", levels = design)
fit2 <- contrasts.fit(fit1, contrast.matrix)
fit2 <- eBayes(fit2)

使用topTable函数获取表达变化倍数和相关统计值：

limma.all <- topTable(fit2, coef = 1, adjust.method = "fdr", number = 3e+05)
limma.fc2 <- topTable(fit2, coef = 1, adjust.method = "fdr", lfc = 1, number = 3e+05)
nrow(limma.all)

## [1] 35697

nrow(limma.fc2)

## [1] 1028

head(limma.fc2)

##             logFC   AveExpr        t      P.Value  adj.P.Val        B
## 13356037 4.540665 10.780301 39.28527 1.070928e-06 0.03822893 3.571807
## 13348341 4.206080  6.004308 31.68465 2.710592e-06 0.04838000 3.402196
## 13348348 3.569355  5.319289 21.22435 1.522879e-05 0.09541435 2.882561
## 13400731 3.029046  6.347611 20.99470 1.595751e-05 0.09541435 2.863891
## 13474474 3.259238  5.847156 20.82997 1.650674e-05 0.09541435 2.850205
## 13454438 2.750208  4.778045 19.55321 2.165705e-05 0.09541435 2.735128

topTable函数中可以直接使用p.value产生进行p值过滤，但由于本数据集生物学重复数太少，上面代码仅用表达变化倍数进行过滤。number参数用于限定结果的最大行数。

7 探针到基因的转换

有多种方法，下面介绍两种。

7.1 方法1：使用芯片注释文件

每版芯片都会有专门文件对探针进行说明，这些文件多数是公开的。前面通过NCBI FTP下载GEO芯片数据集时已提到matrix、miniml、soft和suppl四个子目录。miniml和soft目录都包含有平台相关的信息文件，其中soft目录下的GSE72154_family.soft.gz文件含有表头，可以使用该文件：

ff <- "GSE72154/soft/GSE72154_family.soft.gz"
nn <- grep("^[^#!^]", readLines(ff))[1] - 1
pfinfo <- read.table(ff, sep = "\t", quote = "", header = TRUE, skip = nn, fill = TRUE)
colnames(pfinfo)

##  [1] "ID"                    "seqname"
##  [3] "strand"                "start"
##  [5] "stop"                  "total_probes"
##  [7] "gene_assignment"       "mrna_assignment"
##  [9] "GB_ACC"                "ORF"
## [11] "SPOT_ID"               "swissprot"
## [13] "GO_biological_process" "GO_cellular_component"
## [15] "GO_molecular_function" "protein_domains"
## [17] "crosshyb_type"         "category"

可以看到表格各列所包含的内容。现阶段我们只需要ID和ORF列，即探针编号和它们所检测的mRNA。提取信息后做一些必要的处理

pfinfo <- pfinfo[, c(1, 10)]
pfinfo$ORF <- toupper(pfinfo$ORF)
## 删除重复信息
pfinfo <- pfinfo[!duplicated(pfinfo), ]
## 删除非mRNA探针
pfinfo <- pfinfo[pfinfo$ORF != "", ]
rownames(pfinfo) <- pfinfo$ID
nrow(pfinfo)

## [1] 20657

得到以上基本信息后就可以添加到差异表达列表中。首先去除没有mRNA信息的探针：

result1 <- limma.fc2[rownames(limma.fc2) %in% pfinfo$ID, ]
nrow(result1)

## [1] 214

把AGI信息添加到表达列表，保存表格文件即可：

result1$agi <- pfinfo[rownames(result1), 2]
head(result1)

##              logFC   AveExpr         t      P.Value  adj.P.Val        B
## 13356037  4.540665 10.780301  39.28527 1.070928e-06 0.03822893 3.571807
## 13474474  3.259238  5.847156  20.82997 1.650674e-05 0.09541435 2.850205
## 13454438  2.750208  4.778045  19.55321 2.165705e-05 0.09541435 2.735128
## 13335452 -2.729114  9.397657 -18.21499 2.935056e-05 0.09541435 2.594936
## 13376651 -2.154874  7.127874 -16.55810 4.414524e-05 0.09541435 2.387110
## 13464968  1.667566  8.680761  14.94084 6.845745e-05 0.09541435 2.138016
##                  agi
## 13356037 AT1G67105.1
## 13474474 AT4G27940.1
## 13454438 AT3G27473.1
## 13335452 AT1G03880.1
## 13376651 AT1G47540.2
## 13464968 AT4G02555.1

write.csv(result1, "results.fc2.1.csv")

7.2 方法2：使用基因组特征文件

芯片厂商提供的探针注释文件有可能过时。如分析十多年前的试验芯片，一些基因的注释可能有很大变化。但不管怎么样，探针在基因组上的位置信息应该不会有太大的差异，应用基因组进行比对总不会大错。

Bioconductor提供了专门用于处理基因组特征文件的类和函数，相关的gff格式文件可以从各种途径免费获取。本例使用的是TAIR网站下载的拟南芥基因组特征文件。

library("rtracklayer")
ath.gff <- readGFFAsGRanges("TAIR10_GFF3_genes.gff")
class(ath.gff)

## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

读入数据存储为GRanges类，其数据结构可以使用str函数查看，相关的基础知识已在 R/BioC序列处理之五：Rle和Ranges 一文中介绍过。GRanges的每一条信息都包含了很多内容，这里仅关心外显子、UTR和miRNA相关的序列：

xtype <- as.character(ath.gff@elementMetadata@listData$type)
unique(xtype)

##  [1] "chromosome"                "gene"
##  [3] "mRNA"                      "protein"
##  [5] "exon"                      "five_prime_UTR"
##  [7] "CDS"                       "three_prime_UTR"
##  [9] "miRNA"                     "tRNA"
## [11] "ncRNA"                     "pseudogene"
## [13] "pseudogenic_transcript"    "pseudogenic_exon"
## [15] "transposable_element_gene" "mRNA_TE_gene"
## [17] "snoRNA"                    "snRNA"
## [19] "rRNA"

sels <- grepl("exon|utr|miRNA", xtype, ignore.case = TRUE)
ath.gff <- ath.gff[sels]

接下来获取探针在基因组上的位置信息。这一步可以用前面的芯片注释文件数据，也可以使用随CEL文件读取时载入的数据。类似数据在前面P/A过滤部分已经用到过，但这里要提取更多的信息：

availProbeInfo(data.raw)

## $pmfeature
## [1] "fid"    "fsetid" "atom"   "x"      "y"
##
## $featureSet
## [1] "fsetid"        "strand"        "start"         "stop"
## [5] "exon_id"       "crosshyb_type" "level"         "chrom"
## [9] "type"
##
## $core_mps
## [1] "meta_fsetid"           "transcript_cluster_id" "fsetid"

pinfo <- getProbeInfo(data.raw, field = strsplit("fid chrom strand start stop exon_id"," ")[[1]])
head(pinfo)

##   fid man_fsetid strand    start     stop   exon chrom
## 1   7   13507007      0 11107381 11107491 884614  chr5
## 2   8   13379105      1 19981386 19981486 778732  chr1
## 3  11   13402401      0 16131514 16131647 798181  chr2
## 4  14   13363588      1  1457201  1457469 765883  chr1
## 5  17   13492405      1 15589898 15590906 872458  chr4
## 6  20   13488138      1 11944077 11944154 868967  chr4

在应用数据前做一些必要的处理：

pinfo <- pinfo[grepl("^chr", pinfo$chrom), ]
pinfo$chrom <- gsub("^c", "C", pinfo$chrom)
pinfo$strand <- sapply(pinfo$strand, FUN = function(x) {if (x == 0)"+" else "-"
})

然后构建探针对应的GRanges类对象：

xrange <- GRanges(seqnames = pinfo$chrom, ranges = IRanges(start = pinfo$start,end = pinfo$stop, names = pinfo$man_fsetid), strand = pinfo$strand)

正式比对并将产生探针-基因注释数据：

xlap <- findOverlaps(xrange, ath.gff, select = "all", type = "within")
anno <- data.frame(fid = names(xrange)[queryHits(xlap)], agi = ath.gff@elementMetadata$Parent@unlistData[subjectHits(xlap)])
anno <- anno[!duplicated(anno), ]
head(anno)

##        fid         agi
## 1 13507007 AT5G29044.1
## 2 13379105 AT1G53541.1
## 3 13402401 AT2G38544.1
## 4 13363588 AT1G05070.1
## 5 13492405 AT4G32290.1
## 6 13488138 AT4G22740.1

有部分探针可能和基因组上多个基因对应，有必要进行合并，或者舍弃这部分数据：

anno <- aggregate(agi ~ fid, FUN = c, data = anno, simplify = TRUE)
anno$agi <- sapply(anno$agi, FUN = paste, collapse = ";")
rownames(anno) <- anno$fid

最后合并数据：

result2 <- limma.fc2[rownames(limma.fc2) %in% anno$fid, ]
result2$agi <- sapply(anno[rownames(result2), "agi"], FUN = paste, collapse = ";")
nrow(limma.fc2)

## [1] 1028

nrow(result2)

## [1] 464

write.csv(result2, "results.fc2.2.csv")

为什么最后得到的基因数量比前一种方法要多？自己分析原因吧。

如果需要进行GO/KEGG途径富集分析，还应导出芯片中的“表达”基因：

xagis <- anno[anno$fid %in% fids, "agi"]
writeLines(xagis, "results.present.genes.txt")

8 SessionInfo

print(sessionInfo(), locale = FALSE)

## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 8 (jessie)
##
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] rtracklayer_1.34.2       GenomicRanges_1.26.4
##  [3] GenomeInfoDb_1.10.3      limma_3.30.13
##  [5] pd.aragene.1.0.st_3.12.0 DBI_0.6
##  [7] RSQLite_1.1-2            oligo_1.38.0
##  [9] Biostrings_2.42.1        XVector_0.14.1
## [11] IRanges_2.8.2            S4Vectors_0.12.2
## [13] Biobase_2.34.0           oligoClasses_1.36.0
## [15] BiocGenerics_0.20.0      zblog_0.1.0
## [17] knitr_1.15.1
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10               BiocInstaller_1.24.0
##  [3] formatR_1.4                highr_0.6
##  [5] bitops_1.0-6               iterators_1.0.8
##  [7] tools_3.3.3                zlibbioc_1.20.0
##  [9] digest_0.6.12              bit_1.1-12
## [11] evaluate_0.10              memoise_1.0.0
## [13] preprocessCore_1.36.0      lattice_0.20-35
## [15] ff_2.2-13                  Matrix_1.2-8
## [17] foreach_1.4.3              stringr_1.2.0
## [19] affxparser_1.46.0          grid_3.3.3
## [21] BiocParallel_1.8.1         XML_3.98-1.5
## [23] magrittr_1.5               GenomicAlignments_1.10.1
## [25] Rsamtools_1.26.1           codetools_0.2-15
## [27] splines_3.3.3              SummarizedExperiment_1.4.0
## [29] KernSmooth_2.23-15         stringi_1.1.3
## [31] RCurl_1.95-4.8             affyio_1.44.0

作者: ZGUANG@LZU

Created: 2017-03-27 一 18:03