R极简教程-7：读取数据

读取数据往往是进行数据分析的第一步，数据读取的方式很多，就R语言而言，常见的有几种：Load已经存好的RData，读取文本文件，读取excel文件，读取数据库文件，抓取网络数据。

读取RData

RData是过往存储好的数据格式，你在存储的时候，可以往一个RData中存入不止一个变量，再读取的时候，它们都会被一同载入进来。

> A[,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
> save(A,file="./A.rda")

下面我将A变量存储成为本地的A.rda文件，在存储的时候，你的名字其实可以随便取，甚至后缀名都可以随便取，我这里用的是A.rda是因为我常用的Bioconductor网站将这个后缀规定为标准名称，其实你可以命名为各种各样的名字——.RData, .rdata, .RDA…什么都可以，但其实都没区别，因为只要是save函数存储的东西，结果都是一样的。

然后是载入数据：

> load("./A.rda")
> ls()
[1] "A"
> A[,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
>

使用load()函数就可以载入文件，要记得越大的文件，存储时间越长，载入时间也越长，所以最好不要轻易存取大文件。。。想好了再操作。

读取文本文件

文本文件是很常见的格式，从txt到各种都有，R提供了read.csv和read.table这两个很常用的读取函数。

理论上read.csv是用来读取csv文件的，但其实这两个函数谁好用说不清的，你的csv文件或者txt文件难说有一些特殊的设置，比如tab分割啊，空格分隔啊一类的。我这些年的经验是，都试试，改改参数，总会能用的。

下面我们演示一下：
首先我们复制了TOIBE网站上的计算机语言排名，粘贴到一个txt文档中：

然后我们分别尝试用read.csv()和read.table去读取它们：

可以看到，理论上我们应该用read.table去读取，但是最后read.csv()读取成功了。我也不算什么高手，具体细节还不够清楚。经常读文件都是试着试着的就搞定了。

总之，读取的命令都是：

A <- read.csv("file/path/XXX.file",head,sep)

其中head是告诉程序，你的txt文件有没有标题行，sep是告诉程序，用什么标准来区分每一个列。

读取Excel文件

个人觉得读取Excel文件最理想的办法，不就是把它存储成csv文件，然后直接用read.csv()读取吗？

不过有很多很多其他的直接从Excel读取数据的方法。比如gdata包就是不错的选择：

首先我们把之前的那个文本存储成Excel，我用的是Office2016，名字叫做Toibe.xlsx，然后用命令就可以读取第一个sheet里边的所有信息。

> library(gdata)
> read.xls("./Toibe.xlsx",sheet=1,head=T)Jun.17 Jun.16 Change Programming.Language Ratings Change.1
1       1      1                        Java  14.49%   -6.30%
2       2      2                           C   6.85%   -5.53%
3       3      3                         C++   5.72%   -0.48%
4       4      4                      Python   4.33%    0.43%
5       5      5                          C#   3.53%   -0.26%
6       6      9 change    Visual Basic .NET   3.11%    0.76%
7       7      7                  JavaScript   3.03%    0.44%
8       8      6 change                  PHP   2.77%   -0.45%
9       9      8 change                 Perl   2.31%   -0.09%
10     10     12 change    Assembly language   2.25%    0.13%
11     11     10 change                 Ruby   2.22%   -0.11%
12     12     14 change                Swift   2.21%    0.38%
13     13     13        Delphi/Object Pascal   2.16%    0.22%
14     14     16 change                    R   2.15%    0.61%
15     15     48 change                   Go   2.04%    1.83%
16     16     11 change         Visual Basic   2.01%   -0.24%
17     17     17                      MATLAB   2.00%    0.55%
18     18     15 change          Objective-C   1.96%    0.25%
19     19     22 change              Scratch   1.71%    0.76%
20     20     18 change               PL/SQL   1.57%    0.22%
>

要注意的是gdata包需要Perl，所以你需要再计算机上安装Perl才能用gdata，另外还有其他不错的包，比如xlsx，那个东西需要java，更麻烦……

读取数据库文件

用R语言链接数据库是非常方便的，直接从数据库读取，写入数据，畅快淋漓。而且连接方法很简单很简单，这使得R语言再生产环境下有了一大功能，就是直接对接上数据库进行操作。

首先安装一些包：RPostgreSQL是用来对接PostgreSQL的，RMySQL是对接mySQL服务器的，其他还有各种各样其他的接口包。但是大致功能都差不多。鉴于我最近再用PostgreSQL，我就用这个做一下展示：

连接方式超级简单，载入R包以后，一行就连上了：

library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), host="localhost", user= "joshua", password="12345678", dbname="mydb")

那个包提供了对于数据库的各种操作，增删改查都有。不过我一般都是处理好的数据直接一行命令上传数据库的，所以不太关注其他命令。下面假设我有一批数据，我要把它推送到数据库中，建立成一张表格，需要做的仅仅也是一行命令：

dbWriteTable(con, "MyData", MyData)

其中，后一个MyData是你的一个R对象，DataFrame或者Matrix都行吧。前一个就是一个名字而已，这个名字是你的MyData这一个数据库在数据库内的表的名字。

下面我用R语言中非常常用的鸢尾花数据做一下演示：

> data("iris")
> head(iris)Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> dbWriteTable(con, "iris", iris)
[1] TRUE

通过dbWriteTable函数，整个iris数据就被上传到了postgreSQL中了，并且在其中自动建立了一个叫做iris的表格。我们可以使用命令行查一下：

查询部分，需要再PostgreSQL中完成，首先在bash中，登录到postgres用户中，然后用它登录postgreSQL数据库，然后使用命令\c my db连接mydb数据库：

postgres=# \c mydb;
您现在已经连接到数据库 "mydb",用户 "postgres".
mydb=#

出现mydb=#就意味着已经成功连接上了mydb数据库，然后使用命令\d查看所有的表：

mydb=# \d关联列表架构模式 | 名称 |  类型  | 拥有者
----------+------+--------+--------public   | iris | 数据表 | joshua
(1 行记录)

可以看出，出现了一个叫做iris的表。我们可以使用最常用的SQL查询语句看一下表里的内容：

mydb=# select * from iris limit 6;row.names | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species
-----------+--------------+-------------+--------------+-------------+---------1         |          5.1 |         3.5 |          1.4 |         0.2 | setosa2         |          4.9 |           3 |          1.4 |         0.2 | setosa3         |          4.7 |         3.2 |          1.3 |         0.2 | setosa4         |          4.6 |         3.1 |          1.5 |         0.2 | setosa5         |            5 |         3.6 |          1.4 |         0.2 | setosa6         |          5.4 |         3.9 |          1.7 |         0.4 | setosa
(6 行记录)

可以看出，结果和之前是一样的。如果电脑上有安装pgAdmin，也可以通过pgAdmin查看。该部分摘自我之前写过的一篇博客。

刚才演示了将数据写入PostgreSQL数据库，下面要展示一下从PostgreSQL数据库读取数据。

iris <- dbGetQuery(con, "SELECT * from iris")

通过上述命令，就可以读取数据，其性质就是用R语言进行SQL命令操作，得到相应的数据。

总而言之，R语言对接数据库的步骤就散步：1. 载入函数包，2：连接数据库，3：进行数据操作。

抓取网络数据

这个部分很难写，因为如果往深里写就太多了，简直就是R语言做网络爬虫。大体上，所谓“网络上的数据”分为两种，一种是已经存在的，可以直接下载使用的。只不过你没有下载，想要直接用R去读，这是一种。另一种是普通的网页，你想要从中抓取到细节的信息和数据，这就叫做网络爬虫（这个问题不一定不是违法的。处于比较灰色的地带。）。

先说第一种情况，就是网络上已经有一些数据供你使用，只不过你懒得下载，就直接用R语言去读取。

讲真我觉得这种情况不多见，而且R语言也不是所有文件都能读取，所以网络文件格式需要满足，这就大大限制了R语言。来一个csv文件可以用，来一个word文件可能就不行了……

这方面的一个工具就是data.table包中的fread函数：

> library(data.table)
> mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
trying URL 'http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat'
Content type 'application/x-ns-proxy-autoconfig' length 2102 bytes
downloaded 2102 bytes> head(mydat)V1  V2   V3    V4 V5
1:  1 307  930 36.58  0
2:  2 307  940 36.73  0
3:  3 307  950 36.93  0
4:  4 307 1000 37.15  0
5:  5 307 1010 37.23  0
6:  6 307 1020 37.24  0
>

上述方法就可以成功读取网页：http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat中的内容。转自链接

网络爬虫

R语言的网络爬虫就是用R语言去解析一个一个的HTML页面，把上边的信息全部抓下来。目前很多数据公司的做法，其实都是铺天盖地的抓数据，有的用Python，有的用R……办法很多。至于说这个问题违不违法？我知道之前微博状告过私自爬去微博数据的人。

这部分内容是很多的，多到可以写一本教材可以去看。在这里，我实现一个小小的网络爬虫，爬取的是基因信息。

首先安装两个R包，RCurl是用来抓取网络数据的著名R包，XML用于解析HTML页面，就是说那些花花绿绿的HTML页面，经过XML处理，就会出现一定得格式，然后你就可以从中找到你需要的数据。诚然很多页面都是不一样的，但是如果都是一家网站的网页，一般来说格式是一样的，这样你就可以从中挖掘各种各样的信息。

比如说：我们来看下边这个网站
https://www.ncbi.nlm.nih.gov/gene/2890

这个网站介绍的是基因GRIA1，人体中有两万多个基因，科学家们正在夜以继日地去翻译这些基因的功能。每当一些基因的功能被搞清楚了，NCBI这个网站就会更新基因的信息。假如说，我们想要知道一系列基因的Summary（图中我框起来的部分）应该怎么办？

首先我们发现，NCBI这个网站上，有关于基因的页面是有规律的：网址全部都是https://www.ncbi.nlm.nih.gov/gene/开头，加上一个数字，这个数字不是一般的数字，而是每一个基因对应的ID。

所以我们可以先总结一下，网络爬虫的几个前提：网站需要有一定规律，乱七八糟的网站不可能爬的。其次，每一个你想要爬取的信息，必须对应一定得URL，比如你要知道基因的信息，必须知道什么基因可以对应什么网页对吧？

然后我们继续，载入R包以后，我们开始尝试获取这个页面上的信息：

library(RCurl)
library(XML)
url <- "https://www.ncbi.nlm.nih.gov/gene/2890"
webpage <- getURL(url)
Information <- readHTMLList(webpage)

然后，从一大堆Information中，找到我们需要的那个数据：

> Information[[36]][10]Summary
"Glutamate receptors are the predominant excitatory neurotransmitter receptors in the mammalian brain and are activated in a variety of normal neurophysiologic processes. These receptors are heteromeric protein complexes with multiple subunits, each possessing transmembrane regions, and all arranged to form a ligand-gated ion channel. The classification of glutamate receptors is based on their activation by different pharmacologic agonists. This gene belongs to a family of alpha-amino-3-hydroxy-5-methyl-4-isoxazole propionate (AMPA) receptors. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]"
>

能够完成一次抓取，就能循环做很多东西。下面，假设我们想要获得连续5个基因的Summary信息，我们只需要写一个循环，就可以一次性完成抓取：

geneID <- c("2890","6367","9536","5833","5435")
GeneInfo <- list()
for(i in geneID)
{url <- paste("https://www.ncbi.nlm.nih.gov/gene/",i,sep="")webpage <- getURL(url)Information <- readHTMLList(webpage)GeneInfo[[i]] <- Information[[36]][10]
}

跑完以后，5个基因的信息都在GeneInfo变量中：

> GeneInfo
$`2890`Summary
"Glutamate receptors are the predominant excitatory neurotransmitter receptors in the mammalian brain and are activated in a variety of normal neurophysiologic processes. These receptors are heteromeric protein complexes with multiple subunits, each possessing transmembrane regions, and all arranged to form a ligand-gated ion channel. The classification of glutamate receptors is based on their activation by different pharmacologic agonists. This gene belongs to a family of alpha-amino-3-hydroxy-5-methyl-4-isoxazole propionate (AMPA) receptors. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]" $`6367`Summary
"This antimicrobial gene is one of several Cys-Cys (CC) cytokine genes clustered on the q arm of chromosome 16. Cytokines are a family of secreted proteins involved in immunoregulatory and inflammatory processes. The CC cytokines are proteins characterized by two adjacent cysteines. The cytokine encoded by this gene displays chemotactic activity for monocytes, dendritic cells, natural killer cells and for chronically activated T lymphocytes. It also displays a mild activity for primary activated T lymphocytes and has no chemoattractant activity for neutrophils, eosinophils and resting T lymphocytes. The product of this gene binds to chemokine receptor CCR4. This chemokine may play a role in the trafficking of activated T lymphocytes to inflammatory sites and other aspects of activated T lymphocyte physiology. [provided by RefSeq, Sep 2014]" $`9536`Summary
"The protein encoded by this gene is a glutathione-dependent prostaglandin E synthase. The expression of this gene has been shown to be induced by proinflammatory cytokine interleukin 1 beta (IL1B). Its expression can also be induced by tumor suppressor protein TP53, and may be involved in TP53 induced apoptosis. Knockout studies in mice suggest that this gene may contribute to the pathogenesis of collagen-induced arthritis and mediate acute pain during inflammatory responses. [provided by RefSeq, Jul 2008]" $`5833`Summary
"This gene encodes an enzyme that catalyzes the formation of CDP-ethanolamine from CTP and phosphoethanolamine in the Kennedy pathway of phospholipid synthesis. Alternative splicing results in multiple transcript variants. [provided by RefSeq, May 2010]" $`5435`Summary
"This gene encodes the sixth largest subunit of RNA polymerase II, the polymerase responsible for synthesizing messenger RNA in eukaryotes. In yeast, this polymerase subunit, in combination with at least two other subunits, forms a structure that stabilizes the transcribing polymerase on the DNA template. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]" >

总而言之，这就是一种可以抓取网络数据的方法，抓去完毕以后，就可以进行一些字符串处理等等。另外，抓取过程可以开并行运算加速，有些时候页面过不去，可以用tryCatch来防止错误阻断程序等等……总而言之，再复杂的爬虫程序，其实也是这样的小脚本慢慢搭建起来的。

到目前位置，我介绍了几种用R语言获取数据的方式了，获取数据永远是完成一次分析的第一步工作。