Python3爬取企查查网站的企业年表并存入MySQL

本篇博客的主要内容：爬取企查查网站的企业年报数据，存到mysql中，为了方便记录，分成两个模块来写：

第一个模块是爬取数据+解析数据，并将数据存到文件中；第二个模块是将数据存到mysql数据库中(批量插入)

第一部分：爬取数据+解析数据

需要的库文件：

import pymysql
import requests
from bs4 import BeautifulSoup
import codecs
import time

import sys

没有使用scrapy爬虫框架，帮同学爬取企查查上一些企业的信息。

爬虫前的准备(环境)：

1.Python是必须的，基于python3写的

2.爬虫依赖的库文件

3.MySQL必须有，同时还要有Python操作MySQL的库文件pymysql

(注：Python3中不支持MySQLdb库文件，一开始并不知道，就傻傻地pip了，pip报错，以为是网络的问题或者是window的问题，胡乱搞了一通之后，百度搜索了python mysql库文件安装，结果是其他人都是pip的，并没有啥问题，无奈之下在python后面加了3，然后看到python3不支持MySQLdb，整个人都傻逼了)

Coding之前，我们还需要先分析爬取网站的结构：

1.主要爬取的是企查查的企业年报，页面如下图：爬取的内容就是红色框中的几个表格(分年度的)。

以爬取南京易乾宁金融信息咨询有限公司的企业年报为例

url：http://www.qichacha.com/firm_1aa73f4e4ba0e172143909a124f0cbb6.shtml#report

爬取的内容：

2.通过使用浏览器自带的开发者工具(审查元素)，定位这几个表格在网页中的位置，如下所示：

从图中可以看出：企业基本信息在<table class="ntable"></table>中

Coding：

1. 网页结构和爬取需求分析完毕，接下来就是撸起袖子Coding

使用request库爬取(推荐使用request库，可以满足爬虫的绝大部分需求)。

使用beautifulsoup4来解析网页，bs4库我就不多说了，只要是用过Python写爬虫的人，都说好。对于初学者，可以查看bs4的官方文档(传送地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/)。

先简单使用request将网页下载下来，然后使用beautifulsoup4来解析网页内容，来验证是否和之前分析的结果一样。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup
import codecs
import time
import sys
headers = {'Accept': 'text/html, */*; q=0.01','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh-CN,zh;q=0.8','Host': 'www.qichacha.com','Connection': 'keep-alive','Upgrade-Insecure-Requests': '1','Cache-Control': 'max-age=0','Cookie': 'UM_distinctid=162b82f21fa8f1-0803268b01ee0f-7c117d7e-1fa400-162b82f21fb77c; acw_tc=AQAAAIsSAkpjKAAAq7DGOlIV8eQv2vCF; _uab_collina=152350856009288770740658; hasShow=1; _umdata=C234BF9D3AFA6FE719F2E09D797D23E34110F2E82EB4188B670C9EB8EC9DC9B349A81CD9A226A737CD43AD3E795C914CC2334F10781D1FBF6B6942E57AA6F7F8; PHPSESSID=fphni733iobptf39h3nkqrim96; zg_did=%7B%22did%22%3A%20%22162b82f2175390-079c7ce1a4a2ed-7c117d7e-1fa400-162b82f2176859%22%7D; CNZZDATA1254842228=595737747-1523503772-https%253A%252F%252Fwww.baidu.com%252F%7C1523841965; Hm_lvt_3456bee468c83cc63fb5147f119f1075=1523508520,1523515331,1523583111,1523618471; Hm_lpvt_3456bee468c83cc63fb5147f119f1075=1523844231; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201523844141176%2C%22updated%22%3A%201523844235810%2C%22info%22%3A%201523508519291%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22www.qichacha.com%22%2C%22cuid%22%3A%20%2299d640975c5ff0e372e5283151adbfd0%22%7D','X-Requested-With': 'XMLHttpRequest','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4882.400 QQBrowser/9.7.13059.400'
}
'''
企业基本信息表（分年）
股东（发起人）出资信息表
企业资产状况信息表
股权变更信息表
'''
if __name__ == '__main__':print("爬虫启动....")url = 'http://www.qichacha.com/firm_1aa73f4e4ba0e172143909a124f0cbb6.shtml#report'try:# 爬取的网页地址：南京易乾宁金融信息咨询有限公司的企业年报reqs = requests.get(url, headers=headers, timeout=30)# print(reqs.text)soup = BeautifulSoup(reqs.text, 'lxml')# print(soup)print("#########################")report_div = soup.find('div', class_='container m-t-md').find('div', class_="data_div", id='report_div')print(report_div)print(report_div.find('section', class_='panel pos-rlt  b-a report_info'))except:error = sys.exc_info()print('{0} -> {1} -> {2}'.format(url, error[0], error[1]))# print(error[0], end=' ')# print(error[1])# 出现异常了,咋办print("爬虫结束....")

会发现结果返回为空，这是为什么呢？使用开发者工具查看的时候，在网页中是存在的。再次分析网页：

下图中的细线框中，在使用上面代码爬取数据的时候返回的是空的div，div里面的内容(粗红线框)却不翼而飞。

一筹莫展之际，突然想到了，之前在做爬虫的时候，遇到过数据和页面分开发送的情况：重新整理下思路，发现在点击企业年报的时候，页面的其他内容并没有变化，变化的只有企业年报部分，这个网站应该使用的是Ajax，异步载入需要的数据。在实际爬取的时候，应该去请求另一个网址，得到数据。

使用浏览器的开发者工具，点击network选项，在进入到企业基本信息页面之后，清空network中所有的信息，然后点击企业年报，逐条查看network的信息，发现有一条url的response中返回了数据，经过对这些数据仔细的分析，发现这就是我需要爬取的数据。(注：数据与页面分离，仔细想一想，网页这么做才是正解，如果一次性将企业的所有数据发送过来，不仅耗费网络资源，还需要额外的代码处理这些数据存放)

2. 经过上面的分析之后，改写之前的code：

具体怎么使用beautifulsoup4去解析这里面的数据很简单，我就不在这里细说了，如果有不懂的，看一看文档基本上就会了。

爬取+解析的详细代码实现：(我这里是爬取了17个企业的数据，企业不多，所以url我都固定了，如果要爬取的企业数量非常多，建议你可以通过企查查搜索框开始做：通过企业名搜索得到企业的基本信息页面，然后分析网页的内容，拿到你要爬取的url，这个过程可能需要一些时间来做，我在这里就不浪费时间了，其实是懒的做了)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup
import codecs
import time
import sysheaders = {'Accept': 'text/html, */*; q=0.01','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh-CN,zh;q=0.8','Host': 'www.qichacha.com','Connection': 'keep-alive','Upgrade-Insecure-Requests': '1','Cache-Control': 'max-age=0','Cookie': 'UM_distinctid=162b82f21fa8f1-0803268b01ee0f-7c117d7e-1fa400-162b82f21fb77c; acw_tc=AQAAAIsSAkpjKAAAq7DGOlIV8eQv2vCF; _uab_collina=152350856009288770740658; hasShow=1; _umdata=C234BF9D3AFA6FE719F2E09D797D23E34110F2E82EB4188B670C9EB8EC9DC9B349A81CD9A226A737CD43AD3E795C914CC2334F10781D1FBF6B6942E57AA6F7F8; PHPSESSID=fphni733iobptf39h3nkqrim96; zg_did=%7B%22did%22%3A%20%22162b82f2175390-079c7ce1a4a2ed-7c117d7e-1fa400-162b82f2176859%22%7D; CNZZDATA1254842228=595737747-1523503772-https%253A%252F%252Fwww.baidu.com%252F%7C1523841965; Hm_lvt_3456bee468c83cc63fb5147f119f1075=1523508520,1523515331,1523583111,1523618471; Hm_lpvt_3456bee468c83cc63fb5147f119f1075=1523844231; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201523844141176%2C%22updated%22%3A%201523844235810%2C%22info%22%3A%201523508519291%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22www.qichacha.com%22%2C%22cuid%22%3A%20%2299d640975c5ff0e372e5283151adbfd0%22%7D','X-Requested-With': 'XMLHttpRequest','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4882.400 QQBrowser/9.7.13059.400'
}# 企业基本信息
def parseEnterpriseBasicInfo(html_table, enterpriseNumber, year, gbdate):# print(html_table)trs = html_table.find_all('tr')# print(len(trs))# print(type(enterpriseNumber))# print(type(year))# print(type(gbdate))res=enterpriseNumber+'<^>'+year+'<^>'+gbdate+'<^>'# res=''for tr in trs:tds = tr.find_all('td')# print(tds)count=1for td in tds:if count % 2 == 0:tableInfo=td.text.strip()res+=str(tableInfo)+'<^>'# print(td.text.strip())count+=1# 返回的数据格式# 企业代码,年度,公布日期,注册号,统一社会信用代码,企业经营状态,企业联系电话,从业人数,邮政编码,# 有限责任公司本年度是否发生股东股权转让,企业是否有投资信息或购买其他公司股权,电子邮箱,企业通讯地址# print(res[:-1])return res[:-3], 1# 股东（发起人）出资信息
def parseShareholdersInvestmentInfo(html_table, enterpriseNumber, year, gbdate):# print(html_table)tableInfos=list()count = 1trs = html_table.find_all('tr')# print(len(trs))for tr in trs:# print(count)if count == 1:  # 去掉标题行count += 1continueres = enterpriseNumber + '<^>' + year + '<^>' + gbdate + '<^>'tds = tr.find_all('td')# print(tds)tdcnt = 1for td in tds:if tdcnt==1:tdcnt+=1continue# 去掉序号tableInfo=td.text.strip()res+=str(tableInfo)+'<^>'# print(td.text.strip())tdcnt += 1# print(res)tableInfos.append(res[:-3])count += 1# print(tableInfos)return tableInfos, len(tableInfos)# 企业资产状况信息
def parseEnterpriseAssetStatusInfo(html_table, enterpriseNumber, year, gbdate):# print(html_table)trs = html_table.find_all('tr')res = enterpriseNumber+'<^>'+year+'<^>'+gbdate+'<^>'for tr in trs:tds = tr.find_all('td')# print(tds)count=1for td in tds:if count % 2 == 0:tableInfo = td.text.strip()res+=str(tableInfo)+'<^>'# print(tableInfo)count+=1# 返回的数据格式# print(res[:-1])return res[:-3], 1# 股权变更信息
def parseStockChangeInfo(html_table, enterpriseNumber, year, gbdate):# print(html_table)tableInfos=list()count = 1trs = html_table.find_all('tr')# print(len(trs))for tr in trs:# print(count)if count == 1:  # 去掉标题行count += 1continueres = enterpriseNumber + '<^>' + year + '<^>' + gbdate + '<^>'tds = tr.find_all('td')# print(tds)for td in tds:tableInfo=td.text.strip()res+=str(tableInfo)+'<^>'# print(td.text.strip())# print(res)tableInfos.append(res[:-3])count += 1# print(tableInfos)return tableInfos, len(tableInfos)# 写入文件
def writeToFile(basic, investment, assetState, stockChange):print("写入文件开始")with codecs.open(r'basic.txt', 'w', 'utf-8') as writeTxt1:for line in basic:writeTxt1.write(line+"\n")with codecs.open(r'investment.txt', 'w', 'utf-8') as writeTxt2:for line in investment:writeTxt2.write(line + "\n")with codecs.open(r'assetState.txt', 'w', 'utf-8') as writeTxt3:for line in assetState:writeTxt3.write(line + "\n")with codecs.open(r'stockChange.txt', 'w', 'utf-8') as writeTxt4:for line in stockChange:writeTxt4.write(line + "\n")print("写入文件完毕")# Referer:http://www.qichacha.com/firm_1aa73f4e4ba0e172143909a124f0cbb6.html
def mainCrawl():# 定义4个统计变量 总的bacount = 0ivcount = 0ascount = 0stcount = 0# 定义4个大列表# 企业基本信息表（分年）# 股东（发起人）出资信息表# 企业资产状况信息表# 股权变更信息表basic=list()investment=list()assetState=list()stockChange=list()# 定义爬虫url列表# 7,URI=['http://www.qichacha.com/company_getinfos?unique=1aa73f4e4ba0e172143909a124f0cbb6&companyname=%E5%8D%97%E4%BA%AC%E6%98%93%E4%B9%BE%E5%AE%81%E9%87%91%E8%9E%8D%E4%BF%A1%E6%81%AF%E5%92%A8%E8%AF%A2%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=9f2ec47cc261de7878d04e95566b33c5&companyname=%E5%8D%97%E4%BA%AC%E6%96%87%E5%8C%96%E8%89%BA%E6%9C%AF%E4%BA%A7%E6%9D%83%E4%BA%A4%E6%98%93%E6%89%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=dc95cb922b61508adef8614e4335668d&companyname=%E6%B1%9F%E8%8B%8F%E6%81%92%E5%BE%B7%E8%A1%8C%E6%98%93%E8%B4%B5%E9%87%91%E5%B1%9E%E4%BA%A4%E6%98%93%E4%B8%AD%E5%BF%83%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=5fdcc9baf2526ae1932a33cec70ad137&companyname=%E6%B1%9F%E8%8B%8F%E4%BA%91%E5%8E%A8%E4%B8%80%E7%AB%99%E6%96%B0%E9%9B%B6%E5%94%AE%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=26f9f27a3f5912b1c6f3efefef0d27c9&companyname=%E5%8D%97%E4%BA%AC%E5%B8%8C%E8%B6%8A%E5%95%86%E5%8A%A1%E4%BF%A1%E6%81%AF%E5%92%A8%E8%AF%A2%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=e64ea0014204d067621fd3f385187745&companyname=%E5%8D%97%E4%BA%AC%E6%8B%A8%E4%BA%91%E8%A7%81%E6%97%A5%E4%BF%A1%E6%81%AF%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=d0ea693c06429a2d01167151ac66d266&companyname=%E6%B1%9F%E8%8B%8F%E5%A4%A9%E4%BF%A1%E5%88%9B%E5%AF%8C%E7%BD%91%E7%BB%9C%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=830fee0148fce3047a6e288a9c821453&companyname=%E5%8D%97%E4%BA%AC%E6%A2%93%E7%A6%8F%E5%81%A5%E5%BA%B7%E5%92%A8%E8%AF%A2%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=83e46699920920cdbbfb2b8b1b89eaee&companyname=%E6%B1%9F%E8%8B%8F%E6%98%9F%E4%BA%AB%E8%9E%8D%E4%BF%A1%E6%81%AF%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=309e636267612442bbc0ee337283fabd&companyname=%E7%BB%BF%E9%87%91%E5%9C%A8%E7%BA%BF%E7%94%B5%E5%AD%90%E5%95%86%E5%8A%A1%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=af2e2d318db7aa5c220fbca707c68600&companyname=%E6%B1%9F%E8%8B%8F%E4%B8%AD%E8%9E%8D%E4%BF%A1%E6%B3%B0%E7%94%B5%E5%AD%90%E5%95%86%E5%8A%A1%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=2ea1b09b5537b76252aa130ddc3bc964&companyname=%E6%B1%9F%E8%8B%8F%E6%98%93%E4%BB%98%E5%AE%9D%E5%95%86%E5%8A%A1%E6%9C%8D%E5%8A%A1%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=3a2a5785409da2d5d6b284b7973cb8cd&companyname=%E6%B1%9F%E8%8B%8F%E8%9E%8D%E7%B1%B3%E4%BF%A1%E6%81%AF%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=72694ddeb601624d42b7ffd726acaa54&companyname=%E5%8D%97%E4%BA%AC%E5%BE%B7%E7%A6%8F%E5%86%9C%E4%B8%9A%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=156c60afdc14d79aee1a38feb04707db&companyname=%E5%8D%97%E4%BA%AC%E8%8C%82%E8%A3%95%E4%BC%81%E4%B8%9A%E7%AE%A1%E7%90%86%E5%92%A8%E8%AF%A2%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=5c36e8ef2192457ca2f35288772829f5&companyname=%E6%B1%9F%E8%8B%8F%E4%B8%AD%E5%AE%9E%E8%B4%B5%E9%87%91%E5%B1%9E%E4%BA%A4%E6%98%93%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report','http://www.qichacha.com/company_getinfos?unique=8bbe267bb3ab995b095fdb3921e97a32&companyname=%E5%8D%97%E4%BA%AC%E8%81%9A%E6%B7%BB%E5%88%A9%E8%B4%A2%E5%AF%8C%E7%AE%A1%E7%90%86%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8&tab=report']enterpriseNo=['00000001', '00000002', '00000003', '00000004', '00000005', '00000006', '00000007', '00000008', '00000009', '00000010', '00000011', '00000012', '00000013', '00000014', '00000015', '00000016', '00000017']print("爬虫启动....")for index in range(0, len(URI)):time.sleep(3)  # 避免反爬虫try:url = URI[index].strip()reqs = requests.get(url, headers=headers, timeout=30)# print(reqs.text)# 企查查是将存放数据的网页单独发过来，然后嵌入之前的网页中soup = BeautifulSoup(reqs.text, 'lxml')# soup = BeautifulSoup(html_doc, 'lxml')# print(soup)# 此处还需要判断是否有数据,没有就丢弃panelDiv = soup.find('div', class_='panel-heading b-b m-t-md')# print(panelDiv)if panelDiv is None:# print(enterpriseNo[0], end=' ')# print("没有年度报告")print('{0} -> {1}'.format(enterpriseNo[index], '没有年度报告'))continue# 如果有企业年报,还要判断有几个年度的panel_heading=panelDiv.find('ul').find_all('li')# print(len(panel_heading))for num in range(0, len(panel_heading)):# 找到选项卡的页面 根据idtab_pane2 = soup.find('div', id=str(num))# print(tab_pane2)# 在选项卡里面找公布日期和table# 企业代码enterpriseNumber = enterpriseNo[index]# 公布日期gbdate = tab_pane2.find('div', class_="text-gray m-b-sm m-t-sm").text.split(' ')[0].strip()print(gbdate)# 年度year = str(int(gbdate.split('-')[0]) - 1)print(year)tableTitles = tab_pane2.find_all('div', class_='tcaption')# print(len(tableTitles))# 用for循环# 定义表格解析函数for tableTitle in tableTitles:table_title = tableTitle.text.strip()# print(table_title)html_table = tableTitle.find_next_sibling()if table_title == '企业基本信息':  # 只有一条数据s, cnt1= parseEnterpriseBasicInfo(html_table, enterpriseNumber, year, gbdate)bacount+=cnt1basic.append(s)elif table_title == '股东（发起人）出资信息':  # 多条数据ivlist, cnt2=parseShareholdersInvestmentInfo(html_table, enterpriseNumber, year, gbdate)ivcount+=cnt2investment.extend(ivlist)elif table_title == '企业资产状况信息':  # 一条数据s, cnt3=parseEnterpriseAssetStatusInfo(html_table, enterpriseNumber, year, gbdate)ascount+=cnt3assetState.append(s)elif table_title == '股权变更信息':  # 可能有信息,也可能没有, 有的话,存在多条数据stlist, cnt4=parseStockChangeInfo(html_table, enterpriseNumber, year, gbdate)stcount+=cnt4stockChange.extend(stlist)else:# print("数据库中没有这张表")print('{0} -> {1}'.format(enterpriseNo[index], '数据库中没有这张表'))# breakexcept:error = sys.exc_info()print('{0} -> {1} -> {2}'.format(enterpriseNo[index], error[0], error[1]))continue# print(error[0], end=' ')# print(error[1])# 出现异常了,咋办print("爬虫结束....")# 爬取解析完毕,先写入文件print('{0} -> {1}'.format('企业基本信息表（分年）', bacount))print('{0} -> {1}'.format('股东（发起人）出资信息表', ivcount))print('{0} -> {1}'.format('企业资产状况信息表', ascount))print('{0} -> {1}'.format('股权变更信息表', stcount))writeToFile(basic, investment, assetState, stockChange)'''
企业基本信息表（分年）
股东（发起人）出资信息表
企业资产状况信息表
股权变更信息表
'''
if __name__ == '__main__':mainCrawl()

第二部分：数据入库

1.经过爬取+解析之后，数据已经变成我们需要的格式，接下来就是批量写入数据库

需要使用到pymysql库(注：python3已经不支持MySQLdb库文件，害的我倒腾了很久，pip报错，一开始以为是网络问题，访问外网不稳定)

数据在文件中的格式：<^>是我自定的分隔符，避免和数据中内容重复

2.在MySQL中建好数据库和表格(这里只是简单的将爬取的数据存到数据库汇总，表与表之间并没有做任何的关联，如果有需求，可以先建好表与表之间的关系)

create database qichacha;
use qichacha;-- ----------------------------
-- Table structure for `企业基本信息表`
-- ----------------------------
DROP TABLE IF EXISTS `企业基本信息表`;
CREATE TABLE `企业基本信息表` (`key` int(10) NOT NULL AUTO_INCREMENT,`企业代码` varchar(10) NOT NULL,`年度` varchar(255) DEFAULT NULL,`公布日期` varchar(255) DEFAULT NULL,`注册号` varchar(255) DEFAULT NULL,`统一社会信用代码` varchar(255) DEFAULT NULL,`企业经营状态` varchar(255) DEFAULT NULL,`企业联系电话` varchar(255) DEFAULT NULL,`从业人数` varchar(255) DEFAULT NULL,`邮政编码` varchar(255) DEFAULT NULL,`有限责任公司本年度是否发生股东股权转让` varchar(255) DEFAULT NULL,`企业是否有投资信息或购买其他公司股权` varchar(255) DEFAULT NULL,`电子邮箱` varchar(255) DEFAULT NULL,`企业通讯地址` varchar(255) DEFAULT NULL,PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=218 DEFAULT CHARSET=utf8;
-- ----------------------------
-- Table structure for `股东（发起人）出资信息表`
-- ----------------------------
DROP TABLE IF EXISTS `股东（发起人）出资信息表`;
CREATE TABLE `股东（发起人）出资信息表` (`key` int(10) NOT NULL AUTO_INCREMENT,`企业代码` varchar(10) NOT NULL,`年度` varchar(255) DEFAULT NULL,`公布日期` varchar(255) DEFAULT NULL,`发起人` varchar(255) DEFAULT NULL,`认缴出资额（万元）` varchar(255) DEFAULT NULL,`认缴出资时间` varchar(255) DEFAULT NULL,`认缴出资方式` varchar(255) DEFAULT NULL,`实缴出资额（万元）` varchar(255) DEFAULT NULL,`实缴出资时间` varchar(255) DEFAULT NULL,`实缴出资方式` varchar(255) DEFAULT NULL,PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=574 DEFAULT CHARSET=utf8;
-- ----------------------------
-- Table structure for `企业资产状况信息表`
-- ----------------------------
DROP TABLE IF EXISTS `企业资产状况信息表`;
CREATE TABLE `企业资产状况信息表` (`key` int(10) NOT NULL AUTO_INCREMENT,`企业代码` varchar(10) NOT NULL,`年度` varchar(255) DEFAULT NULL,`公布日期` varchar(255) DEFAULT NULL,`资产总额` varchar(255) DEFAULT NULL,`所有者权益合计` varchar(255) DEFAULT NULL,`营业总收入` varchar(255) DEFAULT NULL,`利润总额` varchar(255) DEFAULT NULL,`营业总收入中主营业务收入` varchar(255) DEFAULT NULL,`净利润` varchar(255) DEFAULT NULL,`纳税总额` varchar(255) DEFAULT NULL,`负债总额` varchar(255) DEFAULT NULL,PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=209 DEFAULT CHARSET=utf8;
-- ----------------------------
-- Table structure for `股权变更信息表`
-- ----------------------------
DROP TABLE IF EXISTS `股权变更信息表`;
CREATE TABLE `股权变更信息表` (`key` int(10) NOT NULL AUTO_INCREMENT,`企业代码` varchar(10) NOT NULL,`年度` varchar(255) DEFAULT NULL,`发布日期` varchar(255) DEFAULT NULL,`股东` varchar(255) DEFAULT NULL,`变更前股权比例` varchar(255) DEFAULT NULL,`变更后股权比例` varchar(255) DEFAULT NULL,`股权变更日期` varchar(255) DEFAULT NULL,PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=18 DEFAULT CHARSET=utf8;

3.使用python3将数据从文件中读取出来，并写入数据库中

插入的数据长度一定要和插入字段的数量保持一致，不然会出现error：not all arguments converted during string formatting。我之前在插入数据的时候就是需要了这个问题：因为从文件中读取出来，需要切分，之前没有用自定的分隔符，使用逗号作为分割(企业基本信息中的邮箱字段中出现了逗号，本应该是点<xxx@xxx.com->xxx@xxx,com>)，导致切分的结果多了一个，因此插入始终不成功。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pymysql
import codecs
import sysdef writeToMySql(basic, investment, assetState, stockChange):# 打开数据库连接db = pymysql.connect("localhost", "root", "root", "qichacha", charset='utf8')# 使用 cursor() 方法创建一个游标对象 cursorcursor = db.cursor()# 使用 execute()  方法执行 SQL 查询cursor.execute("SELECT VERSION()")# 使用 fetchone() 方法获取单条数据.data = cursor.fetchone()print("Database version : %s " % data)# SQL 插入语句# 企业基本信息表basic_sql = "INSERT INTO 企业基本信息表(企业代码, 年度, 公布日期, 注册号, 统一社会信用代码, 企业经营状态, 企业联系电话, 从业人数, 邮政编码, \有限责任公司本年度是否发生股东股权转让, 企业是否有投资信息或购买其他公司股权, 电子邮箱, 企业通讯地址) \VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"# 股东（发起人）出资信息表investment_sql = "INSERT INTO 股东（发起人）出资信息表(企业代码, 年度, 公布日期, 发起人, 认缴出资额（万元）, 认缴出资时间, 认缴出资方式, 实缴出资额（万元）, \实缴出资时间, 实缴出资方式) \VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"# 企业资产状况信息表assetState_sql = "INSERT INTO 企业资产状况信息表(企业代码, 年度, 公布日期, 资产总额, 所有者权益合计, 营业总收入, 利润总额, 营业总收入中主营业务收入, \净利润, 纳税总额, 负债总额) \VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"# 股权变更信息表stockChange_sql = "INSERT INTO 股权变更信息表(企业代码, 年度, 发布日期, 股东, 变更前股权比例, 变更后股权比例, 股权变更日期) \VALUES(%s, %s, %s, %s, %s, %s, %s)"try:# # 执行sql语句 单条插入# cursor.execute(basic_sql, value)# 批量插入数据# cursor.executemany(basic_sql, basic1)cursor.executemany(basic_sql, basic)print(1)cursor.executemany(investment_sql, investment)print(2)cursor.executemany(assetState_sql, assetState)print(3)cursor.executemany(stockChange_sql, stockChange)print(4)# 提交到数据库执行db.commit()except:# 如果发生错误则回滚error = sys.exc_info()print('{0} -> {1}'.format(error[0], error[1]))db.rollback()# sql 查询语句sql = 'select * from 股东（发起人）出资信息表'try:# 执行SQL语句cursor.execute(sql)# 获取所有记录列表results = cursor.fetchall()for row in results:key = row[0]code = row[1]year = row[2]gbdate = row[3]publisher = row[4]money1 = row[5]date1 = row[6]way1 = row[7]money2 = row[8]date2 = row[9]way2 = row[10]# 打印结果print("key=%s,code=%s,year=%s,gbdate=%s,publisher=%s,money1=%s,date1=%s,way1=%s,money2=%s,date2=%s,way2=%s" % \(key, code, year, gbdate, publisher, money1, date1, way1, money2, date2, way2))except:error = sys.exc_info()print('{0} -> {1}'.format(error[0], error[1]))print("Error: unable to fetch data")# 关闭数据库连接db.close()if __name__ == '__main__':# 读取文件print("读入文件开始")basic=list()  # 列表里面存放的是元组investment=list()assetState=list()stockChange = list()with codecs.open(r'basic.txt', 'r', 'utf-8') as readTxt1:for line in readTxt1:line = line.strip().replace('<^>-<^>', '<^>NaN<^>')lineTuple=tuple(line.strip().split('<^>'))# print(len(lineTuple))if len(lineTuple)!=13:print(line)print(lineTuple)basic.append(lineTuple)# print(lineTuple)with codecs.open(r'investment.txt', 'r', 'utf-8') as readTxt2:for line in readTxt2:line = line.strip().replace('<^>-<^>', '<^>NaN<^>')lineTuple = tuple(line.strip().split('<^>'))investment.append(lineTuple)with codecs.open(r'assetState.txt', 'r', 'utf-8') as readTxt3:for line in readTxt3:line = line.strip().replace('<^>-<^>', '<^>NaN<^>')lineTuple = tuple(line.strip().split('<^>'))assetState.append(lineTuple)with codecs.open(r'stockChange.txt', 'r', 'utf-8') as readTxt4:for line in readTxt4:line = line.strip().replace('<^>-<^>', '<^>NaN<^>')lineTuple = tuple(line.strip().split('<^>'))# print(lineTuple)stockChange.append(lineTuple)print("读取文件完毕")# 写入数据库writeToMySql(basic, investment, assetState, stockChange)

4.数据库中的结果

企业基本信息

股东（发起人）出资信息表

企业资产状况信息表

股权变更信息表