问题描述:

I.

#这是.csv格式的文件,有中文乱码现象。[root@Uu jianshu]# cat jianshu.csv
url,title,author
http://www.jianshu.com/p/2a7a594816e1,彖浣犳                   村?鏍?
[root@Uu jianshu]#                            璋㈣传绌凤兼娉绗锛?

II.

#这是.json格式的文件,也有中文显示问题[root@Uu jianshu]# cat jianshu.json
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86"], "author": ["\u65e0\u6212"]}
]

问题解决过程:

I.  首先猜想用UTF-8,问题解决过程如下:

#UTF-8解决问题如下:[root@Uu jianshu]# vi settings.py
# -*- coding: utf-8 -*-# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'jianshu'SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'#FEED_URL = u'/home/BS/jianshu.json'
#FEED_FORMAT = 'json'
FEED_EXPORT_ENCODING = 'UTF-8'
#FEED_EXPORT_ENCODING = 'GBK'
#FEED_EXPORT_ENCODING = 'GB2312'
"settings.py" 98L, 3371C written
[root@Uu jianshu]# cd ..
[root@Uu jianshu]# ll
total 8
drwxr-xr-x. 3 root root 174 Aug 28 22:35 jianshu
-rw-r--r--. 1 root root 117 Aug 28 22:34 jianshu.json
-rw-r--r--. 1 root root 257 Aug 28 14:44 scrapy.cfg
[root@Uu jianshu]# rm -f jianshu.json
[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:35:51 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:35:51 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:35:51 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'UTF-8', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:35:51 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:35:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:35:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:35:57 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:35:57 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:35:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 10881,'downloader/response_count': 2,'downloader/response_status_count/200': 1,'downloader/response_status_count/301': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2018, 8, 28, 14, 35, 57, 854597),'item_scraped_count': 1,'log_count/DEBUG': 4,'log_count/INFO': 8,'memusage/max': 42971136,'memusage/startup': 42971136,'response_received_count': 1,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2018, 8, 28, 14, 35, 51, 387501)}
2018-08-28 22:35:57 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["彖浣犳                   村?], "author": ["鏍?]}                                          璋㈣传绌凤兼娉绗锛?
[root@Uu jianshu]# 

可见,在setting.py中设置参数FEED_EXPORT_ENCODING = 'UTF-8',并不能解决问题。

 II.试着用GBK解决问题(即设置参数FEED_EXPORT_ENCODING = 'GBK'),过程如下:

#GBK解决问题,过程如下:[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:32:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:32:40 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:32:40 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GBK', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:32:40 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:32:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:32:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:32:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:32:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:32:46 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:32:46 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:32:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 10879,'downloader/response_count': 2,'downloader/response_status_count/200': 1,'downloader/response_status_count/301': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2018, 8, 28, 14, 32, 46, 587323),'item_scraped_count': 1,'log_count/DEBUG': 4,'log_count/INFO': 8,'memusage/max': 42975232,'memusage/startup': 42975232,'response_received_count': 1,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2018, 8, 28, 14, 32, 40, 291948)}
2018-08-28 22:32:46 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]}

显而易见,问题得到解决,可以成功显示中文。

III.下面再试试GB2312(即设置参数FEED_EXPORT_ENCODING = 'GB2312'),过程如下:

#GB2312解决,过程如下:[root@Uu jianshu]# vi settings.py
# -*- coding: utf-8 -*-# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'jianshu'SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'jianshu.pipelines.JianshuPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'#FEED_URL = u'/home/BS/jianshu.json'
#FEED_FORMAT = 'json'
#FEED_EXPORT_ENCODING = 'UTF-8'
#FEED_EXPORT_ENCODING = 'GBK'
FEED_EXPORT_ENCODING = 'GB2312'
"settings.py" 98L, 3371C written
[root@Uu jianshu]# cd ..
[root@Uu jianshu]# rm -f jianshu.json
[root@Uu jianshu]# ll
total 4
drwxr-xr-x. 3 root root 174 Aug 28 22:44 jianshu
-rw-r--r--. 1 root root 257 Aug 28 14:44 scrapy.cfg
[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:45:25 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:45:25 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:45:25 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GB2312', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:45:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:45:26 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:45:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:45:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:45:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:45:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:45:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:45:32 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:45:32 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:45:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 10873,'downloader/response_count': 2,'downloader/response_status_count/200': 1,'downloader/response_status_count/301': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2018, 8, 28, 14, 45, 32, 543578),'item_scraped_count': 1,'log_count/DEBUG': 4,'log_count/INFO': 8,'memusage/max': 42971136,'memusage/startup': 42971136,'response_received_count': 1,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2018, 8, 28, 14, 45, 26, 27174)}
2018-08-28 22:45:32 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]}
][root@Uu jianshu]# 

 可以发现,问题同样得到解决,可见GB2312也可以成功解决中文乱码问题。

华丽的总结:

通过实验过程,可以发现:scrapy爬虫中的中文乱码问题只需要在setting.py设置参数FEED_EXPORT_ENCODING,

并且,只有将FEED_EXPORT_ENCODING设置为GBK或者GB2312才可以,设置为UTF-8不能解决问题。

Scrapy爬虫之中文乱码问题相关推荐

  1. python爬虫解决中文乱码和爬取美女图片

    想试试爬取图片是,结果图片名字打印的时候就出现了中文乱码. 代码是这样的 # -- coding:UTF-8 -- from lxml import etree import requests imp ...

  2. Python3爬虫之中文乱码问题分析与解决方法

    前言 分析 解决方法 前言: 今天简单爬取一个网页的源代码时,发现出现了乱码 python代码: import requestsreq = requests.get("http://www. ...

  3. python爬虫京东中文乱码_python3爬虫中文乱码之请求头‘Accept-Encoding’:br 的问题...

    当用python3做爬虫的时候,一些网站为了防爬虫会设置一些检查机制,这时我们就需要添加请求头,伪装成浏览器正常访问. header的内容在浏览器的开发者工具中便可看到,将这些信息添加到我们的爬虫代码 ...

  4. Python_爬虫_中文乱码

    今天在用Python2.7爬取百度百科的一个网页时发现输出时中文为乱码. 尝试一: 查看网页页面信息,发现其中文字编码为"GBK",遂准备对其进行解码. content = url ...

  5. Python爬虫时中文乱码的处理

    比较简单,就是设置编码格式即可解决 re_html = requests.get(AIPAI_URL) re_html.encoding='utf-8' #设置编码utf-8即可解决乱码问题

  6. Python Scrapy爬虫中文乱码问题“鎴愬姛”及用chardet解决乱码问题

    在爬取某个网站时,爬取第一个页面能正常获取,爬取第二个页面,获取文章内容时,返回的数据为中文乱码,乱码如下: {"rptCode":200,"msg":&quo ...

  7. node.js 爬虫中文乱码 处理

    爬虫中文乱码可做如下处理 import request from 'superagent'; import cheerio from 'cheerio';//类似jquery写法 const Icon ...

  8. 彻底解决网络爬虫遇到的中文乱码问题

    你是否遇到过下面的情况: 作为爬虫新手好不容易写了一个爬虫结果爬出来的数据中文数据乱码导致不能使用 如图: 其实很好解决: 如果你是使用的request模块得到的相应对象则可以如下设置: 主要由两种情 ...

  9. 网页爬虫中文乱码问题Python

    之前在做爬虫测试时一直出现下图中的中文乱码问题: 试了一些方法更改setting设置之类的都不管用. 然后看了一篇文章解决了我这种问题,这里做个记录: 首先进入网页打开f12,选择Console: 然 ...

最新文章

  1. Python 相对路径、绝对路径的写法实例演示
  2. Android O: View的绘制流程(二):测量
  3. 自定义SpringBoot Starter实现
  4. BOOST_CONSTANTS_GENERATE宏相关用法的测试程序
  5. help.hybris.com和help.sap.com网站的搜索实现
  6. 重学TCP协议(12)SO_REUSEADDR、SO_REUSEPORT、SO_LINGER
  7. 冠榕智能灯光控制协议分析(node-controller)
  8. 把html压缩成dll,一篇文章带你浅入webpack的DLL优化打包
  9. 提高设计档次的8个方法
  10. 赣州师范高等专科学校计算机网络技术,赣州师范高等专科学校2021年招生简章...
  11. 网络层地址解析协议ARP
  12. ebookpk-java手机英中电子词典_关于理解和支持的句子
  13. mysql和mongodb查询性能测试_MongoDB 模糊查询,及性能测试
  14. Graph U-Nets 笔记
  15. MySQL-存储IP地址一文解决(随便问~)
  16. c语言hypot函数,hypot()函数以及C ++中的示例
  17. uni-app 微信小程序分享功能河北银丹互联网科技有限公司
  18. fread()和fwrite()函数分析
  19. 研究生宿舍大盘点!令人羡慕的研究生宿舍来了!
  20. webpack中处理css文件

热门文章

  1. 你不得不知道的上架app
  2. 未来链上跨境支付、融资领域龙头 Tribal 的发展与机遇
  3. Android事件总线 ( AndroidEventBus ) 框架学习
  4. 2021年全球圆锥破碎机收入大约1357.4百万美元,预计2028年达到1665.6百万美元
  5. 仪征市职称计算机,《关于重申专业技术人员职称评定、职务聘任必须坚持外语、计算机条件的通知》……...
  6. 使用lsof命令恢复已删除文件(正在使用的文件)
  7. 微软必应成功预测法国队夺冠
  8. vue-fullcalendar 日历插件
  9. 硬件艺术家Cauchy吉林省机器人大赛智能无人仿真车竞速比赛赛后总结
  10. stringsAsFactors=FALSE是什么意思