get 到的html代码如何转码,爬虫网页转码逻辑

爬虫网页转码逻辑

最先出现的编码格式是ASCII码，这种编码规则是美国人制定的，大致的规则是用一个字节(8个bit)去表示出现的字符，其实由于在老美的世界里中总共出现的字符也不超过128个，而一个字节能够表示256种字符，所以当时这种编码的方式是没有问题的。

后来计算机在全世界普及起来，不同国家的语言都面临着如何在计算机中表示的问题，比如我们的汉字常用的就有几千个，显然最开始一个字节的ASIIC码表示就不够用了,这个时候就出现了Unicode编码，确切的说它只是一种表示规则，并不对应具体的实现形式。Uni-这个前缀在英文中表示的是统一的含义，它试图把全世界的语言用一种统一的编码表示，但是Unicode只规定了字符对应的二进制数据，但是没有规定这种二进制数据在内存中具体用几个字节存储，然后就乱套了，各国在实现Unicode时都发挥了自己的聪明才智，出现了类似utf-16,utf-32等等的形式，在这种情况下，Unicode的理想并没有实现，直到互联网的普及，utf-8的出现，utf-8的出现真正实现了大一统，它在实现Unicode规范的同时，又扩展了自己的规则，utf-8规定了任意一种字符编码后的机器码都是占用6个字节。

很多人在这里有个误会，就是容易把Bytes和编程语言里的其它数据类型混淆，其实Bytes才是计算机里真正的数据类型，也是网络数据传输中唯一的数据格式，什么Json，Xml这些格式的字符串最后想传输也都得转成Bytes的数据类型才能通过socket进行传输，而Bytes的数据与字符串类型数据的转换就是编码与解码的转换，utf-8是编解码时指定的格式。

这里再简单说一下序列化与反序列化，序列化可以分为本地和网络，对于本地序列化，往往就是将内存中的对象持久化到本地的硬盘，此时序列化做的工作就是将对象和一些对象的相关信息序列化成字符串，然后字符串以某种格式(比如utf-8)进行编码变成bytes类型，存储到硬盘。反序列化就是先将硬盘中的bytes类型中的数据读到内存经过解码变成字符串，然后对字符串进行反序列化解析生成对象。

Request的编码判断：

bytes str unicode

1. str/bytes

>> s = '123'

>> type(s)

str

>> s = b'123'

bytes

2. str 与 bytes 之间的类型转换

python str与bytes之间的转换

str 与 bytes 之间的类型转换如下：

str ⇒ bytes：bytes(s, encoding='utf8')

bytes ⇒ str：str(b, encoding='utf-8')

此外还可通过编码解码的形式对二者进行转换，

str 编码成 bytes 格式：str.encode(s)

bytes 格式编码成 str 类型：bytes.decode(b)

3. strings 分别在 Python2、Python 3下

What is tensorflow.compat.as_str()?

Python 2 将 strings 处理为原生的 bytes 类型，而不是 unicode，

Python 3 所有的 strings 均是 unicode 类型。

1, BefaultSoup 转码逻辑

代码位置 python2.7/site-packages/bs4/dammit.py

@property

def encodings(self):

"""Yield a number of encodings that might work for this markup."""

tried = set()

for e in self.override_encodings:

if self._usable(e, tried):

yield e

# Did the document originally start with a byte-order mark

# that indicated its encoding?

if self._usable(self.sniffed_encoding, tried):

yield self.sniffed_encoding

# Look within the document for an XML or HTML encoding

# declaration.

if self.declared_encoding is None:

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

if self._usable(self.declared_encoding, tried):

yield self.declared_encoding

# Use third-party character set detection to guess at the

# encoding.

if self.chardet_encoding is None:

self.chardet_encoding = chardet_dammit(self.markup)

if self._usable(self.chardet_encoding, tried):

yield self.chardet_encoding

# As a last-ditch effort, try utf-8 and windows-1252.

for e in ('utf-8', 'windows-1252'):

if self._usable(e, tried):

yield e

解释：这段代码包含了几个编码测试函数流程，优先级如下：

1， self.override_encodings 用户定义的编码

2， self.sniffed_encoding

self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)

这个函数通过检查网页开始的空格的编码格式来判断网页的编码

@classmethod

def strip_byte_order_mark(cls, data):

"""If a byte-order mark is present, strip it and return the encoding it implies."""

encoding = None

if isinstance(data, unicode):

# Unicode data cannot have a byte-order mark.

return data, encoding

if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16be'

data = data[2:]

elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16le'

data = data[2:]

elif data[:3] == b'\xef\xbb\xbf':

encoding = 'utf-8'

data = data[3:]

elif data[:4] == b'\x00\x00\xfe\xff':

encoding = 'utf-32be'

data = data[4:]

elif data[:4] == b'\xff\xfe\x00\x00':

encoding = 'utf-32le'

data = data[4:]

return data, encoding

3, self.declared_encoding

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

这个函数通过正则匹配来找到html前面的声明

正则匹配串

xml_encoding_re = re.compile(

'^'.encode(), re.I)

html_meta_re = re.compile(

']+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)

@classmethod

def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):

"""Given a document, tries to find its declared encoding.

An XML encoding is declared at the beginning of the document.

An HTML encoding is declared in a tag, hopefully near the

beginning of the document.

"""

if search_entire_document:

xml_endpos = html_endpos = len(markup)

else:

xml_endpos = 1024

html_endpos = max(2048, int(len(markup) * 0.05))

declared_encoding = None

declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)

if not declared_encoding_match and is_html:

declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)

if declared_encoding_match is not None:

declared_encoding = declared_encoding_match.groups()[0].decode(

'ascii', 'replace')

if declared_encoding:

return declared_encoding.lower()

return None

self.chardet_encoding = chardet_dammit(self.markup)

很明显，这个是根据chardet包来判断， chardet根据正文的编码匹配来统计，会有个confidence的辅助判断

import chardet

def chardet_dammit(s):

return chardet.detect(s)['encoding']

2，Request 转码逻辑

response = requests.get(url, verify=False, headers=configSpider.get_head())

requests 提供了两个编码识别结果

requests.encoding

位置： python2.7/site-packages/requests/adapters.py

```

response.encoding = get_encoding_from_headers(response.headers)

```

位置：python2.7/site-packages/requests/utils.py

```

def get_encoding_from_headers(headers):

"""Returns encodings from given HTTP Header Dict.

:param headers: dictionary to extract encoding from.

:rtype: str

"""

content_type = headers.get('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if 'charset' in params:

return params['charset'].strip("'\"")

if 'text' in content_type:

return 'ISO-8859-1'

```

cgi.parse_header()函数

```

def parse_header(line):

"""Parse a Content-type like header.

Return the main content-type and a dictionary of options.

"""

parts = _parseparam(';' + line)

key = parts.next()

pdict = {}

for p in parts:

i = p.find('=')

if i >= 0:

name = p[:i].strip().lower()

value = p[i+1:].strip()

if len(value) >= 2 and value[0] == value[-1] == '"':

value = value[1:-1]

value = value.replace('\\\\', '\\').replace('\\"', '"')

pdict[name] = value

return key, pdict

```

这个就是取的响应头 header的声明编码，如果有charset具体的编码则给出，如果是text/html 则返回 'ISO-8859-1'

很多网页Response-Headers都是直接给一个content-type: text/html, 用 'ISO-8859-1'明显是乱码了

response.apparent_encoding

Request还有一个apparent_encoding的编码，这个很简单也是来自于正文的chardet，也并不能保证完全准确的

3， Request的content和text

```

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0 or self.raw is None:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

@property

def text(self):

"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = self.encoding

if not self.content:

return str('')

# Fallback to auto-detected encoding.

if self.encoding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# A TypeError can be raised if encoding is None

# So we try blindly encoding.

content = str(self.content, errors='replace')

return content

```

content是bytes 字节流格式的，而text是将其转为str

content = str(self.content, encoding, errors='replace')

如果网页正好是utf-8格式的，因为编码环境# -*- coding: utf-8 -*-，所以content直接可用；否则依然会有乱码问题

综上，最好的解决方案是结合源码的实现以及自身的需求来实现一套方案：

Headers 声明编码

网页开始的空格检测

正文声明编码

chardet 模块检测编码

对于调用Request包，简单处理：

if response.encoding == 'ISO-8859-1':

response.encoding = response.apparent_encoding

response.text

或者借用bs4的方法

from bs4.dammit import EncodingDetector

self.detector = EncodingDetector(

markup, override_encodings, is_html, exclude_encodings)

print self.detector.encoding

get 到的html代码如何转码,爬虫网页转码逻辑相关推荐

php如何抓取html代码,使用php如何获取网页源码？
文章正文内容: 一下是几种常用的获取网页源码的几种方式: 1. file_get_contents 2.curl 3.fopen->fread->fclose 注意: 1.使用file_g ...
hantomjs能解析出html源码吗,网页源码是js js跳转后才是源码怎么用易语言写不使用超文本有延时...
网页源码是js js跳转后才是源码怎么用易语言写不使用超文本有延时附上js代码 var x="pathname@d@1552735436@@@7@div@@new@0xFF@3@@f ...
ie怎么修改html代码,如何修改IE默认网页源码查看器
今天同事问我IE6查看源代码怎么老是用记事本打开看起来太不舒服了可以修改吗?当时也没搞出来. 由于我自己的是IE8而网上搜到的都是IE8以前的版本的(一开始我以为都一样的) 无法测试所以就没告诉 ...
eclipse if代码折叠_仅需一页Java代码就能实现网页源码爬取
作者|小鱼儿. yanxiao|CSDN Java代码基于Eclipse简单实现网页源码爬取今天给大家分享我的最新java学习进程--java网页源码爬虫,废话不多说盘代码. 仅需一页代码: pac ...
易语言易语言取网页源码乱码
.版本 2 网址＝ "url" 网页源码＝网页_取网页源码 (网址) 网页源码＝编码_utf8到gb2312 (网页源码) 其中用到了精易模块转载于:https:// ...
Git代码同时push到GitHub和Gitee(码云)
Git代码同时push到GitHub和Gitee(码云) 1. 在Gitee和GitHub上分别创建一个项目(同名项目) 2. 克隆项目到本地 //从gitee 获取 $ git clone http ...
代码证年审年报附文档短消息类服务接入代码电信业务资源综合管理系统用户手册-码号年报（码号使用单位）
短消息类服务接入代码电信业务资源综合管理系统用户手册-码号年报 (码号使用单位) 下载地址:填报指南下载地址https://download.csdn.net/download/weixin_445 ...
最新在线客服系统php代码微信软件公众号小程序app二维码聊天网站源码
最新在线客服系统php代码微信软件公众号小程序app二维码聊天网站源码管理界面独家长期更新日志(欢迎反馈BUG) 1.添加手机端前后台声音提示 2.添加后台客户管理显示在线离线 3.添加清空当前对 ...
HTML5期末大作业：个人网页设计——薛之谦6页(代码质量好) 学生DW网页设计作业源码 web课程设计网页规划与设计
HTML5期末大作业:个人网页设计--薛之谦6页(代码质量好) 学生DW网页设计作业源码 web课程设计网页规划与设计 HTML+CSS+JavaScript(毕业设计) 常见网页设计作业题材有个人 ...

get 到的html代码如何转码,爬虫网页转码逻辑

get 到的html代码如何转码,爬虫网页转码逻辑相关推荐

最新文章

热门文章