Kaggle 数据清洗挑战 Day 4 - 字符编码（Character Encoding）处理

今天是 Kaggle 数据清洗挑战的第四天，任务是对字符进行编码处理～

分为四个部分来学习：

Get our environment set up
What are encodings?
Reading in files with encoding problems
Saving your files with UTF-8 encoding

1、搭建环境

首先还是引入需要的 lib 包：

# modules we'll use
import pandas as pd
import numpy as np# helpful character encoding module
import chardet# set seed for reproducibility
np.random.seed(0)

2、什么是编码？

字符编码（Character Encoding）是把字符集中的字符编码为指定集合中某一对象（例如：比特模式、自然数序列、8位组或者电脉冲），以便文本在计算机中存储和通过通信网络的传递。例如：将字符串 0110100001101001 编码为人类能读懂的文本 “hi”。

现阶段存在很多不同的编码规则，其中最重要的一个就是 UTF-8：

UTF-8（8-bit Unicode Transformation Format）是一种针对 Unicode 的可变长度字符编码，又称万国码。UTF-8 用 1 到 6 个字节编码 Unicode 字符。用在网页上可以统一页面显示中文简体繁体及其它语言（如英文，日文，韩文）。

先看一个例子，设定一段字符串，类型的输出结果为 str：

# start with a string
before = "This is the euro symbol: €"# check to see what datatype it is
type(before)

然后对这个字符串进行 UTF-8 编码：

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")# check the type
type(after)

如果你查看一个 bytes 类型的对象，你会发现结果前边有一个字母 b，后面跟着一串文本，这是因为这段文本被识别为了由 ASCII 编码过的数据。我们也看到欧元符号被识别为了一些 “mojibake“，这是因为 ASCII 是一种比较老的编码方式，有一些新的符号没有被加入。

# take a look at what the bytes look like
after

当我们用正确的方式来进行解码时，也得到了正确的结果：

# convert it back to utf-8
print(after.decode("utf-8"))

当我们使用不同的编码方式来对 bytes 进行解码时，就会输出错误：

# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))

当我们用其他编码方式将 string 转化为 bytes ，再进行解码时，也会输出错误，因为 python 3 默认的编码方式是 UTF-8，如果我们使用 ASCII：

# start with a string
before = "This is the euro symbol: €"# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")# convert it back to utf-8
print(after.decode("ascii"))# We've lost the original underlying byte string! It's been
# replaced with the underlying byte string for the unknown character :(

3、读文件时可能遇到的编码问题

我们平时遇到的大多数文件都是 UTF-8 编码的，这也是 python 所期望的，但有时候遇到非 UTF-8 编码的文件就会输出错误：

解决错误的一个办法就是去尝试不同的编码规则，看看哪一个可以正常工作。还有一个途径是用 chardet 来获取其编码规则，但也不保证结果 100% 正确。

我们可以先查看文件的前 10000 个字节，一般来说已经足够来推测整个文件的编码规则：

# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:result = chardet.detect(rawdata.read(10000))# check what the character encoding might be
print(result)

从结果来看，chardet 有 73% 的自信确定该文件的编码方式为 Windows-1252，下面来检验这个结果是否正确：

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')# look at the first few lines
kickstarter_2016.head()

没有报错，可以显示该数据集的前5条记录，说明 chardet 的结果是正确的！

4、使用 UTF-8 编码方式来保存文件

UTF-8 是 python 的标准编码方式，当我们保存一个文件时，将默认为使用 UTF-8:

# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

这就是 5 Day Challenge 第四天的内容，完毕～

欢迎关注我的知乎专栏【数据池塘】，专注于分享机器学习、数据挖掘相关内容：https://zhuanlan.zhihu.com/datapool

⬇️ 扫描下方二维码关注公众号【数据池塘】 ⬇️

回复【算法】，获取最全面的机器学习算法网络图：