Mojibakes来自哪里？编码要点

This article explores the basic concepts behind character encoding and then takes a dive deeper into the technical details of encoding systems.

本文探讨了字符编码背后的基本概念，然后深入研究了编码系统的技术细节。

If you have just a basic knowledge of character encoding and want to better understand the essentials, the differences between encoding systems, why we sometimes end up with nonsense text, and the principles behind different encoding system architecture, then read on.

如果您只是字符编码的基础知识，并且想更好地理解其要点，编码系统之间的差异，为什么我们有时会以无意义的文本结尾以及不同编码系统体系结构背后的原理，请继续阅读。

Getting to understand character encoding in detail requires some extensive reading and a good chunk of time. I’ve tried to save you some of that effort by bringing it all together in one place while providing what I believe to be a pretty thorough background of the topic.

要详细了解字符编码，需要大量阅读和大量时间。我试图通过将所有内容集中在一起，同时提供我认为是该主题相当全面的背景知识来节省一些工作。

I’m going to go over how single-byte encodings (ASCII, Windows-1251 etc.) work, the history of how Unicode came to be, the Unicode-based encodings UTF-8, UTF-16 and how they differ, the specific features, compatibility, and lack thereof among various encodings, character encoding principles, and a practical guide to how characters are encoded and decoded.

我将讨论单字节编码(ASCII，Windows-1251等)的工作方式，Unicode的历史，基于Unicode的UTF-8，UTF-16编码以及它们之间的区别，各种编码，字符编码原理以及有关如何编码和解码字符的实用指南之间的特定特征，兼容性和不足。

While character encoding may not be a cutting edge topic, it’s useful to understand how it works now and how it worked in the past without spending a lot of time.

尽管字符编码可能不是最前沿的话题，但了解它现在的工作方式以及过去的工作方式却非常有用，这非常有用。

Unicode的历史 (History of Unicode)

I think it’s best to start our story from the time when computers were nowhere near as advanced nor as commonplace a part of our lives as they are now. Developers and engineers trying to come up with standards at the time didn’t have any idea that computers and the internet would be as hugely popular and pervasive as they did. When that happened, the world needed character encodings.

我认为最好从计算机远不如现在这样先进或普通的生活时代开始我们的故事。当时试图提出标准的开发人员和工程师完全不知道计算机和互联网会像它们一样广受欢迎和普及。发生这种情况时，世界需要字符编码。

But how could you have a computer store characters or letters when it only understood ones and zeros? Out of this need emerged the first 1-byte ASCII encoding, which while not necessarily the first encoding, was the most widely used and set the benchmark. So it’s a good standard to use.

但是，当计算机仅能理解一个或一个零时，又如何存储字符或字母呢？出于这种需求，出现了第一个1字节ASCII编码，虽然不一定是第一个编码，但使用最广泛，并设定了基准。因此，这是一个很好的使用标准。

But what is ASCII? The ASCII code consists of 8 bits. Some easy arithmetic shows that this character set contains 256 symbols (eight bits, zeros and ones 2⁸=256).

但是什么是ASCII？ ASCII码由8位组成。一些简单的算术表明该字符集包含256个符号(八位，零和一2⁸= 256)。

The first 7-bits — 128 symbols (2⁷=128) in the set were used for Latin letters, control characters (such as hard line breaks, tabs and so on) and grammatical symbols. The other bits were for national languages. This way the first 128 characters are always the same, and if you’d like to encode your native language then help yourself to the remaining symbols.

集合中的前7位-128个符号(2⁷= 128)用于拉丁字母，控制字符(例如硬换行符，制表符等)和语法符号。其他位用于本国语言。这样，前128个字符始终是相同的，并且，如果您想对自己的母语进行编码，则可以自己使用其余的符号。

This gave rise to a panoply of national encodings. You end up with a situation like this: say you’re in Russia creating a text file which by default will use Windows-1251 (the Russian encoding used in Windows). And you send your document to someone outside Russia, say in the US. Even if the recipient knows Russian they’re going to be out of luck when they open the document on their computer (with word processing software using ASCII as the default code) because they’ll see bizarre garbled characters (mojibake) instead of Russian letters. More precisely, any English letters will appear just fine, because the first 128 symbols in Windows-1251 and ASCII are identical, but wherever there is Russian text our recipient’s word processing software will use the wrong encoding unless the user has manually set the right character encoding.

这引起了国家编码的泛滥。您最终遇到这样的情况：假设您在俄罗斯创建一个文本文件，默认情况下将使用Windows-1251(Windows中使用的俄语编码)。然后您将文件发送给俄罗斯境外的某人，例如在美国。即使收件人知道俄语，他们在计算机上打开文档(使用使用ASCII作为默认代码的文字处理软件)时也会很不走运，因为他们会看到奇怪的乱码(mojibake)而不是俄语字母。更准确地说，任何英文字母都会很好地显示出来，因为Windows-1251和ASCII中的前128个符号是相同的，但是只要有俄语文本，我们的收件人的文字处理软件都会使用错误的编码，除非用户手动设置了正确的字符编码。

The problem with national character code standards is obvious. And eventually, these national codes began to multiply, the internet began to explode, and everyone wanted to write in his or her national language without producing these indecipherable mojibakes.

国家字符代码标准存在的问题很明显。最终，这些国家法规开始成倍增加，互联网开始爆炸，每个人都想用他或她的本国语言写信而不产生这些难以理解的莫吉贝克。

There were two options at this point — use an encoding for each country or create a universal character map to represent all characters on the planet.

此时有两种选择-为每个国家/地区使用编码或创建通用字符映射表来表示地球上的所有字符。

ASCII入门 (A Short Primer On ASCII)

It may seem overly elementary, but if we’re going to be thorough we have to cover all the bases.

它可能看起来过于基础，但是如果我们要彻底的话，我们必须覆盖所有基础。

There’re 3 groups of columns in the ASCII table:

ASCII表中有3组列：

the decimal value of the character字符的十进制值
the hexadecimal value of the character字符的十六进制值
the glyph for the character itself角色本身的字形

Let’s say we want to encode the word “ok” in ASCII. The letter “o” has a decimal value of 111 and 6F in hexadecimal. In binary that would be — 01101111. The letter “k” is position 107 in decimal and 6B in hex, or — 01101011 in binary. So the word “OK” in ASCII would look like 01101111 01101011. The decoding process would be the opposite. We start with 8 bits, translate them into decimal encoding and end up with the character number, and search the table for the corresponding symbol.

假设我们要用ASCII编码单词“ ok”。字母“ o”的十六进制十进制值为111和6F。如果是二进制文件，则为—01101111。字母“ k”以十进制表示位置107，以十六进制表示6B，或者以二进制表示01011011。因此ASCII中的“ OK”一词看起来像0110111101101011。解码过程与此相反。我们从8位开始，将它们转换为十进制编码，然后以字符号结束，然后在表中搜索相应的符号。

统一码 (Unicode)

From the above, it should be pretty obvious why a single common character map was needed. But what would it look like? The answer is Unicode which is actually not an encoding, but a character set. It consists of 1,114,112 positions, or code points, most of which are still empty, so it isn’t likely the set will need to be expanded.

从上面可以明显看出，为什么需要一个公共字符映射表。但是会是什么样呢？答案是Unicode，它实际上不是编码，而是字符集。它包含1,114,112个位置或代码点，其中大多数仍为空，因此不太可能需要扩展该集合。

The Unicode standard consists of 17 planes with 65,536 code points each. Each plane contains a group of symbols. Plane zero is the basic multilingual plane where we find the most commonly used characters in all modern alphabets. The second plane contains characters from dead languages. There are even two planes reserved for private use. Most planes are still empty.

Unicode标准由17个平面组成，每个平面有65,536个代码点。每个平面包含一组符号。零平面是基本的多语言平面，我们可以在其中找到所有现代字母中最常用的字符。第二平面包含来自死语的字符。甚至还有两架飞机供私人使用。大多数飞机仍然是空的。

Unicode has code points for 0 through 10FFFF (in hexadecimal).

Unicode的代码点为0到10FFFF(十六进制)。

Characters are encoded in hexadecimal format preceded by a “U+”. So, for example, the first basic plane includes characters U+0000 to U+FFFF (0 to 65,535), and block 17 contains U+100000 through U+10FFFF (1,048,576 to 1,114,111).

字符以十六进制格式编码，后跟“ U +”。因此，例如，第一基本平面包含字符U + 0000至U + FFFF(0至65,535)，而块17包含U + 100000至U + 10FFFF(1,048,576至1,114,111)。

So now instead of a menagerie of numerous encodings, we have an all-encompassing table that encodes all symbols and characters which we might ever need. But it is not without its faults. While each character was previously encoded by one byte, it can now be encoded using different numbers of bytes. For example, you used to only need one byte to encode all of the letters in the English alphabet. For example, the Latin letter “o” in Unicode is U+006F. In other words, the same number as in ASCII — 6F in hexadecimal and 111 in binary. But to encode the symbol “U+103D5” (the Persian number “100”), we need 103D5 in hex and 66,517 in decimal, and now we need three bytes.

因此，现在有了一个无所不包的表，可以对所有可能需要的符号和字符进行编码，而不是繁琐的编码。但这并非没有缺点。虽然每个字符以前都由一个字节编码，但现在可以使用不同数量的字节编码。例如，您过去只需要一个字节即可对英语字母表中的所有字母进行编码。例如，Unicode中的拉丁字母“ o”为U + 006F。换句话说，与ASCII中的数字相同-十六进制为6F，二进制为111。但是要对符号“ U + 103D5”(波斯数字“ 100”)进行编码，我们需要十六进制的103D5和十进制的66517，现在我们需要三个字节。

This complexity must be addressed by such Unicode encodings like UTF-8 and UTF-16. And further on we’ll have a look at them.

必须通过诸如UTF-8和UTF-16之类的Unicode编码来解决这种复杂性。进一步，我们将对其进行研究。

UTF-8 (UTF-8)

UTF-8 is a Unicode encoding of the variable-width encoding system which can be used for displaying any Unicode symbol.

UTF-8是可变宽度编码系统的Unicode编码，可用于显示任何Unicode符号。

What do we mean when we speak about variable-width? First of all, we need to understand that the structural (atomic) unit in encoding is a byte. Variable-width encoding means that one character can be encoded using different numbers of units, or bytes. For example, Latin letters are encoded with one byte, and Cyrillic letters with two.

当我们谈论可变宽度时，我们指的是什么？首先，我们需要了解编码中的结构(原子)单元是一个字节。可变宽度编码意味着可以使用不同数量的单位或字节对一个字符进行编码。例如，拉丁字母编码为一个字节，西里尔字母编码为两个字节。

Before we move on, a slight aside regarding the compatibility between ASCII and UTF.

在继续之前，请略微谈谈ASCII和UTF之间的兼容性。

The fact that Latin letters and key control characters such as line breaks, tab stops, etc. contain one byte makes UTF-encoding compatible with ASCII. In other words, Latin script and control characters are found in the exact same code points in ASCII and UTF and are encoded using one byte in both, and are therefore backward compatible.

拉丁字母和键控制字符(例如换行符，制表符等)包含一个字节的事实使UTF编码与ASCII兼容。换句话说，在ASCII和UTF的完全相同的代码点中找到拉丁语脚本和控制字符，并且二者均使用一个字节进行编码，因此向后兼容。

Let’s use the letter “o” from our ASCII example from earlier. Recall that its position in the ASCII table is 111, or 01101111 in binary. In the Unicode table, it’s U+006F, or 01101111. And now since UTF is a variable-width encoding system “o” would be one byte. In other words, “o” would be represented the same way in both. And the same for characters 0 — 128. So, if your document contains English letters you wouldn’t notice a difference if you opened it using UTF-8, UTF-16, or ASCII, and would only notice a difference if you started working with national encodings.

让我们使用前面的ASCII示例中的字母“ o”。回想一下，它在ASCII表中的位置是111，或二进制形式为01101111。在Unicode表中，它是U + 006F或01101111。由于UTF是可变宽度编码系统，因此“ o”将是一个字节。换句话说，“ o”将在两者中以相同的方式表示。字符0到128也是一样。因此，如果您的文档包含英文字母，则使用UTF-8，UTF-16或ASCII打开文档时不会发现任何差异，并且仅在开始工作时才会发现差异。带有国家编码。

Let’s look at how the mixed English/Russian phrase “Hello мир” would appear in three different encoding systems: Windows-1251 (Russian encoding), ISO-8859-1 (encoding system for Western European languages), UTF-8 (Unicode). This example is telling because we have a phrase in two different languages.

让我们看一下混合的英语/俄语短语“ Helloмир”如何在三种不同的编码系统中出现：Windows-1251(俄语编码)，ISO-8859-1(西欧语言编码系统)，UTF-8(Unicode) 。这个示例之所以有用，是因为我们有两种不同语言的短语。

Now let’s consider how these encoding systems work and how we can translate a line of text from one encoding to another, and what happens if the characters are displayed improperly, or if we simply can’t do this due to the differences in the systems.

现在让我们考虑一下这些编码系统是如何工作的，以及如何将一行文本从一种编码转换为另一种编码，以及如果字符显示不正确或者由于系统的不同而导致我们不能做到这一点，该怎么办。

Let’s assume that our original phrase was written with Windows-1251 encoding. When we look at the table above we can see by translating from decimal or hex to decimal that we get the below coding in binary using Windows-1251.

假设原始短语是使用Windows-1251编码编写的。当我们查看上表时，可以看到通过将十进制或十六进制转换为十进制，可以使用Windows-1251以二进制形式获得以下编码。

01001000 01100101 01101100 01101100 01101111 00100000 11101100 11101000 11110000

So now we have the phrase “Hello мир” in Windows-1251 encoding.

因此，现在我们在Windows-1251编码中有了短语“ Helloмир”。

Now imagine that we have a text file but we don’t know what encoding system the text was saved in. We assume it’s encoded in ISO-8859-1 and open it in our word processor using this encoding system. As we saw earlier, some of the characters appear just fine, as they exist in this encoding system, and are even in the same code points, but the characters in the Russian word «мир» don’t work out quite as well. These characters don’t exist in the encoding system, and in their places, or code points, in ISO-8859-1 we find completely different characters. So “м” is code point 236, “и” is 232, and “р” is 240. But in ISO-8859-1 these code points correspond to “ì” (236), “è” (232), and “ð” (240).

现在，假设我们有一个文本文件，但是我们不知道该文本保存在哪个编码系统中。我们假设该文本是在ISO-8859-1中进行编码的，然后使用该编码系统在文字处理器中打开它。正如我们前面所看到的，某些字符看上去很好，因为它们存在于此编码系统中，甚至在相同的代码点中，但是俄语单词“мир»中的字符效果不太好。这些字符在编码系统中不存在，在ISO-8859-1中它们的位置或代码点中，我们发现完全不同的字符。因此，“м”是代码点236，“и”是232，“р”是240。但是在ISO-8859-1中，这些代码点分别对应于“ì”(236)，“è”(232)和“ ð”(240)。

So, our mixed-language phrase “Hello мир” encoded in Windows-1251 and read in ISO-8859-1 will look like “Hello ìèð”. We have partial compatibility and we can’t display a phrase encoded in one system properly in the other, because the symbols we need simply don’t exist in the second encoding.

因此，我们在Windows-1251中编码并在ISO-8859-1中读取的混合语言短语“ Helloмир”将看起来像“ Helloìèð”。我们具有部分兼容性，我们无法在另一个系统中正确显示在一个系统中编码的短语，因为我们需要的符号根本不在第二个编码中。

We need a Unicode encoding — in our case, we’ll use UTF-8 as an example. We’ve already discussed that characters can take between 1 to 4 bytes in UTF-8, but another advantage is that UTF, unlike the two prior encoding systems, isn’t restricted to 256 symbols, but contains all symbols in the Unicode character set.

我们需要Unicode编码-在本例中，我们将以UTF-8为例。我们已经讨论过，在UTF-8中字符可以占用1到4个字节，但是另一个优点是，与两个以前的编码系统不同，UTF不限于256个符号，而是包含Unicode字符集中的所有符号。。

It works something like this: the first bit of each encoded character corresponds not to the glyph or symbol itself, but to a specific byte. So, if the first bit is zero, we know that the encoded symbol uses just one byte — which makes the set backward compatible with ASCII. If we look closely at the ASCII symbol table we see that the first 128 symbols (the English alphabet, control characters, and punctuation marks) are expressed in binary, they all begin with a bit value of 0 (note that if you translate characters into binary using an online converter or anything similar the first zero high-order bit may be discarded, which can be a bit confusing).

它的工作方式如下：每个编码字符的第一位不对应于字形或符号本身，而是对应于特定的字节。因此，如果第一位为零，我们知道编码的符号仅使用一个字节-这使得该集合向后兼容ASCII。如果仔细查看ASCII符号表，我们会发现前128个符号(英语字母，控制字符和标点符号)以二进制表示，它们都以0开头(请注意，如果将字符转换为使用在线转换器或任何类似的二进制文件时，前零高阶位可能会被丢弃，这可能会造成混淆。

01001000 — the first-bit value is 0, so 1 byte encodes 1 character -> “H”.

01001000 —首位值为0，因此1个字节编码1个字符->“ H”。

01100101 — the first-bit value is 0, so 1 byte encodes 1 character-> “e”.

01100101 —首位值为0，因此1个字节编码为1个字符->“ e”。

If the value of the first bit is not zero, the symbol will be encoded in several bytes.

如果第一位的值不为零，则该符号将被编码为几个字节。

A two-byte encoding will have 110 for the first three bit values.

两字节编码的前三个比特值将为110。

11010000 10111100 — the marker bits are 110 and 10, so we use 2 bytes to encode 1 character. The second byte in this case always starts with “10.” So we omit the control bits (the lead bits that are highlighted in red and green) and look at the remainder of the code (10000111100), and convert to hex (043С) -> U+043C which gives us the Russian “м” in Unicode.

11010000 10111100 —标记位是110和10，因此我们使用2个字节来编码1个字符。在这种情况下，第二个字节始终以“ 10”开头。因此，我们省略了控制位(以红色和绿色突出显示的前导位)，然后查看代码的其余部分(10000111100)，并转换为十六进制(043С)-> U + 043C，这给了我们俄语“м”在Unicode中。

The initial bits for a three-byte character are 1110.

三字节字符的起始位是1110。

11101000 10000111 101010101 — we add up all of the bits except the control bits and we find that in hex we have 103В5, U+103D5 — the ancient Persian number one hundred (10000001111010101).

11101000 10000111 101010101 —我们将除控制位以外的所有位加起来，发现以十六进制表示的是103В5，U + 103D5 —古老的波斯数字一百(10000001111010101)。

Four-byte character encodings begin with the lead bits 11110.

四字节字符编码从前导位11110开始。

11110100 10001111 10111111 10111111 — U+10FFFF which is the last available character in the Unicode set (100001111111111111111).

11110100 10001111 10111111 10111111 — U + 10FFFF，它是Unicode集(100001111111111111111)中的最后一个可用字符。

Now, we can easily write our multi-language phrase in UTF-8 encoding.

现在，我们可以轻松地以UTF-8编码编写我们的多语言短语。

UTF-16 (UTF-16)

UTF-16 is another variable-width encoding. The main difference between UTF-16 and UTF-8 is that UTF-16 uses 2 bytes (16 bits) per code unit instead of 1 bye (8 bits). In other words, any Unicode character encoded in UTF-16 can be either two or four bytes. To keep things simple, I will refer to these two bytes as a code unit. So, in UTF-16 any character can be represented using either one or two code units.

UTF-16是另一种可变宽度编码。 UTF-16和UTF-8之间的主要区别在于，UTF-16每个代码单元使用2个字节(16位)，而不是1个再见(8位)。换句话说，以UTF-16编码的任何Unicode字符都可以是两个或四个字节。为简单起见，我将这两个字节称为代码单元。因此，在UTF-16中，任何字符都可以使用一个或两个代码单元表示。

Let’s start with symbols encoded using one code unit. We can easily calculate that there are 65,535 (216) characters with one code unit, which lines up completely with the basic multilingual plane of Unicode. All characters in this plane will be represented by one code unit (two bytes) in UTF-16.

让我们从使用一个代码单元编码的符号开始。我们可以很容易地计算出一个代码单元有65,535(216)个字符，这些字符与Unicode的基本多语言平面完全对齐。该平面中的所有字符在UTF-16中将由一个代码单元(两个字节)表示。

Latin letter “o” — 00000000 01101111.

拉丁字母“ o” — 00000000 01101111。

Cyrillic letter “М” — 00000100 00011100.

西里尔字母“М” — 00000100 00011100。

Now let’s consider characters outside the basic multilingual plane. These require two code units (4 bytes) and are encoded in a slightly more complicated way.

现在让我们考虑基本多语言平面之外的字符。它们需要两个代码单元(4个字节)，并以稍微复杂一些的方式进行编码。

First, we need to define the concept of a surrogate pair. A surrogate pair is two code units used to encode a single character (totaling 4 bytes). The Unicode character set reserves a special range D800 to DFFF for surrogate pairs. This means that when converting a surrogate pair to bytes in hexadecimal, we end up with a code point in this range which is a surrogate pair rather than a separate character.

首先，我们需要定义代理对的概念。代理对是用于编码单个字符(共4个字节)的两个代码单元。 Unicode字符集为代理对保留了D800到DFFF的特殊范围。这意味着当将一个代理对转换为十六进制的字节时，我们最终得到的代码点是该范围内的代理对，而不是一个单独的字符。

To encode a symbol in the range 10000 — 10FFFF (i.e., characters that require more than one code unit) we proceed as follows:

要对范围为10000-10FFFF的符号(即，需要多个代码单元的字符)进行编码，请按以下步骤操作：

Subtract 10000(hex) from the code point (this is the lowest code point in the range 10000 — 10FFFF).

从代码点减去10000(hex)(这是10000 — 10FFFF范围内的最低代码点)。
We end up with up to a 20-bit number no greater than FFFF.

我们最终得到一个不超过FFFF的20位数字。
The high-order 10 bits we end up with are added to D800 (the lowest code point in the surrogate pair range in Unicode).

我们最终得到的高10位被添加到D800(Unicode替代对范围中的最低代码点)。
The next 10 bits are added to DC00 (also from the surrogate pair range).

接下来的10位被添加到DC00(也来自代理对范围)。
Next, we end up with 2 surrogate 16-bit code units, the first 6 bits of which define the unit as part of a surrogate pair.

接下来，我们得到2个代理16位代码单元，其中的前6位将单元定义为代理对的一部分。
The tenth bit in each surrogate defines the order of the pair. If we have a “1” it’s the leading or high surrogate, and if we have a “0” it’s the trailing or low surrogate.

每个代理中的第十个位定义了对的顺序。如果我们有一个“ 1”，那就是领先或较高的代理；如果我们有一个“ 0”，则是落后或较低的代理。

This will make a bit more sense when illustrated with the below example.

通过以下示例进行说明时，这会更有意义。

Let’s encode and then decode the Persian number one hundred (U+103D5):

让我们编码然后解码波斯数字一百(U + 103D5)：

103D5 — 10000 = 3D5.

103D5 — 10000 = 3D5。
3D5 = 0000000000 1111010101 (the high 10 bits are zero, and when converted to hexadecimal we end up with “0” (the first ten), and 3D5 (the second ten)).

3D5 = 0000000000 1111010101(高10位为零，当转换为十六进制时，我们以“ 0”(前十个)和3D5(后十个)结尾。
0 + D800 = D800 (1101100000000000) the first 6 bits tell us that this code point falls in the surrogate pair range, the tenth bit (from the right) has a “0” value, so this is the high surrogate.

0 + D800 = D800(1101100000000000)的前6位告诉我们，此代码点落在代理对范围内，第十个位(从右侧)具有“ 0”值，因此这是高代理。
3D5 + DC00 = DFD5 (1101111111010101) the first 6 bits tell us that this code point falls in the surrogate pair range, the tenth bit (from the right) is a “1”, so we know this is the low surrogate.

3D5 + DC00 = DFD5(1101111111010101)的前6位告诉我们，此代码点落在代理对范围内，第十个位(从右侧)为“ 1”，因此我们知道这是低代理。
The resulting character encoded in UTF-16 looks like — 1101100000000000 1101111111010101.

以UTF-16编码的结果字符看起来像是— 1101100000000000 1101111111010101。

Now let’s decode the character. Let’s say we have the following code point — 1101100000100010 1101111010001000:

现在让我们解码字符。假设我们有以下代码点— 1101100000100010 1101111010001000：

We convert to hexadecimal = D822 DE88 (both code points fall in the surrogate pair range, so we know we’re dealing with a surrogate pair).

我们将其转换为十六进制= D822 DE88(两个代码点都在代理对范围内，因此我们知道我们正在处理代理对)。
1101100000100010 — the tenth bit (from the right) is a “0”, so this is the high surrogate.

1101100000100010 —第十位(从右起)是“ 0”，因此这是高代理。
1101111010001000 — the tenth bit (from the right) is a “1”, so this is the low surrogate.

1101111010001000 —第十位(从右起)是“ 1”，因此这是低位替代。
We ignore the 6 bits identifying this is as a surrogate and are left with 0000100010 1010001000 (8A88).

我们忽略了6位，将其标识为代理，并保留了0000100010 1010001000(8A88)。
We add 10000 (the lowest code point in the surrogate range) 8A88 + 10000 = 18A88.

我们加上10000(代理范围内的最低代码点)8A88 + 10000 = 18A88。
We look at the Unicode table for U+18A88 = Tangut Component-649.

我们看一下U + 18A88 = Tangut Component-649的Unicode表。

Kudos to everyone who read this far!

感谢阅读本文的所有人！

I hope this has been informative without leaving you too bored.

我希望这能为您提供丰富的信息，而又不会让您感到无聊。

You might also find useful: The Unicode character set

您可能还会发现有用： Unicode字符集

Strategies for content localization: IP- or browser-based

内容本地化策略：基于IP或基于浏览器

关于翻译 (About the translator)

Alconost is a global provider of localization services for apps, games, videos, and websites into 70+ languages. We offer translations by native-speaking linguists, linguistic testing, cloud-based workflow, continuous localization, project management 24/7, and work with any format of string resources. We also make advertising and educational videos and images, teasers, explainers, and trailers for Google Play and the App Store.

Alconost是为70多种语言的应用程序，游戏，视频和网站提供本地化服务的全球提供商。我们提供由母语为母语的语言学家，语言测试，基于云的工作流，连续本地化，24/7项目管理以及任何格式的字符串资源提供的翻译。我们还为Google Play和App Store制作广告和教育视频和图像，预告片，解释器和预告片。

翻译自: https://habr.com/en/company/alconost/blog/480688/