什么是像ANSI和Unicode这样的字符编码，它们有何不同？

ASCII, UTF-8, ISO-8859… You may have seen these strange monikers floating around, but what do they actually mean? Read on as we explain what character encoding is and how these acronyms relate to the plain text we see on screen.

ASCII，UTF-8，ISO-8859…您可能已经看到这些奇怪的绰号了，但是它们实际上是什么意思？继续阅读，我们将解释什么是字符编码以及这些首字母缩写词与我们在屏幕上看到的纯文本之间的关系。

基本构建基块 (Fundamental Building Blocks)

When we talk about written language, we talk about letters being the building blocks of words, which then build sentences, paragraphs, and so on. Letters are symbols which represent sounds. When you talk about language, you’re talking about groups of sounds that come together to form some sort of meaning. Each language system has a complex set of rules and definitions that govern those meanings. If you have a word, it’s useless unless you know what language it’s from and you use it with others who speak that language.

当我们谈论书面语言时，我们所谈论的是字母是单词的基础，然后字母会构建句子，段落等。字母是代表声音的符号。当您谈论语言时，您所谈论的是声音的组合，它们形成某种意义。每种语言系统都有一组复杂的规则和定义来控制这些含义。如果您有一个单词，除非您不知道它来自哪种语言，并且与其他使用该语言的人一起使用，否则它是无用的。

(Comparison of Grantha, Tulu, and Malayalam scripts, Image from Wikipedia)

(Grantha，Tulu和Malayalam脚本的比较，图片来自维基百科 )

In the world of computers, we use the term “character.” A character is sort of an abstract concept, defined by specific parameters, but it is the fundamental unit of meaning. The Latin ‘A’ is not the same as a Greek ‘alpha’ or an Arabic ‘alif’ because they have different contexts – they’re from different languages and have slightly different pronunciations – so we can say that they are different characters. The visual representation of a character is called a “glyph” and different sets of glyphs are called fonts. Groups of characters belong to a “set” or a “repertoire.”

在计算机世界中，我们使用术语“字符”。字符是某种抽象概念，由特定参数定义，但它是意义的基本单位。拉丁语“ A”与希腊语“ alpha”或阿拉伯语“ alif”不同，因为它们具有不同的上下文(它们来自不同的语言并且发音略有不同)，因此我们可以说它们是不同的字符。字符的视觉表示形式称为“字形”，而不同的字形集称为字体。字符组属于“集合”或“曲目”。

When you type up a paragraph and you change the font, you’re not changing the phonetic values of the letters, you’re changing how they look. It’s just cosmetic (but not unimportant!). Some languages, like ancient Egyptian and Chinese, have ideograms; these represent whole ideas instead of sounds, and their pronunciations can vary over time and distance. If you substitute one character for another, you’re substituting an idea. It’s more than just changing letters, it’s changing an ideogram.

当您键入一个段落并更改字体时，不是在更改字母的语音值，而是在更改其外观。这只是化妆品(但并不重要！)。有些语言，例如古埃及和中文，都有表意文字；这些代表的是完整的想法而不是声音，而且它们的发音会随时间和距离而变化。如果您用一个字符替换另一个字符，那么您就是在替换一个想法。不仅仅是改变字母，还改变了表意文字。

字符编码 (Character Encoding)

(Image from Wikipedia)

(图片来自维基百科 )

When you type something on the keyboard, or load a file, how does the computer know what to display? That’s what character encoding is for. Text on your computer isn’t actually letters, it’s a series of paired alphanumeric values. The character encoding acts as a key for which values correspond to which characters, much like how orthography dictates which sounds correspond to which letters. Morse code is a sort of character encoding. It explains how groups of long and short units such as beeps represent characters. In Morse code, the characters are just English letters, numbers, and full stops. There are many computer character encodings which translate into letters, numbers, accent marks, punctuation marks, international symbols, and so on.

当您在键盘上键入内容或加载文件时，计算机如何知道显示内容？这就是字符编码的目的。您计算机上的文本实际上不是字母，而是一系列成对的字母数字值。字符编码充当键，其值对应于哪些字符，就像拼字法如何指示哪些声音对应于哪些字母一样。摩尔斯电码是一种字符编码。它说明了长和短单位(例如蜂鸣声)组是如何表示字符的。在摩尔斯电码中，字符只是英文字母，数字和句号。有许多计算机字符编码，它们可以转换为字母，数字，重音符号，标点符号，国际符号等。

Often on this topic, the term “code pages” is also used. They are essentially character encodings as used by specific companies, often with slight modifications. For example, the Windows 1252 code page (formerly known as ANSI 1252) is a modified form of the ISO-8859-1. They’re mostly used as an internal system to refer to standard and modified character encodings that are specific to the same systems. Early on, character encoding wasn’t so important because computers didn’t communicate with each other. With the internet rising to prominence and networking being a common occurrence, it has become an increasingly important of our day-to-day lives without us even realizing it.

通常在此主题上，也使用术语“代码页”。它们本质上是特定公司使用的字符编码，通常会稍作修改。例如，Windows 1252代码页(以前称为ANSI 1252)是ISO-8859-1的修改形式。它们通常用作内部系统，以指代特定于同一系统的标准和修改的字符编码。早期，字符编码并不是那么重要，因为计算机之间无法相互通信。随着互联网的兴起和网络的普及，在我们甚至没有意识到的情况下，它已成为我们日常生活中越来越重要的一部分。

许多不同的类型 (Many Different Types)

(Image from sarah sosiak)

(图片来自莎拉·索西亚克 )

There are plenty of different character encodings out there, and there are plenty of reasons for that. Which character encoding you choose to use depends on what your needs are. If you communicate in Russian, it makes sense to use a character encoding that supports Cyrillic well. If you communicate in Korean, then you’ll want something that represents Hangul and Hanja well. If you’re a mathematician, then you want something that has all of the scientific and mathematical symbols represented well, as well as the Greek and Latin glyphs. If you’re a prankster, maybe you’d benefit from upside-down text. And, if you want all of those types of documents to be viewed by any given person, you want an encoding that’s pretty common and easily accessible.

有很多不同的字符编码，并且有很多原因。选择使用哪种字符编码取决于您的需求。如果您使用俄语进行交流，则可以使用很好地支持西里尔字母的字符编码。如果您使用韩语进行交流，那么您会想要一些能很好地代表韩文和韩文的东西。如果您是数学家，那么您想要一种能够很好地代表所有科学和数学符号以及希腊和拉丁字形的东西。如果您是恶作剧者，也许您会从颠倒的文字中受益。而且，如果您希望任何给定的人都可以查看所有这些类型的文档，则需要一种非常通用且易于访问的编码。

Let’s take a look at some of the more common ones.

让我们看一些更常见的。

(Excerpt of ASCII table, Image from asciitable.com)

(ASCII表节选，图片来自asciitable.com )

ASCII – The American Standard Code for Information Interchange is one of the older character encodings. It was originally devised based on telegraphic codes and evolved over time to include more symbols and some now-outdated non-printed control characters. It’s probably as basic as you can get in terms of modern systems, as it’s limited to the Latin alphabet without accented characters. Its 7-bit encoding allows for only 128 characters, which is why there are several unofficial variants in use around the world.

ASCII –美国信息交换标准代码是较早的字符编码之一。它最初是根据电报代码设计的，并且随着时间的推移不断发展，以包含更多符号和一些现在已经过时的非打印控制字符。就现代系统而言，它可能是最基本的，因为它仅限于不带重音符号的拉丁字母。它的7位编码仅允许128个字符，这就是为什么在世界范围内使用几种非官方的变体的原因。
ISO-8859 – The International Organization for Standardization’s most widely used group of character encodings is number 8859. Each specific encoding is designated by a number, often prefixed by a descriptive moniker, e.g. ISO-8859-3 (Latin-3), ISO-8859-6 (Latin/Arabic). It’s a superset of ASCII, meaning that the first 128 values in the encoding are the same as ASCII. It’s 8-bit, however, and allows for 256 characters, so it builds off from there and includes a much wider array of characters, with each specific encoding focusing on a different set of criteria. Latin-1 included a bunch of accented letters and symbols, but was later replaced with a revised set called Latin-9 which includes updated glyphs like the Euro symbol.

ISO-8859 –国际标准化组织使用最广泛的字符编码组是8859。每种特定的编码均由一个数字指定，通常以一个描述性的名字作为前缀，例如ISO-8859-3(Latin-3)，ISO- 8859-6(拉丁文/阿拉伯文)。这是ASCII的超集，这意味着编码中的前128个值与ASCII相同。但是，它是8位的，并且允许256个字符，因此它从那里开始构建，并包括更多的字符数组，每种特定的编码都集中在一组不同的标准上。 Latin-1包含一堆带重音符号的字母和符号，但后来被称为Latin-9的修订集所替代，其中包含了更新的字形，例如欧元符号。

(Excerpt of Tibetan script, Unicode v4, from unicode.org)

(藏文脚本Unicode v4的摘录，来自unicode.org )

Unicode – This encoding standard aims at universality. It currently includes 93 scripts organized in several blocks, with many more in the works. Unicode works differently than other character sets in that instead of directly coding for a glyph, each value is directed further to a “code point.” These are hexadecimal values that correspond to characters but the glyphs themselves are provided in a detached way by the program, such as your web browser. These code points are commonly depicted as follows: U+0040 (which translates to ‘@’). Specific encodings under the Unicode standard are UTF-8 and UTF-16. UTF-8 attempts to allow for maximum compatibility with ASCII. It’s 8-bit, but allows for all of the characters via a substitution mechanism and multiple pairs of values per character. UTF-16 ditches perfect ASCII compatibility for a more complete 16-bit compatibility with the standard.

Unicode –此编码标准旨在实现通用性。目前，它包括93个脚本，这些脚本按几个模块进行组织，并且还有更多的作品正在编写中。 Unicode与其他字符集的不同之处在于，每个值都直接指向一个“代码点”，而不是直接编码一个字形。这些是对应于字符的十六进制值，但是字形本身是由程序(例如，Web浏览器)以分离的方式提供的。这些代码点通常如下所示：U + 0040(转换为'@' )。 Unicode标准下的特定编码为UTF-8和UTF-16。 UTF-8尝试允许与ASCII的最大兼容性。它是8位，但是通过替换机制允许所有字符，并且每个字符多对值。 UTF-16破坏了完美的ASCII兼容性，以实现与标准的更完整的16位兼容性。
ISO-10646 – This isn’t an actual encoding, just a character set of Unicode that’s been standardized by the ISO. It’s mostly important because it’s the character repertoire used by HTML. Some of the more advanced functions provided by Unicode that allow for collation and right-to-left alongside left-to-right scripting is missing. Still, it works very well for use on the internet as it allows for the usage of a wide variety of scripts and allows the browser to interpret the glyphs. This makes localization somewhat easier.

ISO-10646 –这不是实际的编码，只是由ISO标准化的Unicode字符集。这是最重要的，因为它是HTML使用的字符表。缺少Unicode提供的一些更高级的功能，这些功能允许排序规则和从右到左以及从左到右的脚本编写。不过，它可以在互联网上很好地使用，因为它允许使用各种脚本，并允许浏览器解释字形。这使得本地化有些容易。

我应该使用什么编码？ (What Encoding Should I Use?)

Well, ASCII works for most English speakers, but not for much else. More often you’ll be seeing ISO-8859-1, which works for most Western European languages. The other versions of ISO-8859 work for Cyrillic, Arabic, Greek, or other specific scripts. However, if you want to display multiple scripts in the same document or on the same web page, UTF-8 allows for much better compatibility. It also works really well for people who use proper punctuation, math symbols, or off-the-cuff characters, such as squares and checkboxes.

嗯，ASCII适用于大多数说英语的人，但不适用于其他大多数人。您会经常看到适用于大多数西欧语言的ISO-8859-1。 ISO-8859的其他版本适用于西里尔文，阿拉伯文，希腊文或其他特定脚本。但是，如果要在同一文档或同一网页上显示多个脚本，则UTF-8可以提供更好的兼容性。对于使用适当的标点符号，数学符号或即席字符(例如正方形和复选框)的人们，它也非常有效。

(Multiple languages in one document, Screenshot of gujaratsamachar.com)

(在一个文档中提供多种语言， gujaratsamachar.com的屏幕截图)

There are drawbacks to each set, however. ASCII is limited in its punctuation marks, so it doesn’t work incredibly well for typographically correct edits. Ever type copy/paste from Word only to have some weird combination of glyphs? That’s the drawback of ISO-8859, or more correctly, its supposed inter-operability with OS-specific code pages (we’re looking at YOU, Microsoft!). UTF-8’s major drawback is lack of proper support in editing and publishing applications. Another problem is that browsers often don’t interpret and just display the byte order mark of a UTF-8 encoded character. This results in unwanted glyphs being displayed. And of course, declaring one encoding and using characters from another without declaring/referencing them properly on a web page makes it difficult for browsers to render them correctly and for search engines to index them appropriately.

但是，每组都有缺点。 ASCII的标点符号受到限制，因此对于印刷正确的编辑来说，它的工作效果非常差。是否曾经从Word中键入复制/粘贴仅具有某种奇怪的字形组合？这是ISO-8859的缺点，或更准确地说，是它与特定于OS的代码页的互操作性的缺点(我们在寻找Microsoft)。 UTF-8的主要缺点是在编辑和发布应用程序时缺乏适当的支持。另一个问题是浏览器通常不解释，而只是显示UTF-8编码字符的字节顺序标记。这会导致显示不需要的字形。当然，在网页上没有正确声明/引用它们的情况下，声明一种编码并使用另一种字符会使浏览器难以正确呈现它们，而搜索引擎很难正确地为它们编制索引。

For your own documents, manuscripts, and so forth, you can use whatever you need to get the job done. As far as the web goes, though, it seems that most people agree on using a UTF-8 version that does not use a byte order mark, but that’s not entirely unanimous. As you can see, each character encoding has its own use, context, and strengths and weaknesses. As an end-user, you probably won’t have to deal with this, but now you can take the extra step forward if you so choose.

对于您自己的文档，手稿等，您可以使用完成任务所需的任何东西。不过，就网络而言，似乎大多数人都同意使用不使用字节顺序标记的UTF-8版本，但这并不是完全一致的。如您所见，每种字符编码都有其自己的用途，上下文以及优点和缺点。作为最终用户，您可能不必处理此问题，但是如果您愿意的话，现在您可以采取进一步的措施。

翻译自: https://www.howtogeek.com/howto/45765/htg-explains-what-are-character-encodings-and-how-do-they-differ/