提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档

文章目录

  • A Historical Perspective 历史发展
  • Unicode
  • Encodings 编码
  • How UTF-8 works UTF-8的工作原理
  • The Single Most Important Fact About Encodings 有关编码的真相

Did you ever get an email from your friends in Bulgaria with the subject line “??? ??? ??? ???”?
你有没有从保加利亚的朋友那里收到过标题栏是"??? ??? ??? ???"的邮件呢?

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.
我沮丧的发现许多软件开发者对于字符集,编码,Unicode等东西缺乏了解。几年前,FogBUGZ软件的beta版本测试员想知道怎么处理日语邮件。日语邮件?他们真的会收到日语邮件吗?我不清楚。当我仔细观察我们用来解析MIME电子邮件消息的商业ActiveX控件时,我们发现它对字符集做了完全错误的事情,所以我们不得不编写程序来纠正他。当我研究另一个商业库时,它也有一个稀碎的字符代码实现。 我和那个包的开发者通信,他认为他们“对此无能为力”。 像许多程序员一样,他只是希望这一切会以某种方式消失。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.
但它不会。 当我发现流行的web开发工具PHP几乎完全忽视字符编码问题,使用8位字符,这使得它几乎不可能开发好的国际web应用程序,我想,够了 。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
所以我要宣布一下:如果你是一个在2003年工作的程序员,你不知道字符,字符集,编码和Unicode的基本知识,如果被我逮到了,我会罚你在潜水艇里剥六个月洋葱。我认真的。

And one more thing:

IT’S NOT THAT HARD.
老实说,这真的不难。

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.
在本文中,我将告诉您每一个工作中的程序员都应该知道什么。 所有关于“纯文本= ascii =字符是8位”的东西不仅是错误的,而且是无可救药的错误,如果你仍然以这种方式编程,你比一个不相信细菌的医生好不了多少。 在阅读完本文之前,请不要再写一行代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.
在我开始之前,我应该提醒您,如果您是少有的了解国际化的人之一,那么您将会发现我的整个讨论有点过于简化。 我只是想在这里定下一个最低标准,这样每个人都能理解发生了什么,并能编写有希望处理任何语言的文本,除了英语子集,不包括带重音的单词的代码。我应该提醒你们,要创建一个国际化的软件,字符处理只是一个很小的部分,但我一次只能写一件事,所以今天讲的是字符集。

A Historical Perspective 历史发展

The easiest way to understand this stuff is to go chronologically.
理解这些东西最简单的方法就是从发展历史讲。

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.
您可能认为我将在这里讨论非常古老的字符集,比如EBCDIC。 好吧,我不会的。EBCDIC与您的生活无关。我们不需要追溯到那么遥远的时间。

ASCII tableBack in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.
在中古时代,当Unix刚被发明出来,K&R正在编写C语言的时候,一切都很简单。EBCDIC正在被淘汰。唯一重要的字符是古老的无重音英文字母,我们有一个叫做ASCII的代码,它可以用32到127之间的数字来表示每个字符。 空格是32,字母“A”是65等等。可以很方便的用7位存储。 当时的大多数计算机都使用8位字节,所以你不仅可以存储每一个可能的ASCII字符,而且你还有一个比特,你可以用它来实现你自己的想法: 笨蛋WordStar(一个早期文本处理软件)打开了最高位来表示单词的最后一个字母,这使得WordStar只能显示英文文本。 32以下的代码被称为不可打印,它们被用于控制字符,比如7会让你的电脑发出哔哔声,12会让当前的纸张飞出打印机,然后输入新的纸张。

And all was good, assuming you were an English speaker.
假设你是英语使用者,这一切都很好。

Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.
因为字节有最多8位的空间,很多人开始思考,“天哪,我们可以为我们自己的目的使用代码128-255。” 问题是,很多人同时有了这个想法,他们对128到255的空间都有自己的想法。IBM-PC开发了OEM字符集,它为欧洲语言提供了一些重音字符和一堆线描字符…… 横条、竖条、右边有小小洞的横条等等,你可以用这些线条绘制字符在屏幕上做出漂亮的方框和线条,你现在仍然可以在干洗店的8088电脑上看到这些。 事实上,当人们开始在美国以外的地方购买电脑时,各种不同的OEM字符集就被设计出来了,它们都是为了各自的目的而使用前128个字符。 例如,在一些pc上,字符代码130将显示为é,但在以色列销售的电脑上,它是希伯来字母Gimel(ג),所以当美国人将résumés发送到以色列时,他们将收到rג和ג。 在许多情况下,比如俄语,对于如何处理128个字符有很多不同的想法,所以您甚至不能可靠地交换俄语文档。

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual” code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.
最终,这个免费的OEM被ANSI标准编码。 在ANSI标准中,每个人在128以下做什么有一致意见,这和ASCII差不多,但有很多不同的方法来处理128和128以上的字符,这取决于你住在哪里。 这些不同的系统被称为编码页。 例如在以色列DOS使用的编码页是862,而希腊用户使用的是737。 128以下的字母是一样的,但128以上的字母是不同的,所有有趣的字母都在128以上。 MS-DOS的国家版本有几十个这样的代码页,处理从英语到冰岛语的一切,他们甚至有一些“多语言”代码页,可以在同一台计算机上处理世界语和加利西亚语! 哇! 但是,比如说,在同一台计算机上同时使用希伯来语和希腊语是完全不可能的,除非您编写自己的定制程序,使用位图图形显示所有内容,因为希伯来语和希腊语需要不同的代码页,对较高的数字有不同的解释。

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set” in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.
与此同时,在亚洲,更疯狂的事情正在发生,因为亚洲字母有成千上万个字母,这些字符永远不能用八位表示。这个问题通常通过称为DBCS的混乱系统来解决,即“双字节字符集”,其中一些字母存储在一个字节中,而另一些则存储在两个字节中。这些字符在字符串中可以从前往后移动,但几乎不可能从后往前移动。程序员被鼓励不要使用s++和s-来回移动,而是调用诸如Windows的AnsiNext和AnsiPrev这样的函数,它们知道如何处理整个字符系统。

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.
但是,大多数人还是假装一个字节是一个字符,一个字符是8位,只要你不把一个字符串从一台计算机移到另一台计算机,或者说一种以上的语言,它总是会工作的。 当然,自从有了互联网,把字符串从一台电脑移到另一台电脑上就变得很平常了,整个糟糕的系统就崩溃了。幸运的是,Unicode被发明出来了。

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.
Unicode是一个勇敢的尝试,它创造了一个包含了地球上所有合理的书写系统和一些像克林贡语这样的虚构字符集。有些人有这样的误解,认为Unicode只是一个16位的代码,每个字符需要16位,因此有65,536个可能的字符。 实际上,这是不对的。 这是关于Unicode最常见的误解,所以如果你这么想,不要难过。

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.
事实上,Unicode有一种不同的思考字符的方式,你必须理解Unicode的思考方式,否则什么都讲不通。

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:
到目前为止,我们假设一个字母映射到一些你可以存储在磁盘或内存中的位:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.
在Unicode中,一个字母映射到一个被称为码点的东西,这仍然只是一个理论概念。 如何在内存或磁盘上表示代码点则是另一回事。

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:
在Unicode中,字母A是一个柏拉图式的概念。它只存在于天堂:

A

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.
这种A与B不同,与a也不一样,但和А与Α一样(其它语言的字母)。Times New Roman字体中的A是和Helvetica字体相同的字符,但小写的“a”争议雄安很多。在一些语言中找出一些一些有争议的字母。如德语字母ß是真正的字母还是只是一种特殊的书写方式? 如果一个字母的形状在单词的末尾发生变化,那是一个不同的字母吗? 希伯来语说是(同一个),阿拉伯语说不是。 无论如何,Unicode联盟的聪明人们在过去的十年里一直在研究这个问题,伴随着大量高度政治化的辩论,你不必为此担心。他们已经搞清楚了。

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.
每个字母表中的每个柏拉图式的字母都被Unicode联盟分配了一个神奇的数字,例如:U+0639。 这个神奇的数字叫做码点。 U+表示“Unicode”,数字以十六进制表示。 U+0639是阿拉伯字母Ain。 英文字母A应该是U+0041。 你可以在Windows 2000/XP上使用字符映射工具或访问Unicode网站来找到它们。

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.
Unicode可以定义的字母数量并没有真正的限制,事实上已经超过了65,536个,所以并不是每个Unicode字母都能压缩成两个字节。

OK, so say we have a string:
举个例子:

Hello

which, in Unicode, corresponds to these five code points:
Hello
在Unicode中对应五个码点:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.
这只是一堆代码点或是数字。我们还没有说过如何将它存储在内存中或在电子邮件消息中表示它。

Encodings 编码

That’s where encodings come in.
这就是编码的作用所在。

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
这就是Unicode编码最早的想法,就是它让人们误以为Unicode都是两罐字节的,嘿,让我们把这些数字用两个字节存储。那么 Hello 变成了

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:
对吧?不要这么快下结论!难道它不能是:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
好,从技术上说,可以,我相信它可以,事实上。早期的实现者希望能够用大端模式或小端模式来存储Unicode编码。取决于他们的特定的CPU是采用什么编码的,这已经有两种方法来存储Unicode。 所以人们不得不想出一个奇怪的方法,在每个Unicode字符串的开头存储一个FE FF; 这被称为Unicode字节顺序标记,如果你交换你的高字节和低字节,就会变成FF FE。读取你字符串的人将知道他们必须交换每一个字节。并不是所有Unicode字符串的开头都有字节顺序标记。

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.
有一段时间,这似乎已经足够好了,但程序员抱怨。 “看看这些0 !”他们说,因为他们是美国人,他们在看英文文本,很少使用U+00FF以上的代码点。 他们也是加州保守的自由派嬉皮士。 如果他们是德州人,他们不会介意消耗两倍的字节数。 但那些加州的懦夫们无法忍受将字符串的存储空间翻倍的想法,不管怎样,已经有了这些该死的文件使用各种ANSI和DBCS字符集,谁来转换它们呢? Moi吗? 仅仅因为这个原因,大多数人决定在几年后不使用Unicode,与此同时,情况变得更糟。

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
UTF-8这个绝妙的概念就这样诞生了。UTF-8是另一种存储Unicode码位字符串的系统,这些神奇的U+数字,在内存中使用8位字节。 在UTF-8中,从0到127的每个代码点都存储在一个字节中。 只有编码点128及以上才使用2、3,实际上,最多6个字节来存储。

How UTF-8 works UTF-8的工作原理

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
其中一个好处是,英语文本在UTF-8中看起来和在ASCII中完全一样,所以美国人甚至不会注意到任何错误。 只有世界上其他地方的人需要经历重重困难。具体来说,Hello,原来是U+0048 U+0065 U+006C U+006C U+006F,现在存储为48 65 6C 6C 6F,在ASCII和 ANSI,以及地球上所有OEM字符集的存储方式是一样的。 现在,如果你大胆地使用重音字母、希腊字母或克林贡字母,你将不得不使用几个字节来存储一个代码点,但美国人永远不会注意到。 (UTF-8也有一个很好的属性,旧的字符串处理代码使用一个0字节作为空结束符时,不会截断字符串)。

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
到目前为止,我已经介绍了Unicode的三种编码方式。 传统的双字节存储方法称为UCS-2(因为它有两个字节)或UTF-16(因为它有16位),您还必须弄清楚它是大端UCS-2还是小端UCS-2。还有一个流行的新UTF-8标准,它有一个很好的特性,也能很好地工作,如果你同时使用英语文本和程序,它们只会保留ASCII的内容。(一个字节存储)

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.
实际上还有很多其他编码Unicode的方式。例如所谓的utf - 7,很像utf - 8但保证最高位永远是零,因此,如果你不得不用某种严格为七位的电子邮件系统,这种编码方式能让邮件完整的通过。 还有UCS-4,它将每个代码点存储在4个字节中,它有一个很好的属性,即每个代码点都可以存储在相同数量的字节中,但是,天哪,即使是德克萨斯人也不会如此大胆地浪费这么多内存。

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re really good, a box. Which did you get? -> �
事实上,现在你考虑的是柏拉图式的理想字母这些字母由Unicode码位表示,这些Unicode码位也可以用任何老式的编码方案进行编码! 例如,您可以对Unicode编码字符串Hello(U + 0048 U + 0065 U + 006 c U + 006 c U + 006 f)在ASCII,或旧的OEM希腊编码,或希伯来ANSI编码,或任何的几百种编码。有一个问题:一些字母可能不会出现! 如果在编码中没有对应的Unicode编码点,你通常会得到一个小问号:? 或者一个。 你买了哪一种? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
有数百种传统的编码只能正确存储一些代码点,并将所有其他代码点更改为问号。 一些流行的英语文本编码是Windows-1252(西欧语言的Windows 9x标准)和ISO-8859-1,也就是Latin-1(西欧语言也经常使用)。 但是试着在这些编码中存储俄语或希伯来语字母,你会得到一堆问号。 UTF 7、8、16和32都具有能够正确存储任何代码点的良好属性。

The Single Most Important Fact About Encodings 有关编码的真相

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
如果你完全忘记了我刚才解释的一切,请记住一个非常重要的事实。 如果不知道字符串使用的编码,那么使用字符串是没有意义的。 你不能把头埋在沙子里,假装纯文本是ASCII。

There Ain’t No Such Thing As Plain Text.
根本就没有纯文本这回事。

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
如果您在内存中、文件中或电子邮件消息中有一个字符串,您必须知道它的编码,否则您无法正确地解释它或向用户显示它。

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.
几乎每一个愚蠢的“我的网站全是乱码”或“我用重音字符时邮件不能正确现实”问题,都可以归结为一个天真的程序员不明白一个简单的事实,如果你不告诉我特定的字符串是使用utf - 8编码或ASCII或ISO 8859 - 1 (Latin 1)或Windows 1252(西欧), 我根本无法正确地显示它,甚至无法找出它在哪里结束。 在127位编码之后还有上百个编码,不能全靠猜。

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form
我们如何保存字符串使用的编码信息? 有一些标准的方法。 对于电子邮件消息,表单的标题中应该有一个字符串

Content-Type: text/plain; charset=“UTF-8”

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself – not in the HTML itself, but as one of the response headers that are sent before the HTML page.
对于一个网页,最初的想法是web服务器会连同网页本身返回一个类似的Content-Type http头——不是在HTML内,而是作为响应头之一,在HTML页面之前发送。

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.
这将导致问题。 假设你有一个很大的web服务器,里面有很多网站和数百个页面,这些页面都是由很多人用不同的语言创建的,并且都使用了他们认为合适的编码。 web服务器本身并不知道每个文件的编码,所以它不能正确发送Content-Type头文件。

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
如果您可以使用某种特殊的标记,将HTML文件的Content-Type正好放在HTML文件本身中,那就方便多了。 当然,这把纯粹主义者逼疯了…… 如何读取HTML文件,直到您知道它的编码? 幸运的是,几乎所有常用的编码都对32到127之间的字符做了相同的事情,所以你总是可以在HTML页面上做到这一点,而不用开始使用奇怪的字母:

But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
但是那个元标签必须是部分的第一个东西,因为一旦web浏览器看到这个标签,它将停止解析页面,并在使用您指定的编码重新解释整个页面。

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.
如果web浏览器在http头文件或元标签中找不到任何Content-Type,该怎么办? Internet Explorer实际上做了一件非常有趣的事情:它试图根据各种语言的典型编码中各种字节出现在典型文本中的频率来猜测使用了什么语言和编码。 因为各种旧的8位编码页倾向于把他们的国家字母放在128到255之间的不同范围内,而且因为每种人类语言都有不同的字母使用特征直方图,这实际上是有成功概率的。这听起来很奇怪,但这个方式经常能让网页正常工作,天真的网页作者从来不知道他们需要一个内容类型头,使页面正确显示。直到有一天,他们写的东西并不完全符合他们母语的字符频率分布。导致ie浏览器可能认为它是韩国的,并将网页显示出来。我认为,这证明了Postel法则关于“在你发出的东西上要保守,在你接受的东西上要自由”的观点,坦白地说,这并不是一个好的工程原理。 不管怎么说,这个网站是用保加利亚语写的,但看起来是韩语(甚至不是连贯的韩语),可怜的读者会做什么呢? 他使用View |编码菜单,并尝试了一系列不同的编码(至少有12种东欧语言的编码),直到画面变得正常。 如果他们懂得这么做,但大多数人都不会。

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char”) instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".
对于我的公司发布的网站管理软件CityDesk的最新版本,我们决定在内部采用UCS-2(两个字节)Unicode,这是Visual Basic、COM和Windows NT/2000/XP使用的本地字符串类型。 在c++代码中,我们将声明字符串为wchar_t (“wide char”)而不是char,并使用wcs函数替代str函数(例如,wcscat和wcslen替代strcat和strlen)。 要在C代码中创建UCS-2字符串,你只需在它前面放一个L,例如:L"Hello"。

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.
当CityDesk发布网页时,它会将其转换成UTF-8编码,这种编码方式早就被浏览器支持了。 这就是二十九国语言版的《Joel on Software》的编码方式,我还没有听说过有人在阅读它们时遇到任何困难。

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.
本文得到相当长,我不可能涵盖一切知道字符编码和Unicode,但我希望如果你读到这里,你知道的知识足够你用科学而不是玄学编码了。

[中文翻译]The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode相关推荐

  1. Z-Stack Home Developer's Guide—2. Overview中文翻译【Z-Stack Home 1.2.0开发文档】

    下面是Z-Stack Home 1.2.0开发资料中的Z-Stack Home Developer's Guide-2. Overview的中文翻译 2.1 简介 这章节将介绍 Z-Stack协议栈的 ...

  2. Z-Stack Home Developer's Guide—3. The Home Automation Profile and the Sample Applications中文翻译

    下面是Z-Stack Home 1.2.0开发资料中的Z-Stack Home Developer's Guide-3. The Home Automation Profile and the Sam ...

  3. Z-Stack Home Developer's Guide—8. Additional Information for HA Applications中文翻译

    下面是Z-Stack Home 1.2.0开发资料中的Z-Stack Home Developer's Guide-8. Additional Information for HA Applicati ...

  4. Z-Stack Home Developer's Guide—6. Clusters, Commands and Attributes中文翻译【Z-Stack Home 1.2.0的开发文档】

    这篇文章将翻译Z-Stack Home Developer's Guide开发文档中的6. Clusters, Commands and Attributes部分,在Z-Stack中Cluster.C ...

  5. Z-Stack Home Developer's Guide—4.Using the sample applications as base for new applications 中文翻译

    本章节是官方提供的文档说明 如何将样例程序改为我们自己需要开发的程序,下面是中文翻译 4.使用样例程序为基础开发一个新的应用程序 HA样例程序旨在用作用户开的应用程序的基础,修改他们需要遵循如下步骤: ...

  6. 论文中文翻译——A deep tree-based model for software defect prediction

    本论文相关内容 论文下载地址--Web Of Science 论文中文翻译--A deep tree-based model for software defect prediction 论文阅读笔记 ...

  7. 论文中文翻译——SySeVR A Framework for Using Deep Learning to Detect Software Vulnerabilities

    本论文相关内容 论文下载地址--Web Of Science 论文中文翻译--SySeVR A Framework for Using Deep Learning to Detect Software ...

  8. halcon11用于C++的HTuple.h头文件,纯手添中文翻译!

    halcon11用于C++的HTuple.h头文件,纯手添中文翻译!乐于分享,请勿在非同意下转载! /************************************************* ...

  9. PEP8中文翻译(转)

    阅读目录 Indentation 缩进和换行 Tabs or Spaces? 制表符或者空格? Maximum Line Length 行的最大长度 Blank Lines 空行 Encodings ...

最新文章

  1. RGB格式等比例缩放
  2. 我的第一个windows应用程序
  3. mysql 多数据库实例_Mysql多实例安装
  4. 输出等边三角形php,php打印三角星星方法实列
  5. vb checkbox选中和不选中_UE4 4.23 RetainerBox 选中框位置不正确
  6. Win11任务栏一直转圈圈的解决方法
  7. MiniGUI编程--静态框[转]
  8. Oracle 在安装时,安装文件的目录不能有汉字。
  9. eclipse无法运行PHP_eclipse 无法运行php文件怎么办
  10. sql 怎么获取系统时间
  11. rtl8821cs wifi驱动调试 imx6
  12. python新年快乐代码_新年快乐! python实现绚烂的烟花绽放效果
  13. matlab清除坐标轴,matlab 使用技巧之设置坐标轴
  14. 笔记本安装win10+ubuntu双系统超详细教程
  15. word 输入数学公式(1)
  16. 工控系统 SCADA(监控和数据采集)系统简介
  17. ISUP信令REL原因值
  18. 二叉平衡树之二叉树搜索树【咱们一起手动模拟实现】
  19. 查询至少具有两份工作员工的姓名和其公司名
  20. Git笔记(三)git commit撤销

热门文章

  1. Vue 前端代码风格指南
  2. 电销机器人价格_电话电销机器人价格如何?会很贵吗?
  3. 16位二进制数转换成BCD码的的快速算法-51单片机
  4. obs 屏幕太大,录制不到下面的界面
  5. jquery html()样式悠效,jquery怎么用attr()方法判断改变css样式?
  6. 湖南大学计算机技术硕士录取名单,湖南大学各学院2019年硕士研究生拟录取名单...
  7. 【调剂】湖南大学电子科学与技术专业2023年硕士研究生调剂公告
  8. 几句话掌握Unix的前世今生——简单吧
  9. 焊接技术应用在制造业发展中至关重要
  10. 机器学习算法思想梳理