使用unicode编码识别中文字符、字母和数字，包括生僻汉字

查询网络上如何识别中文字符的帖子，发现大部分只判断了常用汉字，即Unicode范围为0x4E00 ~ 0x9FA5。

unicode编码最新版本是2009年9月出版的5.2版，对汉字又进行了扩充。以往常说的20902个汉字，在unicode中从0x4e00-0x9fa5，但这不是全部的unicode汉字。最新版的unicode汉字块如下：

0x4e00-0x9fff cjk 统一字型常用字共
20992个（实际只定义到0x9fc3)
0x3400-0x4dff cjk 统一字型扩展表a 少用字共 6656个
0x20000-0x2a6df cjk 统一字型扩展表b 少用字，历史上使用共42720个
0xf900-0xfaff cjk 兼容字型重复字，可统一变体，共同字共512个
0x2f800-0x2fa1f cjk 兼容字型补遗可统一变体共544个

一共定义大陆，台湾，香港，新加坡，马来西亚汉字，日语和韩语汉字或偏旁部首71424个。

如何识别生僻汉字，特别是0x20000-0x2a6df这种超过两个字节编码范围的生僻汉字，是次博客的重点。如果想看答案，可直接挑战到博客末尾。

提供答案之前，先补充一下知识：C#语言中，string使用utf-16编码，每两位对应一个字符。这也与string由char构成相对应，即char占2个字节，表示一个符号，而foreach一个字符串，迭代器默认以两个字节，即一个char来遍历。

如何遍历完整的unicode编码字符呢，微软https://docs.microsoft.com/zh-cn/dotnet/api/system.string?view=netstandard-2.0#Characters官方文档给出了答案：

Consecutive index values might not correspond to consecutive Unicode characters, because a Unicode character might be encoded as more than one Char object. In particular, a string may contain multi-character units of text that are formed by a base character followed by one or more combining characters or by surrogate pairs. To work with Unicode characters instead of Char objects, use the System.Globalization.StringInfo and TextElementEnumerator classes. The following example illustrates the difference between code that works with Char objects and code that works with Unicode characters. It compares the number of characters or text elements in each word of a sentence. The string includes two sequences of a base character followed by a combining character.

using System;
using System.Collections.Generic;
using System.Globalization;public class Example
{public static void Main(){// First sentence of The Mystery of the Yellow Room, by Leroux.string opening = "Ce n'est pas sans une certaine émotion que "+"je commence à raconter ici les aventures " +"extraordinaires de Joseph Rouletabille."; // Character counters.int nChars = 0;// Objects to store word count.List<int> chars = new List<int>();List<int> elements = new List<int>();foreach (var ch in opening) {// Skip the ' character.if (ch == '\u0027') continue;if (Char.IsWhiteSpace(ch) | (Char.IsPunctuation(ch))) {chars.Add(nChars);nChars = 0;}else {nChars++;}}TextElementEnumerator te = StringInfo.GetTextElementEnumerator(opening);while (te.MoveNext()) {string s = te.GetTextElement();   // Skip the ' character.if (s == "\u0027") continue;if ( String.IsNullOrEmpty(s.Trim()) | (s.Length == 1 && Char.IsPunctuation(Convert.ToChar(s)))) {elements.Add(nChars);         nChars = 0;}else {nChars++;}}// Display character counts.Console.WriteLine("{0,6} {1,20} {2,20}","Word #", "Char Objects", "Characters"); for (int ctr = 0; ctr < chars.Count; ctr++) Console.WriteLine("{0,6} {1,20} {2,20}",ctr, chars[ctr], elements[ctr]); }
}

使用微软提供的api是一种办法，当然你也可以遍历每个char，判断下个字符是由一个char还是两个char构成的，具体方法可以参考：https://blog.csdn.net/pzasdq/article/details/51075176。这里贴出一位大神的测试代码：

static bool Test(string text){uint high = 0;foreach (char code in text){if (high == 0){if (!char.IsLetterOrDigit(code) && (code < 0x3400 || code > 0x4db5) &&(code < 0x4e00 || code > 0x9fbb) && (code < 0xf900 || code > 0xfa2d) &&(code < 0xfa30 || code > 0xfa6a) && (code < 0xfa70 || code > 0xfad9)){if ((code & 0xfc00) == 0xd800) high = ((code & 0x3ffu) << 10) + 0x10000;else{return false;}}}else{if ((code & 0xfc00) == 0xdc00){high += code & 0x3ffu;if (high < 0x20000 || high > 0x2a6d6){if (high < 0x2f800 || high > 0x2fa1d){return false;}}high = 0;}else{return false;}}}return high == 0;}

原理就是判断特殊的utf-16编码格式，确定这是完整的一个字符，还是需要结合下一个char来组成一个完整的字符。

在此基础上，结合微软提供的迭代器，实现另外一种解法：

public static bool IsChineseLetterOrDigit(string str){TextElementEnumerator t = StringInfo.GetTextElementEnumerator(str);while (t.MoveNext()){string s = t.GetTextElement();if (s.Length == 1){char ch = s[0];uint unicode = (uint) ch;if(char.IsLetterOrDigit(ch)) continue;if(0x4E00 <= unicode && unicode <= 0x9FBF) continue;if(0x3400 <= unicode && unicode <= 0x4DBF) continue;if(0xF900 <= unicode && unicode <= 0xFAFF) continue;if(0x3105 <= unicode && unicode <= 0x312D) continue;return false;}else // 四字节的生僻字{if (s.Length > 2) return false;char high = s[0], low = s[1];if ((high & 0xfc00) != 0xd800) return false;if ((low & 0xfc00) != 0xdc00) return false;uint unicode = 0x10000 + ((high & 0x3ffu) << 10) + (low & 0x3ffu);if(0x20000<= unicode && unicode <= 0x2A6DF) continue;if(0x2F800 <= unicode && unicode <= 0x2FA1F) continue;return false;}}return true;}

好了，本帖到此为止，大家可以使用

ps:不要使用char.IsLetterOrDigit函数，这个会识别所有的字符（除标点符号），一个汉字算一个字符，一个韩文算一个字符。而我们要识别的是数字和英文字母，还是乖乖的用ASCII码吧。微软出品的代码也有这么模糊的函数名,emmmm…

使用unicode编码识别中文字符、字母和数字，包括生僻汉字相关推荐

php u6d4b,PHP解码unicode编码的中文字符代码分享
晚上在抓取某网站数据,结果在数据包中发现了这么一串编码的数据:"--/u65b0/u6d6a/u5fae/u535a--", 这其实是中文被unicode编码后了的数据,我现在就是 ...
php输出字符unicode码,PHP解码unicode编码的中文字符代码分享
问题背景: 晚上在抓取某网站数据,结果在数据包中发现了这么一串编码的数据:"......\u65b0\u6d6a\u5fae\u535a......", 这其实是中文被unicod ...
unicode解码php,PHP解码unicode编码的中文字符
问题背景: 晚上在抓取某网站数据,结果在数据包中发现了这么一串编码的数据:"......\u65b0\u6d6a\u5fae\u535a...... 如何解码unicode编码的字符?[好使 ...
JavaScript中Unicode编码和中文相互转换
Unicode转换简介官方中文名称为统一码,也译名为万国码.国际码.单一码,是计算机科学领域的业界标准.它整理.编码了世界上大部分的文字系统,使得电脑可以用更为简单的方式来呈现和处理文字. Uni ...
tinyxml 读取文本节点_在Windows下使用TinyXML-2读取UTF-8编码包含中文字符的XML文件...
TinyXML-2 是一个用 C++ 开发的小巧.高效的 XML 解析工具,它在 GitHub 网站上的链接为: https://github.com/leethomason/tinyxml2 .它的 ...
VS2015解决非Unicode编码包含中文字段无法编译的问题
VS2015解决非Unicode编码包含中文字段无法编译的问题参考文章: (1)VS2015解决非Unicode编码包含中文字段无法编译的问题 (2)https://www.cnblogs.com/ ...
python生成中文、字母、数字等字符图片
代码功能: 生成指定颜色.大小.字体的中文.字母.数字等字符图片代码 from PIL import Image, ImageDraw, ImageFont import random# 设置背景颜 ...
Python爬虫识别中文字符和标点符号
Python爬虫识别中文字符和标点符号,并且保存成txt文档 import requestshref_list = final_df["隐私政策"].values names = ...
\u65b0\u7f51\u5173 unicode编码与中文互转
中文转unicode编码: 首先,我给大家提供一个中文转unicode编码的工具链接: //download.csdn.net/download/qq_43560721/11988683 操作步骤: ...

使用unicode编码识别中文字符、字母和数字，包括生僻汉字

使用unicode编码识别中文字符、字母和数字，包括生僻汉字相关推荐

最新文章

热门文章