检测编码并制作一切UTF-8

本文翻译自：Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. 我正在从各种RSS提要中读取很多文本，并将其插入到数据库中。

Of course, there are several different character encodings used in the feeds, eg UTF-8 and ISO 8859-1. 当然，提要中使用了几种不同的字符编码，例如UTF-8和ISO 8859-1。

Unfortunately, there are sometimes problems with the encodings of the texts. 不幸的是，文本的编码有时会出现问题。 Example: 例：

The "ß" in "Fußball" should look like this in my database: "ÂŸ". “Fußball”中的“ß”在我的数据库中应如下所示：“ÂŸ”。 If it is a "ÂŸ", it is displayed correctly. 如果它是“Â”，则正确显示。
Sometimes, the "ß" in "Fußball" looks like this in my database: "ÃƒÂŸ". 有时，“Fußball”中的“ß”在我的数据库中看起来像这样：“ÃƒÂŸ”。 Then it is displayed wrongly, of course. 然后，当然会显示错误。
In other cases, the "ß" is saved as a "ß" - so without any change. 在其他情况下，“ß”另存为“ß”-因此无需进行任何更改。 Then it is also displayed wrongly. 然后它也会显示错误。

What can I do to avoid the cases 2 and 3? 我该如何避免情况2和情况3？

How can I make everything the same encoding, preferably UTF-8? 如何使所有内容都使用相同的编码，最好是UTF-8？ When must I use utf8_encode() , when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input? 什么时候必须使用utf8_encode() ，什么时候必须使用utf8_decode() （很明显会产生什么效果，但是什么时候必须使用这些函数？），什么时候不对输入执行任何操作？

How do I make everything the same encoding? 如何使所有内容都具有相同的编码？ Perhaps with the function mb_detect_encoding() ? 也许使用mb_detect_encoding()函数？ Can I write a function for this? 我可以为此编写函数吗？ So my problems are: 所以我的问题是：

How do I find out what encoding the text uses? 如何找出文字使用的编码方式？
How do I convert it to UTF-8 - whatever the old encoding is? 我如何将其转换为UTF-8-不管旧的编码是什么？

Would a function like this work? 这样的功能会起作用吗？

function correct_encoding($text) {$current_encoding = mb_detect_encoding($text, 'auto');$text = iconv($current_encoding, 'UTF-8', $text);return $text;
}

I've tested it, but it doesn't work. 我已经测试过了，但是没有用。 What's wrong with it? 它出什么问题了？

#1楼

参考：https://stackoom.com/question/3owD/检测编码并制作一切UTF

#2楼

I had same issue with phpQuery ( ISO-8859-1 instead of UTF-8 ) and this hack helped me: 我在phpQuery上遇到了同样的问题（ ISO-8859-1而不是UTF-8 ），这个hack帮助了我：

$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;

mb_internal_encoding('UTF-8') , phpQuery::newDocumentHTML($html, 'utf-8') , mbstring.internal_encoding and other manipulations didn't take any effect. mb_internal_encoding('UTF-8') ， phpQuery::newDocumentHTML($html, 'utf-8') ， mbstring.internal_encoding和其他操作均无效。

#3楼

Get encoding from headers and convert it to utf-8. 从标题获取编码并将其转换为utf-8。

$post_url='http://website.domain';/// Get headers
function get_headers_curl($url)
{ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,            $url); curl_setopt($ch, CURLOPT_HEADER,         true); curl_setopt($ch, CURLOPT_NOBODY,         true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT,        15); $r = curl_exec($ch); return $r;
}
$the_header = get_headers_curl($post_url);
/// check for redirect /
if (preg_match("/Location:/i", $the_header)) {$arr = explode('Location:', $the_header);$location = $arr[1];$location=explode(chr(10), $location);$location = $location[0];$the_header = get_headers_curl(trim($location));
}
/// Get charset /
if (preg_match("/charset=/i", $the_header)) {$arr = explode('charset=', $the_header);$charset = $arr[1];$charset=explode(chr(10), $charset);$charset = $charset[0];}
///
// echo $charset;if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }

#4楼

I know this is an older question, but I figure a useful answer never hurts. 我知道这是一个比较老的问题，但是我认为一个有用的答案永远不会让您感到痛苦。 I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. 我在桌面应用程序，SQLite和GET / POST变量之间的编码存在问题。 Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved. 有些使用UTF-8，有些使用ASCII，并且基本上，当涉及到外来字符时，所有内容都会搞砸。

Here is my solution. 这是我的解决方案。 It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. 在处理之前，它会在每次加载页面时擦洗您的GET / POST / REQUEST（我省略了cookie，但可以根据需要添加它们）。 It works well in a header. 它在标头中运行良好。 PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s. 如果PHP无法自动检测源编码，则会抛出警告，因此这些警告将被@抑制。

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{$process = array(&$_GET, &$_POST, &$_REQUEST);while (list($key, $val) = each($process)) {foreach ($val as $k => $v) {unset($process[$key][$k]);if (is_array($v)) {$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;$process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];} else {$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');}}}unset($process);
}
catch(Exception $ex){}

#5楼

A really nice way to implement an isUTF8 -function can be found on php.net : 可以在php.net上找到实现isUTF8功能的一种非常好的方法：

function isUTF8($string) {return (utf8_encode(utf8_decode($string)) == $string);
}

#6楼

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output. 如果将utf8_encode()应用于已经存在的UTF-8字符串，它将返回乱码的UTF-8输出。

I made a function that addresses all this issues. 我做了一个解决所有这些问题的功能。 It´s called Encoding::toUTF8() . 这就是所谓的Encoding::toUTF8() 。

You don't need to know what the encoding of your strings is. 您不需要知道字符串的编码是什么。 It can be Latin1 ( ISO 8859-1) , Windows-1252 or UTF-8, or the string can have a mix of them. 它可以是Latin1（ ISO 8859-1）， Windows-1252或UTF-8，或者字符串可以混合使用。 Encoding::toUTF8() will convert everything to UTF-8. Encoding::toUTF8()会将所有内容转换为UTF-8。

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string. 我之所以这样做，是因为某项服务使我的数据馈送全乱了，在同一字符串中混合了UTF-8和Latin1。

Usage: 用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download: 下载：

https://github.com/neitanod/forceutf8 https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8() , which will fix every UTF-8 string that looks garbled. 我包括了另一个函数Encoding::fixUFT8() ，该函数将修复每个看起来乱码的UTF-8字符串。

Usage: 用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples: 例子：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output: 将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function ( forceUTF8 ) into a family of static functions on a class called Encoding . 我已经将函数（ forceUTF8 ）转换为称为Encoding的类的静态函数系列。 The new function is Encoding::toUTF8() . 新函数是Encoding::toUTF8() 。

检测编码并制作一切UTF-8相关推荐

php 检测编码函数,自己写了一个php检测文件编码的函数
关于文件编码的检测,百度一下一大把都是,但是确实没有能用的. 很多人建议 mb_detect_encoding 检测,可是不知为何我这不成功,什么都没输出. 看到有人写了个增强版,用 BOM 判断的, ...
深度学习vad人声检测之标签制作
深度学习几乎渗透到了各行各业,最火热的莫过于视觉算法.然而,音频相关的很多处理算法也逐渐被深度学习所浸润,vad作为音频前处理的一个操作得到了很广泛的应用,比较典型的vad检测算法是通过提取特征,构造 ...
Python使用chardet包自动检测编码
chardet:charset detection 一旦自动检测出编码,就可以解码了. 八种文件打开方式 w:一旦打开文件,文件内容就清空了 r:只读方式打开 a:追加方式打开 r+:先读后写以上四 ...
姿态检测树莓派_基于深度学习的树莓派老人摔倒检测系统的制作方法
本发明属于电子器件技术领域,涉及基于深度学习的树莓派老人摔倒检测系统. 背景技术: 日益庞大的老年群体已成为人们关注的焦点.由于老年人身体活动不便等特点,摔倒已成为我国人员伤亡的第四位原因,而意外摔倒 ...
pcb成型板aoi检测_一种PCB板的AOI检测控制系统的制作方法
本实用新型属于SMT贴片加工工艺技术领域,具体涉及一种PCB板的AOI检测控制系统. 背景技术: 随着表面贴装元件的广泛应用,电子产品的体积变得越来越小,其焊接质量直接影响到产品的稳定性,目前电子元件 ...
adc0832对光电二极管进行数据采集_一种基于光电二极管的麦克风跟踪检测电路的制作方法...
本实用新型涉及一种基于光电二极管的麦克风跟踪检测电路,属于应用电子技术领域. 背景技术: 随着互联网的发展,语音交互应用正在日益变多,近几年视频直播.网络直播.K歌软件都发展得很快,也推高了麦克风的销 ...
基于改进 YOLOv5 的航空发动机表面缺陷检测模型如何制作？
建立基于改进 YOLOv5 的航空发动机表面缺陷检测模型主要需要以下步骤: 收集航空发动机表面缺陷的数据集.这些数据可以包括训练图像和标签数据,其中标签数据包含了航空发动机表面缺陷的位置信息. 利用 ...
python3----智能检测编码的工具
1 f = open('C:/Users/Administrator/Desktop/100.txt', 'rb') 2 data = f.read() 3 # print(data) 4 f.clo ...
谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词
这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级.整理这篇文章的动机是两个问题: 问题一: 使用Windows记事本的&quo ...

检测编码并制作一切UTF-8

#1楼

#2楼

#3楼

#4楼

#5楼

#6楼

检测编码并制作一切UTF-8相关推荐

最新文章

热门文章