为什么80%的码农都做不了架构师?>>>   

There has been lots of buzz about many of the new features in PHP 5.4, like the traits support, the short array syntax and all those other syntax improvements.

But one set of changes that I think is particularly important was largely overlooked: For PHP 5.4 cataphract (Artefacto on StackOverflow) heroically rewrote large parts ofhtmlspecialcharsthus fixing various quirks and adding some really nice new features.

(The changes discussed here apply not only to htmlspecialchars, but also to the related htmlentities and in parts to htmlspecialchars_decode, html_entity_decode and get_html_translation_table.)

Here a quick summary of the most important changes:

  • UTF-8 as the default charset
  • Improved error handling (ENT_SUBSTITUTE)
  • Doctype handling (ENT_HTML401, …)

UTF-8 as the default charset

As you hopefully know the third argument forhtmlspecialcharsis the character set. Thing is: Most people just leave that argument out, thus falling back to the default charset. This default charset was ISO-8859-1 before PHP 5.4 and as such did not match the UTF-8 encoding most people use. PHP 5.4 fixes this by making UTF-8 the default.

Improved error handling

Error handling inhtmlspecialcharsbefore PHP 5.4 was … uhm, let’s call it “unintuitive”:

If you passed a string containing an “invalid code unit sequence” (which is Unicode slang for “not encoded correctly”)htmlspecialcharswould return an empty string. Well, okay, so far so good. The funny thing was that it additionally would throw an error, but only if error display was disabled. So it would only error if errors are hidden. Nice, innit?

This basically meant that on your development machine you wouldn’t see any errors, but on your production machine the error log would be flooded with them. Awesome.

So, as of PHP 5.4 thankfully this behavior is gone. The error will not be generated anymore.

Additionally there are two options that allow you to specify an alternative to just returning an empty string:

  • ENT_IGNORE: This option (which isn’t actually new, it was there in PHP 5.3 already) will just drop all invalid code unit sequences. This is bad for two reasons: First, you won’t notice invalid encoding because it’ll be simply dropped. Second, this imposes a certain security risk (for more info see the Unicode Security Considerations).
  • ENT_SUBSTITUTE: This new alternative option takes a much more sensible approach at the problem: Instead of just dropping the code units they will be replaced by a Unicode Replacement Character (U+FFFD). So invalid code unit sequences will be replaced by � characters.

Let’s have a look at the different behaviors ( demo):

[php]

// "\80" is invalid UTF-8 in this context  var_dump(htmlspecialchars("a\x80b")); // string(0) "" var_dump(htmlspecialchars("a\x80b", ENT_IGNORE)); // string(2) "ab" var_dump(htmlspecialchars("a\x80b", ENT_SUBSTITUTE)); // string(5) "a�b"

[/php]

Clearly, you want the last behavior. In your real code it will probably look like this:

[php]

// this goes into the bootstrap (or where appropriate) to make the code  // not throw a notice on PHP 5.3  if (!defined('ENT_SUBSTITUTE')) {  define('ENT_SUBSTITUTE', 0); // if you want the empty string behavior on 5.3  // or  define('ENT_SUBSTITUTE', ENT_IGNORE);  // if you want the char removal behavior on 5.3  // (don't forget about the security issues though!)  }  // don't forget to specify the charset! Otherwise you'll get the old default charset on 5.3. $escaped = htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');

[/php]

Doctype handling

In PHP 5.4 there are four additional flags for specifying the used doctype:

  • ENT_HTML401(HTML 4.01) => this is the default
  • ENT_HTML5(HTML 5)
  • ENT_XML1(XML 1)
  • ENT_XHTML(XHTML)

Depending on which doctype you specifyhtmlspecialchars(and the other related functions) will use different entity tables.

You can see this in the following example (demo):

[php]

var_dump(htmlspecialchars("'", ENT_HTML401)); // string(6) "'"  var_dump(htmlspecialchars("'", ENT_HTML5)); // string(6) "'"

[/php]

So for HTML 5 an'entity will be generated, whereas for HTML 4.01 - which does not yet support'- a numerical'entity is returned.

The difference becomes more evident when usinghtmlentities, because the differences are larger there. You can easily see this by having a look at the raw translation tables:

To do this, we can use theget_html_translation_tablefunction. Here first an example for the XML 1 doctype (demo):

[php]

var_dump(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES | ENT_XML1));

[/php]

The result will look like this:

array(5) {["""]=>string(6) "&quot;"["&"]=>string(5) "&amp;"["'"]=>string(6) "&apos;"["<"]=>string(4) "&lt;"[">"]=>string(4) "&gt;"
}

This matches our expectations: XML by itself defines only the five basic entities.

Now try the same thing for HTML 5 (demo) and you’ll see something like this:

array(1510) {["    "]=>string(5) "&Tab;"["
"]=>string(9) "&NewLine;"["!"]=>string(6) "&excl;"["""]=>string(6) "&quot;"["#"]=>string(5) "&num;"["$"]=>string(8) "&dollar;"["%"]=>string(8) "&percnt;"["&"]=>string(5) "&amp;"["'"]=>string(6) "&apos;"// ...
}

So HTML 5 defines a vast number of entities - 1510 to be precise. You can also try HTML 4.01 and XHTML; they both define 253 entities.

Also affected by the chosen doctype is another new error handling flag which I did not mention above:ENT_DISALLOWED. This flag will replace characters with a Unicode Replacement Character, which formally are a valid code unit sequences, but are invalid in the given doctype.

This way you can ensure that the returned string is always well formed regarding encoding (in the given doctype). I’m not sure though how much sense it makes to use this flag. The browser will handle invalid characters gracefully anyways, so this seems unnecessary to me (though I’m probably wrong).

There is other stuff too…

… but I don’t want to list everything here. I think the three changes mentioned above are the most important improvements.

[php]

htmlspecialchars("<\x80The End\xef\xbf\xbf>", ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED | ENT_SUBSTITUTE, 'UTF-8');

[/php]

转载于:https://my.oschina.net/clearchen/blog/113105

htmlspecialchars() improvements in PHP 5.4相关推荐

  1. PHP中htmlentities跟htmlspecialchars的区别

    http://php.net/manual/zh/function.htmlspecialchars.php 很多人都以为htmlentities跟htmlspecialchars的功能是一样的,都是 ...

  2. htmlentities()与htmlspecialchars()

    htmlspecialchars()和htmlentities()之间有什么区别? 什么时候应该使用其中一个? #1楼 如果只希望字符串是XML和HTML安全的htmlspecialchars($st ...

  3. PHP的转义函数 htmlspecialchars、strip_tags、addslashes解释

    第一个函数:strip_tags,去掉 HTML 及 PHP 的标记 注意:本函数可去掉字串中包含的任何 HTML 及 PHP 的标记字串.若是字串的 HTML 及 PHP 标签原来就有错,例如少了大 ...

  4. htmlspecialchars() 函数把一些预定义的字符转换为 HTML 实体。

    htmlspecialchars() 函数把一些预定义的字符转换为 HTML 实体.语法为:htmlspecialchars(string,quotestyle,character-set). PHP ...

  5. PHP的htmlspecialchars、strip_tags、addslashes解释

    2019独角兽企业重金招聘Python工程师标准>>> 第一个函数:strip_tags,去掉 HTML 及 PHP 的标记 注意:本函数可去掉字串中包含的任何 HTML 及 PHP ...

  6. PHP中htmlentities和htmlspecialchars的区别

    使用函数 htmlentities 后使中文变乱码,以至数据存到数据库全部是乱码.一直以为是MYSQL字符集设置问题,花了两天时间才找到原因.使用htmlspecialchars既可解决问题. 这两个 ...

  7. php ignore special characters,PHP htmlspecialchars() 函數--防注入字符轉義函數

    更多實例 例子 1 把一些預定義的字符轉換為 HTML 實體:<?php $str = "Bill & 'Steve'"; echo htmlspecialchars ...

  8. PHP5.4以上版本GBK编码下htmlspecialchars输出为空问题解决方法汇总

    从旧版升级到php5.4,恐怕最麻烦的就是htmlspecialchars这个问题了! 当然,htmlentities也会受影响,不过,对于中文站来说一般用htmlspecialchars比较常见,h ...

  9. php常用过滤htmlspecialchars() 函数把预定义的字符转换为 HTML 实体

    这个函数非常重要,特别是在处理中文字符时,同时开发过程中往往需对写入数据库或读取数据库的数据进行处理. htmlspecialchars(string,flags,character-set,doub ...

最新文章

  1. Zedboard学习(二):zedboard的Linux下交叉编译环境搭建
  2. java 过滤攻击报文_Spring Boot XSS 攻击过滤插件使用
  3. java课程之团队开发冲刺1.4
  4. oracle 使用imp,Oracle中的Imp和Expt用法
  5. 经常用everything对硬盘有伤害吗?
  6. qt控制程序打开记事本_QT记事本小部件教程(二):应用程序主要源文件main.cpp详细代码...
  7. 基于Haproxy的高可用实战
  8. [20150228]Delayed Block Cleanout 2.txt
  9. 宝利通HDX7000常见故障处理
  10. Java编程题——打印“ X ”图形
  11. 软件推荐:论文翻译阅读 + 文献管理 + markdown笔记 + 多设备同步 + 一键导出bib参考文献
  12. Winrar无广告版下载地址
  13. JS判断当前浏览器是否为IE内核
  14. 分享一款代码生成工具,可自定义模板生成不同的代码
  15. 港科百创 | 一清创新完成Pre-A+轮战略融资
  16. PostgreSQL+PostGIS下载和离线安装
  17. 计算机系统如何恢复出厂设置路由器,迅捷(fast)路由器恢复出厂设置后怎么重新设置?...
  18. ubuntu16.04编译obmc
  19. ROC和DO的双重设计:打造出支付领域的重磅产品
  20. android 系统级闹铃,Android 设置系统闹铃和日历

热门文章

  1. linux操作系统的特点有哪些,LINUX操作系统有哪些概念和特点?
  2. 0基础学习数据分析必须掌握的技能有哪些?
  3. springcloud 子项目怎么导入_使用eclipse一步一步创建SpringCloud项目(二)—— 使用feign和ribbon调用微服务...
  4. Redis之单线程 Reactor 模型
  5. C语言开发单片机如何避免全局变量过多混乱
  6. LINUX按照物理地址预取,linux – 如何以编程方式禁用硬件预取?
  7. mysql udate 充值_分享下一个mysql的充值记录系统
  8. 与猜数问题有关的游戏C语言,猜数字游戏(C语言版)
  9. 创建模板_UG中如何创建属于自己的编程模板界面?
  10. Codeforces Round #645 (Div. 2)(D.The Best Vacation)