解决wget下载中文乱码的方法

在下载用apache或者nginx做的索引目录时，遇到文件名乱码问题。一般情况下加上–restrict-file-names=nocontrol参数就可以用了。

有一个前提，要注意索引目录显示出来的是什么编码，比如有些网站是UTF-8（这个应该比较正规，中文不会出现很***烦，可以用方法二搞定），有些是GBK，可能跟文件的编码，或者apache、nginx的设置有关吧。

1、下载的时候保存成ascii，跟方法三类似

1	wget --restrict-file-names=ascii -m www.xxx.com/

2、用一个重命名软件，菲菲更名宝贝RenamePro8.0，相当好用。在“高级文件名变”更里面有一个“文件名编码与解码”，“ANSI编码URL字符串转换为文字”，大家可以多试试。

3、如果不行，可以研究一下wget的一些参数，相关的有两个。
–local-encoding=ENC IRI (国际化资源标识符) 使用 ENC 作为本地编码。
–remote-encoding=ENC 使用 ENC 作为默认远程编码。

方法一
moper：这种方法是把文件名转换成ascii，加了一个 --restrict-file-name=ascii 参数，然后再用python写的一段程序，转换成win能够接受的编码。其实我们只需加另一个参数–restrict-file-names=nocontrol，就可以了。
完整命令为

1	wget --restrict-file-names=nocontrol -m http://xxx.xxx.xxx

1	wget --restrict-file-name=ascii -m http://xxx.xxx.xxx

下载的话，中文文件名会编码成URL形式，比如比如“2010架构师大会PPT”就变成了“2010%E6%9E%B6%E6%9E%84%E5 %B8%88%E5%A4%A7%E4%BC%9APPT”。主要是因为在网页上，中文 URL会以 UTF-8 来编码，而 Windows 存储文件名是用GBK编码。也就是说“2010鏋舵瀯甯堝ぇ浼歅PT”实际上是以 GBK 编码来显示的 UTF-8 编码的文件名。这样我们只要用 Python 写个编码转换器就可以了。代码如下：

import os, urllib, sys, getoptclass Renamer:input_encoding = ""output_encoding = ""path = ""is_url = Falsedef __init__(self, input, output, path, is_url):self.input_encoding = inputself.output_encoding = outputself.path = pathself.is_url = is_urldef start(self):self.rename_dir(self.path)def rename(self, root, path):try:if self.is_url:new = urllib.unquote(path).decode(self.input_encoding).encode(self.output_encoding)else:new = path.decode(self.input_encoding).encode(self.output_encoding)os.rename(os.path.join(root, path), os.path.join(root, new))except:passdef rename_dir(self, path):for root, dirs, files in os.walk(path):for f in files:self.rename(root, f)if dirs == []:for f in files:self.rename(root, f)else:for d in dirs:self.rename_dir(os.path.join(root, d))self.rename(root, d)
def usage():print '''This program can change encode of files or directories.Usage:   rename.exe [OPTION]...Options:-h, --help                  this document.-i, --input-encoding=ENC    set original encoding, default is UTF-8.-o, --output-encoding=ENC   set output encoding, default is GBK.-p, --path=PATH             choose the path which to process.-u, --is-url                whether as a URL'''def main(argv):input_encoding = "utf-8"output_encoding = "gbk"path = ""is_url = Truetry:opts, args = getopt.getopt(argv, "hi:o:p:u", ["help", "input-encoding=", "output-encoding=", "path=", "is-url"])except getopt.GetoptError:usage()sys.exit(2)for opt, arg in opts:if opt in ("-h", "--help"):usage()sys.exit()elif opt in ("-i", "--input-encoding"):input_encoding = argelif opt in ("-o", "--output-encoding"):output_encoding = argelif opt in ("-p", "--path"):path = argelif opt in ("-u", "--is-url"):is_url = Truern = Renamer(input_encoding, output_encoding, path, is_url)rn.start()if __name__ == '__main__':main(sys.argv[1:])

如果 wget 是使用以下命令行来下载：

1	wget --restrict-file-name=ascii -m http://ebook.elain.org

那么下载下来的文件是“2010%E6%9E%B6%E6%9E%84%E5%B8%88%E5%A4%A7%E4%BC%9APPT”形式，运行脚本时就使用以下命令：

1	rename.py -i utf-8 -o gbk -p R:\ebook.elain.org -u

方法二
改wget源代码

moper：不推荐这种方法，因为比较麻烦，我也没有测试，可能这种效果会好一些吧。

文章一《wget中文乱码解决方案》

用wget下载网页时，若文件名含有非ASCII字符或其他特殊字符，就会出现所谓的乱码。若想解决中文乱码的问题，可以修改wget的源代码。
对URL字符串进行编码的源代码文件是url.c。其中，url_file_name()的功能是根据URL判断应该以什么文件名保存文件。而该函数又调用了append_uri_pathel()，该函数调用了FILE_CHAR_TEST()宏，它用于判断URL中的字符是不是特殊字符（也就是需要进行URL编码的字符。当然，包括中文）。问题就出在这个宏身上了。为了不对中文转义，需要将中文字符当作普通字符对待。将如下所示的 FILE_CHAR_TEST()宏：

1
2
3

#define FILE_CHAR_TEST(c, mask) \((opt.restrict_files_nonascii && !c_isascii ((unsigned char)(c))) || \(filechr_table[(unsigned char)(c)] & (mask)))

修改为：

#define FILE_CHAR_TEST(c, mask) \
(((opt.restrict_files_nonascii && !c_isascii ((unsigned char)(c))) || \
(filechr_table[(unsigned char)(c)] & (mask))) \
&& !((c|0x0fffffff) == 0xffffffff)) /* 排除中文 */

方法三,修改url_file_name()函数
wget1.12版本源代码中 url.c文件第1402行的

1
2
3

for (p = b; p < e; p++)
if (FILE_CHAR_TEST (*p, mask))
++quoted;

修改为

1
2
3

for (p = b; p < e; p++)
if (FILE_CHAR_TEST (*p, mask) && !((*p | 0x0fffffff) == 0xffffffff))
++quoted;

转载于:https://blog.51cto.com/374400/1534332

解决wget下载中文乱码的方法相关推荐

两种解决Qt5显示中文乱码的方法（使用QStringLiteral和#pragma execution_character_set(utf-8)两种方法）
两种解决Qt5显示中文乱码的方法(使用QStringLiteral和#pragma execution_character_set("utf-8")两种方法) 升级到Qt5.X之后 ...
易语言mysql乱码_分享一个解决MySQL写入中文乱码的方法
[编程语言:易语言] 之前有发帖请教过如何解决MySQL写入中文乱码的问题.但没人会,或者是会的人不想回答.搜索网上的答案并尝试很多次无效,所以当时就因为这个乱码问题搁浅了一个软件很多日子. 直到昨天 ...
解决wget下载文件名乱码的一些方法
在下载用apache或者nginx做的索引目录时,遇到文件名乱码问题.搜索了不少资料,尝试了好几种方案,大家可以结合使用. 一般情况下加上–restrict-file-names=nocontrol参 ...
Java解决下载中文乱码和URL中文乱码问题
解决浏览器下载中文乱码和URL中文乱码问题下载中文乱码问题 URL路径中含有中文处理方式: 下载中文乱码问题中文乱码是个让人头痛的问题,这里整理针对下载时,中文乱码问题,下面是Java对应的实现, ...
mysql linux 中文乱码怎么解决_如何解决mysql linux 中文乱码的问题
解决mysql linux中文乱码的方法: 1.查看mysql的默认字符集#mysql -u root - p #(输入密码) mysql> show variables like 'chara ...
解决python中文乱码的方法
解决python中文乱码的方法参考文章: (1)解决python中文乱码的方法 (2)https://www.cnblogs.com/bobodeboke/p/11935876.html 备忘一下.
java 插入 mysql 乱码_解决java中插入mysql中文乱码的方法
解决java中插入mysql中文乱码的方法发布时间:2020-07-11 14:35:11 来源:亿速云阅读:100 作者:清晨这篇文章主要介绍解决java中插入mysql中文乱码的方法,文中介 ...
python 保存本地乱码,解决python保存数据到csv文件中文乱码的方法
解决python保存数据到csv文件中文乱码的方法发布时间:2020-07-08 13:49:53 来源:亿速云阅读:695 作者:清晨小编给大家分享一下解决python保存数据到csv文件中文 ...
python pyh html解决中文中文乱码的方法
pyh github上源码: https://github.com/hanxiaomax/pyh 解决中文乱码的方法: 1.打开调用的类: 2.跳转到pyh源码 3.进来后我们可以看到charset, ...

解决wget下载中文乱码的方法

解决wget下载中文乱码的方法相关推荐

最新文章

热门文章