有的时候我们有一些网页的项目,需要用到JavaScript读取一些文本文件,用以读取数据;但各种文本文件的编码方式不尽相同,特别是带有中文字符的文件,为GBK编码,一般加载后都会出现乱码情况,故需要在加载之前将文件的编码形式转为国际兼容的编码方式UTF-8。乱码也是一个很烦的问题,博主苦寻良久,终于找到了相应的解决方案,这个python程序对单个文件或者整个文件夹下的文件进行批量转码操作,经过实例测试,代码有效,代码中文件类型是自己设置的,本文文件格式为"cfg",可根据项目需要在程序内修改文件格式,程序代码如下:

gbk2utf.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-__author__ = ''import logging, os, argparse, textwrap
import time
import chardet# Default configuration will take effect when corresponding input args are missing.
# Feel free to change this for your convenience.
DEFAULT_CONF = {# Only those files ending with extensions in this list will be scanned or converted.'exts'      : ['cfg'],'overwrite' : False,'add_BOM'   : False,'convert_UTF'   : False,'confi_thres' : 0.8,
}# We have to set a minimum threshold. Only those target_encoding results returned by chartdet that are above that threshold level would be accepted.
# See https://github.com/x1angli/convert2utf/issues/4 for further detailslogging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
log = logging.getLogger(__name__)class Convert2Utf8:def __init__(self, args):self.args = argsdef walk_dir(self, dirname):for root, dirs, files in os.walk(dirname):for name in files:extension = os.path.splitext(name)[1][1:].strip().lower()# On linux there is a newline at the end which will cause the match to fail, so we just 'strip()' the '\n'# Also, add 'lower()' to ensure matchingif (extension in self.args.exts):fullname = os.path.join(root, name)try:self.convert_file(fullname)except IOError:log.error("Unable to read or write the file: %s. Please check the file's permission.", fullname)except KeyboardInterrupt:log.warning("Interrupted by keyboard (e.g. Ctrl+C)")exit()# else:#     log.error("Unable to process the file: %s. Please check.", fullname)#     traceback.print_stack()def convert_file(self, filename):with open(filename, 'rb') as f: # read under the binary modebytedata = f.read()if len(bytedata) == 0:log.info("Skipped empty file %s", filename)returnchr_res = chardet.detect(bytedata)if chr_res['encoding'] == None or chr_res['confidence'] < DEFAULT_CONF['confi_thres']:log.warning("Ignoring %s, since its encoding is unable to detect.", filename)returnsrc_enc = chr_res['encoding'].lower()log.debug("Scanned %s, whose encoding is %s ", filename, src_enc)if (src_enc == 'ascii'):log.info("Skipped %s, whose encoding is %s", filename, src_enc)returnif (not self.args.convert_utf) and src_enc.startswith('utf'):log.info("Skipped %s, whose encoding is %s", filename, src_enc)return# Since chardet only recognized all GB-based target_encoding as 'gb2312', the decoding will fail when the text file# contains certain special charaters. To make it more special-character-tolerant, we should# upgrade the target_encoding to 'gb18030', which is a character set larger than gb2312.if src_enc.lower() == 'gb2312':src_enc = 'gb18030'try:strdata = bytedata.decode(src_enc)except UnicodeDecodeError as e:log.error("Unicode error for file %s", filename)print(e)return# preserving file time information (modification time and access time)src_stat = os.stat(filename)# if the 'overwrite' flag is 'False', we would make a backup of the original text file.if not self.args.overwrite:backup_name = filename + '.' + str(int(round(time.time() * 1000))) + '.bak'log.info("Renaming %s to %s", filename, backup_name)os.rename(filename, backup_name)tgt_enc = self.args.target_encodinglog.debug("Writing the file: %s in %s", filename, tgt_enc)with open(filename, 'wb') as f: # write under the binary modef.write(strdata.encode(tgt_enc))log.info("Converted the file: %s from %s to %s", filename, src_enc, tgt_enc)# setting the new file's time to the old fileos.utime(filename, times = (src_stat.st_atime, src_stat.st_ctime))# end of def convert_file(self, filename)def run(self):root = self.args.rootif not os.path.exists(root):log.error("The file specified %s is neither a directory nor a regular file", root)returnlog.info("Start working now!")if os.path.isdir(root):log.info("The root is: %s. ", root)log.info("Files with these extension names will be inspected: %s", self.args.exts)self.walk_dir(root)else:log.info("Wow, only a single file will be processed: %s", root)self.convert_file(root)log.info("Finished all.")# end of def run(self, root):def clean_backups(dirname):if not os.path.isdir(dirname):log.error("The file specified %s is not a directory ", dirname)returnnow = time.time()last40min = now - 60 * 40log.info("Removing all newly-created .bak files under %s", dirname)for root, dirs, files in os.walk(dirname):for name in files:extension = os.path.splitext(name)[1][1:]if extension == 'bak':fullname = os.path.join(root, name)ctime = os.path.getctime(fullname)if ctime > last40min:os.remove(fullname)log.info("Removed the file: %s", fullname)def cli():parser = argparse.ArgumentParser(prog='cvt2utf8',description="A tool that converts non-UTF-encoded text files UTF-8 encoded files.",epilog="You can use this tool to remove BOM from .php source code files, or convert other target_encoding into UTF-8")parser.add_argument('root',metavar = "filename",help    = textwrap.dedent('''\the path pointing to the file or directory.If it's a directory, files contained in it with specified extensions will be converted to UTF-8.Otherwise, if it's a file, only that file will be converted to UTF-8.'''))parser.add_argument('-e','--exts',nargs   = '+', # '+'. Just like '*', all command-line args present are gathered into a list.default = DEFAULT_CONF['exts'],help    = "the list of file extensions. Only those files ending with extensions in this list will be converted.",)parser.add_argument('-o','--overwrite',action  = 'store_true',default = DEFAULT_CONF['overwrite'],help    = "Danger! If you turn this switch on, it would directly overwrite existing file without creating any backups.",)parser.add_argument('-u','--cvtutf',action  = 'store_true',dest    = 'convert_utf',default = DEFAULT_CONF['convert_UTF'],help    = "By default, we will skip files whose encodings are UTF (including UTF-8 and UTF-16), and BOM headers in these files will remain unchanged. ""But, if you want to change BOM headers for these files, you could utilize this option to change their signatures.",)parser.add_argument('-b','--addbom',action  = 'store_true',dest    = 'add_bom',default = DEFAULT_CONF['add_BOM'],help    = "If this command line argument is missing, we convert files to UTF-8 without BOM (i.e. the target encoding would be just 'utf-8'). ""But with this flag, we would add BOM in encoded text files (i.e. the target encoding would be 'utf-8-sig').",)parser.add_argument('-c','--cleanbak',action  = 'store_true',dest    = 'clean_bak',default = False,help    = textwrap.dedent('''Clean all .bak files generated within last 40 minutes.When enabled, no files will be converted to UTF-8. Use this flag with extra caution! '''),)args = parser.parse_args()if args.clean_bak:clean_backups(args.root)else:args.target_encoding = 'utf-8-sig' if args.add_bom else 'utf-8'cvt2utf8 = Convert2Utf8(args)cvt2utf8.run()if __name__ == '__main__':cli()

执行:

结果:

如果觉得本文对您有用就点个赞呗!

用python将GBK编码文件转为UTF-8编码文件相关推荐

  1. Graphviz之DT:手把手教你使用可视化工具Graphviz将dot文件转为结构图的pdf文件

    Graphviz之DT:手把手教你使用可视化工具Graphviz将dot文件转为结构图的pdf文件 目录 Graphviz软件的下载 Graphviz将dot文件转为结构图的pdf文件 输出结果 Gr ...

  2. Graphviz之DT:手把手教你使用可视化工具Graphviz将dot文件转为结构图的png文件

    Graphviz之DT:手把手教你使用可视化工具Graphviz将dot文件转为结构图的png文件 目录 Graphviz软件的下载 Graphviz使用方法 Graphviz软件的下载 Graphv ...

  3. JavaScript - 将 Allegro 坐标文件转为嘉立创坐标文件(CSV 格式)的工具

    将 Allegro 坐标文件转为嘉立创坐标文件(CSV 格式)的工具 Allegro 坐标文件格式: 工具: <!DOCTYPE html> <html><head> ...

  4. 轻松将CAD文件转为加密的PDF文件

    对于从事设计相关工作的朋友来说,CAD肯定再熟悉不过了.一些有特殊要求的CAD文件,需要将其转换成为PDF文件以方便保存.传输.打印,同时还得保证设计图稿的安全性,所以将CAD文件直接转为加密的PDF ...

  5. java 上传文件编码_(java)有什么办法把MultipartFile上传的文件转为utf-8的编码吗

    [Java] 纯文本查看 复制代码import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundExc ...

  6. 将用bootstrap框架的html文件转为eclipse中jsp文件

    eclipse中引入bootstrap框架 最近跟着我们专业一个很厉害很崇拜的老师捯饬毕设,他让我学自适应框架bootstrap,花了一周时间了解了该框架的 基本架构和用法,上去就写html文件,本来 ...

  7. 把.pfx文件转为.pvk和.spc文件的办法

    以前从Verisign购买的证书都是.pvk+.spc文件,这次却得到一个.pfx文件,而inetsdk的SignCode.exe是使用.pvk+.spc文件的,只好去找工具来进行提取了. 工具准备: ...

  8. 将.fig或其他图片文件转为Visio可编辑的.vsd文件

    (1)Matlab的.fig文件转为Visio的.vsd文件 在Matlab中打开.fig 文件,编辑-复制图窗到Visio即可转为完全可编辑的vsd文件. 如果复制图窗仍然是图片,可能是 Matla ...

  9. txt变为html文件,把txt文件转为htm

    linux的txt转为windows的txt# cp unixfile.txt  winfile.txt# vi winfile.txt:set ff=dos:wq第2办法:# sed -e 把mys ...

  10. 【文件编码转换】将GBK编码项目转为UTF-8编码项目

    需求 因原项目是GBK编码的,现需要使用UTF-8编码.将项目导入UTF-8编码的编辑器后,出现中文乱码. 调研 VsCode 可以转文件编码,但只能一个一个转,对于已经完成一起的项目,操作难免比较麻 ...

最新文章

  1. 网络常见的 9 大命令,非常实用!
  2. SVN+post-commit 搭建自动同步版本库
  3. 解决安卓字体偏移:页面整体缩放
  4. python enumerate_Python中enumerate用法详解
  5. Openbiz Cubi 企业级应用程序开发(一)
  6. 空间平面,空间直线及它们的方程
  7. mysql path密码忘记_mysql密码忘记
  8. 2021筠连中学高考成绩查询,四川筠连中学2021年排名
  9. 百度人脸识别技术应用004---利用百度云离线SDK例子程序百度在线人脸库人脸识别接口_实现在线人脸识别
  10. Unity3D灯光与渲染学习之(二):全局、烘焙以及混合光照
  11. .NET程序员面试题总结
  12. python奇数平方和_平方和
  13. java excel checkbox,使用Apache POI(Java)在XLSX中创建复选框
  14. 3位1体学习法(smart哥)
  15. 包含源文件 —— 是奇技淫巧还是饮鸩止渴?
  16. 用了这么多年百度搜索, 今天才发现加上双引号搜索结果这么准
  17. 4gl的内建函数和操作符简介
  18. 如何为WordPress的网站建立多级菜单?
  19. 一文探究OR值和RR值区别
  20. 数学建模(七) 元胞自动机

热门文章

  1. DSPE-PEG-VAP/DCDX/LyP-1/M2pep/GLP-1/HP2/FNB/CPPs/CGKRKb 磷脂-聚乙二醇-多肽
  2. 解密最接近人脑的智能学习机器——深度学习及并行化实现
  3. 济南谷道电子商务正式落地济南!! 外贸牛是什么
  4. iWebOffice2003.ocx 的程序集成
  5. 根据近邻列表法识别团簇—冷凝成核 Matlab+Ovito(上)
  6. 青龙面板教程(四):线报监控系统开发
  7. 互联网业务数据分析- 数据处理
  8. 基于java(ssm)旅游网站系统源码成品(java毕业设计)
  9. win7系统提示0x80070035找不到网络路径
  10. python数学建模基础教程_Python 3破冰人工智能 从入门到实战 大学生数学建模竞赛数学建模算法与应用教程 机器学习深度学...