1. 背景
    今天想找个日语的分词工具,就看到了mecab,然后就在网上找到了相关的示例,运行一下各种报错。
    先后安装的包有:
pip install mecab-python-windows
pip install mecab-python3
pip install mecab
pip install whoosh
Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
pip install tiny_tokenizer[all]pip install SudachiPy
pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20191030.tar.gzsudachipy link -t corepip install -U pytest
  1. 错误信息
 return self.__parse_tostr(text, **kwargs)File "C:\Users\lixianwei\venv\lib\site-packages\natto\mecab.py", line 318, in __parse_tostrreturn self.__bytes2str(raw).strip()File "C:\Users\lixianwei\venv\lib\site-packages\natto\support.py", line 26, in bytes2strreturn b.decode(py3enc)
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0x93 in position 4: illegal multibyte sequence
  1. 问题分析
    很明显是编码问题,但是找了版本没找到哪里设置编码,python文件里已经设置了通常写代码时写的那句utf-8,这里显然取的不是那个,跟踪代码,发现获取的是系统编码。然后找到了这样一句话:
If natto-py for some reason cannot locate the mecab library, or if it cannot determine the correct charset used internally by mecab, then you will need to set the MECAB_PATH and MECAB_CHARSET environment variables.Set the MECAB_PATH environment variable to the exact name/path to your mecab library.
Set the MECAB_CHARSET environment variable to the charset character encoding used by your system dictionary.
e.g., for Mac OS:export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
export MECAB_CHARSET=utf8
e.g., for bash on UNIX/Linux:export MECAB_PATH=/usr/local/lib/libmecab.so
export MECAB_CHARSET=euc-jp
e.g., on Windows:set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
set MECAB_CHARSET=shift-jis
e.g., from within a Python program:import osos.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'
os.environ['MECAB_CHARSET']='utf-16'

出自:https://github.com/buruzaemon/natto-py
3. 解决办法
找到你安装的mecab位置,然后在python脚本里加入如下代码:

os.environ['MECAB_PATH']='C:\\Program Files (x86)\\MeCab\\bin\\libmecab.dll'
os.environ['MECAB_CHARSET']='utf-8'

UnicodeDecodeError: 'shift_jis' codec can't decode byte 0x93 in position 4: illegal multibyte sequen相关推荐

  1. 编码调试:UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xaf in position 12: illegal multibyte sequen

    在程序段: stopkey = [w.strip() for w in codecs.open('data/stopWord.txt', 'r').readlines()] 出现错误: Unicode ...

  2. anaconda -spyder报错解决-UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 611: illegal

    此文首发于我的个人博客:anaconda -spyder报错解决-UnicodeDecodeError 'gbk' codec can't decode byte 0x93 in position 6 ...

  3. 解决Python报错UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 658: illegal multibyte

    解决Python报错–UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 658: illegal multibyte ...

  4. UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x80 in position 658: illegal multibyte sequence

    解决Python报错–UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 658: illegal multibyte ...

  5. 成功解决UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xba in position 2: illegal multibyte sequence

    成功解决UnicodeDecodeError: 'gbk' codec can't decode byte 0xba in position 2: illegal multibyte sequence ...

  6. 成功解决UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 28: illegal multibyte sequenc

    成功解决UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 28: illegal multibyte sequenc ...

  7. UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 43: illegal multibyte sequence

    python读取txt文件时报错: UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 43: illegal mul ...

  8. UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence

    在做<机器学习实战>里的朴素贝叶斯算法时提示错误 UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 19 ...

  9. UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 85: illegal multibyte sequence

    1.今天,写一个小代码运行时,报了这个错误:UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 85: illegal ...

  10. 踩坑记-- UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xa6 in position 17: illegal multibyte seque

    在使用exejs运行js代码的时候发生如下报错,但是在命令行去运行js文件正常: Exception in thread Thread-1: Traceback (most recent call l ...

最新文章

  1. JDBC批处理读取指定Excel中数据到Mysql关系型数据库
  2. 前工404见闻,让我怀疑我是不是身处东南大学……
  3. Android 三角形控件
  4. python编写表白程序_python如何写出表白程序
  5. 已知三角形三点坐标求角度_细心研磨椭圆焦点三角形,这肯定是最全的解释。...
  6. 常用算法2 - 广度优先搜索 深度优先搜索 (python实现)
  7. js刷新当前页面的几种方式
  8. 如何提高在外国网站下载软件或文件的速度
  9. C++产生随机数的例题:投骰子的随机游戏
  10. 作者已死?AI正用艺术征服人类
  11. 【教程】批量删除B站抽奖动态
  12. PS一键生成鎏金字特效插件(糖果滤镜Skin Eye Candy)
  13. 电脑ftp服务器信息,电脑上的ftp信息服务器地址
  14. Jlink 烧录stm32 提示- ERROR: Verification of RAMCode failed @ address 0x20000000.
  15. 滑雪与时间胶囊 题解 BZOJ2753
  16. 疫情中的日本东京it工作
  17. 有N个台阶,可以走两步也可以走一步 一共有多少种走法
  18. 不开电脑机箱,Ubuntu下软件清除bios密码
  19. java8的stream流的使用
  20. Android 下实现 vlc 播放器解码网络摄像头

热门文章

  1. rq940服务器 经常自动重启,高端首选 联想ThinkServer RQ940服务器
  2. win FlashFxp与ubuntu vsftpd共享文件
  3. Web开发——Photoshop(PSD格式截取)
  4. /专访/对话堵俊平:最好的开源生态模型,是亚马逊的原始森林
  5. 【大数据安全分析】图计算在安全方面的应用思考
  6. 第十三章:相关方管理
  7. oracle学习入门系列之五内存结构、数据库结构、进程
  8. 我也DIY一个Android遥控器-全部开源
  9. 定期报告系统服务器出错 1,做好Web服务器的日常维护必备常识
  10. 2005中国千强镇名单