原文链接:https://blog.csdn.net/W_Honor/article/details/105037033?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param

NLP领域最出名的python库之一就是gensim,该库包含了常见的word2vec模型,而我们在使用这些模型进行embedding的时候可会出现如下的编码问题:

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position

个错误在提示加载模型时出现了编码错误,即在某一个位置的字节是无法编码的,并会给出相应的的位置。
这个问题在gensim官方github中也有反馈,并且作者统一给出了解决方法:

Answer: The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered.
The fix is on your side and it is to either:
a) Store your model using a program that understands unicode and utf8 (such as gensim). Some C and Java word2vec tools are known to truncate the strings at byte boundaries, which can result in cutting a multi-byte utf8 character in half, making it non-valid utf8, leading to this error.
b) Set the unicode_errors flag when running load_word2vec_model, e.g. load_word2vec_model(…, unicode_errors=‘ignore’). Note that this silences the error, but the utf8 problem is still there – invalid utf8 characters will just be ignored in this case.

第一种方法告诉了使用者可以用C和JAVA的相关工具去处文本中的无效编码字符;
第二种方法建议在load_word2vec_model函数中设置参数unicode_errors=‘ignore’,可以暂时忽略这个错误。

def get_word_vectors(model_path, dim, words):wordVec = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True, unicode_errors='ignore')wordEmbedding = []for word in words:try:if "<PAD>" == word:wordEmbedding.append(np.zeros(dim))else:vector = wordVec.wv[word]wordEmbedding.append(vector)except:print(word + "不存在于词向量中")wordEmbedding.append(np.random.randn(dim))return np.array(wordEmbedding)

word2vec加载异常解决:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode bytes in position。。。相关推荐

  1. 解决python偶尔读文件报错:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode bytes in position 1022-1023: unex....

    完整报错为:UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 1022-1023: unexpected end of ...

  2. pycharm运行异常 UnicodeDecodeError: ‘utf-8‘ codec can‘t decode bytes in position 600-601: invalid contin

    pycharm运行异常 UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 600-601: invalid contin ...

  3. 处理UnicodeDecodeError: ‘XXX' codec can't decode bytes in position...的问题

    错误信息: UnicodeDecodeError: 'XXX' codec can't decode bytes in position 2-5: illegal multibyte sequence ...

  4. pandas read_csv ‘utf-8‘ codec can‘t decode bytes in position 1198-1199: invalid continuation byte解决

    pandas read_csv 'utf-8' codec can't decode bytes in position 1198-1199: invalid continuation byte解决 ...

  5. 成功解决Python中导出图片出现错误SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position

    成功解决Python中导出图片出现错误SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position ...

  6. UnicodeDecodeError: ‘utf-8‘ codec can‘t decode bytes in position 5098-5099: invalid continuatio byte

    问题描述 读取数据集(.csv格式)时遇到如下错误: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5098-509 ...

  7. UnicodeDecodeError: ‘utf-8‘ codec can‘t decode bytes in position 708-709: invalid continuation byte

    UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 708-709: invalid continuation byte

  8. 成功解决SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 6-7: malformed

    成功解决SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 6-7: malformed ...

  9. 成功解决SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 0-1: malformed

    成功解决SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed ...

最新文章

  1. 21天养成习惯?不一定
  2. linux下安装jira详细步骤
  3. 十条技巧 更聪明地使用Google搜索
  4. virtualbox php mac,详解mac下通过docker搭建LEMP环境
  5. 于变局中开新局!《2021中国SaaS市场研究报告》报告发布
  6. Tensorflow图像调整大小
  7. qt串口通信_Qt编写气体安全管理系统29-跨平台
  8. android 开启离屏缓存,Android性能优化笔记(持续更新帖)
  9. Kinect测量身高
  10. php codesniffer,为你的 PHP_CodeSniffer 构建自定义规则
  11. burp直接抓取windows微信小程序与公众号数据包
  12. AVR单片机网址推荐 .
  13. 火狐flash debug配置
  14. 那些年,我们一起做过的 Java 课后练习题(61 - 65)
  15. android intent.action pick,android intent pick
  16. 字符串匹配——KMP算法【C语言】
  17. C# 如何更改程序集名称
  18. 2021 年 10 月推荐阅读的10篇精选ML论文
  19. 车载摄像头技术趋势简介
  20. Notepad 下载安装

热门文章

  1. Shifting Sort(选择排序)
  2. 网络诊断,浏览器不能上网,其他软件都能上网
  3. Exception in thread “main“ java.lang.ClassCastException 类型转换异常
  4. window11离线安装android子系统步骤
  5. org.apache.hadoop.hdfs.qjournal.client.QuorumException: Unable to check if JNs are ready for formatt
  6. Excel怎么插入按钮
  7. predefined annotation
  8. 【leetcode刷题】72.6 和 9 组成的最大数字 ——Java版
  9. nginx如何替换ssl证书
  10. Jasperreport_6.18的吐血记录三之简易交叉表 + 页面预览和导出