使用FastText实现文本分类-java版

文本分类又称自动文本分类,是指计算机将载有信心的一篇文本映射到预先给定的某一类别或某几个类别主题的过程,实现这一过程的算法模型叫做分类器。哈哈哈,这一句是从大佬文章中借鉴来得,这是是原文 ,这篇文章具体介绍了文本分类的历史发展和一些分类算法,有兴趣的可以去看看。

我这里主要说的是使用FastText实现文本分类,至于想要弄明白原理的,建议看这里 ,大佬对原理做了很详细的说明,向大佬表示敬意。

依赖

<dependency><groupId>com.github.sszuev</groupId><artifactId>fasttext</artifactId><version>1.0.0</version>
</dependency>

想要jar包的可以去maven repository官网 去搜索FastText,这里有各个版本的jar包选择。

训练模型

以下就是训练模型的代码,训练出的模型会保存到指定的路径下,一次训练就可以得到模型,之后就可以用模型预测数据了。

try {Main.train(new String[]{"supervised","-input","./parameter/category_train.txt",  //带标签的训练数据"-output","./parameter/commodity_model",    //训练出的模型保存路径及命名"-dim","64",         //词向量维度"-lr","0.5",         //学习率"-wordNgrams","2",   //最大子词长度"-minCount","1",     //最小词频"-bucket","10000000" //桶数});
}catch (Exception e) {e.printStackTrace();
}

那么带标签的训练数据集是什么样的?如下,这是从官网 找到的测试训练数据集,标签都是以 __label__ 为开头的,当然这个可以设置,只不过默认支持的是这样,以下数据都是英文,如果使用中文可以考虑分词和停用词过滤了。

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
__label__tea What kind of tea do you boil for 45minutes?
__label__baking __label__baking-powder __label__baking-soda __label__leavening How long can batter sit before chemical leaveners lose their power?
__label__food-safety __label__soup Can I RE-freeze chicken soup after it has thawed?

其实在训练模型中还有很多参数可以设置,如下,参数很多,所以如果想要搞明白所有的相关参数意义最好去官网多花点时间读一下。

The following arguments are mandatory:-input        training file path-output       output file pathThe following arguments are optional:-lr           learning rate [0.05]-lrUpdateRate change the rate of updates for the learning rate [100]-dim          size of word vectors [100]  //这里决定了词向量的维度-ws           size of the context window [5]-epoch        number of epochs [5]-minCount     minimal number of word occurences [1]  //最小词频-neg          number of negatives sampled [5]-wordNgrams   max length of word ngram [1]-loss         loss function {ns, hs, softmax} [ns]-bucket       number of buckets [2000000]-minn         min length of char ngram [3]-maxn         max length of char ngram [6]-thread       number of threads [12]-t            sampling threshold [0.0001]-label        labels prefix [__label__]  //看,这里就可以使用你自己想要的标签符号

生成的模型是在文件 commodity_model.bin 中,还会生成一个词向量文件 commodity_model.vec ,词向量文件内容如下。

37899 100
</s> -0.55874 -0.10935 0.17115 0.2388 0.2805 0.11883 0.10674 -0.13214 0.32224 -0.54488 0.4239 -0.51117 -0.56261 -0.23805 0.68661 0.35541 -0.77821 -0.53468 0.60013 0.063523 1.1078 -0.31767 -0.38917 -0.85577 -0.35906 0.23197 0.99904 0.26574 -0.27303 0.0091292 0.62824 0.35154 -0.18146 0.62103 -0.65914 0.99872 0.16585 0.2622 -0.1481 -0.22537 -0.27048 0.075146 -0.15598 -0.28847 0.16145 0.10381 0.32652 -0.45171 -0.26597 -0.061287 0.01858 0.50429 -0.17517 0.21205 -0.023571 0.37332 -0.41411 -0.34945 0.28114 -0.0046294 0.39406 0.39902 -0.25752 -0.052666 0.27889 -0.53025 -0.13618 1.1267 0.032445 -0.62555 0.45881 0.24994 -0.12783 1.0336 -0.56122 0.36742 0.26783 -0.064086 0.56198 -0.054111 0.27858 0.43912 -0.21053 0.20468 0.29792 0.16496 0.13347 0.85231 -0.048318 -0.86905 -0.57763 -0.26486 0.58158 0.49263 -0.14475 0.27656 0.29959 -0.37355 -0.24024 -0.83969
新款 -0.36792 -0.052885 0.2175 0.25894 0.14599 0.67911 0.0028615 0.13432 0.26988 0.44109 -0.047146 0.30435 -0.0052764 0.086225 -0.024577 -0.19852 -0.13228 0.27177 -0.26321 0.53231 0.030532 -0.12034 -0.21005 -0.035567 -0.09993 -0.21439 0.30124 -0.081924 0.219 -0.27545 -0.1321 -0.19909 0.49169 -0.35514 0.010071 0.32131 0.13274 0.0017961 -0.25752 0.15799 -0.15891 -0.19768 0.11458 -0.071166 -0.0049989 -0.10033 -0.089192 0.0051551 0.1948 -0.32105 -0.12673 -0.021479 0.2035 -0.3036 0.042287 -0.18418 -0.16937 0.0034305 -0.00054013 -0.20262 0.21633 0.55448 0.5047 -0.30521 -0.33969 0.23641 -0.13683 -0.039237 -0.038262 0.30688 -0.42023 0.17422 -0.36334 -0.027693 0.13593 0.055707 -0.45232 0.21901 0.0038972 0.073224 -0.24337 -0.17771 0.36014 0.0093526 -0.12155 -0.20782 -0.10076 0.0017971 -0.24683 -0.35155 -0.33855 -0.032884 -0.10803 -0.11618 0.06904 0.36682 0.30256 -0.11811 -0.2719 -0.3967
款 0.10719 -0.047097 0.035313 -0.11604 -0.11128 0.098889 0.07338 -0.062592 -0.36591 0.33696 -0.034139 0.41233 -0.044029 0.26736 -0.35058 0.0011242 0.012222 0.28305 -0.42311 0.49023 -0.063058 0.063913 0.065111 -0.073718 0.0074959 -0.36847 -0.041656 -0.09248 0.11114 -0.20892 -0.16453 -0.47284 0.23434 -0.26671 -0.22557 0.31284 0.21075 0.018571 0.10862 0.093459 -0.40156 0.10052 0.15448 0.38009 0.057931 0.10816 -0.456 -0.088661 0.25591 -0.25163 -0.44263 -0.44058 -0.1008 -0.38986 0.14394 -0.45649 0.031487 0.030237 -0.10706 -0.16413 0.43202 0.39386 0.21072 0.26409 0.43194 0.21329 0.10789 0.1402 0.23095 0.044306 -0.19572 -0.022013 -0.080285 -0.13523 0.076271 -0.073187 -0.044929 -0.15822 0.15628 -0.32383 -0.35998 0.16517 -0.20735 -0.066001 -0.012753 0.08091 -0.057798 -0.14446 -0.10688 -0.19407 -0.38554 0.12981 0.28439 -0.030242 0.16122 0.47328 -0.30161 -0.27374 -0.3731 -0.051387
斤 -0.18143 -0.40266 0.036 0.009181 0.12026 -0.29335 -0.15063 -0.078739 -0.1841 0.038445 0.11106 -0.32147 0.1611 0.38768 -0.14828 -0.30435 0.0057699 -0.1971 0.14482 -0.20802 0.71822 0.062339 0.067823 0.23867 -0.013739 -0.20461 0.045837 -0.19937 0.0002158 0.020062 -0.051749 -0.38833 0.010522 -0.056716 0.14504 0.16327 -0.096777 0.079007 0.015782 -0.073867 0.14208 -0.30683 -0.37885 -0.0098635 -0.071231 -0.079818 0.36771 0.097449 -0.54126 0.081269 0.017566 -0.80512 -0.18466 -0.45232 0.11477 0.53177 -0.36932 -0.46266 0.22826 0.20209 0.20264 0.22821 -0.1284 0.30707 0.33509 0.25278 0.18359 -0.122 -0.16871 0.078083 0.03181 0.043715 -0.027783 0.18351 0.22478 0.12866 0.54027 0.033152 0.14358 -0.08765 0.82464 0.19282 -0.64852 -0.045806 -0.12577 0.46663 0.24793 0.40814 0.10723 -0.050818 0.39453 -0.16839 -0.096019 -0.30689 0.3576 0.078405 -0.070142 -0.10879 0.08264 0.23929

预测

模型准备完毕,开始预测。我这里取了25253条已经标注好标签的数据验证模型的准确度。

//加载模型
FastText fastText = FastText.load("./parameter/commodity_model.bin");
//预测
Map<String, Float> map = fastText.predictLine("某条数据", 1);

这里预测的输入参数1表示只取一个分类标签,所以想要几条标签就输入几就好了。返回的结果map中是标签及占比。运行结果如图。

这里显示准确率86.0%,根据对预测结果与贴标结果不同的数据查看,最少有七八百条数据预测结果更加符合,另外有五六百条数据预测结果和贴标结果均适合,所以准确率应该是可以达到90%以上的。基本达到了上线生产的要求,可以使用。

使用fastText实现文本分类-java版相关推荐

  1. fastText实现文本分类

    fastText实现文本分类 1. fastText的介绍  [用来获取词向量,进行文本分类的工具:分类的效率,得到词向量的效率高] 文档地址:https://fasttext.cc/docs/en/ ...

  2. 【NLP】基于python fasttext的文本分类

    背景 文本分类中的深度学习算法比较多,各种算法也由于其复杂度适应不同的场景.这次介绍的fasttext也是一个结构比较简单模型.结构虽然简单,但效果不错,还快.并且除了python有相关实现的包外,在 ...

  3. 【论文复现】使用fastText进行文本分类

    写在前面 今天是补笔记的一天... 今天的论文是来自Facebook AI Research的Bag of Tricks for Efficient Text Classification 也就是我们 ...

  4. fastrtext︱R语言使用facebook的fasttext快速文本分类算法

    FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的.理论介绍可见博客:NLP︱高级词向量表达(二)--FastTex ...

  5. 【nlp自然语言处理实战】案例---FastText模型文本分类

    目录 1.案例简介 2 代码 2.1 load_data.py 2.2 load_data_iter.py 2.3 FastText.py 2.4 train.py 2.5 predict.py 2. ...

  6. python3 使用fasttext 进行文本分类(一定要用linux )

    直接上代码 怎么用 ,具体原理 你参照其他人的 环境  python3   linux   pycharm 训练语料文件: 具体的文件这里我涉及到业务的问题没办法放出来   格式就是excel 如下图 ...

  7. NLP学习笔记-FastText文本分类(四)

    分类的目的和分类的方法 1. 文本分类的目的 回顾之前的流程,我们可以发现文本分类的目的就是为了进行意图识别 在当前我们的项目的下,我们只有两种意图需要被识别出来,所以对应的是2分类的问题 可以想象, ...

  8. fasttext文本分类

    文章目录 前言 一.环境 二.数据处理 三.训练 总结 前言 fastText是Facebook Research在2016年开源的一个词向量及文本分类工具,今天这篇文章主要使用fasttext在来做 ...

  9. fastText中的子词嵌入和高效文本分类:简单高效性能好

    fastText 由Facebook主持的研究.该研究分为两个项目:子词嵌入和高效文本分类.有学者认为fastText只包含高效文本分类,主张它才是该研究的重中之重.不过,在Facebook相关研究的 ...

最新文章

  1. 玉山银行的一名新员工“玉山小i随身金融顾问”
  2. matlab 极坐标作图polar
  3. symfony api 错误响应_如何设计一个牛逼的 API 接口
  4. Lingoes 2.8 手动去广告步骤
  5. 我又来说Git了, Git 与 SVN 大对比!
  6. **【ci框架】精通CodeIgniter框架
  7. php uploadify下载,JQuery上传插件Uploadify的下载与讲解
  8. Python机器学习:基于西瓜数据集的KNN算法实现
  9. 从猎豹到山狮-苹果操作系统热衷于猫科动物代号
  10. 会议OA之我的会议(排座送审)
  11. 计算机硬件及编程语言
  12. 支持DoH的DNS服务器,Win11 支持私密 DNS-over-HTTPS(DoH) 附启用教程
  13. MySQL创建远程连接用户,使用远程工具连接数据库(详细)
  14. Nebula Exchange 从Hive导入 NoSuchMethodError
  15. c语言dht网络爬虫,用Node.js实现一个DHT网络爬虫,一步一步完成一个BT搜索引擎(一)...
  16. jq里的event对象
  17. 今年剩下的几个月和明年上半年经济宏观走势 魏杰
  18. 事务部分昨日延续----锁
  19. sharp扫地机器人讲话_石头扫地机器人接入了这些音箱:清扫回充变成一句话的事...
  20. 孙云球(二分枚举)(AOJ 851)

热门文章

  1. V部落博客管理平台开源啦! Vue+SpringBoot强强联合! 1
  2. 水电水利建设项目水环境与水生生态保护技术政策
  3. (首页上一页下一页尾页 + 下拉框跳转)分页功能
  4. 数据架构建设方法及案例
  5. 在不借助第三方变量情况下实现两个变量的交换(借助于异或运算)
  6. 刘新华老师-沪师经纪
  7. Again! There are tons of wrong answers!
  8. android按键精灵源码,按键精灵手机版网络验证源码适用安卓/苹果ios手机按键
  9. 新生儿住月子中心20天患败血症 什么是败血症?有哪些危害
  10. Python 使用 PyOTP 实现二步验证