

word2vec项目首页:https://code.google.com/p/word2vec/,文档比较详尽,很容易上手。可能对于不同的系统和gcc版本,需要稍微改一下代码和makefile。具体到我的mac系统,源代码中所有#include <malloc.h>的地方都需要改成#include <stdlib.h>,makefile编译选项中的-Ofast要更改为-O2,-march=native -Wno-unused-result这两个编译选项都不认,使用予以删除。直接make,按照它的文档提示运行即可。


 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there... 


分词我们使用开源的ansj_seg项目,该项目是用java实现中科院ictclas中的算法(下载ictclas没有源码,且linux 64bit的版本在64位mac下链接库报错,应该是不兼容,ictclas官方并未提供mac 64bit的版本)。ansj_seg的官方主页在:https://github.com/ansjsun/ansj_seg,运行:

git clone https://github.com/ansjsun/ansj_seg


error: RPC failed; result=22, HTTP code = 413 | 116 KiB/s
fatal: The remote end hung up unexpectedly
Writing objects: 100% (2504/2504), 449.61 MiB | 4.19 MiB/s, done.
Total 2504 (delta 1309), reused 2242 (delta 1216)
fatal: The remote end hung up unexpectedly


git config --global http.postBuffer 524288000



这时发现ansj_seg是一个maven项目,直接使用mvn compile命令编译,会自动下载其所需依赖,整个编译过程没有报错,最终取得成功。从中提取出项目使用的tree_split-1.0.1.jar,加入到eclipse项目中,重新build一下,eclipse中的红叉消失。


1. 报错找不到library.properties文件,将项目根目录下的library.properties.bak copy成library.properties,并注意添加eclipse项目中的classpath,可以解决这个问题;

2. 初始化词典时会报找不到nature/nature.map文件(词性映射文件,ansj_seg不仅有分词的功能,还能词性标注),find . -iname nature.map会发现其实这个文件是存在的,可以直接加eclipse的classpath指向ansj_seg/src/main/resources目录即可;

3. 跑demo时可能会报OutOfMemory的错误,加载词典可能超出了eclipse的默认jvm大小,可以在run as时,设定argument,-Xmx512M -Xms512M即可。



 1 package org.ansj.demo;
 3 import java.io.BufferedReader;
 4 import java.io.FileOutputStream;
 5 import java.io.IOException;
 6 import java.io.PrintWriter;
 7 import java.io.OutputStreamWriter;
 8 import java.util.HashSet;
 9 import java.util.List;
10 import java.util.Set;
12 import love.cq.util.IOUtil;
14 import org.ansj.domain.Term;
15 import org.ansj.splitWord.analysis.ToAnalysis;
17 public class MyFileDemo {
19     public static final String TAG_START_CONTENT = "<content>";
20     public static final String TAG_END_CONTENT = "</content>";
22     public static void main(String[] args) {
23         String temp = null ;
25         BufferedReader reader = null;
26         PrintWriter pw = null;
27         try {
28             reader = IOUtil.getReader("corpus.txt", "UTF-8") ;
29             ToAnalysis.parse("test 123 孙") ;
30             pw = new PrintWriter(new OutputStreamWriter(new FileOutputStream
31                     ("resultbig.txt"), "UTF-8"), true);
32             long start = System.currentTimeMillis()  ;
33             int allCount =0 ;
34             int termcnt = 0;
35             Set<String> set = new HashSet<String>();
36             while((temp=reader.readLine())!=null){
37                 temp = temp.trim();
38                 if (temp.startsWith(TAG_START_CONTENT)) {
39                     int end = temp.indexOf(TAG_END_CONTENT);
40                     String content = temp.substring(TAG_START_CONTENT.length(), end);
41                     //System.out.println(content);
42                     if (content.length() > 0) {
43                         allCount += content.length() ;
44                         List<Term> result = ToAnalysis.parse(content);
45                         for (Term term: result) {
46                             String item = term.getName().trim();
47                             if (item.length() > 0) {
48                                 termcnt++;
49                                 pw.print(item.trim() + " ");
50                                 set.add(item);
51                             }
52                         }
53                         pw.println();
54                     }
55                 }
56             }
57             long end = System.currentTimeMillis() ;
58             System.out.println("共" + termcnt + "个term," + set.size() + "个不同的词,共 "
59                     +allCount+" 个字符,每秒处理了:"+(allCount*1000.0/(end-start)));
60         } catch (IOException e) {
61             e.printStackTrace();
62         } finally {
63             if (null != reader) {
64                 try {
65                     reader.close();
66                 } catch (IOException e) {
67                     e.printStackTrace();
68                 }
69             }
70             if (null != pw) {
71                 pw.close();
72             }
73         }
74     }
75 }


./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
./distance vectors.bin



cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" | head -n 200000 > corpus.txt




