基于词典的逆向最大匹配中文分词算法，更好实现中英文数字混合分词

基于词典的逆向最大匹配中文分词算法，能实现中英文数字混合分词。比如能分出这样的词：bb霜、3室、乐phone、touch4、mp3、T恤。实际分词效果比正向分词效果好

publicclass RMM

{

privatestaticfinal Log log = LogFactory.getLog(RMM.class);

privatestatic HashMap<String, Integer> dictionary =null;

privatestaticfinalint WORD_MAX_LENGTH =9;

static

{

loadDictionary();

}

//将句子切分出词,逆向最大匹配

publicstatic ArrayList<Token> getToken(ArrayList<Sentence> list) throws IOException

{

Collections.reverse(list);

ArrayList<Token> tokenlist=new ArrayList<Token>();

for(Sentence sen:list)

{

StringBuffer word =new StringBuffer();

int offset=sen.getStartOffset()+sen.getText().length;

int bufferIndex = sen.getText().length-1;

char c;

boolean b=false;

while(bufferIndex>-1)

{

offset--;

c=sen.getText()[bufferIndex--];

if(word.length()==0)

word.append(c);

else

{

String temp = (c+word.toString()).intern();

if(dictionary.containsKey(temp) && dictionary.get(temp)==1)

word.insert(0, c);

elseif(dictionary.containsKey(temp) && bufferIndex>-1)

word.insert(0, c);

else

{

bufferIndex++;

offset++;

while(word.length()>1&& dictionary.get(word.toString())!=null&& dictionary.get(word.toString())==2)

{

word.deleteCharAt(0);

bufferIndex++;

offset++;

}

b=true;

}

if(b || bufferIndex==-1)

{

Token token =new Token(word.toString(),offset,offset+word.length(),"word");

word.setLength(0);

tokenlist.add(token);

b=false;

}

Collections.reverse(tokenlist);

return tokenlist;

}

//加载词典

publicstaticvoid loadDictionary()

{

if (dictionary ==null)

{

dictionary =new HashMap<String, Integer>();

InputStream is =null;

BufferedReader br =null;

try

{

is =new FileInputStream(new File(RMM.class.getClassLoader().getResource("dictionary.txt").toURI()));

br =new BufferedReader(new InputStreamReader(is, "UTF-8"));

String word =null;

while ((word = br.readLine()) !=null)

{

word=word.toLowerCase();

if ((word.indexOf("#") ==-1) && (word.length() <= WORD_MAX_LENGTH))

{

dictionary.put(word.intern(), 1);

int i =1;

while(i < word.length()-1)

{

String temp = word.substring(i,word.length()).intern();

if (!dictionary.containsKey(temp))

dictionary.put(temp,2);

i++;

}

catch (Exception e)

{

log.info(e);

}

finally

{

try

{

if(br!=null)

br.close();

if(is!=null)

is.close();

}

catch (IOException e)

{

log.info(e);

}

publicstatic String[] segWords(Reader reader)

{

ArrayList<String> list=new ArrayList<String>();

try

{

ArrayList<Token> tlist= Util.getNewToken(getToken(Util.getSentence(reader)));

for(Token t:tlist)

{

list.add(t.getWord());

}

catch(IOException e)

{

log.info(e);

}

return (String[])list.toArray(new String[0]);

}

publicstaticvoid main(String[] args)

{

String[] cc=RMM.segWords(new StringReader("急、急、急、花里林居,二房二厅,业主诚心,出租".toLowerCase()));

for(String c:cc)

{

System.out.println(c);

}

public class Util
{
//切分出由中文、字母、数字组成的句子
public static ArrayList<Sentence> getSentence(Reader reader) throws IOException
{
ArrayList<Sentence> list=new ArrayList<Sentence>();
StringBuffer cb=new StringBuffer();
int d=reader.read();
int offset=0;
boolean b=false;
while(d>-1)
{
int type=Character.getType(d);
if(type==2 || type==9 || type==5)
{
d=toAscii(d);
cb.append((char)d);
}
else
{
b=true;
}
d=reader.read();
if(d==-1 || b)
{
if(d==-1) offset++;
b=false;
char[] ioBuffer = new char[cb.length()];
cb.getChars(0, cb.length(), ioBuffer, 0);
Sentence sen=new Sentence(ioBuffer,offset-cb.length());
list.add(sen);
cb.setLength(0);
}
offset++;
}
return list;
}

//将相连的单个英文或数字组合成词
public static ArrayList<Token> getNewToken(ArrayList<Token> list) throws IOException
{
ArrayList<Token> tokenlist=new ArrayList<Token>();
Token word=null;
for(int i=0;i<list.size();i++)
{
Token t=list.get(i);
if(t.getWord().length()==1 && Character.getType((int)t.getWord().charAt(0))!=5)
{
if(word==null)
word=t;
else if(word.getEnd()==t.getStart())
{
word.setEnd(t.getEnd());
word.setWord(word.getWord()+t.getWord());
}
else
{
tokenlist.add(word);
word=t;
}
}
else if(word!=null)
{
tokenlist.add(word);
word=null;
tokenlist.add(t);
}
else
tokenlist.add(t);
}
if(word!=null)
tokenlist.add(word);
return tokenlist;
}

//双角转单角
public static int toAscii(int codePoint)
{
if((codePoint>=65296 && codePoint<=65305) //０-９
|| (codePoint>=65313 && codePoint<=65338) //Ａ-Ｚ
|| (codePoint>=65345 && codePoint<=65370) //ａ-ｚ
)
{
codePoint -= 65248;
}
return codePoint;
}
}

转载于:https://www.cnblogs.com/ibook360/archive/2011/11/11/2245871.html

基于词典的逆向最大匹配中文分词算法，更好实现中英文数字混合分词相关推荐

基于词典的正向最大匹配中文分词算法，能实现中英文数字混合分词
基于词典的正向最大匹配中文分词算法,能实现中英文数字混合分词.比如能分出这样的词:bb霜.3室.乐phone.touch4.mp3.T恤第一次写中文分词程序,欢迎拍砖. publicclass MM ...
PHP基于字典的中英文数字混合分词算法RMM简易实现
<?phpclass Seg {//字典private $dict = [];//加载字典function set_dict($vDict){//词典大写,方便比对foreach ($vDict ...
python双向最大匹配算法_中文分词算法之基于词典的逆向最大匹配算法
在之前的博文中介绍了基于词典的正向最大匹配算法,用了不到50行代码就实现了,然后分析了词典查找算法的时空复杂性,最后使用前缀树来实现词典查找算法,并做了3次优化. 下面我们看看基于词典的逆向最大匹配算 ...
java中文分词算法_Java实现逆向最大匹配中文分词算法
写道 //Java实现逆向最大匹配中文分词算法 public class SplitChineseCharacter { public static void main(String[] args) ...
【摘抄】百度分词算法详解：查询处理以及分词技术
随着搜索经济的崛起,人们开始越加关注全球各大搜索引擎的性能.技术和日流量.作为企业,会根据搜索引擎的知名度以及日流量来选择是否要投放广告等:作为普通网民,会根据搜索引擎的性能和技术来选择自己喜欢的引 ...
搭建基于飞桨的OCR工具库，总模型仅8.6M的超轻量级中文OCR，单模型支持中英文数字组合识别、竖排文本识别、长文本识别的PaddleOCR
介绍基于飞桨的OCR工具库,包含总模型仅8.6M的超轻量级中文OCR,单模型支持中英文数字组合识别.竖排文本识别.长文本识别.同时支持多种文本检测.文本识别的训练算法. 相关链接 PaddleOCR ...
基于N-gram的双向最大匹配中文分词
• 摘要这次实验的内容是中文分词.将一个句子的所有词用空格隔开,将一个字串转换为一个词序列. 而我们用到的分词算法是基于字符串的分词方法中的正向最大匹配算法和逆向最大匹配算法.然后对两个方向匹配得出 ...
基于词典的前缀扫描中文分词
说明中文分词是很多文本分析的基础.最近一个项目,输入一个地址,需要识别出地址中包含的省市区街道等单词.与以往的分词技术不同.jieba/hanlp等常用的分词技术,除了基于词典,还有基于隐马尔科夫/ ...
列举：中文分词算法你知道几种？
列举:中文分词算法你知道几种? 摘要:看似普通的一句话,甚至几个词,在机器眼里都要经过好几道"程序".这个过程主要靠中文分词算法,这个算法分为三大类:机械分词算法.基于n元语法的分 ...

基于词典的逆向最大匹配中文分词算法，更好实现中英文数字混合分词

基于词典的逆向最大匹配中文分词算法，更好实现中英文数字混合分词相关推荐

最新文章

热门文章