Hadoop_23_MapReduce倒排索引实现

1.1.倒排索引

　　　　根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确

定属性值，而是由属性值来确定记录的位置，因而称为倒排索引(invertedindex)

　　　　例如：单词——文档矩阵（将属性值放在前面作为索引）

1.2.MapReduce实现倒排索引

需求：对大量的文本（文档、网页），需要建立搜索索引

代码实现：

package cn.bigdata.hdfs.mr;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 使用MapRedeuce程序建立倒排索引文件* 文件列表如下：*  a.txt               b.txt                  c.txt*  hello tom           hello jerry            hello jerry*  hello jerry         hello jerry            hello tom*  hello tom           tom jerry*/public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();//将得到的每行文本数据根据空格" "进行切分String [] words = line.split(" ");//根据切片信息获取文件名FileSplit inputSplit = (FileSplit)context.getInputSplit();String fileName = inputSplit.getPath().getName();for(String word : words){k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values ,Context context) throws IOException, InterruptedException {int count = 0;for(IntWritable value :  values){count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//输入文件路径FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);}
}

运行结果输出文件：E:\inverseOut\part-r-00000

hello--a.txt    3
hello--b.txt    2
hello--c.txt    2
jerry--a.txt    1
jerry--b.txt    3
jerry--c.txt    1
tom--a.txt      2
tom--b.txt      1
tom--c.txt      1

在原来的基础上进行二次合并，格式如上图单词矩阵，代码实现如下：

package cn.bigdata.hdfs.mr;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/*** 对第一次的输出结果进行合并，使得一个value对应的多个文档记录组成一条完整记录* @author Administrator**/public class IndexStepTwo {static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text>{@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {//定义Buffer缓冲数组StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception{if (args.length < 1 || args == null) {args = new String[]{"E:/inverseOut/part-r-00000", "D:/inverseOut2"};}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1:0);}
}

运行结果：

hello    c.txt-->2    b.txt-->2    a.txt-->3
jerry    c.txt-->1    b.txt-->3    a.txt-->1
tom      c.txt-->1    b.txt-->1    a.txt-->2

总结：

　　对大量的文档建立索引无非就两个过程，一个是分词，另一个是统计分词在每个文档中出现的次数，根据分词在每个文档

中出现的次数建立索引文件，下次搜索词的时候直接查询索引文件，从而返回文档的摘要等信息；

转载于:https://www.cnblogs.com/yaboya/p/9252313.html

Hadoop_23_MapReduce倒排索引实现相关推荐

Elasticsearch 如何做到快速检索 - 倒排索引的秘密
欢迎关注方志朋的博客,回复"666"获面试宝典来源:https://ricstudio.top/archives/es-lucene-reverted-index 一.前言最近 ...
倒排索引原理_拜托，面试请不要再问我分布式搜索引擎的架构原理！
欢迎关注头条号:石杉的架构笔记周一至周五早八点半!精品技术文章准时送上!!! 精品学习资料获取通道,参见文末目录 (1)倒排索引到底是啥? (2)什么叫分布式搜索引擎? (3)ElasticSea ...
Elasticsearch 为什么能做到快速检索？— 倒排索引的秘密
来源:https://ricstudio.top/archives/es-lucene-reverted-index "All problems in computer science ca ...
ElasticSearch是什么？为什么快？倒排索引是什么？ElasticSearch的应用？
ElasticSearch是什么?为什么快?倒排索引是什么?ElasticSearch的应用? Elasticsearch 是一个分布式的免费开源搜索和分析引擎,适用于包括文本.数字.地理空间.结构化 ...
Elasticsearch之倒排索引
搜索引擎正排索引 - 文档Id到文档内容.单词的关联关系文档ID 文档内容 1 elasticsearch是最流行的搜索引擎 2 php是世界上最好的语言 3 搜索引擎是如何诞生的倒排索引 - ...
倒排索引统计与 Python 字典
2019独角兽企业重金招聘Python工程师标准>>> 最近折腾索引引擎以及数据统计方面的工作比较多, 与 Python 字典频繁打交道, 至此整理一份此方面 API 的用法与坑法备 ...
MapReduce实现倒排索引（类似协同过滤）
一.问题背景倒排索引其实就是出现次数越多,那么权重越大,不过我国有凤巢....zf为啥不管,总局回应推广是不是广告有争议... eclipse里ctrl+t找接口或者抽象类的实现类,看看都有啥方法, ...
MapReduce实战--倒排索引
本文地址:http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html,转载请注明源地址. 1.倒排索引简介倒排索引(Inver ...
Lucene 源码分析之倒排索引（三）
上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...

Hadoop_23_MapReduce倒排索引实现

Hadoop_23_MapReduce倒排索引实现相关推荐

最新文章

热门文章