信息检索系统——基于Lucene实现

题目要求

基于课程学习内容，实现简单的搜索引擎功能（界面可视化不做要求，可终端输出），要求实现以下基本功能：

拼写检查（参考最小编辑距离原理）
倒排索引
使用TF/IDF或者VSM进行文档排序

实现

这里使用的是lucene-8.0.0，由于版本不同，网上很多博客的教程已经失效，具体的api参数或者调用要参考官网最新的手册，这里需要一定的搜索与查阅文档的能力。

http://lucene.apache.org/core/8_0_0/core/

项目完整源码：Github传送门

下面只讲述部分关键的代码

1.构建倒排索引

这里利用IndexWriter类来构建索引，由于这里使用的是中文文档，故要使用分析中文的分析器SmartChineseAnalyzer.
根据建立索引的目录以及数据的目录来读取。
定义一个fieldType，并设置其属性，既保存在文件又用于索引建立
读取 file 转 string
用文件内容来建立倒排索引
用文件名来建立倒排索引
用文件路径来建立倒排索引

public class Indexer {private IndexWriter writer;public Indexer(String indexDirectoryPath) throws IOException{// 获取目录directoryDirectory indexDirectory = FSDirectory.open(FileSystems.getDefault().getPath(indexDirectoryPath));// 中文分析器Analyzer analyzer = new SmartChineseAnalyzer();IndexWriterConfig config = new IndexWriterConfig(analyzer);writer = new IndexWriter(indexDirectory, config);}public void close() throws CorruptIndexException, IOException{writer.close();}private Document getDocument(File file) throws IOException{Document document = new Document();// 定义一个fieldType，并设置其属性，既保存在文件又用于索引建立FieldType fieldType = new FieldType();fieldType.setStored(true);fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);// 读取 file 转 stringStringBuffer buffer = new StringBuffer();BufferedReader bf= new BufferedReader(new FileReader(file));String s = null;while((s = bf.readLine())!=null){//使用readLine方法，一次读一行buffer.append(s.trim());}String xml = buffer.toString();// 用文件内容来建立倒排索引Field contentField = new Field(LuceneConstants.CONTENTS, xml,fieldType);// 用文件名来建立倒排索引Field fileNameField = new Field(LuceneConstants.FILE_NAME,file.getName(),fieldType);// 用文件路径来建立倒排索引Field filePathField = new Field(LuceneConstants.FILE_PATH,file.getCanonicalPath(),fieldType);// 添加到documentdocument.add(contentField);document.add(fileNameField);document.add(filePathField);return document;}   private void indexFile(File file) throws IOException{System.out.println("Indexing "+file.getCanonicalPath());Document document = getDocument(file);writer.addDocument(document);}public int createIndex(String dataDirPath, FileFilter filter) throws IOException{//get all files in the data directoryFile[] files = new File(dataDirPath).listFiles();int count = 0;for (File file : files) {//        System.out.println(file);if(!file.isDirectory()&& !file.isHidden()&& file.exists()&& file.canRead()&& filter.accept(file)){indexFile(file);count++;}}return count;}
}

测试函数：

public class LuceneTester {String indexDir = "C:/Users/asus/Desktop/java/information-retrieval-system/index";String dataDir = "C:/Users/asus/Desktop/java/information-retrieval-system/data";Indexer indexer;public static void main(String[] args) {LuceneTester tester;
//      File[] fs = new File("C:/Users/asus/Desktop/java/information-retrieval-system/data").listFiles();
//      for (File f : fs){
//          System.out.println(f);
//      }  try {tester = new LuceneTester();tester.createIndex();} catch (IOException e) {e.printStackTrace();} }private void createIndex() throws IOException{indexer = new Indexer(indexDir);int numIndexed;long startTime = System.currentTimeMillis();   numIndexed = indexer.createIndex(dataDir, new TextFileFilter());long endTime = System.currentTimeMillis();indexer.close();System.out.println(numIndexed+" File indexed, time taken: "+(endTime-startTime)+" ms");      }
}

输出结果：这里我们就已经建立好索引，并在文件目录能找到索引文件

文件目录中的索引：

2. 使用TF/IDF进行文档排序，并使用关键词搜索文档

得到读取索引文件的路径
通过dir得到的路径下的所有的文件
设置为TF/IDF 排序
实例化分析器
建立查询解析器
根据传进来的q查找
开始查询

public class ReaderByIndexerTest {public static void search(String indexDir,String q)throws Exception{//得到读取索引文件的路径Directory dir=FSDirectory.open(Paths.get(indexDir));//通过dir得到的路径下的所有的文件IndexReader reader=DirectoryReader.open(dir);//建立索引查询器IndexSearcher is=new IndexSearcher(reader);// 设置为TF/IDF 排序ClassicSimilarity sim = new ClassicSimilarity();// Implemented as sqrt(freq).// sim.tf(reader.getSumDocFreq(q));// Implemented as log((docCount+1)/(docFreq+1)) + 1.// sim.idf(reader.getSumDocFreq(q), reader.numDocs());is.setSimilarity(sim);// 实例化分析器Analyzer analyzer=new SmartChineseAnalyzer(); // 建立查询解析器/*** 第一个参数是要查询的字段；* 第二个参数是分析器Analyzer* */QueryParser parser=new QueryParser("contents", analyzer);// 根据传进来的q查找Query query=parser.parse(q);// 计算索引开始时间long start=System.currentTimeMillis();// 开始查询/*** 第一个参数是通过传过来的参数来查找得到的query；* 第二个参数是要出查询的行数* */TopDocs hits=is.search(query, 10);// 计算索引结束时间long end=System.currentTimeMillis();System.out.println("匹配 "+q+" ，总共花费"+(end-start)+"毫秒"+"查询到"+hits.totalHits+"个记录");//遍历hits.scoreDocs，得到scoreDoc/*** ScoreDoc:得分文档,即得到文档* scoreDocs:代表的是topDocs这个文档数组* @throws Exception * */for(ScoreDoc scoreDoc:hits.scoreDocs){Document doc=is.doc(scoreDoc.doc);System.out.println(doc.get(LuceneConstants.FILE_PATH));}//关闭readerreader.close();}

3. 拼写检查

建立目录
创建初始化索引
根据创建好的索引来检查k个建议的关键词
返回正确的关键词

public static String[] checkWord(String queryWord){//新索引目录String spellIndexPath = "C:\\Users\\asus\\Desktop\\java\\information-retrieval-system\\newPath";//已有索引目录String oriIndexPath = "C:\\Users\\asus\\Desktop\\java\\information-retrieval-system\\index";//拼写检查try {//目录Directory directory = FSDirectory.open((new File(spellIndexPath)).toPath());SpellChecker spellChecker = new SpellChecker(directory);// 以下几步用来初始化索引IndexReader reader = DirectoryReader.open(FSDirectory.open((new File(oriIndexPath)).toPath()));// 利用已有索引Dictionary dictionary = new LuceneDictionary(reader, LuceneConstants.CONTENTS);IndexWriterConfig config = new IndexWriterConfig(new SmartChineseAnalyzer());spellChecker.indexDictionary(dictionary, config, true);int numSug = 5;String[] suggestions = spellChecker.suggestSimilar(queryWord, numSug);reader.close();spellChecker.close();directory.close();return suggestions;} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}return null;}

4. 综合测试

这里调用之前实现好的基类和工具类，并制作简陋的命令行界面来进行信息检索

//测试public static void main(String[] args) throws IOException {String indexDir="C:\\Users\\asus\\Desktop\\java\\information-retrieval-system\\index";// 处理输入BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); String str = null; System.out.println("请输入你要搜索的关键词:"); try {str = br.readLine();System.out.println(); } catch (IOException e1) {// TODO Auto-generated catch blocke1.printStackTrace();}// 拼写检查String temp = str;String[] suggestions = checkWord(str);if (suggestions != null && suggestions.length != 0){System.out.println("你可能想输入的是:"); for(int i = 0; i < suggestions.length; i++){System.out.println((i+1) + " : " + suggestions[i]); }System.out.println("请选择上面的一个正确的关键词(输入 1 ~ 5)，或继续原词(输入0)进行搜索:"); str = br.readLine();System.out.println(); if (str != "0"){str = suggestions[str.charAt(0) - '1'];}else{str = temp;}}try {search(indexDir,str);} catch (Exception e) {// TODO Auto-generated catch blocke.printStackTrace();}}

测试结果：

测试用例1解析：

这里我输入一个错误的关键词美利坚共和国，试图进行搜索，然后系统马上就会告诉我拼写检查的结果让我重新选择。

重新选择后会输出美利坚合众国的正确查询结果

测试用例2解析：

这里直接输入一个正确的存在的关键词，就会直接输出美利坚合众国的正确查询结果，不会出现拼写检查的提醒

Python 自然语言处理笔记（五）——信息检索系统，基于Lucene实现相关推荐

Python推荐系统学习笔记（3）基于协同过滤的个性化推荐算法实战---ItemCF算法（下）
本文在 Python推荐系统学习笔记(2)基于协同过滤的个性化推荐算法实战---ItemCF算法一文的基础上,对其基本的ItemCF算法做出改进. 一.相关概念 1.ItemCF中,基于行为(喜好) ...
Python推荐系统学习笔记（5）基于协同过滤的个性化推荐算法实战---UserCF算法（下）
本文在 Python推荐系统学习笔记(4)基于协同过滤的个性化推荐算法实战---UserCF算法(上) 一文的基础上,对其基本的UserCF算法做出改进. 一.相关概念 1.UserCF中,基于行为( ...
基于Python实现的英文文本信息检索系统
目录 1.用户交互的实现: 3 3.查询表的建立 6 3.1 预处理 6 3.2 倒排表的构建 8 3.3 倒排表的压缩 9 3.4 构建轮排索引 10 4.布尔查询 11 5.TF-IDF 值的计算 ...
《简明 Python 教程》笔记-----面向对象及系统相关
文档地址:http://sebug.net/paper/python/index.html <简明 Python 教程>笔记-----基础知识 1.类 ①.每个函数都有个self参数,代表 ...
Programming Computer Vision with Python （学习笔记五）
SciPy库 SciPy库,与之前我们使用的NumPy和Matplotlib,都是scipy.org提供的用于科学计算方面的核心库.相对NumPy,SciPy库提供了面向更高层应用的算法和函数(其实也 ...
Python自然语言处理笔记（一）wordnet相似度计算
wordnet 参考WordNet Python API (整理总结) wordnet简介一个synset由lemma.POS.number组成,代表一个语义. 注意synset和synsets 的 ...
Python推荐系统学习笔记（1）基于协同过滤的个性化推荐算法实战---隐语义模型
一.相关概念: 1.隐语义模型(LFM) 通过矩阵分解建立用户和隐类之间的关系,物品和隐类之间的关系,最终得到用户对物品的偏好关系. 假设我们想要发现 F 个隐类, 我们的任务就是找到两个矩阵 U 和 ...
【Python】学习笔记五：缩进与选择
Python最具特色的用缩进来标明成块的代码缩进 i = 4 j = 2 if i > j:i = i+1print(i) 这是一个简单的判断,Python的if使用很简单,没有括号等繁琐语法 ...
Python自然语言处理笔记(三)------频率分布
一. 频率分布频率分布:显示每一个词项在文本中出现的频率,它告诉我们文本中词标识符的总数是如何分布在词项中的. 1.如何能自动识别文本中最能体现文本主题和风格的词汇? 找到高频词. 找到只出现一次的 ...
统计自然语言处理笔记
前言学习技术离不开经典技术材料,目前深度学习的自然语言处理如火如荼,了解一下之前的统计自然语言处理也是很有必要的. 课程介绍男,1970年生,黑龙江省宁安市人.博士,教授,博士生导师.AAAS会员 ...

Python 自然语言处理笔记（五）——信息检索系统，基于Lucene实现