很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前十章,发现前面看的都忘的差不多了,还是回来记一下吧。

Boolean Retrieval

一、information retrieval定义:

学院派定义:

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).

Category :

Category By Scale :

web search、domain-specific search、personal information retrieval

Basic need:

1、To process large document collections quickly.

2、To allow more flexible matching operations

3、To allow ranked retrieval

Simple idea:

term-document incidence matrix use binary logical OR AND NOT...:110100 AND 110111 AND 101111 = 100100

What is Boolean Retrival:

The Boolean retrieval model is a model for information BOOLEAN RETRIEVAL retrieval in which we
MODEL can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
Such queries effectively view each document as a set of words.

What's the boolean retrival query like:

(Calpurnia AND Brutus) AND Caesar

how to assess IR system

Precision : What fraction of the returned results are relevant to the information
need?
Recall : What fraction of the relevant documents in the collection were returned
by the system?

vector space model: Easy to rank

Term-document matrix: not scalable

Inverted index: dictionary and posting list.

How Build Inverted index :

1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .

3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

Processing Boolean queries:

AND operation:

intersect two posting list:

INTERSECT(p1, p2)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then p1 ← next(p1)
9 else p2 ← next(p2)
10 return answer

mulitiple term AND operation:

Process terms in order of increasing document frequency:

if we start by intersecting the two smallest postings lists, then all intermediate resultsmust be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work

INTERSECT(ht1, . . . , tni)
1 terms ← SORTBYINCREASINGFREQUENCY(ht1, . . . , tni)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms != NIL and result != NIL
5 do result ← INTERSECT(result, postings( f irst(terms)))
6 terms ← rest(terms)
7 return result

OR operation:

    The idea is 归并排序中的n路归并,similarily with AND operation。

The extended Boolean model versus ranked retrieval:

Proximity operator:

A proximity operator is a way of specifying that two terms in a query must occur in a document close to each other, where closeness may be measured
by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

Addition to do:

1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, givingmoreweight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need the term frequency information TERM FREQUENCY (the number of
times a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.

Introduce to Inforamtion Retrieval读书笔记(1)相关推荐

  1. 《统计自然语言处理》读书笔记 一.基础知识及概念介绍

    最近准备学习自然语言处理相关的知识,主要参考<统计自然语言处理·宗成庆>和<Natural Language Processing with Python>,推荐大家阅读.第一 ...

  2. 重构(Refactoring)技巧读书笔记 之二

    重构(Refactoring)技巧读书笔记 之二<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:of ...

  3. 读书笔记之何时重构(下)

    因为中间看了一本其他的书,差不多一个月未跟新读书笔记了,这段时间要补补课,接着上一章继续说说何时重构,文章中很多重构的方法这里还没有说明,后续章节会详细的介绍这些经常使用到的重构方法,尽请期待: 11 ...

  4. 重构(Refactoring)技巧读书笔记 之三

    重构(Refactoring)技巧读书笔记 之三<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:of ...

  5. 计算广告——读书笔记(二)

    目录 一.计算广告技术概述 1. 个性化系统框架 2. 各类广告系统优化目标 3. 计算广告系统框架 3.1 广告投放引擎 3.2 数据公路高速 3.3 离线数据处理 3.4 在线数据处理 4. 计算 ...

  6. 《重构》第七章--读书笔记

    第七章 在对象之间搬移特性 --读书笔记 在对象的设计过程中,要决定把对象放在哪里,可能不会一开始就做对,但是可以运用重构,改变自己原先的设计,这就用到了本章所提到额重构手法. 7.1 Move Me ...

  7. 《学术研究你的成功之道》读书笔记之论文篇

    <学术研究你的成功之道>读书笔记之论文篇 五年前的一次机器学习会议上,正逢凌晓峰教授签名出售新书<学术研究你的成功之道>,心血来潮买了一本,翻了一下觉得写得很不错.近日由于要在 ...

  8. 20220527数据结构绿皮书读书笔记书笔记

    个人博客 https://blog.hylstudio.cn/archives/961 20220527数据结构绿皮书读书笔记 8 排序 各种排序算法来咯 插入排序.选择排序.希尔排序.快排.堆排 8 ...

  9. 【读书笔记】知易行难,多实践

    前言: 其实,我不喜欢看书,只是喜欢找答案,想通过专业的解答来解决我生活的困惑.所以,我听了很多书,也看了很多书,但看完书,没有很多的实践,导致我并不很深入在很多时候. 分享读书笔记: <高效1 ...

最新文章

  1. python安装第三方库-python第三方库的四种安装方法
  2. IOS-webService
  3. PyOpenCL图像处理:Box模糊
  4. 全国计算机等级考试题库二级C操作题100套(第88套)
  5. abp框架(aspnetboilerplate)设置前端报错显示
  6. php判断是不是iphone访问,php基于http协议访问,判断访问来源iphone,android,微信浏览器,pc电脑...
  7. linux libmpi.so.12,单机安装vasp5.4.4,系统SUSE 12 SP3,编译器Intel_Parallel_Studio_XE_2019_Linux...
  8. 求助:字符的显示问题
  9. 分时问候并显示不用图片案例
  10. 把一个下拉框中的选项添加到另一个中
  11. 有限差分法MATLAB程序
  12. iPhone内存溢出——黑白苹果
  13. 9、HDFS核心设计--心跳机制、安全模式、副本存放策略、负载均衡
  14. Hadoop2.x与3.x的区别:
  15. 宇视摄像头安装——筒机安装
  16. Java 反射 理解
  17. DbContext 查询(二)
  18. html页面统计在线人数,统计在线人数couter
  19. 1556_AURIX_TC275_复位系统控制单元
  20. uoj265【2016提高】愤怒的小鸟(状压dp)

热门文章

  1. pytorch 调参
  2. c++文件读取、容器(vector、map)、迭代(iterator)、排序(sort)综合案例
  3. 百度地图 Api v3.0 自定义信息窗体样式
  4. Cent os 快捷键设置
  5. 外行假装内行,我也来谈谈SAP BAPI和BADI
  6. ZZULIOJ 1114-1130 数组专题 参考代码
  7. Springboot学习笔记(二)Web开发
  8. DirectX11.2前哨战 R7/R9显卡性能首测
  9. 阿里在数据库智能优化路上_做了哪些探索与实践?
  10. 1.1到底什么是云计算