Introduce to Inforamtion Retrieval读书笔记(1)

很好的一本书，介绍的非常全面，看了很久了，还没有看完，刚看完前十章，发现前面看的都忘的差不多了，还是回来记一下吧。

Boolean Retrieval

一、information retrieval定义：

学院派定义：

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).

Category ：

Category By Scale :

web search、domain-specific search、personal information retrieval

Basic need:

1、To process large document collections quickly.

2、To allow more flexible matching operations

3、To allow ranked retrieval

Simple idea:

term-document incidence matrix use binary logical OR AND NOT...：110100 AND 110111 AND 101111 = 100100

What is Boolean Retrival:

The Boolean retrieval model is a model for information BOOLEAN RETRIEVAL retrieval in which we
MODEL can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
Such queries effectively view each document as a set of words.

What's the boolean retrival query like:

(Calpurnia AND Brutus) AND Caesar

how to assess IR system

Precision : What fraction of the returned results are relevant to the information
need?
Recall : What fraction of the relevant documents in the collection were returned
by the system?

vector space model： Easy to rank

Term-document matrix: not scalable

Inverted index： dictionary and posting list.

How Build Inverted index :

1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .

3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

Processing Boolean queries:

AND operation:

intersect two posting list:

INTERSECT(p1, p2)
1 answer ← （）
2 while p1 ！= NIL and p2 ！= NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then p1 ← next(p1)
9 else p2 ← next(p2)
10 return answer

mulitiple term AND operation:

Process terms in order of increasing document frequency:

if we start by intersecting the two smallest postings lists, then all intermediate resultsmust be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work

INTERSECT(ht1, . . . , tni)
1 terms ← SORTBYINCREASINGFREQUENCY(ht1, . . . , tni)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms != NIL and result != NIL
5 do result ← INTERSECT(result, postings( f irst(terms)))
6 terms ← rest(terms)
7 return result

OR operation:

The idea is 归并排序中的n路归并,similarily with AND operation。

The extended Boolean model versus ranked retrieval:

Proximity operator:

A proximity operator is a way of specifying that two terms in a query must occur in a document close to each other, where closeness may be measured
by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

Addition to do:

1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, givingmoreweight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need the term frequency information TERM FREQUENCY (the number of
times a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.

Introduce to Inforamtion Retrieval读书笔记(1)相关推荐

《统计自然语言处理》读书笔记一.基础知识及概念介绍
最近准备学习自然语言处理相关的知识,主要参考<统计自然语言处理·宗成庆>和<Natural Language Processing with Python>,推荐大家阅读.第一 ...
重构（Refactoring）技巧读书笔记之二
重构(Refactoring)技巧读书笔记之二<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:of ...
读书笔记之何时重构（下）
因为中间看了一本其他的书,差不多一个月未跟新读书笔记了,这段时间要补补课,接着上一章继续说说何时重构,文章中很多重构的方法这里还没有说明,后续章节会详细的介绍这些经常使用到的重构方法,尽请期待: 11 ...
重构（Refactoring）技巧读书笔记之三
重构(Refactoring)技巧读书笔记之三<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:of ...
计算广告——读书笔记（二）
目录一.计算广告技术概述 1. 个性化系统框架 2. 各类广告系统优化目标 3. 计算广告系统框架 3.1 广告投放引擎 3.2 数据公路高速 3.3 离线数据处理 3.4 在线数据处理 4. 计算 ...
《重构》第七章--读书笔记
第七章在对象之间搬移特性 --读书笔记在对象的设计过程中,要决定把对象放在哪里,可能不会一开始就做对,但是可以运用重构,改变自己原先的设计,这就用到了本章所提到额重构手法. 7.1 Move Me ...
《学术研究你的成功之道》读书笔记之论文篇
<学术研究你的成功之道>读书笔记之论文篇五年前的一次机器学习会议上,正逢凌晓峰教授签名出售新书<学术研究你的成功之道>,心血来潮买了一本,翻了一下觉得写得很不错.近日由于要在 ...
20220527数据结构绿皮书读书笔记书笔记
个人博客 https://blog.hylstudio.cn/archives/961 20220527数据结构绿皮书读书笔记 8 排序各种排序算法来咯插入排序.选择排序.希尔排序.快排.堆排 8 ...
【读书笔记】知易行难，多实践
前言: 其实,我不喜欢看书,只是喜欢找答案,想通过专业的解答来解决我生活的困惑.所以,我听了很多书,也看了很多书,但看完书,没有很多的实践,导致我并不很深入在很多时候. 分享读书笔记: <高效1 ...

Introduce to Inforamtion Retrieval读书笔记(1)

Introduce to Inforamtion Retrieval读书笔记(1)相关推荐

最新文章

热门文章