python算法应用（五）——搜索与排名1（连接数据库及简单排名）

sqlite数据库

sqlite数据库以.db格式的文件形式存在，所以不需要安装驱动和应用系统，python在标准库中也集成了sqlite数据库的操作库。
用python来操作数据库很方便，对于.db格式文件的sqlite数据库：
1.连接到数据库
2.使用API对数据库进行操作（传入的参数与数据库指令相同）

class crawler:#初始化crawer类并传入数据库名称def __init__(self, dbname):self.con = sqlite3.connect(dbname)def __del__(self):self.con.close()def dbcommit(self):self.con.commit()

实例化
注意：select … from … where 是在数据库中查询的指令
1.select 后接字段，表示在该字段中查询
2.from 后接表，表示在该表中查询
3.where 后接条件，表示查询符合该条件的数据

#连接到数据库
crawler=crawler('searchindex.db')
#print([row for row in crawler.con.execute('select * from wordlist')])
#print(crawler.con.execute('select * from wordlocation where location=327 and urlid=1').fetchone())
#print(crawler.con.execute('select * from wordlist where rowid=145').fetchone())
crawler.__del__()

在该例中使用的数据库：

其中rowid为默认索引，不需要自己添加
urllist：保存的是url表
wordlist：保存的是所有网页中提取出来的word，不重复
wordlocation：存的也是所有的word（word会重复，但其字段对应的不同），且还有其不同的字段属性，urlid为该word所在的url在urllist的索引，wordid为该word在wordlist的索引，location为该word在该网页中的位置

搜索

查询函数传入的是不同单词构成的字符串，单词之间用空格隔开
算法描述：
1.将输入的字符串拆成多个单词，组成一个list
2.对该list进行迭代
3.利用数据库的查询指令在wordlist获取单词的ID
4.利用单词的ID去wordlocation中做查询
5.4中描述的算法利用MYSQL中的查询语句实现，要实现的是返回一个tuple，（url，word1location，word2location，……）即同一网页中所要查询的各个word的位置，会有很多排列组合

select w0.urlid,w0.location,w1.location
from wordlocation w0,wordlocation w1
where w0.wordid=126 and w0.urlid=w1.urlid and w1.wordid=127

上面的代码先看where后面的语句。w0表示一个数据表，w1表示另一个数据表。
下面的语句表示满足条件：
w0表的wordid字段为126的所有w0中的记录
w1表的wordid字段为127的所有w1中的记录
w0表的urlid字段与w1表中urlid字段存在相同值的所有w0中的记录和w1中的所有记录。
where后的语句查询上述数据集的交集。

where w0.wordid=126 and w0.urlid=w1.urlid and w1.wordid=127

再来看from后面的语句。表示wordlocation表可以用w0这个名称代替，wordlocation这个表也可以用w1这个名称代替。

from wordlocation w0,wordlocation w1

最后来看select后的语句。上面查询到的数据集既有w0中的又有w1中的，把w0中记录的urlid字段、w0中记录的location字段、w1中记录的location字段提取出来。

select w0.urlid,w0.location,w1.location

上述仅描述了两个word查询，如果word更多的话，上面的的字段就要更多了
总体来说where后的条件应为urlid相等，wordid分别与查询的word的ID相同，在urlid、location、location字段中查询，最后返回的结果应该如下格式所示

[(urlid1,wordlocation1_1,wordlocation2_1),  # 链接1中，单词1的位置1，单词2的位置1(urlid1,wordlocation1_1,wordlocation2_2),  # 链接1中，单词1的位置1，单词2的位置2(urlid1,wordlocation1_1,wordlocation2_3),  # 链接1中，单词1的位置1，单词2的位置3(urlid1,wordlocation1_2,wordlocation2_1),  # 链接1中，单词1的位置2，单词2的位置1(urlid1,wordlocation1_2,wordlocation2_2),  # 链接1中，单词1的位置2，单词2的位置2(urlid1,wordlocation1_2,wordlocation2_3),  # 链接1中，单词1的位置2，单词2的位置3...(urlid2,wordlocation1_1,wordlocation2_1),  # 链接2中，单词1的位置1，单词2的位置1(urlid2,wordlocation1_1,wordlocation2_2),  # 链接2中，单词1的位置1，单词2的位置2...
]

参考：https://blog.csdn.net/luanpeng825485697/article/details/78997189

#定义一个用于搜索的类
class searcher:def __init__(self, dbname):self.con = sqlite3.connect(dbname)def __del__(self):self.con.close()#查询函数 将输入字符串拆成多个单词，进行查找，只查找包含所有不同单词的URLdef getmatchrows(self,q):#构造查询的字符串fieldlist='w0.urlid'tablelist=''clauselist=''wordids=[]#根据空格拆分单词words=q.split(' ')tablenumber=0for word in words:#获取单词的IDwordrow=self.con.execute("select rowid from wordlist where word='%s'" % word).fetchone()if wordrow!=None:wordid=wordrow[0]wordids.append(wordid)if tablenumber>0:tablelist+=','clauselist+=' and 'clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber)fieldlist+=',w%d.location' %tablenumbertablelist+='wordlocation w%d' % tablenumberclauselist+='w%d.wordid=%d' % (tablenumber,wordid)tablenumber+=1#根据各个组分，建立查询fullquery='select %s from %s where %s' %(fieldlist,tablelist,clauselist)cur=self.con.execute(fullquery)#row: urlid location1 location2 ... rows=[row for row in cur]#(urlid location1 location2) ... +[wordid1 wordid2]return rows,wordids

简单排名

经过搜索可以获得匹配查询关键字的url，那么接下来就是对这些url进行排名，找出最优的搜索结果。
简单排名可以分为两类
基于内容的排名：根据网页的内容来进行排名，分别为单词频度，文档位置，单词距离
外部回指链接排名：根据外部网页对其引用的多少来进行排名
对url根据评分进行排序的方法
1.首先根据评分标准，对每个url进行评分
2.将urlid对应到url名上去
3.输出排序后的结果，形式为评分 url

    #接受查询请求，将获取到的行集置于字典中，并以格式化列表的形式显示输出def getscoredlist(self,rows,wordids):totalscores=dict([(row[0],0) for row in rows])#此处是稍后放置评价函数的地方weights=[(1.0,self.frequencyscore(rows)),(1.5,self.locationscore(rows)),(1.0,self.distancescore(rows)),(1.0,self.inboundlinkscore(rows))]for (weight,scores) in weights:for url in totalscores:totalscores[url]+=weight*scores[url]return totalscores#返回url的值def geturlname(self,id):return self.con.execute("select url from urllist where rowid=%d" %id).fetchone()[0]#输出排序后的结果def query(self,q):rows,wordids=self.getmatchrows(q)scores=self.getscoredlist(rows,wordids)rankedscores=sorted([(score,url) for (url,score) in scores.items()],reverse=1)for (score,urlid) in rankedscores[0:10]:print('%f\t%s' %(score,self.geturlname(urlid)))

归一化条件
所有的评分方法返回的都是数字评分值，但是他们的数量级却不一样，而且有的是越大越好，有的却是越小越好，所以需要对其进行归一化，全部投影到（0,1）上去，1代表最佳，0代表最差。

    #归一化函数 接受一个包含ID与评价值的字典，并返回一个包含ID与评价值（最佳结果为1，最差为0）的字典def normalizescores(self,scores,smallIsBetter=0):vsmall=0.00001 #避免被零整除if smallIsBetter:minscore=min(scores.values())return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()])else:maxscore=max(scores.values())if maxscore==0:maxscore=vsmallreturn dict([(u,float(c)/maxscore) for (u,c) in scores.items()])

单词频度
位于查询条件的单词在文档中出现的次数能助于我们判断文档的相关程度
注意：dict可以去重

    #以单词频度作为度量手段def frequencyscore(self,rows):#建立一个字典，去掉重复出现的url，所有key的value都是0counts=dict([(row[0],0) for row in rows])for row in rows:#对每一个url进行检索，检索到一次加1，代表该url中多了一个正在检索的单词counts[row[0]]+=1return self.normalizescores(counts)

文档位置
一般来说，搜索单词在网页中的位置距离网页开始处的距离越近越好。
注意：这里的距离取的是所有搜索单词的location之和

    #以文档位置为度量手段 搜索单词在网页中的位置离网页开始处的位置越近代表越好def locationscore(self,rows):#建立一个字典 去掉重复的url，所有key的value都设为很大locations=dict([(row[0],1000000) for row in rows])for row in rows:loc=sum(row[1:])#对每一个url进行检索，如果有距离更小的，便置为valueif locations[row[0]]>loc:locations[row[0]]=locreturn self.normalizescores(locations,smallIsBetter=1)

单词距离
当查询中包含多个单词时，寻找单词彼此间距很近的网页时很有意义的
注意：单词距离即各location之间差的绝对值的和

 #以单词距离为度量手段 当查询包含多个单词时，寻找彼此间距很近的往往是很有意义的def distancescore(self,rows):#如果仅有一个单词，则得分都一样if len(rows[0])<=2:return dict([(row[0],1.0) for row in rows])#初始化字典，并填入一个很大的数mindistance=dict([(row[0],1000000) for row in rows])for row in rows:dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))])if dist<mindistance[row[0]]:mindistance[row[0]]=distreturn self.normalizescores(mindistance,smallIsBetter=1)

外部回指链接的简单计数
我们往往可以通过考查外界对该网页的评价来判断该网页内容的优劣，一般来说，外部引用数越多，代表网页的内容越可靠。
注意：这里在数据库的link表中查询

    #利用外部回指链接 对网页引用的越多，代表其内容越可靠#对网页上统计链接的数目进行简单计数def inboundlinkscore(self,rows):uniqueurls=set([row[0] for row in rows])#在数据库中查询各toid(即不同url)出现的次数，即外部引用数inboundlinkcount=dict([(u,self.con.execute('select count(*) from link where toid=%d' %u).fetchone()[0]) for u in uniqueurls])return self.normalizescores(inboundlinkcount)