
当我们有很多文献时,如果想从众多文献中搜索一个特定的字符串,我们难道要逐个PDF打开找吗,那么多文献,而且全是PDF,逐个打开,Ctrl + F搜索也不现实,肿么办,难不成为自己的文献库构建个索引吗,在本机构建文本语料库索引工作量不小,我们能不能找个轻量级的办法呢,当然可以,收到Linux中常用的搜索命令grep的启发,那么我们能直接用grep命令搜PDF文件吗,当然不能,grep命令是搜文本文件的(各类code源代码,plain text等),那我们把PDF文献全部转换为TXT,然后再用grep行不行,行,但是能不能把这个转换步骤也省了,必须能,怎么一分钟实现,看下文。



sudo apt-get update -y
sudo apt-get install -y pdfgrep



pdfgrep -n -i 'cosine' *.pdf

(base) ergou@dell:~/Desktop/paper_reading/papers$ pdfgrep -n -i 'cosine' *.pdf
paper0.pdf:22:in VSM or BoW, are compared using similarity measure like Cosine similarity (Vu et al.
paper1.pdf:1:space. Given their vector embeddings, we then use cosine
paper1.pdf:3:first tokenizes the input text and then calculates vectors for                     to average them as the cosine similarity function depends
paper1.pdf:3:or tf-idf, these vectors are contextualized; they consider                   choice of summing or averaging would not influence the cosine
paper1.pdf:3:and problem report as a potential match. Another positive                      the euclidian similarity, and cosine similarity [9], [43]. The
paper1.pdf:4:dimensions. In contrast, the cosine similarity measures the
paper1.pdf:4:Previous research [1][3], [9], showed that cosine similarity                                          05/2012              09/2018
paper1.pdf:5:analyze DeepMatcher's cosine similarity values to understand
paper1.pdf:5:consuming over 80% battery. Had to uninstall to even                       cosine similarity, it added one additional suggestion per step
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim­
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim­
paper1.pdf:6:suggested bug reports to three, the MAP score                                    irrelevant bug report suggestions. Figure 4 shows the cosine
paper1.pdf:6:of problem reports for which DeepMatcher                                            We found that VLC has the lowest cosine similarity score
paper1.pdf:6:the MAP and the hit ratio scores for each                                    (26 matches). The lower cosine similarity indicates a higher
paper1.pdf:7:by the developers. our previously reported plot of the cosine                     report 546 days after the corresponding problem report for
paper1.pdf:7:the highest cosine similarity score and highest noun overlap                          It is essential for app developers to address users'
paper1.pdf:11:sensitive embeddings on which we applied cosine similarity to                     [15] M. Honnibal, I. Montani, S. Van Landeghem, and A.
paper2.pdf:3:encoding implies computing 												


