Python之精心整理的二十五个文本提取及NLP相关的处理案例

一、提取 PDF 内容

# pip install PyPDF2  安装 PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader# Creating a pdf file object.
pdf = open("test.pdf", "rb")# Creating pdf reader object.
pdf_reader = PyPDF2.PdfFileReader(pdf)# Checking total number of pages in a pdf file.
print("Total number of Pages:", pdf_reader.numPages)# Creating a page object.
page = pdf_reader.getPage(200)# Extract data from a specific page number.
print(page.extractText())# Closing the object.
pdf.close()

二、提取 Word 内容

# pip install python-docx  安装 python-docximport docxdef main():try:doc = docx.Document('test.docx')  # Creating word reader object.data = ""fullText = []for para in doc.paragraphs:fullText.append(para.text)data = '\n'.join(fullText)print(data)except IOError:print('There was an error opening the file!')returnif __name__ == '__main__':main()

三、提取 Web 网页内容

# pip install bs4  安装 bs4from urllib.request import Request, urlopen
from bs4 import BeautifulSoupreq = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',headers={'User-Agent': 'Mozilla/5.0'})webpage = urlopen(req).read()# Parsing
soup = BeautifulSoup(webpage, 'html.parser')# Formating the parsed html file
strhtm = soup.prettify()# Print first 500 lines
print(strhtm[:500])# Extract meta tag value
print(soup.title.string)
print(soup.find('meta', attrs={'property':'og:description'}))# Extract anchor tag value
for x in soup.find_all('a'):print(x.string)# Extract Paragraph tag value
for x in soup.find_all('p'):print(x.text)

四、读取 Json 数据

import requests
import jsonr = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
res = r.json()# Extract specific node content.
print(res['quiz']['sport'])# Dump data as string
data = json.dumps(res)
print(data)

五、读取 CSV 数据

import csvwith open('test.csv','r') as csv_file:reader =csv.reader(csv_file)next(reader) # Skip first rowfor row in reader:print(row)

六、删除字符串中的标点符号

import re
import stringdata = "Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!"# Methood 1 : Regex
# Remove the special charaters from the read string.
no_specials_string = re.sub('[!#?,.:";]', '', data)
print(no_specials_string)# Methood 2 : translate()
# Rake translator object
translator = str.maketrans('', '', string.punctuation)
data = data.translate(translator)
print(data)

七、使用 NLTK 删除停用词

from nltk.corpus import stopwordsdata = ['Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!']# Remove stop words
stopwords = set(stopwords.words('english'))output = []
for sentence in data:temp_list = []for word in sentence.split():if word.lower() not in stopwords:temp_list.append(word)output.append(' '.join(temp_list))print(output)

八、使用 TextBlob 更正拼写

from textblob import TextBlobdata = "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages."output = TextBlob(data).correct()
print(output)

九、使用 NLTK 和 TextBlob 的词标记化

import nltk
from textblob import TextBlobdata = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages."nltk_output = nltk.word_tokenize(data)
textblob_output = TextBlob(data).wordsprint(nltk_output)
print(textblob_output)

执行结果：

['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']

十、使用 NLTK 提取句子单词或短语的词干列表

from nltk.stem import PorterStemmerst = PorterStemmer()
text = ['Where did he learn to dance like that?','His eyes were dancing with humor.','She shook her head and danced away','Alex was an excellent dancer.']output = []
for sentence in text:output.append(" ".join([st.stem(i) for i in sentence.split()]))for item in output:print(item)print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

执行结果：

where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump

十一、使用 NLTK 进行句子或短语词形还原

from nltk.stem import WordNetLemmatizerwnl = WordNetLemmatizer()
text = ['She gripped the armrest as he passed two cars at a time.','Her car was in full view.','A number of cars carried out of state license plates.']output = []
for sentence in text:output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()]))for item in output:print(item)print("*" * 10)
print(wnl.lemmatize('jumps', 'n'))
print(wnl.lemmatize('jumping', 'v'))
print(wnl.lemmatize('jumped', 'v'))print("*" * 10)
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('happiest', 'a'))
print(wnl.lemmatize('easiest', 'a'))

执行结果：

She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy

十二、使用 NLTK 从文本文件中查找每个单词的频率

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDistnltk.download('webtext')
wt_words = webtext.words('testing.txt')
data_analysis = nltk.FreqDist(wt_words)# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])for key in sorted(filter_words):print("%s: %s" % (key, filter_words[key]))data_analysis = nltk.FreqDist(filter_words)data_analysis.plot(25, cumulative=False)

执行结果：

[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...

十三、从语料库中创建词云

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as pltnltk.download('webtext')
wt_words = webtext.words('testing.txt')  # Sample data
data_analysis = nltk.FreqDist(wt_words)filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])wcloud = WordCloud().generate_from_frequencies(filter_words)# Plotting the wordcloud
plt.imshow(wcloud, interpolation="bilinear")plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()

十四、NLTK 词法散布图

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as pltwords = ['data', 'science', 'dataset']nltk.download('webtext')
wt_words = webtext.words('testing.txt')  # Sample datapoints = [(x, y) for x in range(len(wt_words))for y in range(len(words)) if wt_words[x] == words[y]]if points:x, y = zip(*points)
else:x = y = ()plt.plot(x, y, "rx", scalex=.1)
plt.yticks(range(len(words)), words, color="b")
plt.ylim(-1, len(words))
plt.title("Lexical Dispersion Plot")
plt.xlabel("Word Offset")
plt.show()

十五、使用 countvectorizer 将文本转换为数字

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())# Change column headers
df2.columns = df1.columns
print(df2)

执行结果：

             Go  Java  Python
and           2     2       2
application   0     1       0
are           1     0       1
bytecode      0     1       0
can           0     1       0
code          0     1       0
comes         1     0       1
compiled      0     1       0
derived       0     1       0
develops      0     1       0
for           0     2       0
from          0     1       0
functional    1     0       1
imperative    1     0       1
...

十六、使用 TF-IDF 创建文档术语矩阵

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})# Initialize
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())# Change column headers
df2.columns = df1.columns
print(df2)

执行结果：

                 Go      Java    Python
and          0.323751  0.137553  0.323751
application  0.000000  0.116449  0.000000
are          0.208444  0.000000  0.208444
bytecode     0.000000  0.116449  0.000000
can          0.000000  0.116449  0.000000
code         0.000000  0.116449  0.000000
comes        0.208444  0.000000  0.208444
compiled     0.000000  0.116449  0.000000
derived      0.000000  0.116449  0.000000
develops     0.000000  0.116449  0.000000
for          0.000000  0.232898  0.000000
...

十七、为给定句子生成 N-gram

NLTK：

import nltk
from nltk.util import ngrams# Function to generate n-grams from sentences.
def extract_ngrams(data, num):n_grams = ngrams(nltk.word_tokenize(data), num)return [ ' '.join(grams) for grams in n_grams]data = 'A class is a blueprint for the object.'print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

TextBlob：

from textblob import TextBlob# Function to generate n-grams from sentences.
def extract_ngrams(data, num):n_grams = TextBlob(data).ngrams(num)return [ ' '.join(grams) for grams in n_grams]data = 'A class is a blueprint for the object.'print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

执行结果：

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']

十八、使用带有二元组的 sklearn CountVectorize 词汇规范

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer# Sample data for analysis
data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them."
data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language"df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]})# Initialize
vectorizer = CountVectorizer(ngram_range=(2, 2))
doc_vec = vectorizer.fit_transform(df1.iloc[0])# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())# Change column headers
df2.columns = df1.columns
print(df2)

执行结果：

                       Assembly  Machine
also either                    0        1
and or                         0        1
are also                       0        1
are readable                   1        0
are still                      1        0
assembly language              5        0
because each                   1        0
but difficult                  0        1
by computers                   0        1
by people                      0        1
can execute                    0        1
...

十九、使用 TextBlob 提取名词短语

from textblob import TextBlob#Extract noun
blob = TextBlob("Canada is a country in the northern part of North America.")for nouns in blob.noun_phrases:print(nouns)

执行结果：

canada
northern part
america

二十、如何计算词-词共现矩阵

import numpy as np
import nltk
from nltk import bigrams
import itertools
import pandas as pddef generate_co_occurrence_matrix(corpus):vocab = set(corpus)vocab = list(vocab)vocab_index = {word: i for i, word in enumerate(vocab)}# Create bigrams from all words in corpusbi_grams = list(bigrams(corpus))# Frequency distribution of bigrams ((word1, word2), num_occurrences)bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))# Initialise co-occurrence matrix# co_occurrence_matrix[current][previous]co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))# Loop through the bigrams taking the current and previous word,# and the number of occurrences of the bigram.for bigram in bigram_freq:current = bigram[0][1]previous = bigram[0][0]count = bigram[1]pos_current = vocab_index[current]pos_previous = vocab_index[previous]co_occurrence_matrix[pos_current][pos_previous] = countco_occurrence_matrix = np.matrix(co_occurrence_matrix)# return the matrix and the indexreturn co_occurrence_matrix, vocab_indextext_data = [['Where', 'Python', 'is', 'used'],['What', 'is', 'Python' 'used', 'in'],['Why', 'Python', 'is', 'best'],['What', 'companies', 'use', 'Python']]# Create one list using many lists
data = list(itertools.chain.from_iterable(text_data))
matrix, vocab_index = generate_co_occurrence_matrix(data)data_matrix = pd.DataFrame(matrix, index=vocab_index,columns=vocab_index)
print(data_matrix)

执行结果：

  best  use  What  Where  ...    in   is  Python  used
best         0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
use          0.0  0.0   0.0    0.0  ...   0.0  1.0     0.0   0.0
What         1.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
Where        0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
Pythonused   0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
Why          0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
companies    0.0  1.0   0.0    1.0  ...   1.0  0.0     0.0   0.0
in           0.0  0.0   0.0    0.0  ...   0.0  0.0     1.0   0.0
is           0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0
Python       0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
used         0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0[11 rows x 11 columns]

二十一、使用 TextBlob 进行情感分析

from textblob import TextBlobdef sentiment(polarity):if blob.sentiment.polarity < 0:print("Negative")elif blob.sentiment.polarity > 0:print("Positive")else:print("Neutral")blob = TextBlob("The movie was excellent!")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)blob = TextBlob("The movie was not bad.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)blob = TextBlob("The movie was ridiculous.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)

执行结果：

Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative

二十二、使用 Goslate 进行语言翻译

import goslatetext = "Comment vas-tu?"gs = goslate.Goslate()translatedText = gs.translate(text, 'en')
print(translatedText)translatedText = gs.translate(text, 'zh')
print(translatedText)translatedText = gs.translate(text, 'de')
print(translatedText)

二十三、使用 TextBlob 进行语言检测和翻译

from textblob import TextBlobblob = TextBlob("Comment vas-tu?")print(blob.detect_language())print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))

执行结果：

fr
¿Como estas tu?
How are you?
你好吗？

二十四、使用 TextBlob 获取定义和同义词

from textblob import TextBlob
from textblob import Wordtext_word = Word('safe')print(text_word.definitions)synonyms = set()
for synset in text_word.synsets:for lemma in synset.lemmas():synonyms.add(lemma.name())print(synonyms)

执行结果：

['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}

二十五、使用 TextBlob 获取反义词列表

from textblob import TextBlob
from textblob import Wordtext_word = Word('safe')antonyms = set()
for synset in text_word.synsets:for lemma in synset.lemmas():        if lemma.antonyms():antonyms.add(lemma.antonyms()[0].name())        print(antonyms)

执行结果：

{'dangerous', 'out'}

Python之精心整理的二十五个文本提取及NLP相关的处理案例相关推荐

Python的内置函数(二十五)、readlines()
概述 readlines() 方法用于读取所有行(直到结束符 EOF)并返回列表,该列表可以由 Python 的 for... in ... 结构进行处理. 如果碰到结束符 EOF 则返回空字符串. ...
Python遥感图像处理应用篇(二十五)：Python+GDAL 波段组合
1.使用场景描述之前使用arcpy python2.7写了一篇进行遥感数据波段组合计算的博文,使用场景是将所有单波段数据放在一个文件夹中(如下图),文件名称前缀一样,后缀波段编号用来区分不同的波段, ...
[Python从零到壹] 三十五.图像处理基础篇之OpenCV绘制各类几何图形
欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都将结合案例.代码和作者的经验讲 ...
[Python从零到壹] 四十五.图像增强及运算篇之图像灰度非线性变换详解
欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都将结合案例.代码和作者的经验讲 ...
VMware vSphere 服务器虚拟化之二十五桌面虚拟化之终端服务池
VMware vSphere 服务器虚拟化之二十五桌面虚拟化之终端服务池终端服务池是指由一台或多台微软终端服务器提供服务的桌面源组成的池.终端服务器桌面源可交付多个桌面.它具有以下特征: 1.终端 ...
开源操作系统 FreeDOS 二十五年演进史：因微软抛弃 MS-DOS 而来！
[CSDN 编者按]1994 年,微软宣布停止支持 MS-DOS,而 FreeDOS 的作者 Jim Hall 作为 MS-DOS 的超级粉丝,决定自行设计一个 MS-DOS 的自由软件替代--这就有 ...
二十五个深度学习相关公开数据集
转 [干货]二十五个深度学习相关公开数据集 2018年04月18日 13:42:53 阅读数:758 (选自Analytics Vidhya:作者:Pranav Dar:磐石编译) 目录介绍图像处 ...
2021年大数据Hadoop（二十五）：YARN通俗介绍和基本架构
全网最详细的Hadoop文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录本系列历史文章前言 YARN通俗介绍和基本架构 Yarn通俗介绍 Yarn基本 ...
未处理异常和C++异常——Windows核心编程学习手札之二十五
未处理异常和C++异常 --Windows核心编程学习手札之二十五当一个异常过滤器返回EXCEPTION_CONTINUE_SEARCH标识符时是告诉系统继续上溯调用树,寻找另外的异常过滤器,但当每 ...

Python之精心整理的二十五个文本提取及NLP相关的处理案例

一、提取 PDF 内容

二、提取 Word 内容

三、提取 Web 网页内容

四、读取 Json 数据

五、读取 CSV 数据

六、删除字符串中的标点符号

七、使用 NLTK 删除停用词

八、使用 TextBlob 更正拼写

九、使用 NLTK 和 TextBlob 的词标记化

十、使用 NLTK 提取句子单词或短语的词干列表

十一、使用 NLTK 进行句子或短语词形还原

十二、使用 NLTK 从文本文件中查找每个单词的频率

十三、从语料库中创建词云

十四、NLTK 词法散布图

十五、使用 countvectorizer 将文本转换为数字

十六、使用 TF-IDF 创建文档术语矩阵

十七、为给定句子生成 N-gram

十八、使用带有二元组的 sklearn CountVectorize 词汇规范

十九、使用 TextBlob 提取名词短语

二十、如何计算词-词共现矩阵

二十一、使用 TextBlob 进行情感分析

二十二、使用 Goslate 进行语言翻译

二十三、使用 TextBlob 进行语言检测和翻译

二十四、使用 TextBlob 获取定义和同义词

二十五、使用 TextBlob 获取反义词列表

Python之精心整理的二十五个文本提取及NLP相关的处理案例相关推荐

最新文章

热门文章