今天主要跟大家整理了25个值得收藏的Python文本处理案例。Python 处理文本是一项非常常见的功能,可以收藏起来,总会用到的,想要了解更多的关于python知识的,领取免费资源的,可以点击这个链接

目录

1提取 PDF 内容

2提取 Word 内容

3提取 Web 网页内容

4读取 Json 数据

5读取 CSV 数据

6删除字符串中的标点符号

7使用 NLTK 删除停用词

8使用 TextBlob 更正拼写

9使用 NLTK 和 TextBlob 的词标记化

10使用 NLTK 提取句子单词或短语的词干列表

11使用 NLTK 进行句子或短语词形还原

12使用 NLTK 从文本文件中查找每个单词的频率

13从语料库中创建词云

14NLTK 词法散布图

15使用 countvectorizer 将文本转换为数字

16使用 TF-IDF 创建文档术语矩阵

17为给定句子生成 N-gram

18使用带有二元组的 sklearn CountVectorize 词汇规范

19使用 TextBlob 提取名词短语

20如何计算词-词共现矩阵

21使用 TextBlob 进行情感分析

22使用 Goslate 进行语言翻译

23使用 TextBlob 进行语言检测和翻译

24使用 TextBlob 获取定义和同义词

25使用 TextBlob 获取反义词列表


1提取 PDF 内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

# pip install PyPDF2  安装 PyPDF2

import PyPDF2

from PyPDF2 import PdfFileReader

 

# Creating a pdf file object.

pdf = open("test.pdf", "rb")

 

# Creating pdf reader object.

pdf_reader = PyPDF2.PdfFileReader(pdf)

 

# Checking total number of pages in a pdf file.

print("Total number of Pages:", pdf_reader.numPages)

 

# Creating a page object.

page = pdf_reader.getPage(200)

 

# Extract data from a specific page number.

print(page.extractText())

 

# Closing the object.

pdf.close()

2提取 Word 内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# pip install python-docx  安装 python-docx

import docx

 

 

def main():

    try:

        doc = docx.Document('test.docx')  # Creating word reader object.

        data = ""

        fullText = []

        for para in doc.paragraphs:

            fullText.append(para.text)

            data = '\n'.join(fullText)

 

        print(data)

 

    except IOError:

        print('There was an error opening the file!')

        return

 

 

if __name__ == '__main__':

    main()

3提取 Web 网页内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

# pip install bs4  安装 bs4

from urllib.request import Request, urlopen

from bs4 import BeautifulSoup

 

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',

              headers={'User-Agent': 'Mozilla/5.0'})

 

webpage = urlopen(req).read()

 

# Parsing

soup = BeautifulSoup(webpage, 'html.parser')

 

# Formating the parsed html file

strhtm = soup.prettify()

 

# Print first 500 lines

print(strhtm[:500])

 

# Extract meta tag value

print(soup.title.string)

print(soup.find('meta', attrs={'property':'og:description'}))

 

# Extract anchor tag value

for x in soup.find_all('a'):

    print(x.string)

 

# Extract Paragraph tag value    

for x in soup.find_all('p'):

    print(x.text)

4读取 Json 数据

1

2

3

4

5

6

7

8

9

10

11

12

import requests

import json

r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")

res = r.json()

# Extract specific node content.

print(res['quiz']['sport'])

# Dump data as string

data = json.dumps(res)

print(data)

5读取 CSV 数据

1

2

3

4

5

6

7

import csv

with open('test.csv','r') as csv_file:

    reader =csv.reader(csv_file)

    next(reader) # Skip first row

    for row in reader:

        print(row)

6删除字符串中的标点符号

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import re

import string

 

data = "Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a fresher step with grate\

guitars and soulful orchestras.\

It would impress anyone who cares to listen!"

 

# Methood 1 : Regex

# Remove the special charaters from the read string.

no_specials_string = re.sub('[!#?,.:";]', '', data)

print(no_specials_string)

 

 

# Methood 2 : translate()

# Rake translator object

translator = str.maketrans('', '', string.punctuation)

data = data.translate(translator)

print(data)

7使用 NLTK 删除停用词

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

from nltk.corpus import stopwords

 

 

data = ['Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a fresher step with grate\

guitars and soulful orchestras.\

It would impress anyone who cares to listen!']

 

# Remove stop words

stopwords = set(stopwords.words('english'))

 

output = []

for sentence in data:

    temp_list = []

    for word in sentence.split():

        if word.lower() not in stopwords:

            temp_list.append(word)

    output.append(' '.join(temp_list))

 

 

print(output)

8使用 TextBlob 更正拼写

1

2

3

4

5

6

from textblob import TextBlob

data = "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages."

output = TextBlob(data).correct()

print(output)

9使用 NLTK 和 TextBlob 的词标记化

1

2

3

4

5

6

7

8

9

10

11

import nltk

from textblob import TextBlob

data = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages."

nltk_output = nltk.word_tokenize(data)

textblob_output = TextBlob(data).words

print(nltk_output)

print(textblob_output)

Output:

['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']

10使用 NLTK 提取句子单词或短语的词干列表

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from nltk.stem import PorterStemmer

 

st = PorterStemmer()

text = ['Where did he learn to dance like that?',

        'His eyes were dancing with humor.',

        'She shook her head and danced away',

        'Alex was an excellent dancer.']

 

output = []

for sentence in text:

    output.append(" ".join([st.stem(i) for i in sentence.split()]))

 

for item in output:

    print(item)

 

print("-" * 50)

print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

Output:

where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump

11使用 NLTK 进行句子或短语词形还原

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

text = ['She gripped the armrest as he passed two cars at a time.',

        'Her car was in full view.',

        'A number of cars carried out of state license plates.']

output = []

for sentence in text:

    output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()]))

for item in output:

    print(item)

print("*" * 10)

print(wnl.lemmatize('jumps', 'n'))

print(wnl.lemmatize('jumping', 'v'))

print(wnl.lemmatize('jumped', 'v'))

print("*" * 10)

print(wnl.lemmatize('saddest', 'a'))

print(wnl.lemmatize('happiest', 'a'))

print(wnl.lemmatize('easiest', 'a'))

Output:

She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy

12使用 NLTK 从文本文件中查找每个单词的频率

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import nltk

from nltk.corpus import webtext

from nltk.probability import FreqDist

 

nltk.download('webtext')

wt_words = webtext.words('testing.txt')

data_analysis = nltk.FreqDist(wt_words)

 

# Let's take the specific words only if their frequency is greater than 3.

filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

 

for key in sorted(filter_words):

    print("%s: %s" % (key, filter_words[key]))

 

data_analysis = nltk.FreqDist(filter_words)

 

data_analysis.plot(25, cumulative=False)

Output:

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...

13从语料库中创建词云

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import nltk

from nltk.corpus import webtext

from nltk.probability import FreqDist

from wordcloud import WordCloud

import matplotlib.pyplot as plt

 

nltk.download('webtext')

wt_words = webtext.words('testing.txt')  # Sample data

data_analysis = nltk.FreqDist(wt_words)

 

filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

 

wcloud = WordCloud().generate_from_frequencies(filter_words)

 

# Plotting the wordcloud

plt.imshow(wcloud, interpolation="bilinear")

 

plt.axis("off")

(-0.5, 399.5, 199.5, -0.5)

plt.show()

14NLTK 词法散布图

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import nltk

from nltk.corpus import webtext

from nltk.probability import FreqDist

from wordcloud import WordCloud

import matplotlib.pyplot as plt

 

words = ['data', 'science', 'dataset']

 

nltk.download('webtext')

wt_words = webtext.words('testing.txt')  # Sample data

 

points = [(x, y) for x in range(len(wt_words))

          for y in range(len(words)) if wt_words[x] == words[y]]

 

if points:

    x, y = zip(*points)

else:

    x = y = ()

 

plt.plot(x, y, "rx", scalex=.1)

plt.yticks(range(len(words)), words, color="b")

plt.ylim(-1, len(words))

plt.title("Lexical Dispersion Plot")

plt.xlabel("Word Offset")

plt.show()

15使用 countvectorizer 将文本转换为数字

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

 

# Sample data for analysis

data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."

data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."

data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."

 

df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})

 

# Initialize

vectorizer = CountVectorizer()

doc_vec = vectorizer.fit_transform(df1.iloc[0])

 

# Create dataFrame

df2 = pd.DataFrame(doc_vec.toarray().transpose(),

                   index=vectorizer.get_feature_names())

 

# Change column headers

df2.columns = df1.columns

print(df2)

Output:

Go  Java  Python
and           2     2       2
application   0     1       0
are           1     0       1
bytecode      0     1       0
can           0     1       0
code          0     1       0
comes         1     0       1
compiled      0     1       0
derived       0     1       0
develops      0     1       0
for           0     2       0
from          0     1       0
functional    1     0       1
imperative    1     0       1
...

16使用 TF-IDF 创建文档术语矩阵

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data for analysis

data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."

data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."

data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."

df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})

# Initialize

vectorizer = TfidfVectorizer()

doc_vec = vectorizer.fit_transform(df1.iloc[0])

# Create dataFrame

df2 = pd.DataFrame(doc_vec.toarray().transpose(),

                   index=vectorizer.get_feature_names())

# Change column headers

df2.columns = df1.columns

print(df2)

Output:

Go      Java    Python
and          0.323751  0.137553  0.323751
application  0.000000  0.116449  0.000000
are          0.208444  0.000000  0.208444
bytecode     0.000000  0.116449  0.000000
can          0.000000  0.116449  0.000000
code         0.000000  0.116449  0.000000
comes        0.208444  0.000000  0.208444
compiled     0.000000  0.116449  0.000000
derived      0.000000  0.116449  0.000000
develops     0.000000  0.116449  0.000000
for          0.000000  0.232898  0.000000
...

17为给定句子生成 N-gram

自然语言工具包:NLTK

1

2

3

4

5

6

7

8

9

10

11

12

13

14

import nltk

from nltk.util import ngrams

# Function to generate n-grams from sentences.

def extract_ngrams(data, num):

    n_grams = ngrams(nltk.word_tokenize(data), num)

    return [ ' '.join(grams) for grams in n_grams]

data = 'A class is a blueprint for the object.'

print("1-gram: ", extract_ngrams(data, 1))

print("2-gram: ", extract_ngrams(data, 2))

print("3-gram: ", extract_ngrams(data, 3))

print("4-gram: ", extract_ngrams(data, 4))

文本处理工具:TextBlob

1

2

3

4

5

6

7

8

9

10

11

12

13

from textblob import TextBlob

 

# Function to generate n-grams from sentences.

def extract_ngrams(data, num):

    n_grams = TextBlob(data).ngrams(num)

    return [ ' '.join(grams) for grams in n_grams]

 

data = 'A class is a blueprint for the object.'

 

print("1-gram: ", extract_ngrams(data, 1))

print("2-gram: ", extract_ngrams(data, 2))

print("3-gram: ", extract_ngrams(data, 3))

print("4-gram: ", extract_ngrams(data, 4))

Output:

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']

18使用带有二元组的 sklearn CountVectorize 词汇规范

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

 

# Sample data for analysis

data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them."

data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language"

 

df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]})

 

# Initialize

vectorizer = CountVectorizer(ngram_range=(2, 2))

doc_vec = vectorizer.fit_transform(df1.iloc[0])

 

# Create dataFrame

df2 = pd.DataFrame(doc_vec.toarray().transpose(),

                   index=vectorizer.get_feature_names())

 

# Change column headers

df2.columns = df1.columns

print(df2)

Output:

Assembly  Machine
also either                    0        1
and or                         0        1
are also                       0        1
are readable                   1        0
are still                      1        0
assembly language              5        0
because each                   1        0
but difficult                  0        1
by computers                   0        1
by people                      0        1
can execute                    0        1
...

19使用 TextBlob 提取名词短语

1

2

3

4

5

6

7

from textblob import TextBlob

#Extract noun

blob = TextBlob("Canada is a country in the northern part of North America.")

for nouns in blob.noun_phrases:

    print(nouns)

Output:

canada
northern part
america

20如何计算词-词共现矩阵

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

import numpy as np

import nltk

from nltk import bigrams

import itertools

import pandas as pd

 

 

def generate_co_occurrence_matrix(corpus):

    vocab = set(corpus)

    vocab = list(vocab)

    vocab_index = {word: i for i, word in enumerate(vocab)}

 

    # Create bigrams from all words in corpus

    bi_grams = list(bigrams(corpus))

 

    # Frequency distribution of bigrams ((word1, word2), num_occurrences)

    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

 

    # Initialise co-occurrence matrix

    # co_occurrence_matrix[current][previous]

    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

 

    # Loop through the bigrams taking the current and previous word,

    # and the number of occurrences of the bigram.

    for bigram in bigram_freq:

        current = bigram[0][1]

        previous = bigram[0][0]

        count = bigram[1]

        pos_current = vocab_index[current]

        pos_previous = vocab_index[previous]

        co_occurrence_matrix[pos_current][pos_previous] = count

    co_occurrence_matrix = np.matrix(co_occurrence_matrix)

 

    # return the matrix and the index

    return co_occurrence_matrix, vocab_index

 

 

text_data = [['Where', 'Python', 'is', 'used'],

             ['What', 'is', 'Python' 'used', 'in'],

             ['Why', 'Python', 'is', 'best'],

             ['What', 'companies', 'use', 'Python']]

 

# Create one list using many lists

data = list(itertools.chain.from_iterable(text_data))

matrix, vocab_index = generate_co_occurrence_matrix(data)

 

 

data_matrix = pd.DataFrame(matrix, index=vocab_index,

                             columns=vocab_index)

print(data_matrix)

Output:

best  use  What  Where  ...    in   is  Python  used
best         0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
use          0.0  0.0   0.0    0.0  ...   0.0  1.0     0.0   0.0
What         1.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
Where        0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
Pythonused   0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
Why          0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
companies    0.0  1.0   0.0    1.0  ...   1.0  0.0     0.0   0.0
in           0.0  0.0   0.0    0.0  ...   0.0  0.0     1.0   0.0
is           0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0
Python       0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
used         0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0
 
[11 rows x 11 columns]

21使用 TextBlob 进行情感分析

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

from textblob import TextBlob

def sentiment(polarity):

    if blob.sentiment.polarity < 0:

        print("Negative")

    elif blob.sentiment.polarity > 0:

        print("Positive")

    else:

        print("Neutral")

blob = TextBlob("The movie was excellent!")

print(blob.sentiment)

sentiment(blob.sentiment.polarity)

blob = TextBlob("The movie was not bad.")

print(blob.sentiment)

sentiment(blob.sentiment.polarity)

blob = TextBlob("The movie was ridiculous.")

print(blob.sentiment)

sentiment(blob.sentiment.polarity)

Output:

Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative

22使用 Goslate 进行语言翻译

1

2

3

4

5

6

7

8

9

10

11

12

13

14

import goslate

text = "Comment vas-tu?"

gs = goslate.Goslate()

translatedText = gs.translate(text, 'en')

print(translatedText)

translatedText = gs.translate(text, 'zh')

print(translatedText)

translatedText = gs.translate(text, 'de')

print(translatedText)

23使用 TextBlob 进行语言检测和翻译

1

2

3

4

5

6

7

8

9

from textblob import TextBlob

 

blob = TextBlob("Comment vas-tu?")

 

print(blob.detect_language())

 

print(blob.translate(to='es'))

print(blob.translate(to='en'))

print(blob.translate(to='zh'))

Output:

fr
¿Como estas tu?
How are you?
你好吗?

24使用 TextBlob 获取定义和同义词

1

2

3

4

5

6

7

8

9

10

11

12

13

from textblob import TextBlob

from textblob import Word

 

text_word = Word('safe')

 

print(text_word.definitions)

 

synonyms = set()

for synset in text_word.synsets:

    for lemma in synset.lemmas():

        synonyms.add(lemma.name())

         

print(synonyms)

Output:

['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}

25使用 TextBlob 获取反义词列表

1

2

3

4

5

6

7

8

9

10

11

12

from textblob import TextBlob

from textblob import Word

text_word = Word('safe')

antonyms = set()

for synset in text_word.synsets:

    for lemma in synset.lemmas():        

        if lemma.antonyms():

            antonyms.add(lemma.antonyms()[0].name())        

print(antonyms)

Output:

{'dangerous', 'out'}

(盘点)25个值得收藏的Python文本处理案例相关推荐

  1. 推荐|45个值得收藏的Python优质资源(附链接)

    热门资源博客 Mybridge AI 比较了18000个关于Python的项目,并从中精选出45个最具竞争力的项目.我们进行了翻译,在此一并送上. 这份清单中包括了各不相同的20个主题,以及一些资深程 ...

  2. 分享6 个值得收藏的 Python 代码

    1.类有两个方法,一个是 new,一个是 init,有什么区别,哪个会先执行呢? 1 class test(object):​​​ def __init__(self):​​​ print(" ...

  3. python 文本聚类分析案例——从若干文本中聚类出一些主题词团

    python 文本聚类分析案例 说明 摘要 1.结巴分词 2.去除停用词 3.生成tfidf矩阵 4.K-means聚类 5.获取主题词 / 主题词团 说明 实验要求:对若干条文本进行聚类分析,最终得 ...

  4. Python文本分析案例:近体诗格律分析

    作者:长行 时间:2020.05.26 Github原文:Week-03/Example-0301 在这个案例中,我们将要实现近体诗格律的分析.具体的,我们从如下角度分析近体诗的格律: 诗句数量.诗句 ...

  5. 精心整理 25 个 Python 文本处理案例,收藏!

    Python 处理文本是一项非常常见的功能,本文整理了多种文本提取及NLP相关的案例,还是非常用心的.文章很长,要忍一下,如果忍不了,那就收藏吧,总会用到的! 提取 PDF 内容 提取 Word 内容 ...

  6. 值得收藏的Python小技巧:这17个骚操作你都OK吗?

    导读:Python 是一门非常优美的语言,其简洁易用令人不得不感概人生苦短. 在本文中,作者 Gautham Santhosh 带我们回顾了 17 个非常有用的 Python 技巧,例如查找.分割和合 ...

  7. 25个值得收藏的导航类网站合集

    柠芒集导航https://wukandy.cn/ 野创网导航http://www.yechuang.top/ ppt资源导航http://www.hippter.com/ 沟渠网https://got ...

  8. 唯快不破,2019最快的固态硬盘(SSD)大盘点,绝对值得收藏

    缓慢的存储速度,让你一直在等待数据加载,这个时候,是不是你抓狂的时候?其实很简单,想要加快读写速度,其中最有效的方法之一就是你需要一块高速运转的固态硬盘(SSD).什么样的固态硬盘,才是最好的呢?说实 ...

  9. Python文本整理案例分析:《全唐诗》文本整理

    在整理<全唐诗>的文本之前,我们首先需要完成以下两个步骤: 确定需求 了解文本 在完成以上步骤后,我们开始实际着手整理文本,在整理的过程中大体上也包含两个流程: 文本解析 结果输出 全唐诗 ...

最新文章

  1. 如何在Play Framework 2中实现会话超时
  2. win10切第二屏幕_Win10特有的31个快捷键,装逼利器,赶快收藏吧!学习电脑知识...
  3. 大龄程序员失业后,看他们是如何破局突围的? | 技术头条
  4. 斐波那契数列大数的压位c语言,HDU 1568 Fibonacci(大数前4位)
  5. sql server版本 性能_迁移到高版本 SQL 数据库后,性能变差了
  6. 【车辆计数】基于matlab形态学停车场车辆计数【含Matlab源码 628期】
  7. HBase二级索引的设计原理及简单设计demo
  8. 2018全球50大最佳发明名单
  9. snipaste如何滚动截图_试用了20个截图工具,我写下这份超全的软件指南
  10. html怎么改变字段字体,怎么换字体?
  11. 投屏设置 android,上班摸鱼神器 Anlink安卓手机投屏操作体验
  12. info There appears to be trouble with your network connection. Retrying...
  13. 女子深夜醉酒后躺在马路中央,被路过的十几辆汽车压过,怎么定罪?
  14. WebSpider和一些杂七杂八
  15. 数学四大思想八大方法_数学八种思维方法
  16. 非银行支付机构网络支付业务管理办法对第三方支付账户的影响
  17. (转)Windows 7 系统下载安装一贴导航
  18. Python官网主页改版 http://www.python.org/
  19. NASA发布史上最深的宇宙全彩照!韦伯如何回传150万公里外的太空数据?
  20. 【车载】ABS/BAS/BA防抱死制动系统

热门文章

  1. KNN算法(附鸢尾花分类实现)
  2. 转载 nbsp; banq---阎宏之间的恩怨
  3. matlab交流电路仿真,直流交流变换电路MATLAB仿真实训教案
  4. 华三 h3c ospf路由聚合
  5. PyQt5 从零开始制作 PDF 阅读器(一)
  6. 软件测试购物车怎么测试用例,购物车页面如何设计测试用例?需要用到哪些测试类型?...
  7. ​LeetCode刷题实战168:Excel表列名称
  8. SpringBoot集成cas-client 客户端配置拦截过滤,绝对最简单有效,亲测
  9. ar面部识别_世界首款面部识别App通过智能手机识别 加入增强现实AR功能
  10. iTutorGroup与戴尔强强联手 在线学习新模式未来可期