斯坦福ML课程——python转写(Week7—课程作业ex6

利用python完成课程作业ex6的第二部分Spam，Introduction如下：

Many email services today provide spam lters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build your own spam lter.

代码如下：

ex6_

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 10:56:17 2020@author: Lonely_hanhan
"""'''
%% Machine Learning Online Class
%  Exercise 6 | Spam Classification with SVMs
%
%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     gaussianKernel.m
%     dataset3Params.m
%     processEmail.m
%     emailFeatures.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%'''
#import matplotlib.pyplot as plt
import scipy.io as sio
import numpy as np
#import scipy.optimize as op
import processEmail as pr
import emailFeatures as ef
from sklearn import svm
import getVocabList as ge'''
%% ==================== Part 1: Email Preprocessing ====================
%  To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
%  to convert each email into a vector of features. In this part, you will
%  implement the preprocessing steps for each email. You should
%  complete the code in processEmail.m to produce a word indices vector
%  for a given email.
'''
print('\nPreprocessing sample email (emailSample1.txt)\n')
#Extract Features
file_contents = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\emailSample1.txt', 'r').read()
word_indices = pr.processEmail(file_contents)
# Print stats
print('Word Indices: ')
print(word_indices)'''
%% ==================== Part 2: Feature Extraction ====================
%  Now, you will convert each email into a vector of features in R^n.
%  You should complete the code in emailFeatures.m to produce a feature
%  vector for a given email.
'''
print('\nExtracting features from sample email (emailSample1.txt)\n')
features = ef.emailFeatures(word_indices)
# Print stats
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))'''
%% =========== Part 3: Train Linear SVM for Spam Classification ========
%  In this section, you will train a linear classifier to determine if an
%  email is Spam or Not-Spam.% Load the Spam Email dataset
% You will have X, y in your environment
'''Data = sio.loadmat('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\spamTrain.mat')
X = Data['X']
y = Data['y'].flatten()print('\nTraining Linear SVM (Spam Classification)\n')
print('(this may take 1 to 2 minutes) ...\n')C = 0.1
clf = svm.SVC(C, kernel='linear', tol=1e-3)
clf.fit(X, y)
p = clf.predict(X)
print('Training Accuracy: {}'.format(np.mean(p == y) * 100))
'''
%% =================== Part 4: Test Spam Classification ================
%  After training the classifier, we can evaluate it on a test set. We have
%  included a test set in spamTest.mat% Load the test dataset
% You will have Xtest, ytest in your environment
'''
Data = sio.loadmat('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\spamTest.mat')
Xtest = Data['Xtest']
ytest = Data['ytest'].flatten()p = clf.predict(Xtest)
print('Test Accuracy: {}'.format(np.mean(p == ytest) * 100))'''
%% ================= Part 5: Top Predictors of Spam ====================
%  Since the model we are training is a linear SVM, we can inspect the
%  weights learned by the model to understand better how it is determining
%  whether an email is spam or not. The following code finds the words with
%  the highest weights in the classifier. Informally, the classifier
%  'thinks' that these words are the most likely indicators of spam.
%
'''
vocabList = ge.getVocabList()
# clf.coef_表示特征权重的大小，np.argsort表示从小到大排序，返回的是索引值
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)# 返回与文档中同样的单词数目15个
for i in range(15):print('{} ({:0.6f})'.format(vocabList[indices[i]], clf.coef_.flatten()[indices[i]]))'''
%% =================== Part 6: Try Your Own Emails =====================
%  Now that you've trained the spam classifier, you can use it on your own
%  emails! In the starter code, we have included spamSample1.txt,
%  spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
%  The following code reads in one of these emails and then uses your
%  learned SVM classifier to determine whether the email is Spam or
%  Not Spam% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
'''print('\nPreprocessing sample email (emailSample2.txt)\n')
#Extract Features
file_contents = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\emailSample2.txt', 'r').read()
word_indices = pr.processEmail(file_contents)
# Print stats
print('Word Indices: ')
print(word_indices)print('\nExtracting features from sample email (emailSample2.txt)\n')
features = ef.emailFeatures(word_indices)
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))Xm = features.reshape((1,1899))
pm = clf.predict(Xm)print(pm) # 0 is not spam, 1 is spam

processEmail.m

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 11:38:24 2020@author: Lonely_duoha
"""
import getVocabList as ge
import re
import nltk, nltk.stem.porter
import numpy as npdef processEmail(email_contents):'''%PROCESSEMAIL preprocesses a the body of an email and%returns a list of word_indices %   word_indices = PROCESSEMAIL(email_contents) preprocesses %   the body of an email and returns a list of indices of the %   words contained in the email. %'''# Load VocabularyvocabList = ge.getVocabList()word_indices = np.array([], dtype=np.int64)# Lower caseemail_contents = email_contents.lower()#Strip all HTML#Looks for any expression that starts with < and ends with > and replace#and does not have any < or > in the tag it with a spaceemail_contents = re.sub(r'<[^<>]+>', ' ', email_contents)# Handle Numbers Look for one or more characters between 0-9email_contents = re.sub('\d+','number', email_contents)# Handle URLS Look for strings starting with http:// or https://email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)# Handle Email Addresses Look for strings with @ in the middleemail_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)# Handle $ signemail_contents = re.sub('[$]+', 'dollar', email_contents)# ===================== Tokenize Email =====================# Output the email to screen as wellprint('==== Processed Email ====')stemmer = nltk.stem.porter.PorterStemmer()tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)# Tokenize and also get rid of any punctuationfor token in tokens:# Remove any non alphanumeric characterstoken = re.sub('[^a-zA-Z0-9]', '', token)# Stem the word (获取单词前缀)token = stemmer.stem(token)if len(token) < 1:continueprint(token)for k, v in vocabList.items():if token == v:# 单词在词库中存在就加入word_indices = np.append(word_indices, k)print('==================')return word_indices

getVocabList.m

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 11:44:58 2020@author: Lonely_hanhan
"""def getVocabList():'''%GETVOCABLIST reads the fixed vocabulary list in vocab.txt and returns a%cell array of the words%   vocabList = GETVOCABLIST() reads the fixed vocabulary list in vocab.txt %   and returns a cell array of the words in vocabList.'''# Read the fixed vocabulary list#fid = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\vocab.txt', 'r').read()# Store all dictionary words in cell array vocab{}#n = 1899 # Total number of words in the dictionary'''% For ease of implementation, we use a struct to map the strings => integers% In practice, you'll want to use some form of hashmap'''vocabList = {}with open('vocab.txt') as f:for line in f:(val, key) = line.split()vocabList[int(val)] = keyreturn vocabList

emailFeatures.m

# -*- coding: utf-8 -*-
"""
Created on Sun Jan  5 16:05:09 2020@author: Lonely_hanhan
"""
import numpy as np
import getVocabList as ge
def emailFeatures(word_indices):'''%EMAILFEATURES takes in a word_indices vector and produces a feature vector%from the word indices%   x = EMAILFEATURES(word_indices) takes in a word_indices vector and %   produces a feature vector from the word indices. '''#  Total number of words in the dictionaryn = 1899# You need to return the following variables correctly.x = np.zeros((n,1))'''% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return a feature vector for the
%               given email (word_indices). To help make it easier to
%               process the emails, we have have already pre-processed each
%               email and converted each word in the email into an index in
%               a fixed dictionary (of 1899 words). The variable
%               word_indices contains the list of indices of the words
%               which occur in one email.
%
%               Concretely, if an email has the text:
%
%                  The quick brown fox jumped over the lazy dog.
%
%               Then, the word_indices vector for this text might look
%               like:
%
%                   60  100   33   44   10     53  60  58   5
%
%               where, we have mapped each word onto a number, for example:
%
%                   the   -- 60
%                   quick -- 100
%                   ...
%
%              (note: the above numbers are just an example and are not the
%               actual mappings).
%
%              Your task is take one such word_indices vector and construct
%              a binary feature vector that indicates whether a particular
%              word occurs in the email. That is, x(i) = 1 when word i
%              is present in the email. Concretely, if the word 'the' (say,
%              index 60) appears in the email, then x(60) = 1. The feature
%              vector should look like:
%
%              x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];'''vocabList = {}vocabList = ge.getVocabList()m = len(word_indices)for i in range(m):if word_indices[i] in vocabList.keys():x[word_indices[i]-1] = 1return x

运行结果：

Preprocessing sample email (emailSample1.txt)==== Processed Email ====
anyon
know
how
much
it
cost
to
host
a
web
portal
well
it
depend
on
how
mani
visitor
you
re
expect
thi
can
be
anywher
from
less
than
number
buck
a
month
to
a
coupl
of
dollarnumb
you
should
checkout
httpaddr
or
perhap
amazon
ecnumb
if
your
run
someth
big
to
unsubscrib
yourself
from
thi
mail
list
send
an
email
to
emailaddr
==================
Word Indices:
[  86  916  794 1077  883  370 1699  790 1822 1831  883  431 1171  7941002 1893 1364  592 1676  238  162   89  688  945 1663 1120 1062 1699375 1162  479 1893 1510  799 1182 1237  810 1895 1440 1547  181 16991758 1896  688 1676  992  961 1477   71  530 1699  531]Extracting features from sample email (emailSample1.txt)Length of feature vector: 1899
Number of non-zero entries: 45Training Linear SVM (Spam Classification)(this may take 1 to 2 minutes) ...Training Accuracy: 99.825
Test Accuracy: 98.9
[1190  297 1397 ... 1764 1665 1560]
otherwis (0.500614)
clearli (0.465916)
remot (0.422869)
gt (0.383622)
visa (0.367710)
base (0.345064)
doesn (0.323632)
wife (0.269724)
previous (0.267298)
player (0.261169)
mortgag (0.257298)
natur (0.253941)
ll (0.253467)
futur (0.248297)
hot (0.246404)Preprocessing sample email (emailSample2.txt)==== Processed Email ====
folk
my
first
time
post
have
a
bit
of
unix
experi
but
am
new
to
linux
just
got
a
new
pc
at
home
dell
box
with
window
xp
ad
a
second
hard
diskfor
linux
partit
the
disk
and
have
instal
suse
number
number
from
cd
which
wentfin
except
it
didn
t
pick
up
my
monitor
i
have
a
dell
brand
enumberfpp
number
lcd
flat
panel
monitor
and
a
nvidia
geforcenumbertinumb
video
card
both
of
which
are
probabl
too
new
to
featur
in
suse
s
defaultset
i
download
a
driver
from
the
nvidia
websit
and
instal
it
use
rpm
then
i
ran
saxnumb
as
wa
recommend
in
some
post
i
found
on
the
net
butit
still
doesn
t
featur
my
video
card
in
the
avail
list
what
next
anoth
problem
i
have
a
dell
brand
keyboard
and
if
i
hit
capslock
twice
the
whole
machin
crash
in
linux
not
window
even
the
on
off
switch
isinact
leav
me
to
reach
for
the
power
cabl
instead
if
anyon
can
help
me
in
ani
way
with
these
prob
i
d
be
realli
grate
i
ve
search
the
net
but
have
run
out
of
idea
or
should
i
be
go
for
a
differ
version
of
linux
such
as
redhat
opinionswelcom
thank
a
lot
peter
irish
linux
user
group
emailaddrhttpaddr
for
un
subscript
inform
list
maintain
emailaddr
==================
Word Indices:
[ 662 1084  652 1694 1280  756  186 1162 1752  594  225   64 1099 1699960  902  726 1099 1228  124  787  427  208 1860 1855 1885   21 1464752  960 1217 1666  464   74  756  847 1627 1120 1120  688  259 1840583  883  450 1249 1760 1084 1061  756  427  210 1120 1208 1061   741792  246  204 1162 1840 1308 1708 1099 1699  626  825 1627  487  492688 1666 1824   74  847  883 1437 1671  116 1803 1376  825 1545 1280677 1171 1666 1095 1590  476  626 1084 1792  246  825 1666  139  9611835 1101   80 1309  756  427  210  909   74  810  785 1666 1845  988380  825  960 1113 1855  571 1666 1171 1163 1630  940 1018 1699 1365666 1666 1284  230  850  810   86  238  771 1018  825   75 1860 1675162 1371 1785 1462 1666 1095  225  756 1440 1192 1162  805 1182 1510162  718  666  452 1790 1162  960 1613  116 1379 1664  980  876  9601773  735  666 1744 1610  840  961  995  531]Extracting features from sample email (emailSample2.txt)Length of feature vector: 1899
Number of non-zero entries: 119
[0]

为了记录自己的学习进度同时也加深自己对知识的认知，刚刚开始写博客，如有错误或者不妥之处，请大家给予指正。

斯坦福ML课程——python转写(Week7—课程作业ex6_2)相关推荐

python在线编程免费课程-Python少儿基础编程课程
Python基础编程 L5-L8 主要内容: 为了帮孩子打下坚实编程基础,妙小程设计Python基础课程,学习Python基础知识及相关数学.物理等知识,并将其运用在游戏作品制作中.并让孩子了解并掌握 ...
少儿编程python课程-Python少儿基础编程课程
Python基础编程 L5-L8 主要内容: 为了帮孩子打下坚实编程基础,妙小程设计Python基础课程,学习Python基础知识及相关数学.物理等知识,并将其运用在游戏作品制作中.并让孩子了解并掌握 ...
python 菜鸟学院-Python菜鸟晋级视频课程（上）
跟李宁老师学Python视频课程(1):初识Python 10节 1小时21分钟课程目标: 本课程会对Python做一个简要的介绍.然后会演示如何安装Python开发环境,以及如何在IDE中调试Py ...
代写python作业费用标准_代做159.272作业、代写Programming Paradigms作业、代做Python实验作业、代写Java/c++编程作业代写Database|代做R...
代做159.272作业.代写Programming Paradigms作业.代做Python实验作业.代写Java/c++编程作业代写Database|代做RComputational Thinkin ...
python编程书籍1020python编程书籍_代写INFT 1020作业、Database作业代做、Java课程作业代写、c++，Python编程作业代做...
代写INFT 1020作业.Database作业代做.Java课程作业代写.c++,Python编程作业代做日期:2020-04-25 10:27 INFT 1020 Database Fundam ...
python实验报告代写_TensorFlow作业代写、代做Python程序语言作业、代写github课程作业、Python实验作业代写...
TensorFlow作业代写.代做Python程序语言作业.代写github课程作业.Python实验作业代写日期:2019-07-10 10:34 Python Practical Examine ...
代写编程的作业、笔试题、课程设计，包括但不限于C/C++/Python
代写编程作业/笔试题/课程设计,包括但不限于C/C++/Python 先写代码再给钱,不要任何定金!价钱公道,具体见图,诚信第一! (截止2016-11-22已接12单,顺利完成!后文有成功交付的聊天 ...
python程序设计作业_CS602留学生作业代做、代写Programming课程作业、代做Python语言作业、Python编程设计作业调试...
CS602留学生作业代做.代写Programming课程作业.代做Python语言作业.Python编程设计作业调试日期:2019-12-06 10:50 CS602 - Data-Driven D ...
代写python期末作业价格_代写program留学生作业、代做Python程序语言作业、代写Python课程设计作业...
代写program留学生作业.代做Python程序语言作业.代写Python课程设计作业日期:2019-11-29 12:55 Completing the Final Project - Pyth ...
哈佛大学(2020)《CS50 Python人工智能入门》课程资料下载
关注上方"深度学习技术前沿",选择"星标公众号", 资源干货,第一时间送达! [导读]本课程探讨现代人工智能基础上的概念和算法,深入探讨游戏引擎.手写识别和机器 ...

斯坦福ML课程——python转写(Week7—课程作业ex6_2)

斯坦福ML课程——python转写(Week7—课程作业ex6_2)相关推荐

最新文章

热门文章