http://blog.csdn.net/pipisorry/article/details/44833603

在[1]:

%matplotlib inline

抓取的数据

一个简单的HTTP请求

在[2]:

import requestsprint requests.get("http://example.com").text

<!doctype html>
<html>
<head><title>Example Domain</title><meta charset="utf-8" /><meta http-equiv="Content-type" content="text/html; charset=utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><style type="text/css">body {background-color: #f0f0f2;margin: 0;padding: 0;font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;}div {width: 600px;margin: 5em auto;padding: 50px;background-color: #fff;border-radius: 1em;}a:link, a:visited {color: #38488f;text-decoration: none;}@media (max-width: 700px) {body {background-color: #fff;}div {width: auto;margin: 0 auto;border-radius: 0;padding: 1em;}}</style>
</head><body>
<div><h1>Example Domain</h1><p>This domain is established to be used for illustrative examples in documents. You may use thisdomain in examples without prior coordination or asking for permission.</p><p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

与api交流

在[3]:

response = requests.get("https://www.googleapis.com/books/v1/volumes", params={"q":"machine learning"})
raw_data = response.json()
titles = [item['volumeInfo']['title'] for item in raw_data['items']]
titles

[3]:

[u'C4.5',u'Machine Learning',u'Machine Learning',u'Machine Learning',u'A First Course in Machine Learning',u'Machine Learning',u'Elements of Machine Learning',u'Introduction to Machine Learning',u'Pattern Recognition and Machine Learning',u'Machine Learning and Its Applications']

在[4]:

import lxml.htmlpage = lxml.html.parse("http://www.blocket.se/stockholm?q=apple")
# ^ This is probably illegal. Blocket, please don't sue me!
items_data = []
for el in page.getroot().find_class("item_row"):links = el.find_class("item_link")images = el.find_class("item_image")prices = el.find_class("list_price")if links and images and prices and prices[0].text:items_data.append({"name": links[0].text,"image": images[0].attrib['src'],"price": int(prices[0].text.split(":")[0].replace(" ", ""))})
items_data

[4]:

[{'image': 'http://cdn.blocket.com/static/2/lithumbs/98/9864322297.jpg','name': 'Macbook laddare 60w','price': 250},{'image': 'http://cdn.blocket.com/static/2/lithumbs/43/4338840758.jpg','name': u'Apple iPhone 5S 16GB - Ol\xe5st - 12 m\xe5n garanti','price': 3999},{'image': 'http://cdn.blocket.com/static/0/lithumbs/98/9838946223.jpg','name': u'Ol\xe5st iPhone 5 64 GB med n\xe4stan nytt batteri','price': 3000},{'image': 'http://cdn.blocket.com/static/1/lithumbs/79/7906971367.jpg','name': u'Apple iPhone 5C 16GB - Ol\xe5st - 12 m\xe5n garanti','price': 3099},{'image': 'http://cdn.blocket.com/static/0/lithumbs/79/7926951568.jpg','name': u'HP Z620 Workstation - 1 \xe5rs garanti','price': 12494},{'image': 'http://cdn.blocket.com/static/0/lithumbs/97/9798755036.jpg','name': 'HP ProBook 6450b - Andrasortering','price': 1699},{'image': 'http://cdn.blocket.com/static/1/lithumbs/98/9898462036.jpg','name': 'Macbook pro 13 retina, 256 gb ssd','price': 12000}]

阅读本地数据

在[5]:

import pandasdf = pandas.read_csv('sample.csv')

在[6]:

# Display the DataFrame
df

[6]:

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	南	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	南	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	南

在[7]:

# DataFrame's columns
df.columns

[7]:

Index([u'Year', u'Make', u'Model', u'Description', u'Price'], dtype='object')

在[8]:

# Values of a given column
df.Model

[8]。

0                                      E350
1                Venture "Extended Edition"
2    Venture "Extended Edition, Very Large"
3                            Grand Cherokee
Name: Model, dtype: object

分析了dataframe

在[9]:

# Any missing values?
df['Price']

[9]:

0    3000
1    4900
2    5000
3     NaN
Name: Price, dtype: float64

在[10]:

df['Description']

[10]。

0                         ac, abs, moon
1                                   NaN
2                                   NaN
3    MUST SELL!\nair, moon roof, loaded
Name: Description, dtype: object

在[11]:

# Fill missing prices by a linear interpolation
df['Description'] = df['Description'].fillna("No description is available.")
df['Price'] = df['Price'].interpolate()df

[11]。

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	没有可用的描述。	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	没有可用的描述。	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	5000年

探索数据

在[12]:

import matplotlib.pyplot as pltdf = pandas.read_csv('sample2.csv')df

[12]。

	办公室	一年	销售
0	斯德哥尔摩	2004年	200年
1	斯德哥尔摩	2005年	250年
2	斯德哥尔摩	2006年	255年
3	斯德哥尔摩	2007年	260年
4	斯德哥尔摩	2008年	264年
5	斯德哥尔摩	2009年	274年
6	斯德哥尔摩	2010年	330年
7	斯德哥尔摩	2011年	364年
8	纽约	2004年	432年
9	纽约	2005年	469年
10	纽约	2006年	480年
11	纽约	2007年	438年
12	纽约	2008年	330年
13	纽约	2009年	280年
14	纽约	2010年	299年
15	纽约	2011年	230年

在[13]:

# This table has 3 columns: Office, Year, Sales
print df.columns# It's really easy to query data with Pandas:
print df[(df['Office'] == 'Stockholm') & (df['Sales'] > 260)]# It's also easy to do aggregations...
aggregated_sales = df.groupby('Year').sum()
print aggregated_sales

Index([u'Office', u'Year', u'Sales'], dtype='object')Office  Year  Sales
4  Stockholm  2008    264
5  Stockholm  2009    274
6  Stockholm  2010    330
7  Stockholm  2011    364Sales
Year
2004    632
2005    719
2006    735
2007    698
2008    594
2009    554
2010    629
2011    594

在[14]:

# ... and generate plots
%matplotlib inline
aggregated_sales.plot(kind='bar')

[14]。

<matplotlib.axes._subplots.AxesSubplot at 0x1089dcc10>

机器学习

特征提取

在[15]:

from sklearn import feature_extraction

从文本中提取特征

在[16]:

corpus = ['All the cats really are great.','I like the cats but I still prefer the dogs.','Dogs are the best.','I like all the trains',]tfidf = feature_extraction.text.TfidfVectorizer()print tfidf.fit_transform(corpus).toarray()
print tfidf.get_feature_names()

[[ 0.38761905  0.38761905  0.          0.          0.38761905  0.0.49164562  0.          0.          0.49164562  0.          0.256561080.        ][ 0.          0.          0.          0.4098205   0.32310719  0.323107190.          0.32310719  0.4098205   0.          0.4098205   0.427722680.        ][ 0.          0.4970962   0.6305035   0.          0.          0.49709620.          0.          0.          0.          0.          0.329022880.        ][ 0.4970962   0.          0.          0.          0.          0.          0.0.4970962   0.          0.          0.          0.32902288  0.6305035 ]]
[u'all', u'are', u'best', u'but', u'cats', u'dogs', u'great', u'like', u'prefer', u'really', u'still', u'the', u'trains']

Dict vectorizer

在[17]:

import jsondata = [json.loads("""{"weight": 194.0, "sex": "female", "student": true}"""),{"weight": 60., "sex": 'female', "student": True},{"weight": 80.1, "sex": 'male', "student": False},{"weight": 65.3, "sex": 'male', "student": True},{"weight": 58.5, "sex": 'female', "student": False}]vectorizer = feature_extraction.DictVectorizer(sparse=False)vectors = vectorizer.fit_transform(data)
print vectors
print vectorizer.get_feature_names()

[[   1.     0.     1.   194. ][   1.     0.     1.    60. ][   0.     1.     0.    80.1][   0.     1.     1.    65.3][   1.     0.     0.    58.5]]
[u'sex=female', 'sex=male', u'student', u'weight']

在[18]:

class A:def __init__(self, x):self.x = xself.blabla = 'test'a = A(20)
a.__dict__

出[18]:

{'blabla': 'test', 'x': 20}

预处理

扩展

在[19]:

from sklearn import preprocessingdata = [[10., 2345., 0., 2.],[3., -3490., 0.1, 1.99],[13., 3903., -0.2, 2.11]]print preprocessing.normalize(data)

[[  4.26435200e-03   9.99990544e-01   0.00000000e+00   8.52870400e-04][  8.59598396e-04  -9.99999468e-01   2.86532799e-05   5.70200269e-04][  3.33075223e-03   9.99994306e-01  -5.12423421e-05   5.40606709e-04]]

降维

在[20]:

from sklearn import decompositiondata = [[0.3, 0.2, 0.4,  0.32],[0.3, 0.5, 1.0, 0.19],[0.3, -0.4, -0.8, 0.22]]pca = decomposition.PCA()
print pca.fit_transform(data)
print pca.explained_variance_ratio_

[[ -2.23442295e-01  -7.71447891e-02   8.06250485e-17][ -8.94539226e-01   5.14200202e-02   8.06250485e-17][  1.11798152e+00   2.57247689e-02   8.06250485e-17]]
[  9.95611223e-01   4.38877684e-03   9.24548594e-33]

机器学习模型

分类(支持向量机)

在[21]:

from sklearn import datasets
from sklearn import svm

在[22]:

iris = datasets.load_iris()X = iris.data[:, :2]
y = iris.target# Training the model
clf = svm.SVC(kernel='rbf')
clf.fit(X, y)# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02]]
print clf.predict(new_data)

[0 1]

回归(线性回归)

在[23]:

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as pltdef f(x):return x + np.random.random() * 3.X = np.arange(0, 5, 0.5)
X = X.reshape((len(X), 1))
y = map(f, X)clf = linear_model.LinearRegression()
clf.fit(X, y)

(23):

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

在[24]:

new_X = np.arange(0.2, 5.2, 0.3)
new_X = new_X.reshape((len(new_X), 1))
new_y = clf.predict(new_X)plt.scatter(X, y, color='g', label='Training data')plt.plot(new_X, new_y, '.-', label='Predicted')
plt.legend()

(24):

<matplotlib.legend.Legend at 0x10a38f290>

集群(DBScan)

在[25]:

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=200, centers=centers, cluster_std=0.4,random_state=0)
X = StandardScaler().fit_transform(X)

在[26]:

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
db.labels_

出[26]:

array([-1,  0,  2,  1,  1,  2, -1,  0,  0, -1, -1,  0,  0,  2, -1, -1,  2,0,  1,  0,  0,  2, -1, -1,  0, -1, -1,  1, -1,  2,  1, -1,  1, -1,1,  0,  1,  0,  0,  2,  2, -1,  2,  1,  0,  1,  0,  1,  2,  1,  1,2, -1,  2,  1, -1,  0,  0, -1,  1,  0,  0,  1,  2,  0, -1,  2,  1,-1,  0,  0,  1,  1,  0, -1,  2, -1,  1,  2,  2,  0,  2,  1,  0, -1,0,  2,  1, -1,  2,  0, -1,  1,  1,  2,  0,  2,  1,  2,  1,  2,  2,-1,  2,  0,  1,  0, -1,  2,  0,  1,  0,  0, -1,  1,  0,  2,  2,  0,1,  0, -1,  1,  0,  1,  1,  1, -1,  1,  2,  1, -1, -1,  0,  0,  2,1,  1, -1,  0,  1,  2,  1,  0,  0, -1,  2,  1,  1,  1,  2,  2,  0,0,  2, -1,  1,  0,  1,  1,  2,  1,  2,  1,  0, -1,  2,  0,  2,  1,2,  1,  0,  1,  2,  0,  1, -1,  2,  0,  0,  1,  1,  1, -1,  0,  1,0,  1,  2, -1, -1,  2,  1,  0,  0,  2, -1,  2,  0])

在[27]:

import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=db.labels_)

(27):

<matplotlib.collections.PathCollection at 0x10a6bc110>

交叉验证

在[28]:

from sklearn import svm, cross_validation, datasetsiris = datasets.load_iris()
X, y = iris.data, iris.targetmodel = svm.SVC()
print cross_validation.cross_val_score(model, X, y, scoring='precision')
print cross_validation.cross_val_score(model, X, y, scoring='mean_squared_error')

[ 0.98148148  0.96491228  0.98039216]
[-0.01960784 -0.03921569 -0.02083333]

from: http://blog.csdn.net/pipisorry/article/details/44833603

ref:Data-processing and machine learning with Python

http://nbviewer.ipython.org/github/halflings/python-data-workshop/blob/master/data-workshop-notebook.ipynb

Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化相关推荐

Python数据分析案例07——二手车估价（机器学习全流程，数据清洗、特征工程、模型选择、交叉验证、网格搜参、预测储存）
案例背景本次案例来自2021年matchcop大数据竞赛A题数据集.要预测二手车的价格.训练集3万条数据,测试集5千条.官方给了二手车的很多特征,有的是已知的,有的是匿名的.要求就是做模型去预测测试 ...
python 爬取自如租房的租房数据，使用图像识别获取价格信息
python 爬取自如租房的租房数据完整代码下载:https://github.com/tanjunchen/SpiderProject/tree/master/ziru #!/usr/bin/py ...
数据预处理--样本选择、交叉验证
1.样本下采样选择 # 下采样取样本数据 X = data.ix[:, data.columns != 'Class'] y = data.ix[:, data.columns == 'Class'] ...
python pandas excel数据处理_Python处理Excel数据-pandas篇
Python处理Excel数据-pandas篇非常适用于大量数据的拼接.清洗.筛选及分析在计算机编程中,pandas是Python编程语言的用于数据操纵和分析的软件库.特别是,它提供操纵数值表格和 ...
电影推荐系统（数据预处理+模型训练+预测）
博客源地址电影推荐思路利用doc2vec做电影推荐,其实核心就是比较两部电影介绍文本之间的向量相似程度.自然语言处理中的分布式假设提出了"某个单词的含义由它周围的单词形成" ...
机器学习-CrossValidation交叉验证Python实现
版权声明:本文为原创文章,转载请注明来源. 1.原理 1.1 概念交叉验证(Cross-validation)主要用于模型训练或建模应用中,如分类预测.PCR.PLS回归建模等.在给定的样本空间中, ...
CrossValidation十字交叉验证的Python实现
1.原理 1.1 概念交叉验证(Cross-validation)主要用于模型训练或建模应用中,如分类预测.PCR.PLS回归建模等.在给定的样本空间中,拿出大部分样本作为训练集来训练模型,剩余的小 ...
python多元线性回归实例_Python机器学习多元线性回归模型 | kTWO-个人博客
前言在上一篇文章<机器学习简单线性回归模型>中我们讲解分析了Python机器学习中单输入的线性回归模型,但是在实际生活中,我们遇到的问题都是多个条件决定的问题,在机器学习中我们称之为多元 ...
python对比excel两列数据_python 对比excel表格数据表-python实现两个excel表列数据对比若源表与目标表存......
在数据分析方面,比起python,excel的局限性在哪 data3 = pandas.merge(data1, data2, on=['名称'], how='inner') inner:内连接,取交 ...
对于机器学习中数据拟合度和模型复杂度的一些建议
Advice for Applying Machine Learning 我这里想做的是,确保大家在设计机器学习系统时,能够明白怎样选择一条最合适.最正确的路径.因此,接下来我们要讨论一些实用的建议和 ...

Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化