机器学习实战3--豆瓣读书简介

graphlab对中文的支持非常无解,怎么办?

# coding: utf-8# # graphlab对中文的支持简直无解,怎么办?求解决方法# In[34]:import sys
reload(sys)
sys.setdefaultencoding('utf8')
import graphlab
import datetime# In[35]:# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)# In[36]:

douban = graphlab.SFrame.read_json('data/douban.json')# In[37]:

douban.head()# In[38]:

len(douban)# In[41]:

weicheng = douban[douban['name'] == '围城']# In[42]:

weicheng# In[43]:

weicheng['intro']# In[44]:

weicheng['word_count'] = graphlab.text_analytics.count_words(weicheng['intro'])# In[46]:

weicheng['word_count']# In[47]:#创建一张新表,stack可以将k-v转换为2列
weicheng_word_count_table = weicheng[['word_count']].stack('word_count', new_column_name = ['word','count'])# In[48]:

weicheng_word_count_table.head()# In[49]:#排序,降序
weicheng_word_count_table.sort('count',ascending=False)# In[50]:#TF-IDF取决于所有文本
douban['word_count'] = graphlab.text_analytics.count_words(douban['intro'])
douban.head()# In[51]:#计算tf-idf
tfidf = graphlab.text_analytics.tf_idf(douban['word_count'])# Earlier versions of GraphLab Create returned an SFrame rather than a single SArray
# This notebook was created using Graphlab Create version 1.7.1
if graphlab.version <= '1.6.1':tfidf = tfidf['docs']tfidf# In[52]:

douban['tfidf'] = tfidf# In[53]:

weicheng = douban[douban['name'] == '围城']# In[54]:#创建一个围城的tfidf列并排序
weicheng[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)# In[55]:#创建一个临近模型
knn_model = graphlab.nearest_neighbors.create(douban,features=['tfidf'],label='name')# In[56]:

knn_model.query(weicheng)# In[ ]:

代码地址(附作业答案): https://github.com/RedheatWei/aiproject/tree/master/Machine%20Learning%20Specialization/week4

爬虫地址: https://github.com/RedheatWei/douban_book_intro

转载于:https://www.cnblogs.com/redheat/p/9300059.html

机器学习实战3--豆瓣读书简介相关推荐

python_NLP实战之豆瓣读书数据聚类
用k_means对豆瓣读书数据聚类 1.读取数据以及数据预处理 book_data = pd.read_csv('data/data.csv') #读取文件print(book_data.head() ...
【机器学习实战系列】读书笔记之DecisionTree（ID3算法）（三）
一.使用决策树预测隐形眼镜类型这里实现一个例子,即利用决策树预测一个患者需要佩戴的隐形眼镜类型.以下是整个预测的大体步骤: 收集数据:使用书中提供的小型数据集准备数据:对文本中的数据进行预处理,如 ...
【社区图书馆】读书推荐：《PyTorch高级机器学习实战》
读书推荐:<PyTorch高级机器学习实战> 作者:i阿极作者简介:Python领域新星作者.多项比赛获奖者:博主个人首页
机器学习实战读书总结
机器学习实战读书总结蒟蒻退役ACMer 1403mashaonan终于读完了新买的Machine Learning in Action(机器学习实战) 立的年前读完这本书的flag没有完成(主要是1 ...
Python分布式爬虫实战 - 豆瓣读书
本实例从零到一实现豆瓣读书的所有标签的分布式爬虫编写本实例使用到的工具: IDE:Pycharm 工具:Python,Scrapy,linux,mysql,redis 需要用到的模块:scrapy ...
Python《机器学习实战》读书笔记（三）——决策树
第三章决策树引言 3-1 决策树的构造 3-1-1 信息增益 3-1-2 划分数据集 3-1-3 递归构建决策树 3-2 在Python中使用Matplotlib注解绘制树形图 3-2-1 Mat ...
机器学习实战---读书笔记：第11章使用Apriori算法进行关联分析---2---从频繁项集中挖掘关联规则
#!/usr/bin/env python # encoding: utf-8''' <<机器学习实战>> 读书笔记第11章使用Apriori算法进行关联分析---从频繁项 ...
爬取豆瓣读书-豆瓣成员常用的标签（Python爬虫实战）
前两篇博客,我们介绍了如何对豆瓣读书历史记录进行抓取,这一篇博客是一个收尾工作. 传送门: 爬取豆瓣读书-用户信息页链接(Python爬虫实战) 爬取豆瓣读书-用户所有阅读书籍名称.日期和书籍链接(P ...
python爬取豆瓣读书的书名与简介
最近写了一个python爬取豆瓣读书的书名与简介的程序,一开始是要爬取当当书名与简介的,由于涉及动态的一些问题,运用了selenium库,也实现了但是爬取速度慢,而且不稳定,出现被目标计算机积极拒绝访 ...

机器学习实战3--豆瓣读书简介

机器学习实战3--豆瓣读书简介相关推荐

最新文章

热门文章