Web Intelligence and Big Data 
by Dr. Gautam Shroff

这门课是关于大数据处理,本周是第一次编程作业,要求使用Map-Reduce对文本数据进行统计。使用的工具为轻量级的mincemeat。
需要注意的是,使用正则式来匹配单词。做完之后先按照姓名和频率排序,即双重排序,然后写入文件。做作业时因为有两分钟的时间限制,要即时进行搜索。
作业要求如下:
Download data files bundled as a .zip file from hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.
输入范例如下:
conf/fc/KravitzG99:::David W. Kravitz::David M. Goldschlag:::Conditional Access Concepts and Principles.
conf/fc/Moskowitz01:::Scott Moskowitz:::A Solution to the Napster Phenomenon: Why Value Cannot Be Created Absent the Transfer of Subjective Data.
conf/fc/BellareNPS01:::Mihir Bellare::Chanathip Namprempre::David Pointcheval::Michael Semanko:::The Power of RSA Inversion Oracles and the Security of Chaum's RSA-Based Blind Signature Scheme.
conf/fc/Kocher98:::Paul C. Kocher:::On Certificate Revocation and Validation.
conf/ep/BertiDM98:::Laure Berti::Jean-Luc Damoiseaux::Elisabeth Murisasco:::Combining the Power of Query Languages and Search Engines for On-line Document and Information Retrieval : The QIRi@D Environment.
conf/ep/LouS98:::Qun Lou::Peter Stucki:::Funfamentals of 3D Halftoning.
conf/ep/Mather98:::Laura A. Mather:::A Linear Algebra Approach to Language Identification.
conf/ep/BallimCLV98:::Afzal Ballim::Giovanni Coray::A. Linden::Christine Vanoirbeek:::The Use of Automatic Alignment on Structured Multilingual Documents.
conf/ep/ErdenechimegMN98:::Myatav Erdenechimeg::Richard Moore::Yumbayar Namsrai:::On the Specification of the Display of Documents in Multi-lingual Computing.
conf/ep/VercoustreP98:::Anne-Marie Vercoustre::François Paradis:::Reuse of Linked Documents through Virtual Document Prescriptions.
conf/ep/CruzBMW98:::Isabel F. Cruz::Slava Borisov::Michael A. Marks::Timothy R. Webb:::Measuring Structural Similarity Among Web Documents: Preliminary Results.
conf/er/Hohenstein89:::Uwe Hohenstein:::Automatic Transformation of an Entity-Relationship Query Language into SQL.
conf/er/NakanishiHT01:::Yoshihiro Nakanishi::Tatsuo Hirose::Katsumi Tanaka:::Modeling and Structuring Multiple Perspective Video for Browsing.
conf/er/Sciore91:::Edward Sciore:::Abbreviation Techniques in Entity-Relationship Query Languages.
conf/er/Chen79:::Peter P. Chen:::Recent Literature on the Entity-Relationship Approach.

进行处理时,需要开两个客户端。使用的命令分别是:

python mincemeat.py -p pwd localhost
python hw3.py
hw3.py的code为:
import glob
import mincemeat
import operatorall_filepaths = glob.glob('hw3data/*')def file_contents(filename):f = open(filename)try:return f.read()finally:f.close()datasource = dict((filename,file_contents(filename)) for filename in all_filepaths)def my_mapper(key,value):from stopwords import allStopWordsimport refor line in value.splitlines():allThree=line.split(':::')for author in allThree[1].split('::'):for word in re.sub(r'([^\s\t0-9a-zA-Z-])+', '',allThree[2]).split():tmpWord=word.strip().lower()if len(tmpWord)<=1 or tmpWord in allStopWords:continueyield (author,tmpWord),1def my_reducer(key,value):result=sum(value)return results = mincemeat.Server()
s.datasource = datasource
s.mapfn = my_mapper
s.reducefn = my_reducerresults = s.run_server(password="pwd")
print resultsresList=[(x[0],x[1],results[x]) for x in results.keys()]
sorted_results = sorted(resList, key=operator.itemgetter(0,2))with open('output.txt','w') as f:for (a,b,c) in sorted_results:f.write(a+' *** '+b+' *** '+str(c)+'\n')

输出的结果范例如下:

Stephen L. Bloom *** scalar *** 1
Stephen L. Bloom *** concatenation *** 1
Stephen L. Bloom *** point *** 1
Stephen L. Bloom *** varieties *** 1
Stephen L. Bloom *** observation *** 1
Stephen L. Bloom *** equivalence *** 1
Stephen L. Bloom *** axioms *** 1
Stephen L. Bloom *** languages *** 1
Stephen L. Bloom *** logical *** 1
Stephen L. Bloom *** algebras *** 1
Stephen L. Bloom *** equations *** 1
Stephen L. Bloom *** number *** 1
Stephen L. Bloom *** vector *** 1
Stephen L. Bloom *** polynomial *** 1
Stephen L. Bloom *** solving *** 1
Stephen L. Bloom *** equational *** 1
Stephen L. Bloom *** axiomatizing *** 1
Stephen L. Bloom *** characterization *** 1
Stephen L. Bloom *** regular *** 2
Stephen L. Bloom *** sets *** 2
Stephen L. Bloom *** iteration *** 3
Stephen L. Lieman *** unacceptable *** 1
Stephen L. Lieman *** correcting *** 1
Stephen L. Lieman *** never *** 1
Stephen L. Lieman *** powerful *** 1
Stephen L. Lieman *** accept *** 1

网络智能和大数据公开课Homework3 Map-Reduce编程相关推荐

  1. 最值得看的十大机器学习公开课

    https://www.leiphone.com/news/201701/0milWCyQO4ZbBvuW.html?from=timeline&viewType=weixin 在当下的机器学 ...

  2. 云计算和大数据课程开课简介

    云计算和大数据课程开课简介  前言:小编今年还是大学的学生,这一学期学校开了有关于云计算.大数据.以及大数据仓库方面的课程,写这一系列的博客真正的意义并不在于说小编有学的多好(小编从小就是一个学渣), ...

  3. 中兴智能视觉大数据公交车专用道移动智能电子警察系统功能、特点及优势详细介绍...

    中兴智能视觉大数据公交车专用道移动智能电子警察系统功能.特点及优势详细介绍 公交车专用道移动智能电子警察系统是专为交管部门查处社会车辆高峰期占用公交车道行驶行为的一款产品. 系统由前端抓拍设备和后端电 ...

  4. 中兴智能视觉大数据报道:人脸识别准确率高达99.8%

    中兴智能视觉大数据报道:随着人工智能(AI)在数十个火车站安装,其中包括了5秒识别乘客身份的面部识别系统,让中国火车站忙碌的场景开始成为过去时.当一名乘客走近车站面部识别系统内的摄像头时,摄像头会扫描 ...

  5. 广州市城市智能交通大数据体系研究与实践

    广州市城市智能交通大数据体系研究与实践 张孜1, 黄钦炎2, 冯川2 1 广州市交通运输局,广东 广州 510620 2 广州交通信息化建设投资营运有限公司,广东 广州 510620 摘要:为了构建现 ...

  6. 苏宁智能 BU大数据中心数据治理团队负责人韦真:数据治理“三字经”,超实用!...

    中生代技术 链接技术大咖,分享技术干货 全文:4700字 " 随着移动互联网和大数据的蓬勃发展,"数据即资产"的理念深入人心.大数据已发展成为具有战略意义的生产资料,在各 ...

  7. 网络诈骗是大数据的“原罪”吗

    作者:江西财经大学管理哲学研究中心主任.教授.博士生导师 黄欣荣 大数据时代刚刚来临,本应造福人类的大数据技术却被诈骗分子利用,屡屡成为诈骗利器.山东临沂学生徐玉玉事件让网络诈骗浮出水面,成为新闻热点 ...

  8. 数据中心网络布线为大数据时代铺路

    罗森伯格亚太电子有限公司:孙慧永 前言 伴随云时代的来临,大数据(Big data)也吸引了越来越多的关注,人们用它来描述和定义信息爆炸时代产生的海量数据.我们来了解一下大数据的概念,大数据是指数据集 ...

  9. 114页5万字智能交通大数据平台建设方案

    目录 1.项目概述 1 1.1 项目名称 2 1.2 项目承担单位 2 1.3 建设方案编写单位 2 1.4 建设方案编制依据 2 1.5 建设目标.规模.周期 4 2.项目建设的必要性和可行性 6 ...

最新文章

  1. 数据结构中等号表示什么_通过分析2016年最重要的252个中等故事我学到了什么...
  2. Ecplise软件Devices看到两个相同设备问题
  3. Unicode和Ascii转byte,Unicode占二个byte,Ascii占一个byte
  4. h5文字垂直居中_CSS中垂直居中和水平垂直居中的方法
  5. linux公司常用基础命令必知必会一
  6. slider改变对话框颜色
  7. 二叉排序树 算法实验
  8. 电影屏幕行业调研报告 - 市场现状分析与发展前景预测(2021-2027年)
  9. [渝粤教育] 西南科技大学 现代企业管理 在线考试复习资料(2)
  10. Android App加固原理与技术历程
  11. 三泰串口卡linux驱动,工业多串口卡.pdf
  12. 哈利波特AR游戏-巫师联盟 深度解析
  13. java计算机毕业设计个人交友网站源程序+mysql+系统+lw文档+远程调试
  14. html和css制作的网页设计期末大作业【小米购物商城网站制作】
  15. Android手机直播系统开发介绍
  16. 读围城论方鸿渐与孙柔嘉的爱情
  17. NFT游戏开发元宇宙游戏开发游戏源码+搭建
  18. npm的插件如何直接在html中使用,webpack中html-webpack-plugin插件的使用(生成多个html页面,引入不一样的js文件)...
  19. 【转载】bat批处理教程 作者:hipi 日期:2006-11-05
  20. HBase的安装(单机版)

热门文章

  1. 三级哪个不用学c语言,考全国计算机等级考试三级如何学C语言
  2. centos安装mysql8.0.13_CentOS 7.4安装MySql 8.0.13及配置
  3. mysql安装./depcomp_编译安装mysql及问题汇总
  4. 用链表实现约瑟夫环(没用)
  5. android数据库开发案例教程,Android Studio项目开发教程 第6章 数据库编程(30页)-原创力文档...
  6. 华为交换机关闭接口命令_华为交换机常用命令及技巧
  7. 台式计算机光标时不时跳动,解决方案:如何解决联想笔记本触摸板上的光标跳动?...
  8. 什么镜头最适合拍风景_哪种镜头最适合你的街头摄影?
  9. python db2 linux 安装,python安装DB2模块
  10. 进入环境_大学新生,进入新环境该怎样和舍友、同学相处