MapReduce with MongoDB and Python

从 Artificial Intelligence in Motion 作者:Marcel Pinheiro Caraciolo (由于Artificial Intelligence in Motion发布的图在墙外,所以将图换到cnblogs)

Hi all,

In this post, I'll present a demonstration of a map-reduce example with MongoDB and server side JavaScript.  Based on the fact that I've been working  with this technology recently, I thought it would be useful to present here a simple example of  how it works and how to integrate with Python.
But What is MongoDb ?
For you, who doesn't know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases:
  • SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
  • Doesn't guarantee atomic operations.
  • Distributed and horizontally scalable.
  • It doesn't have to predefine schemas. (Non-Schema)
  • Non-tabular data storing (eg; key-value, object, graphs, etc).
Although it is not so obvious, No-SQL is an abbreviation  to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here.  It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.
MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database  that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..
One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain  how it works illustrated with a simple example using MongoDb and Python.
If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.

Map- Reduce 

MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers.  Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming.  It can be considered  'Data-Oriented' which process data in two primary steps: Map and Reduce.  On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called 'Map', and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called 'Reduce'.  The paper about the MapReduce with more details it can be read here.
Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous is Hadoop and is implemented in Java as an open-source project.  In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.  In real case scenarios, in a situation where  you would have used GROUP BY in SQL,  map/reduce is the equivalent tool in MongoDB.
Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.

PyMongo

PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See here how to install  and use it.

Map-Reduce in Action

Now let's see Map-Reduce in action. For demonstrate the map-reduce I've decided to used of the classical problems solved using it: Word Frequency count across a series of documents. It's a simple problem and is suited to being solved by a map-reduce query.
I've decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works.  The second one is the 2009 Obama's Speech at his election for president. It will be used to show a real example illustrated by the code.
Let's consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split  in words and grouped by the function map and after reduced independently (aggregation)  by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. It's also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy.

Map-Reduce Distribution
Some notes you need to know before developing your map and reduce functions:
  • The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
  • Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
  • If you need to perform an operation only once, use a finalize function.
Let's go now to the code. For this task, I'll use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obama's speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's Speech.

Obama's Speech in 2009
For writing our map and reduce functions, MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.

wordMap.js

As you can see the 'this' variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by calling this.text.  The map function doesn't return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others  intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now.

wordReduce.js
The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example using the Pymongo client and passing the map/reduce functions to the server.

mapReduce.py
Let's see the result now:

And it works! :D

With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we  can see the potential of map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be download here.

My next posts will be about  performance evaluation on machine learning techniques.  Wait for news!

Marcel Caraciolo

References

  • http://nosql.mypopescu.com/post/394779847/mongodb-tutorial-mapreduce
  • http://fredzvt.wordpress.com/2010/04/24/no-sql-mongodb-from-introduction-to-high-level-usage-in-csharp-with-norm/

转载于:https://www.cnblogs.com/zhengyun_ustc/archive/2010/08/22/1805849.html

MapReduce with MongoDB and Python[ZT]相关推荐

  1. mongodb和python交互

    mongodb和python交互 1. mongdb和python交互的模块 pymongo 提供了mongdb和python交互的所有方法 安装方式: pip install pymongo 2. ...

  2. 基于 MongoDB 的 python 日志功能

    本文首发于 Gevin的博客 原文链接:基于MongoDB的python日志功能 未经 Gevin 授权,禁止转载 基于MongoDB的python日志功能 why-log-to-mongodb 我几 ...

  3. MongoDB 和 Python 不通用的操作

    具体操作 Mongodb SQL Python 空值操作 db.getCollection('example _data_2').find({'grade': null}) rows = collec ...

  4. MongoDB与python 交互

    一.安装pymongo 注意 :当同时安装了python2和python3,为区分两者的pip,分别取名为pip2和pip3. 推荐:https://www.cnblogs.com/thunderLL ...

  5. 三、mongodb数据库系列——mongodb和python交互 总结

    一.mongodb和python交互 学习目标 掌握 mongdb和python交互的增删改查的方法 掌握 权限认证的方式使用pymongo模块 1. mongdb和python交互的模块 pymon ...

  6. python做前端mongodb_Python爬虫之mongodb和python交互

    mongodb和python交互 学习目标 掌握 mongdb和python交互的增删改查的方法 掌握 权限认证的方式使用pymongo模块 1. mongdb和python交互的模块 pymongo ...

  7. Mapreduce Wordcount白名单 Python实现

    Mapreduce Wordcount白名单 Python实现 1.Mapper部分的map.py代码: 其中读入文件The_Man_of_Property.txt需要上传到HDFS文件系统上:had ...

  8. Python爬取豆瓣音乐存储MongoDB数据库(Python爬虫实战1)

    Python爬取豆瓣音乐存储MongoDB数据库(Python爬虫实战1) 1.  爬虫设计的技术 1)数据获取,通过http获取网站的数据,如urllib,urllib2,requests等模块: ...

  9. MongoDB 与 python 的使用

    MongoDB 与 python 的使用 运行结果 文章目录 MongoDB 与 python 的使用 MongoDB 数据库的结构 基本操作 创建数据库 删除数据库 创建集合 删除集合 查看已有集合 ...

最新文章

  1. 交流一点CCNP学习经验
  2. OPPOr7sm恢复出厂设置一直卡在开机界面
  3. jstat -gcutil 输出结果分析_JVM故障分析
  4. solidworks的小金球插件_SOLIDWORKS旋转流体仿真
  5. Microsoft SQL Server Desktop Engine安装过程中遇到的问题(2)
  6. html与markdown互相转换
  7. 使用RMAN备份数据库和归档日志合二为一
  8. 欧盟:2020年之前普及免费WiFi网络
  9. 《大象UML》看书笔记2:
  10. 斗鱼直播实时数据爬取
  11. 系统迁移到ssd 开启哪些服务器,如何使用分区助手完美迁移系统到SSD固态硬盘...
  12. 【python】py课上机作业3「谢尔宾斯基三角形」「递归输出列表」
  13. 【干货】实例讲解:跨部门沟通和与领导沟通的心得与技巧
  14. 【Matlab】 气候资料数据集预处理
  15. 【python爬虫学习】cookie模拟登陆
  16. codeforces 869 E. The Untended Antiquity(树状数组)
  17. 详解脑的功能区域分布以及布罗德曼分区系统
  18. linux一次系统调用时间,Linux系统调用—时间和日期
  19. python中write写入后文件依然空白
  20. POJ - 2955 Brackets (区间DP)

热门文章

  1. 关于Android Studio里的Gradle文件
  2. React入门0x014: Fragment
  3. vsftpd 配置:chroot_local_user与chroot_list_enable详解
  4. 线程工具类(根据电脑逻辑处理器个数控制同时运行的线程个数)
  5. 关于 继承、扩展和协议,深度好文
  6. [NodeJS] 优缺点及适用场景讨论 - 鱼松
  7. Eclipse工作空间还原到最初状态
  8. libvirt-adabddad
  9. iOS - 切换图片/clip subview/iCarousel
  10. 典型的开发国内小项目没失败的经验分享