I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

解决方案

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

Create a virtualenv purely for your Spark nodes

Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies

Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have

Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash

# helper script to fulfil Spark's python packaging requirements.

# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of

# supplied to --py-files argument of `pyspark` or `spark-submit`

# First argument should be the top-level virtualenv

# Second argument is the zipfile which will be created, and

# which you can subsequently supply as the --py-files argument to

# spark-submit

# Subsequent arguments are all the private packages you wish to install

# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift

ZIPFILE=$1; shift

PACKAGES=$*

. $VENV/bin/activate

for pkg in $PACKAGES; do

pip install --upgrade $pkg

done

TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes

( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )

mv $TMPZIP $ZIPFILE

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

spark的python开发安装方式,最简单的方式来安装Python依赖关系的Spark执行器节点?...相关推荐

  1. python开发面试问题及答案_前50个Python面试问题(最受欢迎)

    热门Python面试问答 下面列出的是关于Python编程语言的最常见面试问题和答案. 让我们探索!! #1)Python可以用于Web客户端和Web服务器端编程吗?哪一个最适合Python? 答案: ...

  2. 小学生python编程写游戏_小学生开始学Python,开发AI的首选编程语言:推荐一波Python书单...

    AlphaGo 都在使用的 Python 语言,是最接近 AI 的编程语言. 教育部考试中心近日发布了"关于全国计算机等级(NCRE)体系调整"的通知,决定自2018年3月起,在全 ...

  3. Python开发的编译神器PyCharm----测试从业来编写Python脚本最钟意的工具

    目录 前言: 一.PyCharm简介 二.PyCharm下载与安装 1.下载 2.安装 三.PyCharm新增Python项目 步骤1.新增 步骤2.路径配置 步骤3.环境选择 步骤4. 项目运行 四 ...

  4. python开发面试问题及答案_集锦 | 53个Python面试问题 答案打包带走

    作者丨Chris来源丨AI科技大本营(ID:rgznai100)链接丨https://towardsdatascience.com/53-python-interview-questions-and- ...

  5. python开发app的软件_python可以写APP吗(python能做手机软件吗)

    一枚程序媛程序媛2 人赞同了该文章用Python操作手机APP的项目,例如抖音.闲鱼之类的,看完后发现这些项目无一例外需要部署ADB环境.至于什么是ADB,很多大神都讲这里介绍几款可以在手机上编程的a ...

  6. vim配置python开发环境_GitHub - TTWShell/legolas-vim: Vim配置,为python、go开发者打造的IDE。...

    legolas-vim 个人vim配置.支持python.go等自动提示,支持python.go的函数跳转(python支持虚拟环境). 最终效果图(函数列表的feature已移除,因为大项目会导致性 ...

  7. 20.校准相机——纯粹的方式,简单的方式,多平面校准_4

    目录 纯粹的方式 简单的方式 多平面校准 总结 纯粹的方式 所以现在我们没有M,现在我们知道,M编码所有的参数,外部参数和内部参数.所以我们应该能够从m中找到关于相机的东西,例如,焦距应该是一个固有的 ...

  8. python开发的运维工具_8种常用的Python工具

    Python是一种开源的编程语言,可用于Web编程.数据科学.人工智能以及许多科学应用.学习Python可以让程序员专注于解决问题,而不是语法.由于Python相对较小,且拥有各式各样的工具,因此比J ...

  9. python开发工具排名-7款公认比较出色的Python IDE,你值得拥有!

    Python作为一款比较"简洁"的编程语言,它拥有很多性价比高的性能,造就了它现在比较火热的局面,很多人都来学习Python.Python 的学习过程少不了 IDE 或者代码编辑器 ...

最新文章

  1. 在linux中如何高效的使用帮助
  2. python3 基本数据类型
  3. java if-then和if-then-else语句(翻译自Java Tutorials)
  4. 傻瓜教程:asp.net(c#) 如何配置authentication,完成基于表单的身份验证
  5. MFC中动态创建button及添加响应事件
  6. python运维脚本部署jdk_基于Java/Python搭建Web UI自动化环境
  7. Python-复习-习题-13
  8. linux虚拟机上lvs-nat的实现
  9. 【教程】PDF控件Spire.PDF 教程:在C#中加密和解密PDF文件
  10. 佳佳数据恢复软件免费版
  11. 青囊如可授 从此访鸿蒙的意思,《坛滴槐花露,香飘柏子风。》
  12. python语言mooc作业_计算机基础(Ⅱ)Python语言-中国大学mooc-试题题目及答案
  13. 在 Go 中处理恐慌
  14. 复旦大学的计算机专业分数线,2018复旦大学分数线 各专业分数线是多少
  15. SDNU_ACM_ICPC_2020_Winter_Practice_4th [Reproduced](新知识点:矩阵快速幂的应用)
  16. Java编程之英文单词首字母大写
  17. 为何如今在主板上找不到北桥了?简述主板芯片组发展史
  18. NEXTCHIP,包括哪些方面?有哪些功能?
  19. Linux——匿名管道、命名管道及进程池概念和实现原理
  20. Linux下xl710网卡驱动,CentOS 6.x 系统安装+网卡驱动安装(Realtek PCIe GBE Family Controller for Linux)...

热门文章

  1. python零基础自学教材-Python零基础入门到精通自学视频教程
  2. python爬虫从入门到放弃-python爬虫从入门到放弃(二)- 爬虫的深层原理
  3. python3.6安装步骤-Ubuntu16.04安装python3.6详细教程
  4. python软件下载百度云-python电子书学习资料打包分享百度云资源下载
  5. python和c 的区别-Python与C语言的区别
  6. 自学python顺序-Python学习之调换顺序
  7. python中文叫什么-python中文叫什么
  8. python处理excel表格-如何用python处理excel表格
  9. 长虹CIRI语音智能电视技术原理简析
  10. 消费者生产者代码之---一步一步带你写