spark的python开发安装方式,最简单的方式来安装Python依赖关系的Spark执行器节点？...

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

解决方案

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

Create a virtualenv purely for your Spark nodes

Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies

Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have

Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash

# helper script to fulfil Spark's python packaging requirements.

# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of

# supplied to --py-files argument of `pyspark` or `spark-submit`

# First argument should be the top-level virtualenv

# Second argument is the zipfile which will be created, and

# which you can subsequently supply as the --py-files argument to

# spark-submit

# Subsequent arguments are all the private packages you wish to install

# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift

ZIPFILE=$1; shift

PACKAGES=$*

. $VENV/bin/activate

for pkg in $PACKAGES; do

pip install --upgrade $pkg

done

TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes

( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )

mv $TMPZIP $ZIPFILE

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

spark的python开发安装方式,最简单的方式来安装Python依赖关系的Spark执行器节点？...相关推荐

python开发面试问题及答案_前50个Python面试问题（最受欢迎）
热门Python面试问答下面列出的是关于Python编程语言的最常见面试问题和答案. 让我们探索!! #1)Python可以用于Web客户端和Web服务器端编程吗?哪一个最适合Python? 答案: ...
小学生python编程写游戏_小学生开始学Python,开发AI的首选编程语言：推荐一波Python书单...
AlphaGo 都在使用的 Python 语言,是最接近 AI 的编程语言. 教育部考试中心近日发布了"关于全国计算机等级(NCRE)体系调整"的通知,决定自2018年3月起,在全 ...
Python开发的编译神器PyCharm----测试从业来编写Python脚本最钟意的工具
目录前言: 一.PyCharm简介二.PyCharm下载与安装 1.下载 2.安装三.PyCharm新增Python项目步骤1.新增步骤2.路径配置步骤3.环境选择步骤4. 项目运行四 ...
python开发面试问题及答案_集锦 | 53个Python面试问题答案打包带走
作者丨Chris来源丨AI科技大本营(ID:rgznai100)链接丨https://towardsdatascience.com/53-python-interview-questions-and- ...
python开发app的软件_python可以写APP吗(python能做手机软件吗)
一枚程序媛程序媛2 人赞同了该文章用Python操作手机APP的项目,例如抖音.闲鱼之类的,看完后发现这些项目无一例外需要部署ADB环境.至于什么是ADB,很多大神都讲这里介绍几款可以在手机上编程的a ...
vim配置python开发环境_GitHub - TTWShell/legolas-vim: Vim配置，为python、go开发者打造的IDE。...
legolas-vim 个人vim配置.支持python.go等自动提示,支持python.go的函数跳转(python支持虚拟环境). 最终效果图(函数列表的feature已移除,因为大项目会导致性 ...
20.校准相机——纯粹的方式，简单的方式，多平面校准_4
目录纯粹的方式简单的方式多平面校准总结纯粹的方式所以现在我们没有M,现在我们知道,M编码所有的参数,外部参数和内部参数.所以我们应该能够从m中找到关于相机的东西,例如,焦距应该是一个固有的 ...
python开发的运维工具_8种常用的Python工具
Python是一种开源的编程语言,可用于Web编程.数据科学.人工智能以及许多科学应用.学习Python可以让程序员专注于解决问题,而不是语法.由于Python相对较小,且拥有各式各样的工具,因此比J ...
python开发工具排名-7款公认比较出色的Python IDE，你值得拥有！
Python作为一款比较"简洁"的编程语言,它拥有很多性价比高的性能,造就了它现在比较火热的局面,很多人都来学习Python.Python 的学习过程少不了 IDE 或者代码编辑器 ...

spark的python开发安装方式,最简单的方式来安装Python依赖关系的Spark执行器节点？...

spark的python开发安装方式,最简单的方式来安装Python依赖关系的Spark执行器节点？...相关推荐

最新文章

热门文章