白裤子变粉裤子怎么办

At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them share dependencies with each other which means code refactoring can become a pain and consumes lots of time. In addition, as we’re very opposed to Data Scientists’ tendency to copy/paste codes, we need a unified location where we can store reusable functions that can be easily accessed.

在HousingAnywhere ,扩展数据团队时我们面临的第一个主要障碍之一是建立一个包含我们不断增长的机器学习应用程序的集中式存储库。 在这些项目之间,它们中的许多彼此共享依赖关系,这意味着代码重构可能会很麻烦并且会花费大量时间。 另外,由于我们非常反对数据科学家复制/粘贴代码的趋势,因此我们需要一个统一的位置,在这里我们可以存储易于访问的可重用功能。

The perfect solution to our use case was building a monorepo. In this article, I’ll go through how a simple monorepo can be built using the build automation system Pantsbuild.

对于我们的用例,完美的解决方案是构建一个monorepo。 在本文中,我将介绍如何使用构建自动化系统Pantsbuild构建简单的monorepo

什么是monorepo? (What is a monorepo?)

A monorepo is a repository where code for many projects are stored together. Having a centralised repository for your team comes with a number of benefits:

monorepo是一个用于存储许多项目代码的存储库。 为您的团队建立集中式存储库有许多好处:

  • Reusability: Allows projects to share functions, in the case of Data Science, codes for preprocessing data, calculating metrics and even plotting graphs can be shared across projects.

    可重用性 :允许项目共享功能,就数据科学而言,可以在项目之间共享用于预处理数据计算度量甚至绘图的代码。

  • Atomic changes: It only takes one operation to make changes across multiple projects.

    原子更改 :只需执行一项操作即可在多个项目中进行更改。

  • Large scale refactoring: can be done easily and quickly, ensuring projects would still work afterwards.

    大规模重构 : 可以轻松,快速地完成,从而确保项目在以后仍然可以正常工作。

Monorepo, however, is not a solution that fits all as there are a number of disadvantages:

但是,Monoropo并不是一个适合所有人的解决方案,因为存在许多缺点:

  • Security issues: There are no means to expose only parts of the repository.

    安全问题 :没有办法只公开存储库的一部分。

  • Big codebase: As the repo grows in size, it can cause problems as developers have to check out the entire repository.

    大型代码库 :随着存储库大小的增加,由于开发人员必须检出整个存储库,因此可能导致问题。

At HousingAnywhere, our team of Data Scientists find monorepo to be the perfect solution for our use cases in the Data team. Many of our machine learning applications have smaller projects that spin off from them. The monorepo enables us to quickly intergrate these new projects into the CI/CD pipeline, reducing the amount of time having to setup pipeline individually for each new project.

在HousingAnywhere,我们的数据科学家团队发现monorepo是我们数据团队中用例的理想解决方案。 我们的许多机器学习应用程序都有一些较小的项目,这些项目可以从中分离出来。 monorepo使我们能够快速将这些新项目集成到CI / CD管道中,从而减少了为每个新项目分别设置管道的时间。

We tried out a number of build automation systems and the one that we stuck with is Pantsbuild. Pants is one of the few systems that supports Python natively, and is an open-source project widely used by Twitter, Toolchain, Foursquare, Square, and Medium.

我们尝试了许多构建自动化系统, 我们坚持使用的是Pantsbuild 。 Pant是本机支持Python的少数系统之一,并且是Twitter,Toolchain,Foursquare,Square和Medium广泛使用的开源项目。

Recently Pants has updated to v2 which only supports Python at the moment but it isn’t too much of a limitation for Data Science projects.

最近,Pant已更新到v2 ,目前仅支持Python,但对Data Science项目的限制不是太大。

一些基本概念 (Some basic concepts)

There are a couple of concepts in Pants that you should understand beforehand:

您需要事先了解裤子中的几个概念:

  • Goals help users tell Pants what actions to take e.g. test

    目标可以帮助用户告诉裤子要采取哪些措施,例如进行test

  • Tasks are the Pants modules that run actions

    任务是运行动作的裤子模块

  • Targets describe what files to take those actions upon. These targets are defined in a BUILD file

    目标描述要对这些文件执行哪些操作。 这些目标在BUILD文件中定义

  • Target types define the types of operations that can be performed on a target e.g. you can perform tests on test targets

    目标类型定义可以在目标上执行的操作的类型,例如,您可以在测试目标上执行测试

  • Addresses describe the location of a target in the repo

    地址描述了目标在仓库中的位置

For more information, I highly recommend reading this documentation where the developers of Pants have done an excellent job in explaining these concepts in detail.

有关更多信息,我强烈建议阅读本文档 ,Pant的开发人员在详细解释这些概念方面做得很好。

一个示例存储库 (An example repository)

In this section, I’ll go through how you can easily setup a monorepo using Pants. First, makes sure these requirements are met to install Pants:

在本节中,我将介绍如何使用裤子轻松设置monorepo。 首先,确保满足以下要求才能安装裤子:

  • Linux or macOS.Linux或macOS。
  • Python 3.6+ discoverable on your PATH.

    可在PATH上发现的Python 3.6+。

  • Internet access (so that Pants can fully bootstrap itself).Internet访问(以便裤子可以完全自举)。

Now, let’s set up a new repository:

现在,让我们建立一个新的存储库:

mkdir monorepo-examplecd monorepo-examplegit init

Alternatively, you can clone the example repo via:

或者,您可以通过以下方式克隆示例存储库 :

git clone https://github.com/uiucanh/monorepo-example.git

Next, run these commands to download the setup file:

接下来,运行以下命令以下载安装文件:

printf '[GLOBAL]\npants_version = "1.30.0"\nbackend_packages = []\n' > pants.tomlcurl -L -o ./pants https://pantsbuild.github.io/setup/pants && \ chmod +x ./pants

Then, bootstrap Pants by running ./pants --version . You should receive 1.30.0 as output.

然后,通过运行./pants --version引导裤子。 您应该收到1.30.0作为输出。

Let’s add a couple of simple apps to the repo. First, we’ll create a utils/data_gen.py and a utils/metrics.py that contain a couple of util functions:

让我们向仓库添加几个简单的应用程序。 首先,我们将创建一个utils/data_gen.py和一个utils/metrics.py ,其中包含几个util函数:

import numpy as npdef generate_linear_data(n_samples: int = 100, n_features: int = 1,x_min: int = -5, x_max: int = 5,m_min: int = -10, m_max: int = 10,noise_strength: int = 1, seed: int = None,bias: int = 10):# Set the random seedif seed is not None:np.random.seed(seed)X = np.random.uniform(x_min, x_max, size=(n_samples, n_features))m = np.random.uniform(m_min, m_max, size=n_features)y = np.dot(X, m).reshape((n_samples, 1))if bias != 0:y += bias# Add Gaussian noisey += np.random.normal(size=y.shape) * noise_strengthreturn X, ydef split_dataset(X: np.ndarray, y: np.ndarray,test_size: float = 0.2, seed: int = 0):# Set the random seednp.random.seed(seed)# Shuffle datasetindices = np.random.permutation(len(X))X = X[indices]y = y[indices]# SplittingX_split_point = int(len(X) * (1 - test_size))y_split_point = int(len(y) * (1 - test_size))X_train, X_test = X[:X_split_point], X[X_split_point:]y_train, y_test = y[:y_split_point], y[y_split_point:]return X_train, X_test, y_train, y_test
import numpy as npdef mean_absolute_percentage_error(y_true: np.ndarray, y_pred: np.ndarray):y_true, y_pred = np.array(y_true), np.array(y_pred)return np.mean(np.abs((y_true - y_pred) / y_true)) * 100def r2(y_test: np.ndarray, y_pred: np.ndarray):y_mean = np.mean(y_test)ss_tot = np.square(y_test - y_mean).sum()ss_res = np.square(y_test - y_pred).sum()result = 1 - ss_res / ss_totreturn result

Now, we’ll add an application first_app/app.pythat imports these codes. The app uses data fromgenerate_linear_data , passes them to a Linear Regression model and outputs the Mean Absolute Percentage Error.

现在,我们将添加一个导入这些代码的应用程序first_app/app.py 该应用程序使用generate_linear_data数据,将其传递到线性回归模型,然后输出平均绝对百分比误差。

import os
import sys# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.data_gen import generate_linear_data, split_dataset  # noqa
from utils.metrics import mean_absolute_percentage_error, r2  # noqa
from sklearn.linear_model import LinearRegression  # noqaclass Model:def __init__(self, X, y):self.X = Xself.y = yself.m = LinearRegression()self.y_pred = Nonedef split(self, test_size=0.33, seed=0):self.X_train, self.X_test, self.y_train, self.y_test = split_dataset(self.X, self.y, test_size=test_size, seed=seed)def fit(self):self.m.fit(self.X_train, self.y_train)def predict(self):self.y_pred = self.m.predict(self.X_test)def main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()print("MAPE:", mean_absolute_percentage_error(m.y_test, m.y_pred))if __name__ == '__main__':main()

And another app second_app/app.pythat uses the first app codes:

还有另一个使用第一个应用程序代码的应用程序second_app/app.py

import sys
import os# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.metrics import r2  # noqa
from utils.data_gen import generate_linear_data, split_dataset  # noqa
from first_app.app import Model  # noqadef main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()result = r2(m.y_test, m.y_pred)print("R2:", result)return resultif __name__ == '__main__':_ = main()

Then we add a couple of simple tests for these apps, for example:

然后,我们为这些应用添加一些简单的测试,例如:

import numpy as np
from first_app.app import Modeldef test_model_working():X, y = np.array([[1, 2, 3], [4, 5, 6]]), np.array([[1], [2]])m = Model(X, y)m.split()m.fit()m.predict()assert m.y_pred is not None

In each of these directories, we’ll need a BUILD file. These files contain information about your targets and their dependencies. In these files, we’ll declare what requirements are needed for these projects as well as declare the test targets.

在每个目录中,我们需要一个BUILD文件。 这些文件包含有关目标及其依赖项的信息。 在这些文件中,我们将声明这些项目需要哪些要求以及声明测试目标。

Let’s start from the root of the repository:

让我们从存储库的根目录开始:

python_requirements()

This BUILD file contains a macro python_requirements() that creates multiple targets to pull third party dependencies from a requirements.txt in the same directory. It saves us time from having to do it manually for each requirement:

此BUILD文件包含一个宏python_requirements() ,该宏创建多个目标以从同一目录中的requirements.txt中提取第三方依赖项。 它为我们节省了手动完成每个需求的时间:

python_requirement_library(    name="numpy",    requirements=[        python_requirement("numpy==1.19.1"),    ],)

The BUILD file inutils would look like below:

utils的BUILD文件如下所示:

python_library(name = "utils",sources = ["data_gen.py","metrics.py",],dependencies = [# The `//` signals that the target is at the root of your project."//:numpy"]
)python_tests(name = 'utils_test',sources = ["data_gen_test.py","metrics_test.py",],dependencies = [":utils",]
)

Here we have two targets: First one is a Python library that contains Python codes which are defined in source i.e. our two utility files. It also specifies the requirements needed to run these codes which is numpy, one of our third party dependencies we defined in the root BUILD file.

这里我们有两个目标:第一个是Python库,其中包含在source代码中定义的Python代码,即我们的两个实用程序文件。 它还指定了运行这些代码numpy所需的要求, numpy是我们在根BUILD文件中定义的第三方依赖项之一。

The second target is the collection of tests we defined earlier, their dependency is the previous Python library. To run these tests, it’s as simple as running ./pants test utils:utils_test or ./pants test utils:: from root. The second : tells Pants to run all the test targets in that BUILD file. The output should look like this:

第二个目标是我们前面定义的测试集合,它们的依赖关系是先前的Python库。 要运行这些测试,就像从根目录运行./pants test utils:utils_test./pants test utils::一样简单。 第二个:告诉Pant运行该BUILD文件中的所有测试目标。 输出应如下所示:

============== test session starts ===============platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.9.0, pluggy-0.13.1cachedir: .pants.d/test/pytest/.pytest_cacherootdir: /Users/ducbui/Desktop/Projects/monorepo-example, inifile: /dev/nullplugins: cov-2.8.1, timeout-1.3.4collected 3 itemsutils/data_gen_test.py .                   [ 33%]utils/metrics_test.py ..                   [100%]

Similarly, we’ll create 2 BUILD files for first_app and second_app

同样,我们将为first_appsecond_app创建2个BUILD文件

python_library(name = "first_app",sources = ["app.py"],dependencies = ["//:numpy","//:scikit-learn","//:pytest","utils",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":first_app",]
)

In the second_app BUILD file, we declare the library fromfirst_app above as the dependency for this library. This means that all the dependencies from that library, together with its source will be the dependencies for first_app .

second_app BUILD文件中,我们从上面的first_app声明该库作为该库的依赖项。 这意味着该库中的所有依赖项及其源将成为first_app的依赖项。

python_library(name = "second_app",sources = ["app.py"],dependencies = ["first_app",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":second_app",]
)

Similarly, we also add some test targets to these BUILD files and they can be run with ./pants test first_app:: or ./pants test second_app:: .

同样,我们也向这些BUILD文件添加了一些测试目标,它们可以通过./pants test first_app::./pants test second_app::

The final directory tree should look like this:

最终目录树应如下所示:

.├── BUILD├── first_app│   ├── BUILD│   ├── app.py│   └── app_test.py├── pants├── pants.toml├── requirements.txt├── second_app│   ├── BUILD│   ├── app.py│   └── app_test.py└── utils    ├── BUILD    ├── data_gen.py    ├── data_gen_test.py    ├── metrics.py    └── metrics_test.py

The power of Pants comes from the ability to trace transitive dependencies between projects and test targets that were affected by the change. The developers of Pants provide us with this nifty bash script that can be used to track down affected test targets:

Pant的强大之处在于能够跟踪受更改影响的项目与测试目标之间的传递依赖关系。 Pants的开发人员为我们提供了这个漂亮的bash脚本,可用于跟踪受影响的测试目标:

#!/bin/bashset -x
set -o
set -e# Disable Zinc incremental compilation to ensure no historical cruft pollutes the build used for CI testing.
export PANTS_COMPILE_ZINC_INCREMENTAL=Falsechanged=("$(./pants --changed-parent=origin/master list)")
dependees=("$(./pants dependees --dependees-transitive --dependees-closed ${changed[@]})")
minimized=("$(./pants minimize ${dependees[@]})")
./pants filter --filter-type=-jvm_binary ${minimized[@]} | sort > minimized.txt# In other contexts we can use --spec-file to read the list of targets to operate on all at
# once, but that would merge all the classpaths of all the test targets together, which may cause
# errors. See https://www.pantsbuild.org/3rdparty_jvm.html#managing-transitive-dependencies.
# TODO(#7480): Background cache activity when running in a loop can sometimes lead to race conditions which
# cause pants to error. This can probably be worked around with --no-cache-compile-rsc-write. See
# https://github.com/pantsbuild/pants/issues/7480.for target in $(cat minimized.txt); do./pants test $target
done

To showcase its power, let’s run an example. We’ll create a new branch, make a modification to data_gen.py (e.g. changing the default parameter for generate_linear_data ) and commit:

为了展示其功能,让我们来看一个例子。 我们将创建一个新分支,对data_gen.py进行修改(例如,更改generate_linear_data的默认参数)并提交:

git checkout -b "example_1"git add utils/data_gen.pygit commit -m "support/change-params"

Now, running the bash script we’ll see a minimized.txt that contains all the projects that are impacted and the test targets that will be executed:

现在,运行bash脚本,我们将看到一个minimized.txt ,其中包含所有受影响的项目以及将要执行的测试目标:

first_app:app_testsecond_app:app_testutils:utils_test
Transitive dependencies
传递依存关系

Looking at the graph above, we can clearly see that changing utils would affect all of its above nodes, including first_app and second_app .

查看上图,我们可以清楚地看到更改utils会影响其上面的所有节点,包括first_appsecond_app

Let’s do another example, this time we’ll only modify second_app/app.py . Switch branch, commit and run the script again. Insideminimized.txt , we’ll only get second_app:app_test as it’s the topmost node.

让我们再举一个例子,这次我们只修改second_app/app.py 切换分支,提交并再次运行脚本。 里面minimized.txt ,我们只得到second_app:app_test ,因为它是最顶层的节点。

And that’s it, hopefully, I’ve managed to demonstrate to you how useful Pantsbuild can be for Data Science monorepos. Together with a properly implemented CI/CD pipeline, the speed and reliability of development can be improved vastly.

就是这样,希望我能够向您演示Pantsbuild对Data Science monorepos的有用性。 加上正确实施的CI / CD管道,可以极大地提高开发速度和可靠性。

翻译自: https://towardsdatascience.com/building-a-monorepo-for-data-science-with-pantsbuild-2f77b9ee14bd

白裤子变粉裤子怎么办


http://www.taodudu.cc/news/show-995342.html

相关文章:

  • 青年报告_了解青年的情绪
  • map(平均平均精度_客户的平均平均精度
  • 鲜活数据数据可视化指南_数据可视化实用指南
  • 图像特征 可视化_使用卫星图像可视化建筑区域
  • 海量数据寻找最频繁的数据_在数据中寻找什么
  • 可视化 nlp_使用nlp可视化尤利西斯
  • python的power bi转换基础
  • 自定义按钮动态变化_新闻价值的变化定义
  • 算法 从 数中选出_算法可以选出胜出的nba幻想选秀吗
  • 插入脚注把脚注标注删掉_地狱司机不应该只是英国电影历史数据中的脚注,这说明了为什么...
  • 贝叶斯统计 传统统计_统计贝叶斯如何补充常客
  • 因为你的电脑安装了即点即用_即你所爱
  • 团队管理新思考_需要一个新的空间来思考讨论和行动
  • bigquery 教程_bigquery挑战实验室教程从数据中获取见解
  • java职业技能了解精通_如何通过精通数字分析来提升职业生涯的发展,第8部分...
  • kfc流程管理炸薯条几秒_炸薯条成为数据科学的最后前沿
  • bigquery_到Google bigquery的sql查询模板,它将您的报告提升到另一个层次
  • 数据科学学习心得_学习数据科学时如何保持动力
  • python多项式回归_在python中实现多项式回归
  • pd种知道每个数据的类型_每个数据科学家都应该知道的5个概念
  • xgboost keras_用catboost lgbm xgboost和keras预测财务交易
  • 走出囚徒困境的方法_囚徒困境的一种计算方法
  • 平台api对数据收集的影响_收集您的数据不是那么怪异的api
  • 逻辑回归 概率回归_概率规划的多逻辑回归
  • ajax不利于seo_利于探索移动选项的界面
  • 数据探索性分析_探索性数据分析
  • stata中心化处理_带有stata第2部分自定义配色方案的covid 19可视化
  • python 插补数据_python 2020中缺少数据插补技术的快速指南
  • ab 模拟_Ab测试第二部分的直观模拟
  • 亚洲国家互联网渗透率_发展中亚洲国家如何回应covid 19

白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo相关推荐

  1. 什么事数据科学_如果您想进入数据科学,则必须知道的7件事

    什么事数据科学 No way. No freaking way to enter data science any time soon-That is exactly what I thought a ...

  2. 快速数据库框架_快速学习新的数据科学概念的框架

    快速数据库框架 重点 (Top highlight) 数据科学 (Data Science) Success in data science and software engineering depe ...

  3. 使用python构建向量空间_使用Docker构建Python数据科学容器

    人工智能(AI)和机器学习(ML)最近真的火了,并驱动了从自动驾驶汽车到药物发现等等应用领域的快速发展.AI和ML的前途一片光明. 另一方面,Docker通过引入临时轻量级容器彻底改变了计算世界.通过 ...

  4. 唐宇迪机器学习课程数据集_最受欢迎的数据科学和机器学习课程-2020年8月

    唐宇迪机器学习课程数据集 There are a lot of great online resources and websites on data science and machine lear ...

  5. 大数据技术 学习之旅_为什么聚焦是您数据科学之旅的关键

    大数据技术 学习之旅 David Robinson, a data scientist, has said the following quotes: 数据科学家David Robinson曾说过以下 ...

  6. 科学价值 社交关系 大数据_服务的价值:数据科学和用户体验研究美好生活

    科学价值 社交关系 大数据 A crucial part of building a product is understanding exactly how it provides your cus ...

  7. 杨超越微数据_资料来源同意:数据科学技能超越数据

    杨超越微数据 As data science enthusiasts know, there's a lot more to excelling in the field than just its ...

  8. 大数据技术 学习之旅_如何开始您的数据科学之旅?

    大数据技术 学习之旅 Machine Learning seems to be fascinating to a lot of beginners but they often get lost in ...

  9. 数据多重共线性_多重共线性对您的数据科学项目的影响比您所知道的要多

    数据多重共线性 Multicollinearity is likely far down on a mental list of things to check for, if it is on a ...

最新文章

  1. 应该知道关于Python的随机模型 以及使用范围例子洗牌 特别长 1米
  2. 产品团队的批判性思维:如何通过合理的决策带来合理的结果?
  3. webdiyer aspnet pager最近又用这个。还是记录下。
  4. 相对于oracle数据库的作用 类似于,郑州大学软件技术学院Oracle试卷
  5. 通过url来设置log4j的记录级别
  6. 迫于误解压力,RMS从自由软件基金会与MIT离职
  7. 比特(bit)和字节(byte)(1byte=8bit)
  8. 内核态(Kernel Mode)与用户态(User Mode)
  9. 【React】设计高质量的React组件
  10. 倾角传感器原理及市场现状浅析
  11. DICOM 开源工具汇总
  12. MATLAB求解3对角系数矩阵方程,实验5.3 用追赶法求解三对角方程组
  13. (十三)ATP应用测试平台——springboot集成kafka案例实战
  14. 肖sir__出现无法连接仓库的情况:Error performing git command: git ls-remote -h
  15. 打开51cto.com网页出现病毒提示
  16. python可视化分析网易云音乐评论_网易云音乐评论催泪刷屏?我用Python抓取了1008328条热评告诉你为什么!...
  17. 我想深入学习Go语言
  18. 虚幻4皮肤材质_UE4实时虚拟角色材质篇之Skin Material(一)
  19. RStudio安装失败的解决办法
  20. 2021.05.15智能风控峰会之流量反作弊论坛-论坛笔记

热门文章

  1. HYSBZ - 1101——莫比乌斯反演
  2. 【汇编语言】8086、x86-32和C语言【赋值语句 和 数组】的对比学习(王爽学习笔记:5.8段前缀的使用)
  3. 【面试必备】java面试题视频讲解
  4. php用户之间的数据,什么是位于用户与操作系统之间的一层数据管理软件
  5. 字符串驻留机制截图?#注意回顾字符串的深浅拷贝小数据池那节
  6. JAVA静态和非静态内部类
  7. 这是我们的第一篇博客----偕行软件
  8. 一个java处理JSON格式数据的通用类(三)
  9. 【托管服务qin】WEB网站压力测试教程详解
  10. kafka logstash elk