使用composer_在Google Cloud Composer（Airflow）上使用Selenium搜寻网页

使用composer

There are already a lot of different resources available on creating web-scrapers using Python which are usually based on either a combination of the well known Python packages urllib+beautifulsoup4 or Selenium. When you are faced with the challenge to scrape a javascript-heavy web page or a level of interaction with the content is required that can not be achieved by simply sending URL requests, then Selenium is very likely your preferred choice. I don’t want to go into the details here on how you can set-up your scraping script and the best practices on how to run it in a reliable way. I just want to refer to this and this resources that I found are particularly helpful.

使用Python创建网络抓取工具已经有很多不同的资源，这些资源通常基于著名的Python软件包urllib + beautifulsoup4或Selenium的组合。 当您面临抓取大量javascript网页的挑战，或者需要与内容进行一定程度的交互(仅通过发送URL请求无法实现)时， Selenium很可能是您的首选。我不想在这里详细介绍如何设置抓取脚本以及如何以可靠方式运行抓取脚本的最佳实践。我只想引用这个和这个资源，我发现是特别有帮助。

The problem that we want to solve in this post is: How can I, as a Data Analyst/Data Scientist, set up an orchestrated and fully managed process to facilitate a Selenium scraper with a minimum of dev-ops required? The main use case for such a set up is a managed and scheduled solution to run all your scraping jobs in the cloud.

我们在这篇文章中要解决的问题是： 作为一名数据分析师/数据科学家，我如何建立一个精心策划和完全管理的流程，以最少的开发人员操作来促进Selenium scraper？ 此类设置的主要用例是托管和计划的解决方案，以在云中运行所有抓取作业。

The tools we are going to use are:

我们将使用的工具是：

Google Cloud Composer to schedule jobs and orchestrate workflows

Google Cloud Composer安排工作并编排工作流程
Selenium as a framework to scrape websites

Selenium作为刮刮网站的框架
Google Kubernetes Engine to deploy a Selenium remote driver as containerized application in the cloud

Google Kubernetes Engine将Selenium远程驱动程序部署为云中的容器化应用程序

At HousingAnywhere we were already using Google Cloud Composer for a number of different tasks. Cloud Composer is quite an amazing tool to easily manage, schedule and monitor workflows as directed acyclic graphs (DAGs). It is based on the open-source framework Apache Airflow and using pure Python, which makes it ideal for everyone working in the data field. The entry barrier to deploy Airflow on your own is relatively high if you are not coming from DevOps which led to some cloud providers to provide managed deployments of Airflow — Google’s Cloud Composer being one of them.

在HousingAnywhere，我们已经在使用 Google Cloud Composer可以执行许多不同的任务。 Cloud Composer是一个非常了不起的工具，可以作为有向无环图 (DAG)轻松管理，安排和监视工作流。它基于开源框架Apache Airflow并使用纯Python，这使其非常适合从事数据领域工作的每个人。如果您不是来自DevOps，则独自部署Airflow的入门门槛相对较高，这导致一些云提供商提供托管的Airflow部署-Google的Cloud Composer就是其中之一。

When deploying Selenium for webscraping, we’re actually using the so-called Selenium Webdriver. This WebDriver is a framework that allows you to control a browser using code (Java, .Net, PHP, Python, Perl, Ruby). For most use-cases you would simply download a browser that can directly interact with the WebDriver framework, for example Mozilla Geckodriver or ChromeDriver. The scraping script will initiate a browser instance on your local and execute all actions as specified. In our use case things are a bit more complicated because we want to run the script on a recurring schedule without using any local resources. To be able to deploy and run web scraping scripts in the cloud we need to use a Selenium Remote WebDriver (a.k.a Selenium Grid) instead of the Selenium WebDriver.

部署Selenium进行Web爬网时，实际上是在使用所谓的Selenium Webdriver。这个WebDriver是一个框架，允许您使用代码( Java，.Net，PHP，Python，Perl，Ruby )控制浏览器。对于大多数用例，您只需下载可以直接与WebDriver框架进行交互的浏览器，例如Mozilla Geckodriver或ChromeDriver 。抓取脚本将在您本地的浏览器实例上启动并执行指定的所有操作。在我们的用例中，事情要复杂一些，因为我们希望在不使用任何本地资源的情况下定期执行脚本。为了能够在云中部署和运行Web抓取脚本，我们需要使用Selenium Remote WebDriver (又名Selenium Grid )而不是Selenium WebDriver。

Source: https://www.browserstack.com/guide/difference-between-selenium-remotewebdriver-and-webdriver

使用Selenium Grid运行远程Web浏览器实例 (Running remote web browser instances with Selenium Grid)

The idea behind Selenium Grid is to provide a framework that allows you to run parallel scraping instances by running web browsers on a single or multiple machines . In this case, we can make use of the provided standalone browsers (keep in mind that each of the available browsers, Firefox, Chrome and Opera are a different image) which are already wrapped up as a Docker image.

Selenium Grid背后的思想是提供一个框架，使您可以通过在单台或多台计算机上运行Web浏览器来运行并行抓取实例。在这种情况下，我们可以使用提供的独立浏览器 (请注意，每个可用的浏览器Firefox，Chrome和Opera都是不同的映像)，这些浏览器已经打包为Docker映像。

Cloud Composer runs Apache Airflow on top of a Google Kubernetes Engine (GKE) cluster. Furthermore, it is fully integrated with other Google Cloud products. The creation of a new Cloud Composer environment also comes along with a functional UI and a Cloud Storage bucket. All DAGs, plugins, logs and other required files are stored in this bucket.

Cloud Composer在Google Kubernetes Engine(GKE)集群的顶部运行Apache Airflow。此外，它与其他Google Cloud产品完全集成。还创建了新的Cloud Composer环境，并带有功能性UI和Cloud Storage存储桶。所有DAG，插件，日志和其他必需文件都存储在此存储桶中。

在GKE上部署和公开远程驱动程序 (Deploy and expose the remote driver on GKE)

You can deploy a docker image for the Firefox standalone browser using the selenium-firefox.yaml file below and apply the specified configuration on your resource by running:

您可以使用下面的selenium-firefox.yaml文件为Firefox独立浏览器部署docker映像，并通过运行以下命令在资源上应用指定的配置：

kubectl apply -f selenium-firefox.yaml

The configuration file describes what kind of object you want to create, it’s metadata as well as specs.

配置文件描述了一种对象的要创建的，它的元数据以及规范的东西。

We can create new connection in the Admin UI of Airflow and access the connection details later in our Plugin. The connection details are either specified in the yaml file or can be found on your Kubernetes cluster.

我们可以在Airflow的Admin UI中创建新连接，并稍后在插件中访问连接详细信息。连接详细信息可以在yaml文件中指定，也可以在Kubernetes集群上找到。

After setting up the connections we can access the connection in our scraping script (Airflow Plugin) where we connect to the remote browser.

设置连接后，我们可以在我们的抓取脚本(气流插件)中访问该连接，在该脚本中，我们可以连接到远程浏览器。

Thank you Massimo Belloni for technical consultancy and advice in realizing the project and this article.

感谢 Massimo Belloni 为实现项目和本文提供技术咨询和建议。

翻译自: https://towardsdatascience.com/scraping-the-web-with-selenium-on-google-cloud-composer-airflow-7f74c211d1a1

使用composer

查看全文

http://www.taodudu.cc/news/show-863834.html

nlp自然语言处理_自然语言处理（NLP）：不要重新发明轮子
机器学习导论�_机器学习导论
直线回归数据离群值_处理离群值：OLS与稳健回归
Python中机器学习的特征选择技术
聚类树状图_聚集聚类和树状图-解释
机器学习与分布式机器学习_我将如何再次开始学习机器学习（3年以上）
机器学习算法机器人足球_购买足球队：一种机器学习方法
机器学习与不确定性_机器学习求职中的不确定性
pandas数据处理代码_使用Pandas方法链接提高代码可读性
opencv 检测几何图形_使用OpenCV + ConvNets检测几何形状
立即学习AI：03-使用卷积神经网络进行马铃薯分类
netflix 开源_Netflix的Polynote是一个新的开源框架，可用来构建更好的数据科学笔记本
电场大学_人工电场优化算法
主题建模lda_使用LDA的Google Play商店应用评论的主题建模
胶囊路由_评论：胶囊之间的动态路由
交叉验证python_交叉验证
open ai gpt_您实际上想尝试的GPT-3 AI发明鸡尾酒
python 线性回归_Python中的简化线性回归
机器学习模型的性能指标
利用云功能和API监视Google表格中的Cloud Dataprep作业状态
谷歌联合学习的论文_Google的未来联合学习
使用cnn预测房价_使用CNN的人和马预测
利用colab保存模型_在Google Colab上训练您的机器学习模型中的“后门”
java 回归遍历_回归基础：代码遍历
sql 12天内的数据_想要在12周内成为数据科学家吗？
SorterBot-第1部分
算法题指南书_分类算法指南
小米 pegasus_使用Google的Pegasus库生成摘要
数据集准备及数据预处理_1.准备数据集
ai模型_这就是AI的样子：用于回答问题的BiDAF模型