It often happens that you come across a website and are forced to perform a set of actions to finally get some data. You are then faced with a dilemma: how do you make this data available in a form which can easily be consumed by your application?

通常,您访问一个网站并被迫执行一系列操作以最终获取一些数据。 然后,您将面临一个难题:如何以易于应用程序使用的形式提供这些数据?

Scraping comes to the rescue in such a case. And selecting the right tool for the job is quite important.

在这种情况下,便可以进行报废。 选择正确的工作工具非常重要。

木偶:不仅是另一个剪贴库 (Puppeteer: Not Just Another Scraping Library)

Puppeteer is a Node.js library maintained by the Chrome Devtools Team at Google. It basically runs a Chromium or Chrome (perhaps the more recognizable name) instance in a headless (or configurable) manner and exposes a set of high-level APIs.

Puppeteer是Google的Chrome Devtools小组维护的Node.js库。 它基本上以无头(或可配置)的方式运行Chromium或Chrome(也许更易于识别的名称)实例,并公开了一组高级API。

From its official documentation, puppeteer is normally leveraged for multiple processes which are not limited to the following:

从其官方文档中 ,puppeteer通常用于多个过程,而不仅限于以下过程:

  • Generating screenshots and PDFs生成屏幕截图和PDF
  • Crawling an SPA and generating pre-rendered content (i.e. Server Side Rendering)搜寻SPA并生成预渲染的内容(即服务器端渲染)
  • Testing Chrome extensions测试Chrome扩展程序
  • Automation testing of Web InterfacesWeb界面的自动化测试
  • Diagnosis of performance issues through techniques like capturing the timeline trace of a website通过捕获网站时间线跟踪之类的技术诊断性能问题

For our case, we need to be able to access a website and map the data in a form which can be easily consumed by our application.

对于我们来说,我们需要能够访问网站并以易于应用程序使用的形式映射数据。

Sounds simple? The implementation is not that complex, either. Let's start.

听起来很简单? 实现也不是那么复杂。 开始吧。

将代码串起来 (Stringing the Code Along)

My fondness for Amazon products prompts me to use one of their product listing page as a sample here. We will implement our use case in two steps:

我对Amazon产品的爱好促使我在此处使用其产品列表页面之一作为示例。 我们将分两步实施用例:

  • Extract data from the page and map it in an easily consumable JSON form从页面中提取数据并以易于使用的JSON形式映射它
  • Add a little sprinkle of automation to make our lives a little bit easier增加一点点自动化,使我们的生活更轻松

You can find the complete code in this repository.

您可以在此存储库中找到完整的代码。

We will be extracting the data from this link: https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 ( a listing of the top searched shirts as shown in the image) in an API servable form.

我们将从此链接中提取数据:以API形式提供的https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 (如图中所示,是搜索最多的衬衫的清单)。

Before we get started using puppeteer extensively in this section, we need to understand the two primary classes provided by it.

在本节中开始广泛使用puppeteer之前,我们需要了解它提供的两个主要类。

  • Browser: launches a Chrome instance when we use puppeteer.launch or puppeteer.connect . This works as a simple browser emulation.

    浏览器:当我们使用puppeteer.launchpuppeteer.connect时,启动一个Chrome实例。 这可以用作简单的浏览器仿真。

  • Page: resembles a single tab on a Chrome browser. It provides an exhaustive set of methods you can use with a particular page instance and is invoked when we call browser.newPage. Just like you can create multiple tabs in the browser, you can similarly create multiple page instances at a single time in puppeteer.

    页面:类似于Chrome浏览器上的单个标签。 它提供了可用于特定页面实例的详尽方法集,并在我们调用browser.newPage时被调用。 就像您可以在浏览器中创建多个选项卡一样,您可以在puppeteer中一次创建多个页面实例。

设置人偶并导航到目标URL (Setting Up Puppeteer and Navigating to the Target URL)

We start setting up puppeteer by using the npm module provided. After installing puppeteer, we create an instance of the browser and the page class and navigate to the target URL.

我们开始使用提供的npm模块来设置puppeteer。 安装puppeteer之后,我们创建浏览器和页面类的实例,然后导航到目标URL。

We use networkidle2 as the value for the waitUntil option while navigating to the URL. This ensures that the page load state is considered final when it has no more than 2 connections running for at least 500ms.

在导航到URL时,我们将networkidle2用作waitUntil选项的值。 这样可以确保页面加载状态在运行至少500ms时不超过2个连接时被视为最终状态。

Note: You do not need to have Chrome or an instance of it installed on your system for puppeteer to work. It already ships with a lite version of it bundled with the library.

注意:您不需要在系统上安装Chrome或Chrome的实例,即可使用puppeteer。 它已经附带了与库捆绑在一起的精简版。

提取和映射数据的页面方法 (Page Methods to Extract and Map Data)

The DOM has already loaded in the page instance created. We will go ahead and leverage the page.evaluate() method to query the DOM.

DOM已加载到创建的页面实例中。 我们将继续使用page.evaluate()方法来查询DOM。

Before we start, we need to figure out the exact data-points we need to extract. In the current sample, each of the product objects will look something like this.

在开始之前,我们需要弄清楚需要提取的确切数据点。 在当前示例中,每个产品对象都将类似于以下内容。

We have laid out the structure we want to achieve. Time to start inspecting the DOM for the identifiers. We check for the selectors that occur throughout the items to be mapped. We will mostly use document.querySelector and document.querySelectorAll for traversing the DOM.

我们已经列出了我们想要实现的结构。 是时候开始检查DOM的标识符了。 我们检查在要映射的项目中出现的选择器。 我们将主要使用document.querySelectordocument.querySelectorAll遍历DOM。

// traverse for brand and product names

//遍历品牌和产品名称

After investigating the DOM, we see that each listed item is enclosed under an element with the selector div[data-cel-widget^="search_result_"] . This particular selector seeks out all div tags with the attribute data-cel-widget that have a value starting with search_result_.

在研究了DOM之后,我们看到每个列出的项目都用选择器div[data-cel-widget^="search_result_"]包围在一个元素下。 这个特定的选择器查找所有具有data-cel-widget属性的div标签,这些标签的值以search_result_开头。

Similarly, we map out the selectors for the parameters we require as listed. If you want to learn more about DOM traversal, you can check out this informative article by Zell.

同样,我们列出了所需参数的选择器。 如果您想了解有关DOM遍历的更多信息,可以查看Zell 撰写的这篇内容丰富的文章 。

  • total listed items: div[data-cel-widget^="search_result_"]

    列出的项目总数: div[data-cel-widget^="search_result_"]

  • brand: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base (i stands for the node number in total listed items)

    品牌: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base ( i代表total listed items的节点号)

  • product: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base  or div[data-cel-widget="search_result_${i}"] .a-size-medium.a-color-base.a-text-normal (i stands for the node number in total listed items)

    产品: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-basediv[data-cel-widget="search_result_${i}"] .a-size-medium.a-color-base.a-text-normal ( i代表total listed items的节点号)

  • url: div[data-cel-widget="search_result_${i}"] a[target="_blank"].a-link-normal (i stands for the node number in total listed items)

    网址: div[data-cel-widget="search_result_${i}"] a[target="_blank"].a-link-normal ( i代表total listed items的节点号)

  • image: div[data-cel-widget="search_result_${i}"] .s-image (i stands for the node number in total listed items)

    图片: div[data-cel-widget="search_result_${i}"] .s-image ( i代表total listed items的节点编号)

  • price: div[data-cel-widget="search_result_${i}"] span.a-offscreen (i stands for the node number in total listed items)

    价格: div[data-cel-widget="search_result_${i}"] span.a-offscreen ( i代表total listed items的节点号)

Note: We wait for div[data-cel-widget^="search_result_"] selector named elements to be available on the page by using the page.waitFor method.

注意:通过使用page.waitFor方法,我们等待div[data-cel-widget^="search_result_"]选择器命名的元素在页面上可用。

Once the page.evaluate method is invoked, we can see the data we require logged.

调用page.evaluate方法后,我们可以看到需要记录的数据。

添加自动化以简化流程 (Adding Automation to Ease Flow)

So far we are able to navigate to a page, extract the data we need, and transform it into an API-ready form. That sounds all hunky-dory.

到目前为止,我们已经能够导航到页面,提取所需的数据,并将其转换为支持API的形式。 听起来很笨拙。

However, consider for a moment a case where you have to navigate to one URL from another by performing some actions – and then try to extract the data you need.

但是,请考虑一下您必须通过执行某些操作从另一个URL导航到一个URL的情况,然后尝试提取所需的数据。

Would that make your life a little trickier? Not at all. Puppeteer can easily imitate user behavior. Time to add some automation to our existing use case.

这会使您的生活更加棘手吗? 一点也不。 木偶可以轻松地模仿用户行为。 是时候为我们现有的用例添加一些自动化了。

Unlike in the previous example, we will go to the amazon.in homepage and search for 'Shirts'. It will take us to the products listing page and we can extract the data required from the DOM. Easy peasy. Let's look at the code.

与前面的示例不同,我们将转到amazon.in主页并搜索“衬衫”。 它将带我们到产品列表页面,我们可以从DOM中提取所需的数据。 十分简单。 让我们看一下代码。

We can see that we wait for the search box to be available and then we add the searchTerm passed using page.evaluate. We then navigate to the products listing page by emulating the 'search button' click action and exposing the DOM.

我们可以看到,我们等待搜索框可用,然后我们添加searchTerm使用传递page.evaluate 。 然后,我们通过模仿“搜索按钮”点击动作并公开DOM导航到产品列表页面。

The complexity of automation varies from use case to use case.

自动化的复杂性因用例而异。

一些值得注意的陷阱:未成年人 (Some Notable Gotchas: A Minor Heads Up)

Puppeteer's API is pretty comprehensive but there are a few gotchas I came across while working with it. Remember, not all of these gotchas are directly related to puppeteer but tend to work better along with it.

Puppeteer的API非常全面,但是在使用它时遇到了一些麻烦。 请记住,并非所有这些陷阱都与木偶戏直接相关,但往往会更好地配合使用。

  • Puppeteer creates a Chrome browser instance as already mentioned. However, it is likely that some existing websites might block access if they suspect bot activity. There is this package called user-agents which can be used with puppeteer to randomize the user-agent for the browser.

    如前所述,Puppeteer创建了一个Chrome浏览器实例。 但是,如果某些现有网站怀疑机器人活动,则可能会阻止访问。 有一个名为“ user-agents程序包,可以与puppeteer一起使用,以随机化浏览器的用户代理。

Note: Scraping a website lies somewhere in the grey areas of legal acceptance. I would recommend using it with caution and checking rules where you live.

注意:爬网网站位于法律认可的灰色区域。 我建议谨慎使用它并检查您居住的地方的规则。

  • We came across defaultViewport: null when launching our Chrome instance and I had listed it as optional. This is because it comes in handy only when you are viewing the Chrome instance being launched. It prevents the website's width and height from being affected when it is rendered.

    我们在启动Chrome实例时遇到defaultViewport: null ,我将其列为可选实例。 这是因为只有在查看正在启动的Chrome实例时,它才派上用场。 它可以防止网站的宽度和高度在渲染时受到影响。

  • Puppeteer is not the ultimate solution when it comes to performance. You, as a developer, will have to optimize it to increase its performance efficiency through actions like throttling animations on the site, allowing only essential network calls, etc.在性能方面,Puppeteer并不是最终的解决方案。 作为开发人员,您将必须对其进行优化,以通过限制站点上的动画,仅允许进行必要的网络呼叫等操作来提高性能。
  • Remember to always end a puppeteer session by closing the Browser instance by using browser.close. (I happened to miss out on it in the first try) It helps end a running Browser Session.

    请记住,始终通过使用browser.close关闭Browser实例来结束伪造者会话。 (我在第一次尝试中碰巧错过了它)它有助于结束正在运行的Browser Session。

  • Certain common JavaScript operations like console.log() will not work within the scope of the page methods. The reason being that the page context/browser context differs from the node context in which your application is running.

    某些常见JavaScript操作(例如console.log()将不在页面方法的范围内工作。 原因是页面上下文/浏览器上下文与运行应用程序的节点上下文不同。

These are some of the gotchas I noticed. If you have more, feel free to reach out to me with them. I would love to learn more.

这些是我注意到的一些陷阱。 如果您还有更多内容,请随时与他们联系。 我想了解更多。

Done? Let's run the application.

做完了吗 让我们运行该应用程序。

网站使用您的API:将所有内容整合在一起 (Website to Your API: Bringing it All Together)

The application is run in non-headless mode so you can witness what exactly happens. We will automate the navigation to the product listing page from which we obtain the data.

该应用程序以非无头模式运行,因此您可以见证发生了什么。 我们将自动导航到从中获取数据的产品列表页面。

There. You have your own API consumable data setup from the website of your choice. All you need to do now is to wire this up with a server side framework like express and you are good to go.

那里。 您可以从自己选择​​的网站上设置自己的API消耗数据。 您现在所要做的就是将其与服务器端框架(如express ,一切顺利。

结论 (Conclusion)

There is so much you can do with Puppeteer. This is just one particular use case. I would recommend that you spend some time to read the official documentation. I will be doing the same.

Puppeteer可以做很多事情。 这只是一个特定的用例。 我建议您花一些时间阅读官方文档。 我会做同样的。

Puppeteer is used extensively in some of the largest organizations for automation tasks like testing and server side rendering, among others.

在一些最大的组织中,Puppeteer被广泛用于自动化任务,例如测试和服务器端渲染等。

There is no better time to get started with Puppeteer than now.

没有比现在更好的时间开始使用Puppeteer。

If you have any questions or comments, you can reach out to me on LinkedIn or Twitter.

如果您有任何疑问或意见,可以在LinkedIn或Twitter上与我联系。

In the meantime, keep coding.

同时,继续编码。

翻译自: https://www.freecodecamp.org/news/create-api-website-using-puppeteer/

如何使用Puppeteer从任何网站创建自定义API相关推荐

  1. 为SharePoint网站创建自定义导航菜单

    相信不少人都希望把SharePoint网站内置的那个顶部导航菜单,换成自己希望的样式.由于SharePoint 2007/2010的网站导航基本上基于标准的ASP.NET SiteMap模型,所以只要 ...

  2. Sharepoint网站创建自定义导航全记录

    转:http://tech.it168.com/a2009/1207/820/000000820524_all.shtml [IT168 技术文档]在一个Sharepoint网站中可以创建子网站,页面 ...

  3. wordpress 古腾堡_如何在WordPress中创建自定义古腾堡块(简便方法)

    wordpress 古腾堡 Do you want to create a custom Gutenberg block for your WordPress site? After the Word ...

  4. 如何在WordPress中创建自定义主页

    Often users ask us if it's possible to create a custom homepage in WordPress. 用户经常问我们是否可以在WordPress中 ...

  5. 关于创建奇门自定义api接口流程备忘录

    项目需要对接速卖通的api,并使数据安全出聚石塔,采用的方法是在奇门创建自定义api进行消息的转发,为什么不走官方场景呢,我没试过,我同事说官方场景的api少字段,不满足需求,走自定义可以返回所有的字 ...

  6. 创建自定义主机头的网站集

    当我们在一个SharePoint Web应用程序中创建新网站集时,虽然我们可以指定网站集的路径,但是网站集的主机头,似乎必须使用Web应用程序所定义的主机头.比如,当在"http://sp2 ...

  7. 小D学blend-----如何创建自定义的Tooltip控件

    运行环境:blend 4.0或者blend 3.0 +silverlight 3.0(其实我相信步骤应该是差不多的) 语言:C# Tooltip类:它是表示一个长方形的小弹出窗口,该窗口在用户将指针悬 ...

  8. php创建菜单_php实现微信公众号创建自定义菜单功能的实例代码

    目的 创建自定义菜单,实现菜单事件. 首先获取Access_Token 接口: 我用的是测试号,修改APPID和APPSECRET,然后浏览器访问上面这个Url即可生成Access_Token 然后配 ...

  9. 创建自定义排序用户界面

    简介 显示大量已经按类别(不是很多)排序的数据但没有类别分界线,用户很难找到所需要的类别.例如,数据库中只有9个类别(8个不同的类别和1个null),共81种产品.现在用一个GridView列出所有产 ...

最新文章

  1. C 函数 strstr 的高效实现
  2. Python中的eval,exec以及其相关函数
  3. 奇安信cdn配置教程_PicGo+jsDelivr+GitHub搭建免费cdn加速的图床
  4. 深度学习可以与大数据分手吗?
  5. 赚钱有捷径吗?为什么有的人赚钱很容易
  6. java实现数据结构-堆排序
  7. java框架面试题及答案,年薪50W
  8. 用群晖服务器搭建网站
  9. OCLint + Infer + Jenkins + SonarQube 搭建iOS代码静态分析系统
  10. Diabetic Retinopathy Detection
  11. 动手学深度学习:3.16 实战Kaggle比赛:房价预测
  12. Python + Selenium(九)- 解决图片验证码登录或注册问题
  13. shaarli 书签管理器
  14. PowerShell_8_零基础自学课程_8_高级主题:WMI对象和COM组件
  15. docker基础手册
  16. 速算24点java_HDU-1427-速算24点
  17. 用全站 CDN 部署 Discourse 论坛
  18. 有竞争力的运维负责人岗位JD-北京
  19. 网安大事件丨Fortinet对Apache Log4j漏洞利用的全面复盘与防御
  20. python股票指标计算库_GitHub - unclevicky/stock: stock,股票系统。使用python进行开发。...

热门文章

  1. Python与机器视觉(x) 颜色直方图
  2. 动态规划经典题:给出两个字符串s1和s2,返回其中最大的公共子串
  3. StringBuilder类的对象 c#
  4. 前端开发-认识前端开发-0226
  5. django-多级联动课堂版0912
  6. python-pygame声音模块的使用
  7. 私有GIT服务器的免密提交
  8. Python3 调用 Node.js 解析 MathJax 公式
  9. centos 7 firewalld 设置
  10. redhat python3.4安装步骤