python lxml使用

by Timber.io

由Timber.io

使用lxml和Python进行Web抓取的简介 (An Intro to Web Scraping with lxml and Python)

Why should you bother learning how to web scrape? If your job doesn’t require you to learn it, then let me give you some motivation.

您为什么要费心学习如何进行网页抓取? 如果您的工作不要求您学习,那么让我给您一些动力。

What if you want to create a website which curates the cheapest products from Amazon, Walmart, and a couple of other online stores? A lot of these online stores don’t provide you with an easy way to access their information using an API. In the absence of an API, your only choice is to create a web scraper. This allows you to extract information from these websites automatically and makes that information easy to use.

如果您想创建一个网站来管理亚马逊,沃尔玛和其他几个在线商店中最便宜的产品,该怎么办? 这些在线商店很多都没有为您提供使用API​​访问其信息的简便方法。 在没有API的情况下,您唯一的选择是创建网络抓取工具。 这使您可以自动从这些网站中提取信息,并使这些信息易于使用。

Here is an example of a typical API response in JSON. This is the response from Reddit:

这是JSON中典型API响应的示例。 这是Reddit的回复:

There are a lot of Python libraries out there which can help you with web scraping. There is lxml, BeautifulSoup, and a full-fledged framework called Scrapy.

有很多Python库可以帮助您进行网络抓取。 有lxml , BeautifulSoup和一个名为Scrapy的成熟框架。

Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with lxml in this post. I will teach you the basics of XPaths and how you can use them to extract data from an HTML document. I will take you through a couple of different examples so that you can quickly get up-to-speed with lxml and XPaths.

大多数教程都讨论BeautifulSoup和Scrapy,因此在本文中我决定使用lxml。 我将教您XPath的基础知识,以及如何使用它们从HTML文档中提取数据。 我将为您介绍几个不同的示例,以便您可以快速了解lxml和XPath。

获取数据 (Getting the data)

If you are a gamer, you will already know of (and likely love) this website. We will be trying to extract data from Steam. More specifically, we will be selecting from the “popular new releases” information.

如果您是游戏玩家,您将已经知道(并可能喜欢)此网站。 我们将尝试从Steam提取数据。 更具体地说,我们将从“ 热门新版本 ”信息中进行选择。

I am converting this process into a two-part series. In this part, we will be creating a Python script which can extract the names of the games, the prices of the games, the different tags associated with each game and the target platforms. In the second part, we will turn this script into a Flask-based API and then host it on Heroku.

我将这个过程分为两部分。 在这一部分中,我们将创建一个Python脚本,该脚本可以提取游戏名称,游戏价格,与每个游戏相关联的不同标签以及目标平台。 在第二部分中,我们将把该脚本转换为基于Flask的API,然后将其托管在Heroku上。

First of all, open up the “popular new releases” page on Steam and scroll down until you see the Popular New Releases tab. At this point, I usually open up Chrome developer tools and see which HTML tags contain the required data. I extensively use the element inspector tool (the button in the top left of the developer tools). It allows you to see the HTML markup behind a specific element on the page with just one click.

首先,在Steam上打开“ 热门新版本 ”页面,然后向下滚动直到您看到“热门新版本”标签。 此时,我通常会打开Chrome开发人员工具,并查看哪些HTML标签包含所需的数据。 我广泛使用了元素检查器工具(开发人员工具左上方的按钮)。 它使您只需单击一下即可查看页面上特定元素后面HTML标记。

As a high-level overview, everything on a web page is encapsulated in an HTML tag, and tags are usually nested. You need to figure out which tags you need to extract the data from, and you’ll be good to go. In our case, if we take a look, we can see that every separate list item is encapsulated in an anchor (a) tag.

作为高级概述,网页上的所有内容都封装在HTML标记中,并且标记通常是嵌套的。 您需要弄清楚需要从哪些标签中提取数据,这将是您的最佳选择。 在我们的例子中,如果我们看一下,我们可以看到每个单独的列表项都封装在anchor( a )标记中。

The anchor tags themselves are encapsulated in the div with an id of tab_newreleases_content. I mention the id because there are two tabs on this page. The second tab is the standard "New Releases" tab, and we don't want to extract information from that tab. So we will first extract the "Popular New Releases" tab, and then we will extract the required information from this tag.

锚标记本身封装在div ,其ID为tab_newreleases_content 。 我提到ID是因为此页面上有两个标签。 第二个选项卡是标准的“新发行”选项卡,我们不想从该选项卡中提取信息。 因此,我们将首先提取“热门新版本”标签,然后再从此标签中提取所需的信息。

This is a perfect time to create a new Python file and start writing down our script. I am going to create a scrape.py file. Now let's go ahead and import the required libraries. The first one is the requests library and the second one is the lxml.html library.

这是创建新的Python文件并开始编写脚本的绝佳时机。 我将创建一个scrape.py文件。 现在,让我们继续导入所需的库。 第一个是requests库,第二个是lxml.html库。

import requests import lxml.html

If you don’t have requests installed, you can easily install it by running this command in the terminal:

如果没有安装requests ,则可以通过在终端中运行以下命令来轻松安装它:

$ pip install requests

The requests library is going to help us open the web page in Python. We could have used lxml to open the HTML page as well, but it doesn't work well with all web pages. To be on the safe side, I am going to use requests.

请求库将帮助我们使用Python打开网页。 我们本来也可以使用lxml打开HTML页面,但是它不适用于所有网页。 为了安全起见,我将使用requests

提取和处理信息 (Extracting and processing the info)

Now let’s open up the web page using requests and pass that response to lxml.html.fromstring.

现在,让我们使用请求打开网页,并将该响应传递给lxml.html.fromstring

html = requests.get('https://store.steampowered.com/explore/new/') doc = lxml.html.fromstring(html.content)

This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.

这为我们提供了HtmlElement类型的对象。 该对象具有xpath方法,可用于查询HTML文档。 这为我们提供了一种从HTML文档提取信息的结构化方法。

Now save this file and open up a terminal. Copy the code from the scrape.py file and paste it in a Python interpreter session.

现在保存此文件并打开一个终端。 复制scrape.py文件中的代码,并将其粘贴到Python解释器会话中。

We are doing this so that we can quickly test our XPaths without continuously editing, saving and executing our scrape.py file.

我们这样做是为了能够快速测试XPath,而无需持续编辑,保存和执行scrape.py文件。

Let’s try writing an XPath for extracting the div which contains the ‘Popular New Releases’ tab. I will explain the code as we go along:

让我们尝试编写一个XPath来提取包含“热门新发行版”标签的div。 我将逐步解释代码:

new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]

This statement will return a list of all the divs in the HTML page which have an id of tab_newreleases_content. Now, because we know that only one div on the page has this id, we can take out the first element from the list ([0]) and that would be our required div. Let's break down the xpath and try to understand it:

该语句将返回HTML页面中ID为tab_newreleases_content的所有divs的列表。 现在,由于我们知道页面上只有一个div具有此ID,因此我们可以从列表( [0] )中取出第一个元素,这就是我们需要的div 。 让我们分解一下xpath并尝试理解它:

  • // these double forward slashes tell lxml that we want to search for all tags in the HTML document which match our requirements/filters. Another option is to use / (a single forward slash). The single forward slash returns only the immediate child tags/nodes which match our requirements/filters

    //这些双斜杠告诉lxml我们要搜索HTML文档中与我们的要求/过滤器匹配的所有标签。 另一种选择是使用/ (单个正斜杠)。 单斜杠仅返回与我们的要求/过滤器匹配的直接子标记/节点

  • div tells lxml that we are searching for divs in the HTML page

    div告诉lxml我们正在HTML页面中搜索divs

  • [@id="tab_newreleases_content"] tells lxml that we are only interested in those divs which have an id of tab_newreleases_content

    [@id="tab_newreleases_content"]告诉lxml我们只对ID为tab_newreleases_content divs tab_newreleases_content

Cool! We have the required div. Now let's go back to Chrome and check which tag contains the titles of the releases.

凉! 我们有所需的div 。 现在,让我们回到Chrome浏览器,检查哪个标签包含版本标题。

The title is contained in a div with a class of tab_item_name. Now that we have the "Popular New Releases" tab extracted, we can run further XPath queries on that tab. Write down the following code in the same Python console which we previously ran our code in:

标题包含在一个具有tab_item_name类的tab_item_name 。 现在,我们提取了“热门新版本”标签,我们可以在该标签上运行进一步的XPath查询。 在我们之前在其中运行代码的同一Python控制台中写下以下代码:

titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')

This gives us the titles of all of the games in the “Popular New Releases” tab. Here is the expected output:

这为我们提供了“热门新发行”标签中所有游戏的标题。 这是预期的输出:

Let’s break down this XPath a little bit, because it is a bit different from the last one.

让我们分解一下这个XPath,因为它与上一个有所不同。

  • . tells lxml that we are only interested in the tags which are the children of the new_releases tag

    . 告诉lxml我们只对new_releases标签的子标签感兴趣

  • [@class="tab_item_name"] is pretty similar to how we were filtering divs based on id. The only difference is that here we are filtering based on the class name

    [@class="tab_item_name"]与我们根据id过滤divs的方式非常相似。 唯一的区别是这里我们根据类名进行过滤

  • /text() tells lxml that we want the text contained within the tag we just extracted. In this case, it returns the title contained in the div with the tab_item_name class name

    /text()告诉lxml我们希望文本包含在刚提取的标记中。 在这种情况下,它返回div中包含的标题以及tab_item_name类名

Now we need to extract the prices for the games. We can easily do that by running the following code:

现在我们需要提取游戏的价格。 我们可以通过运行以下代码轻松地做到这一点:

prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')

I don’t think I need to explain this code, as it is pretty similar to the title extraction code. The only change we made is the change in the class name.

我认为不需要解释此代码,因为它与标题提取代码非常相似。 我们所做的唯一更改是更改了类名。

Now we need to extract the tags associated with the titles. Here is the HTML markup:

现在我们需要提取与标题关联的标签。 这是HTML标记:

Write down the following code in the Python terminal to extract the tags:

在Python终端中写下以下代码以提取标签:

tags = new_releases.xpath('.//div[@class="tab_item_top_tags"]')total_tags = []for tag in tags:    total_tags.append(tag.text_content())

So what we are doing here is extracting the divs containing the tags for the games. Then we loop over the list of extracted tags and then extract the text from those tags using the text_content() method. text_content() returns the text contained within an HTML tag without the HTML markup.

因此,我们在这里所做的是提取包含游戏标签的divs 。 然后,我们遍历提取的标签列表,然后使用text_content()方法从这些标签中提取文本。 text_content()返回不包含HTML标记HTML标记中包含的文本。

Note: We could have also made use of a list comprehension to make that code shorter. I wrote it down in this way so that even those who don’t know about list comprehensions can understand the code. Either way, this is the alternate code:

注意:我们也可以利用列表理解来使代码更短。 我以这种方式写下来,这样即使是对列表理解一无所知的人也可以理解代码。 无论哪种方式,这都是备用代码:

tags = [tag.text_content() for tag in new_releases.xpath('.//div[@class="tab_item_top_tags"]')]

Let’s separate the tags in a list as well, so that each tag is a separate element:

让我们也将标签分隔在一个列表中,以便每个标签都是一个单独的元素:

tags = [tag.split(', ') for tag in tags]

Now the only thing remaining is to extract the platforms associated with each title. Here is the HTML markup:

现在剩下的唯一事情就是提取与每个标题关联的平台。 这是HTML标记:

The major difference here is that the platforms are not contained as texts within a specific tag. They are listed as the class name. Some titles only have one platform associated with them, like this:

此处的主要区别在于,平台不作为文本包含在特定标签中。 它们被列为类名。 某些标题只有一个与之关联的平台,例如:

<span class="platform_img win">&lt;/span>

While some titles have 5 platforms associated with them, like this:

虽然某些标题具有5个与之相关的平台,例如:

<span class="platform_img win"></span><span class="platform_img mac"></span><span class="platform_img linux"></span><span class="platform_img hmd_separator"></span> <span title="HTC Vive" class="platform_img htcvive"></span> <span title="Oculus Rift" class="platform_img oculusrift"></span>

As we can see, these spans contain the platform type as the class name. The only common thing between these spans is that all of them contain the platform_img class. First of all, we will extract the divs with the tab_item_details class. Then we will extract the spans containing the platform_img class. Finally, we will extract the second class name from those spans. Here is the code:

如我们所见,这些spans包含平台类型作为类名。 这些spans之间唯一的共同之处是它们都包含platform_img类。 首先,我们将使用tab_item_details类提取divs 。 然后,我们将提取包含platform_img类的spans 。 最后,我们将从这些spans提取第二个类名。 这是代码:

platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]')total_platforms = []for game in platforms_div:    temp = game.xpath('.//span[contains(@class, "platform_img")]')    platforms = [t.get('class').split(' ')[-1] for t in temp]    if 'hmd_separator' in platforms:        platforms.remove('hmd_separator')    total_platforms.append(platforms)

In line 1, we start with extracting the tab_item_details div. The XPath in line 5 is a bit different. Here we have [contains(@class, "platform_img")] instead of simply having [@class="platform_img"]. The reason is that [@class="platform_img"] returns those spans which only have the platform_img class associated with them. If the spans have an additional class, they won't be returned. Whereas [contains(@class, "platform_img")] filters all the spans which have the platform_img class. It doesn't matter whether it is the only class or if there are more classes associated with that tag.

第1行中,我们首先提取tab_item_details div 。 第5行中的XPath有所不同。 在这里,我们有[contains(@class, "platform_img")]而不是简单地拥有[@class="platform_img"] 。 其原因是, [@class="platform_img"]返回那些spans仅具有platform_img与它们相关联的类。 如果spans有其他类别,则不会返回。 而[contains(@class, "platform_img")]过滤所有具有platform_img类的spans 。 不管是唯一的类还是与该标签关联的其他类都没有关系。

In line 6 we are making use of a list comprehension to reduce the code size. The .get() method allows us to extract an attribute of a tag. Here we are using it to extract the class attribute of a span. We get a string back from the .get() method. In case of the first game, the string being returned is platform_img win so we split that string based on the comma and the whitespace. Then we store the last part (which is the actual platform name) of the split string in the list.

第6行中,我们利用列表推导来减少代码大小。 .get()方法允许我们提取标签的属性。 在这里,我们使用它来提取spanclass属性。 我们从.get()方法得到一个字符串。 在第一个游戏的情况下,返回的字符串是platform_img win因此我们根据逗号和空格来拆分该字符串。 然后,我们将拆分字符串的最后一部分(这是实际的平台名称)存储在列表中。

In lines 7–8 we are removing the hmd_separator from the list if it exists. This is because hmd_separator is not a platform. It is just a vertical separator bar used to separate actual platforms from VR/AR hardware.

第7-8行中,如果存在hmd_separator ,则将其从列表中删除。 这是因为hmd_separator不是平台。 它只是一个垂直分隔条,用于将实际平台与VR / AR硬件分开。

This is the code we have so far:

这是我们到目前为止的代码:

import requestsimport lxml.htmlhtml = requests.get('https://store.steampowered.com/explore/new/')doc = lxml.html.fromstring(html.content)new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')tags = [tag.text_content() for tag in new_releases.xpath('.//div[@class="tab_item_top_tags"]')]tags = [tag.split(', ') for tag in tags]platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]')total_platforms = []for game in platforms_div:    temp = game.xpath('.//span[contains(@class, "platform_img")]')    platforms = [t.get('class').split(' ')[-1] for t in temp]    if 'hmd_separator' in platforms:        platforms.remove('hmd_separator')    total_platforms.append(platforms)

Now we just need this to return a JSON response so that we can easily turn this into a Flask based API. Here is the code:

现在,我们只需要使用它来返回JSON响应,以便我们可以轻松地将其转换为基于Flask的API。 这是代码:

output = []for info in zip(titles,prices, tags, total_platforms):    resp = {}    resp['title'] = info[0]    resp['price'] = info[1]    resp['tags'] = info[2]    resp['platforms'] = info[3]    output.append(resp)

This code is self-explanatory. We are using the zip function to loop over all of those lists in parallel. Then we create a dictionary for each game and assign the title, price, tags, and platforms as a separate key in that dictionary. Lastly, we append that dictionary to the output list.

此代码是不言自明的。 我们正在使用zip函数来并行遍历所有这些列表。 然后,我们为每个游戏创建一个词典,并将标题,价格,标签和平台分配为该词典中的单独键。 最后,我们将该字典附加到输出列表中。

结语 (Wrapping up)

In the next post, we will take a look at how we can convert this into a Flask based API and host it on Heroku.

在下一篇文章中,我们将研究如何将其转换为基于Flask的API并将其托管在Heroku上。

I am Yasoob from Python Tips. I hope you guys enjoyed this tutorial. If you want to read more tutorials of a similar nature, please go to Python Tips. I regularly write Python tips, tricks, and tutorials on that blog. And if you are interested in learning intermediate Python, then please check out my open source book here.

我是Python Tips的 Yasoob 。 我希望你们喜欢本教程。 如果您想类似性质的教程,请转到Python技巧 。 我定期在该博客上编写Python技巧,窍门和教程。 如果您对学习中级Python感兴趣,请在此处查看我的开源书。

Just a disclaimer: we’re a logging company here @ Timber. We’d love it if you tried out our product (it’s seriously great!), but you’re here to learn about web scraping in Python and we didn’t want to take away from that.

只是免责声明:我们是@ Timber的伐木公司。 如果您试用了我们的产品,我们将非常喜欢它(它的确很棒!),但是您在这里是要学习有关Python中的网络抓取的知识,我们不想从中脱颖而出。

Originally published at timber.io.

最初发布于timber.io 。

翻译自: https://www.freecodecamp.org/news/an-intro-to-web-scraping-with-lxml-and-python-b02b7a3f3098/

python lxml使用

python lxml使用_使用lxml和Python进行Web抓取的简介相关推荐

  1. python添加lxml库_7分钟,建议看完这5个Python库对比丨web抓取

    网页抓取的时候,如何决定哪一个库适合自己的特定项目?哪个Python库最灵活?5个示例找寻答案. 文丨ABHISHEK SHARMA 编译丨小二 "我们有足够的数据"这句话,在数据 ...

  2. python爬虫电影资源_【Python爬虫】第十六次 xpath整站抓取阳光电影网电影资源

    [Python爬虫]第十六次 xpath整站抓取阳光电影网电影资源# 一.解析电影url # 请求15题构造出的每个电影菜单的分页url,解析出每个电影url # 二.xpath解析电影资源 # 对第 ...

  3. python编程基础_月隐学python第2课

    python编程基础_月隐学python第2课 学习目标 掌握变量的输入和输出 掌握数据类型的基本概念 掌握算数运算 1.变量的输入和输出 1.1 变量输入 使用input输入 input用于输入数据 ...

  4. 查看Python的版本_查看当前安装Python的版本

    一.查看Python的版本_查看当前安装Python的版本 具体方法: 首先按[win+r]组合键打开运行: 然后输入cmd,点击[确定]: 最后执行[python --version]命令即可. 特 ...

  5. python 垂直搜索引擎_最火的 Python 到底牛在哪?就业薪资高吗?

    Python是什么呢? Python是一种全栈的开发语言,你如果能学好Python,前端,后端,测试,大数据分析,爬虫等这些工作你都能胜任. 当下Python有多火我不再赘述,Python有哪些作用呢 ...

  6. python的web抓取_使用Python进行web抓取的新手指南

    Python部落(python.freelycode.com)组织翻译,禁止转载,欢迎转发. 使用基本的Python工具获得一些实践经验,以获取完整的HTML站点. 图片来源 : Jason Bake ...

  7. python 数据挖掘 网页_使用Selenium和Python进行网页搜刮!

    使用Selenium和Python进行网页搜刮! 机器学习助长了当今的技术奇迹,例如无人驾驶汽车,太空飞行,图像和语音识别.但是,一位数据科学专业人员将需要大量数据来构建针对此类业务问题的健壮且可靠的 ...

  8. 5种流行的Web抓取Python库,你用过哪种?

    "我们有足够的数据"这句话,在数据科学领域并不存在. 我很少会听到有人拒绝为他们的机器学习或深度学习项目收集更多的数据,往往都是永远觉得自己拥有的数据不够多. 为了缓解这种&quo ...

  9. 使用python进行web抓取

    http://cxy.liuzhihengseo.com/462.html 原文出处:  磁针石    本文摘要自Web Scraping with Python – 2015 书籍下载地址:http ...

最新文章

  1. 主板uefi和传统引导方式区别_反吊膜与传统污水池加盖方式有什么区别
  2. python代码大全表解释-Python中顺序表的实现简单代码分享
  3. java 接口 私有_Java 9:好的,坏的和私有的接口方法
  4. java jpa_Java JPA 语法知识
  5. k8s核心技术-Pod(调度策略)_影响Pod调度(资源限制和节点选择器)---K8S_Google工作笔记0025
  6. SpringRMI解析3-RmiServiceExporter逻辑细节
  7. 【R爬虫-1】BBC Learning English
  8. Python 搜狗词库的批量下载
  9. java内部编译器错误,可能的Java编译器错误!程序不能与某些编译器一起编译
  10. 在线预览PDF(pdfobject)
  11. IP欺骗攻击原理及如何修改IP
  12. 1.13 打印机和工作簿的打印设置 [原创Excel教程]
  13. C#程序设计第三版(李春葆)第12章文件操作课后习题答案
  14. Python3.6 新特性f-string
  15. 易观分析:银行实现无感风控落地需提高主动感知风险能力
  16. 西安国际港务区铁路集装箱中心站正式开通 两个“中心”构建国际陆港平台...
  17. linux mc服务器及forge,教程/架设Forge服务器
  18. OLI.System.2010 2CD
  19. ProE中文版安裝問題
  20. JUNOS ISIS区域拓扑解析

热门文章

  1. win8.1除了应用界面的应用,其他系统程序都不能上网,包括IE
  2. 9206-1118-周三 猜拳小游戏一次性版本
  3. 编码规范二 缩进与注释
  4. python-环境篇-Anaconda的安装
  5. ansible /usr/bin/python: not found
  6. OpenState之 Mac学习 实验
  7. 关于Struts2中的值栈与OGNL表达式
  8. Python交互模式方向键出现乱码
  9. RHEL5中mdadm配置raid5磁盘阵列
  10. Struts原理与实践(2)