scrap python_Python Web Scrap实用指南

scrap python

网页搜集基础 (Web Scraping Basics)

What is web scraping all about? Consider the following scenario:

网络抓取到底是什么？请考虑以下情形：

Imagine that one day, out of the blue, you find yourself thinking “Gee, I wonder who the five most popular mathematicians are?”

想象一下，有一天，您突然想到“ Gee，我想知道五个最受欢迎的数学家是谁吗？”

You do a bit of thinking, and you get the idea to use Wikipedia’s XTools to measure the popularity of a mathematician by equating popularity with page views. For example, look at the page on Henri Poincaré. There you can see that Poincaré’s pageviews for the last 60 days are, as of December 2017, around 32,000.

您进行了一些思考，便有了使用Wikipedia的XTools的想法，即通过将人气与页面浏览量等同来衡量数学家的人气。例如，查看HenriPoincaré上的页面。在那里，您可以看到，截至2017年12月，庞加莱过去60天的综合浏览量约为32,000。

Next, you Google for “famous mathematicians” and find this resource which lists 100 names. Now you have both a page listing mathematician’s names and you have a website that provides information about how “popular” that mathematician is. Now what?

接下来，您用Google搜索“著名的数学家”，并找到列出100个名字的资源。现在，您既有列出数学家名称的页面，又有一个网站提供有关该数学家“受欢迎程度”的信息。怎么办？

This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

这就是Python和Web抓取工具的用处。Web抓取工具是从Web下载结构化数据，选择一些数据，然后将您选择的内容传递给另一个进程。

In this tutorial, you will be writing a Python program that downloads the list of 100 mathematicians and their XTools pages, selects data about their popularity, and finishes by telling us the all time top 5 most popular mathematicians! Lets get started.

在本教程中，您将编写一个Python程序，该程序将下载100位数学家及其XTools页面的列表，选择有关其受欢迎程度的数据，并通过告诉我们有史以来最热门的5位数学家来完成！让我们开始吧。

设置您的Python Web抓取工具 (Setting Up Your Python Web Scraper)

You will be using Python 3 and Python virtual environments throughout the tutorial. Feel free to set things up however you like, but here is how I tend to do it:

在整个教程中，您将使用Python 3和Python 虚拟环境。随意设置任何您喜欢的东西，但是我倾向于这样做：

 $ python3 -m venv venv
$ python3 -m venv venv
$ . ./venv/bin/activate
$ . ./venv/bin/activate

You will only need to install these two packages:

您只需要安装以下两个软件包：

requests for performing your HTTP requests.
BeautifulSoup4 for handling all of your HTML processing.

要求执行你的HTTP请求。
BeautifulSoup4用于处理所有HTML处理。

Let’s install these dependencies with pip:

让我们使用pip安装这些依赖项：

Finally, if you want to follow along, fire up your favorite text editor and create a file called mathematicians.py. Get started by including these import statements at the top:

最后，如果您想继续学习，请启动您喜欢的文本编辑器并创建一个名为mathematicians.py的文件。首先在顶部包含以下import语句：

 from from requests requests import import get
get
from from requests.exceptions requests.exceptions import import RequestException
RequestException
from from contextlib contextlib import import closing
closing
from from bs4 bs4 import import BeautifulSoup
BeautifulSoup

发出网络请求 (Making Web Requests)

Your first task will be to download web pages. The requests package comes to the rescue. requests aims to be an easy to use tool for doing all things HTTP in Python, and it doesn’t dissapoint. In this tutorial, you will only need requests.get function, but the you should definitely checkout the full documentation when you want to go further.

您的第一个任务是下载网页。 requests包可以解救。 requests目标是成为一种易于使用的工具，用于在Python中完成HTTP的所有工作，并且它不会令人失望。在本教程中，您仅需要requests.get函数，但是当您想进一步使用时，一定要签出完整的文档。

First your function:

首先您的功能：

The simple_get function accepts a single url argument. It then makes a GET request to that url. If nothing goes wrong, you end up with the raw HTML content for the page you requested. If there were any problems with your request (like the url is bad or the remote server is down) then your funciton returns None.

simple_get函数接受单个url参数。然后，它对该URL发出GET请求。如果一切正常，那么您最终将获得所请求页面的原始HTML内容。如果您的请求有任何问题（例如url错误或远程服务器已关闭），则您的功能将返回None 。

You may have noticed the use of the closing function in your definition of simple_get. The closing function ensures that any network resources are freed when they go out of scope in that with block. Using closing like that is good practice and helps to to prevent fatal errors and network timeouts.

您可能已经注意到在您的simple_get定义中使用了closing函数。该closing功能可保证当他们在去的范围内的任何网络资源被释放with块。像这样使用closing是一种很好的做法，有助于防止致命错误和网络超时。

You can test the simple_get like this:

您可以像这样测试simple_get ：

 >>> >>>  from from mathematicians mathematicians import import simple_get
simple_get
>>> >>>  raw_html raw_html = = simple_getsimple_get (( 'https://realpython.com/blog/)
'https://realpython.com/blog/)
>>> >>>  lenlen (( raw_htmlraw_html )
)
3387833878>>> >>>  no_html no_html = = simple_getsimple_get (( 'https://realpython.com/blog/nope-not-gonna-find-it''https://realpython.com/blog/nope-not-gonna-find-it' )
)
>>> >>>  no_html no_html is is None
None
True
True

用BeautifulSoup处理HTML (Wrangling HTML With BeautifulSoup)

Once you have raw HTML in front of you, you can start to select and extract. For this purpose you will be using BeautifulSoup. The BeautifulSoup constructor parses raw HTML strings and produces an object that mirrors the HTML document’s strucutre. The object includes a slew of methods to select, view, and manipulate DOM nodes and text content.

一旦有了原始HTML，就可以开始选择和提取。为此，您将使用BeautifulSoup 。 BeautifulSoup构造函数解析原始HTML字符串，并生成一个镜像HTML文档结构的对象。该对象包括一系列选择，查看和操作DOM节点和文本内容的方法。

As a quick, contrived, example, consider the following HTML document:

作为一个简短的示例，请考虑以下HTML文档：

Suppose that the above HTML is saved in the file contrived.html, then you can use BeautifulSoup like this:

假设上述HTML保存在文件contrived.html ，那么您可以像这样使用BeautifulSoup ：

 >>> >>>  from from bs4 bs4 import import BeautifulSoup
BeautifulSoup
>>> >>>  raw_html raw_html = = openopen (( 'contrived.html''contrived.html' )) .. readread ()
()
>>> >>>  html html = = BeautifulSoupBeautifulSoup (( raw_htmlraw_html , , 'html.parser''html.parser' )
)
>>> >>>  for for p p in in htmlhtml .. selectselect (( 'p''p' ):
):
...     ...     if if pp [[ 'id''id' ] ] == == 'walrus''walrus' :
:
...         ...         printprint (( pp .. texttext ))'I am the walrus'
'I am the walrus'

Breaking down the example, you first parse the raw HTML by passing it to the BeautifulSoup constructor. BeautifulSoup accepts multiple backend parsers but the standard bakend is 'html.parser', which you supply here as the second argument. (If you neglect to supply that 'html.parser' the code will still work, but you will see a warning print to your screen.)

分解示例，首先将原始HTML传递给BeautifulSoup构造函数，以对其进行解析。 BeautifulSoup接受多个后端解析器，但是标准的bakend是'html.parser' ，您可以在此处将其作为第二个参数提供。（如果您忽略提供'html.parser'该代码仍然可以使用，但是您会在屏幕上看到警告打印。）

The select method on your html object lets you use CSS selectors to locate elements in the document. In the above case, html.select('p') returns a list of paragraph elements. Each p has HTML attibutes that you can access like a dict. In the line if p['id'] == 'walrus', for example, you check if the id attribute is equal it the string 'walrus', which corresponds to <p id="walrus"> in the HTML.

html对象上的select方法使您可以使用CSS选择器来定位文档中的元素。在上述情况下， html.select('p')返回一个段落元素列表。每个p都有HTML样式，您可以像dict一样进行访问。例如，在if p['id'] == 'walrus' ，检查id属性是否等于字符串'walrus' ，它对应于HTML中的<p id="walrus"> 。

使用BeautifulSoup获取数学家的名字 (Using BeautifulSoup to Get Mathematician Names)

Now that you have given BeautifulSoup‘s select method a short test drive, how do you find out what to supply to select? The fastest way is to step out of Python and into your web browser’s developer tools. You can use your browser to examine the the document in some detail – I usually look for id or class element attributes or any other information that uniquely identifies the information I want to extract.

既然您已经对BeautifulSoup的select方法进行了简短的测试，那么如何找到要select呢？最快的方法是退出Python并进入Web浏览器的开发人员工具。您可以使用浏览器来更详细地检查文档-我通常会查找id或class元素属性或唯一标识我要提取的信息的任何其他信息。

To make matters concrete, turn to the list of mathematicians you saw earlier. Spending a minute or two looking at this page’s source, you can see that each mathematician’s name appears inside the text content of an <li> tag. Even better, <li> tags on this page seem to contain nothing but names of mathematicians.

具体而言，请转到您之前看到的数学家列表。花一两分钟查看此页面的源代码，您可以看到每个数学家的名字都出现在<li>标记的文本内容内。更好的是，此页面上的<li>标记似乎只包含数学家的名字。

Here’s a quick look using Python:

这是使用Python的快速浏览：

The above experiment shows that some of the <li> elements contain multiple names separated by newline characters, and others contain just a single name. With this information in mind, you can write your function to extract a single list of names:

上面的实验表明，某些<li>元素包含多个用换行符分隔的名称，而其他元素仅包含一个名称。考虑到此信息，您可以编写函数以提取单个名称列表：

 def def get_namesget_names ():():"""
"""
    Downloads the page where the list of mathematicians is found
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    and returns a list of strings, one per mathematician
    """    """url url = = 'http://www.fabpedigree.com/james/mathmen.htm''http://www.fabpedigree.com/james/mathmen.htm'response response = = simple_getsimple_get (( urlurl ))if if response response is is not not NoneNone ::html html = = BeautifulSoupBeautifulSoup (( responseresponse , , 'html.parser''html.parser' ))names names = = setset ()()for for li li in in htmlhtml .. selectselect (( 'li''li' ):):for for name name in in lili .. texttext .. splitsplit (( '' nn '' ):):if if lenlen (( namename ) ) > > 00 ::namesnames .. addadd (( namename .. stripstrip ())())return return listlist (( namesnames ))# Raise an exception if we failed to get any data from the url# Raise an exception if we failed to get any data from the urlraise raise ExceptionException (( 'Error retrieving contents at {}''Error retrieving contents at {}' .. formatformat (( urlurl ))
))

The get_names function downloads the page and iterates over the <li> elements, picking out each name that occurs. Next, you add each name to a Python set, which ensures that you don’t end up with duplicate names. Finally you convert the set to a list and return it.

get_names函数下载页面并遍历<li>元素，以选择出现的每个名称。接下来，将每个名称添加到Python set ，以确保不会以重复的名称结尾。最后，将集合转换为列表并返回。

获得人气分数 (Getting the Popularity Score)

Nice, you’re nearly done! Now that you have a list of names, you need to pick out the pageviews for each one. The function you write is similar to the function you made to get the list of names, only now you supply a name and pick out an integer value from the page.

很好，您快完成了！现在您已经有了一个名称列表，接下来需要为每个名称选择页面浏览量。您编写的函数与获取名称列表的功能类似，只是现在您提供了一个名称并从页面中选择了一个整数值。

Again, you should first check out an example page in your browser’s developer tools. It looks like the text appears inside an <a> element, and that the href attribute of that element always contains the string 'latest-60' as a substring. That’s all the information you need to write your function!

同样，您应该首先在浏览器的开发人员工具中签出示例页面。看起来文本显示在<a>元素内，并且该元素的href属性始终包含字符串'latest-60'作为子字符串。这就是编写函数所需的全部信息！

放在一起 (Putting It All Together)

You have reached a point where you can finally find out which mathematician is most beloved by the public! The plan is simple:

您已经到了可以最终找出哪个数学家最受公众喜爱的地步！该计划很简单：

Get a list of names,
Iterate over the list to get a “popularity score” for each name; and
Finish by sorting the names by popularity.

获取名称列表，
遍历列表以获取每个名称的“人气得分”；和
通过按受欢迎程度对名称进行排序来完成。

Simple right? Well there’s one thing that hasn’t been mentioned yet: errors.

简单吧？嗯，还没有提到一件事：错误。

Working with real-world data is messy, and trying to force messy data into a uniform shape will invariably result in the occassional error jumping in to mess with your nice clean vision of how things ought to be. Ideally, you would like to keep track of errors when they occur in order to get a better sense of the quality your data.

处理现实世界的数据很麻烦，而试图将混乱的数据强制为统一的形状，将不可避免地导致偶尔出现的错误跳入，使您对事物的状态一目了然。理想情况下，您希望跟踪错误发生的时间，以便更好地了解数据质量。

For your present purpose, you will track instances when you could not find a popularity score for a given mathematician’s name. At the end of the script, you will print a message showing the number of mathematicians who were left out of the rankings.

出于当前目的，您将跟踪找不到给定数学家姓名的流行度得分的实例。在脚本的最后，您将打印一条消息，显示未列入排名的数学家人数。

Here’s the code:

这是代码：

 if if __name__ __name__ == == '__main__''__main__' ::printprint (( 'Getting the list of names....''Getting the list of names....' ))names names = = get_namesget_names ()()printprint (( '... done.'... done. nn '' ))results results = = [][]printprint (( 'Getting stats for each name....''Getting stats for each name....' ))for for name name in in namesnames ::trytry ::hits hits = = get_hits_on_nameget_hits_on_name (( namename ))if if hits hits is is NoneNone ::hits hits = = -- 11resultsresults .. appendappend (((( hitshits , , namename ))))exceptexcept ::resultsresults .. appendappend (((( -- 11 , , namename ))))log_errorlog_error (( 'error encountered while processing ''error encountered while processing ''{}, skipping''{}, skipping' .. formatformat (( namename ))))printprint (( '... done.'... done. nn '' ))resultsresults .. sortsort ()()resultsresults .. reversereverse ()()if if lenlen (( resultsresults ) ) > > 55 ::top_marks top_marks = = resultsresults [:[: 55 ]]elseelse ::top_marks top_marks = = resultsresultsprintprint (( '' nn The most popular mathematicians are:The most popular mathematicians are: nn '' ))for for (( markmark , , mathematicianmathematician ) ) in in top_markstop_marks ::printprint (( '{} with {} page views''{} with {} page views' .. formatformat (( mathematicianmathematician , , markmark ))))no_results no_results = = lenlen ([([ res res for for res res in in results results if if resres [[ 00 ] ] == == -- 11 ])])printprint (( '' nn But we did not find results for 'But we did not find results for ''{} mathematicians on the list''{} mathematicians on the list' .. formatformat (( no_resultsno_results ))
))

And that’s it!

就是这样！

When you run the script, you should see at the following report:

运行脚本时，应该在以下报告中看到：

结论与后续步骤 (Conclusion & Next Steps)

Web scraping is a big field, and you have just finished a brief tour of that field using Python as you guide. You can get pretty far using just requests and BeautifulSoup, but the as you followed along, you may have come up with few questions:

Web抓取是一个很大的领域，您刚刚按照指南完成了对该领域的简要介绍。仅使用requests和BeautifulSoup就可以走得很远，但是随着您的发展，您可能会提出几个问题：

What happens if page content loads as a result of asynchronous Javascript requests? (Check out Selenium’s Python API.)
How do I write a web spider or search engine bot that traverses large portions of the web?
What is this Scrapy thing I keep hearing about?

如果页面内容由于异步Javascript请求而加载，该怎么办？（查看Selenium的Python API 。）
如何编写遍历网络大部分的网络蜘蛛或搜索引擎机器人？
我一直在听到这令人毛骨悚然的事情是什么？

翻译自: https://www.pybloggers.com/2018/01/practical-introduction-to-web-scraping-in-python/

scrap python