python 实现分步累加

As data scientists, we are always on the look for new data and information to analyze and manipulate. One of the main approaches to find data right now is scraping the web for a particular inquiry.

作为数据科学家，我们一直在寻找新的数据和信息进行分析和处理。目前查找数据的主要方法之一是抓取特定查询的网络。

When we browse the internet, we come across a massive number of websites, these websites display various data on the browser. If we, for some reason want to use this data for a project or an ML algorithm, we can — but shouldn’t — gather this data manually. So, we will copy the sections we want and paste them in a doc or CSV file.

当我们浏览Internet时，我们会遇到大量网站，这些网站在浏览器上显示各种数据。如果出于某种原因我们想要将此数据用于项目或ML算法，我们可以(但不应该)手动收集此数据。因此，我们将复制所需的部分并将其粘贴到doc或CSV文件中。

Needless to say, that will be quite a tedious task. That’s why most data scientists and developers go with web scraping using code. It’s easy to write code to extract data from a 100 webpage than do them by hand.

不用说，这将是一个繁琐的任务。这就是为什么大多数数据科学家和开发人员都使用代码进行Web抓取的原因。与手动编写代码相比，编写代码从100个网页提取数据要容易得多。

Web Scraping is the technique used by programmers to automate the process of finding and extracting data from the internet within a relatively short time.

Web Scraping是程序员用来在相对较短的时间内自动从Internet查找和提取数据的过程的技术。

The most important question when it comes to web scraping, is it legal?

关于网页抓取，最重要的问题是合法的吗？

网站抓取合法吗？ (Is web scraping legal?)

Short answer, yes.

简短的回答，是的。

The more detailed answer, scraping publically available data for non-commercial purposes was announced to be completely legal in late January 2020.

更为详细的答案是，出于非商业目的收集可公开获得的数据在2020年1月下旬宣布是完全合法的。

You might wonder, what does publically available mean?

您可能想知道， 公开可用是什么意思？

Publically available information is the information that anyone can see/ find on the internet without the need for special access. So, information on Wikipedia, social media or Google’s search results are examples of publically available data.

公开信息是任何人都可以在互联网上看到/找到的信息，而无需特殊访问。因此，有关维基百科，社交媒体或Google搜索结果的信息就是公开可用数据的示例。

Now, social media is somewhat complicated, because there are parts of it that are not publically available, such as when a user sets their information to be private. In this case, this information is illegal to be scraped.

现在，社交媒体有些复杂，因为社交媒体的某些部分是不公开的，例如当用户将其信息设置为私人信息时。在这种情况下，此信息被非法删除。

One last thing, there’s a difference between publically available and copyrighted. For example, you can scrap YouTube for video titles, but you can’t use the videos for commercial use because they are copyrighted.

最后一件事，公开可用和受版权保护之间有区别。例如，您可以删除YouTube上的视频标题，但不能将其用于商业用途，因为它们已受版权保护。

如何抓取网页？ (How to scrap the web?)

There are different programming languages that you can use to scrape the web, and within every programming language, there are different libraries to achieve the same goal.

您可以使用多种编程语言来抓取Web，并且在每种编程语言中，都有不同的库可以实现相同的目标。

So, what to use?

那么，使用什么呢？

In this article, I will use Python, Requests, and BeautifulSoup to scrap some pages from Wikipedia.

在本文中，我将使用Python ， Requests和BeautifulSoup从Wikipedia抓取一些页面。

To scrap and extract any information from the internet, you’ll probably need to go through three stages: Fetching HTML, Obtaining HTML Tree, then Extracting information from the tree.

要从互联网上抓取和提取任何信息，您可能需要经历三个阶段：获取HTML，获取HTML树，然后从树中提取信息。

We will use the Requests library to fetch the HTML code from a specific URL. Then, we will use BeautifulSoup to Parse and Extract the HTML tree, and finally, we will use pure Python to organize the data.

我们将使用Requests库从特定的URL提取HTML代码。然后，我们将使用BeautifulSoup解析和提取HTML树，最后，我们将使用纯Python来组织数据。

基本HTML (Basic HTML)

Before we get scraping, let’s revise HTML basics quickly. Everything in HTML is defined within tags. The most important tag is <HTML> which means that the text to follow is HTML code.

在抓取之前，让我们快速修改HTML基础。 HTML中的所有内容都在标记中定义。最重要的标记是<HTML>，这意味着要跟随的文本是HTML代码。

In HTML, each opened tag must be closed. So, at the end of the HTML file, we need a closure tag </HTML>.

在HTML中，必须关闭每个打开的标签。因此，在HTML文件的末尾，我们需要一个结束标记</ HTML>。

Different tags in HTML means different things. Using a combination of tags, a webpage is represented. Any text enclosed between an open and close tag is called inner HTML text.

HTML中的不同标记意味着不同的含义。使用标签的组合来表示网页。包含在打开和关闭标签之间的任何文本都称为内部HTML文本 。

If we have multiple elements with the same tag, we might — actually, always — want to differentiate between them somehow. There are two ways to do that, either through using classes or ids. Ids are unique, which means we can’t have two elements with the same id. Classes, on the other hand, are not. More than one element can have the same class.

如果我们有多个具有相同标签的元素，则我们可能-实际上一直-希望以某种方式区分它们。有两种方法可以做到这一点，或者通过使用类或ID。 ID是唯一的，这意味着我们不能有两个具有相同ID的元素。另一方面，类不是。多个元素可以具有相同的类。

Here are 10 HTML tags you will see a lot when scraping the web.

这是10个HTML标记，在抓取网络时会看到很多。

基本刮 (Basic Scraping)

Awesome, now that we know the basics, let’s start up small and then build up!

太棒了，现在我们已经了解了基础知识，让我们从小处开始，然后逐步建立！

Our first step is to install BeautifulSoup by typing the following in the command line.

我们的第一步是通过在命令行中键入以下内容来安装BeautifulSoup。

pip install bs4

To get familiar with scraping basics, we will consider an example HTML code and learn how to use BeautifulSoup to explore it.

为了熟悉抓取的基础知识，我们将考虑一个示例HTML代码，并学习如何使用BeautifulSoup进行探索。

<HTML><HEAD><TITLE>My cool title</TITLE></HEAD><BODY><H1>This is a Header</H1><ul id="list" class="coolList"><li>item 1</li><li>item 2</li><li>item 3</li></ul></BODY>
</HTML>

BeautifulSoup doesn’t fetch HTML from the web, it is, however, extremely good at extracting information from an HTML string.

BeautifulSoup不能从Web上获取HTML，但是，它非常擅长从HTML字符串中提取信息。

In order to use the above HTML in Python, we will set it up as a string and then use different BeautifulSoup to explore it.

为了在Python中使用上述HTML，我们将其设置为字符串，然后使用其他BeautifulSoup对其进行探索。

Note: if you’re using Jupyter Notebook to follow this article, you can type the following command to view HTML within the Notebook.

注意：如果您使用Jupyter Notebook跟随本文，则可以键入以下命令以在Notebook中查看HTML。

from IPython.core.display import display, HTMLdisplay(HTML(some_html_str))

For example, the above HTML will look something like this:

例如，上面HTML将如下所示：

Next, we need to feed this HTML to BeautifulSoup in order to generate the HTML tree. HTML tree is a representation of the different levels of the HTML code, it shows the hierarchy of the code.

接下来，我们需要将此HTML馈送到BeautifulSoup，以便生成HTML树。 HTML树表示HTML代码的不同级别，它显示了代码的层次结构。

The HTML tree of the above code is:

上面代码HTML树为：

To generate the tree, we write

为了生成树，我们写

some_html_str = """<HTML>    <HEAD>        <TITLE>My cool title</TITLE>    </HEAD><BODY>    <H1>This is a Header</H1>    <ul id="list" class="coolList">        <li>item 1</li>        <li>item 2</li>        <li>item 3</li>    </ul></BODY></HTML>"""#Feed the HTML to BeautifulSoupsoup = bs(some_html_str)

The variable soup now has the information extracted from the HTML string. We can use this variable to obtain information from the HTML tree.

现在，变量soup具有从HTML字符串中提取的信息。我们可以使用此变量从HTML树中获取信息。

BeautifulSoup has many functions that can be used to extract specific aspects of the HTML string. However, two functions are used to most: find and find_all.

BeautifulSoup具有许多功能，可用于提取HTML字符串的特定方面。但是，大多数情况下使用两个函数： find和find_all.

The function find returns only the first occurrence of the search query, while find_all returns a list of all matches.

函数find仅返回搜索查询的第一个匹配项，而find_all返回所有匹配项的列表。

Say, we are searching for all <h1> headers in the code.

说，我们正在搜索代码中的所有<h1>标头。

As you can see, the find function gave me the <h1> tag. With the tags and all. Often, we only want to extract the inner HTML text. To do that we use .text .

如您所见， find函数给了我<h1>标记。随着标签和所有。通常，我们只想提取内部HTML文本。为此，我们使用.text 。

That was simply because we only have one <h1> tag. But what if we want to look for list items — we have an unordered list with three items in our example — we can’t use find. If we do, we will only get the first item.

那仅仅是因为我们只有一个<h1>标签。但是，如果我们要查找列表项-我们的示例中有一个包含三个项目的无序列表，该怎么办-我们不能使用find 。如果这样做，我们只会得到第一项。

To find all the list items, we need to use find_all.

要查找所有列表项，我们需要使用find_all 。

Okay, now that we have a list of items, let’s answer two questions:

好的，现在我们有了项目列表，让我们回答两个问题：

1- How to get the inner HTML of the list items?

1-如何获取列表项的内部HTML？

To obtain the inner text only, we can’t use .text straight away, because now we have a list of elements and not just one. Hence, we need to iterate over the list and obtain the inner HTML of each list item.

只获取内部文本，我们不能立即使用.text，因为现在我们有了元素列表，而不仅仅是一个。因此，我们需要遍历列表并获取每个列表项的内部HTML。

2- What if we have multiple lists in the code?

2-如果代码中有多个列表怎么办？

If we have more than one list in the code — which is usually the case — we can be precise when searching for elements. In our example, the list has id=’list’ and class=’coolList’. We can use this — both or just one — with the find_all or find functions to be precise and get the information we want.

如果我们在代码中有多个列表(通常是这种情况)，则在搜索元素时可以很精确。在我们的示例中，列表具有id ='list'和class ='coolList'。我们可以将它(全部或仅一个)与find_all一起使用，或精确find功能并获取所需的信息。

One thing to note here is the return of the find or find_all functions are BeautifulSoup objects and those can be traversed further. So, we can treat them just like the object obtained directly from the HTML string.

这里要注意的一件事是， find或find_all函数的返回值都是BeautifulSoup对象，可以进一步遍历这些对象。因此，我们可以将它们像直接从HTML字符串获得的对象一样对待。

Complete code for this section:

本节的完整代码：

#Import needed libraries
from bs4 import BeautifulSoup as bs
import requests as rq
#HTML string
some_html_str = """
<HTML><HEAD><TITLE>My cool title</TITLE></HEAD><BODY><H1>This is a Header</H1><ul id="list" class="coolList"><li>item 1</li><li>item 2</li><li>item 3</li></ul>
</BODY>
</HTML>
"""
soup = bs(some_html_str)
#Get headers
print(soup.find('h1'))
print(soup.find('h1').text)
#Get all list items
inner_text = [item.text for item in soup.find_all('li')]
print(inner_text)
ourList = soup.find(attrs={"class":"coolList", "id":"list"})
print(ourList.find_all('li'))

We can traverse the HTML tree using other BeautifulSoup functions, like children, parent, next, etc.

我们可以使用其他BeautifulSoup功能，如遍历HTML树children ， parent ， next ，等

抓取一个网页 (Scraping one webpage)

Let’s consider a more realistic example, where we fetch the HTML from a URL and then use BeautifulSoup to extract patterns and data.

让我们考虑一个更现实的示例，其中我们从URL提取HTML，然后使用BeautifulSoup提取模式和数据。

We will start by fetching one webpage. I love coffee, so let’s try fetching the Wikipedia page listing countries by coffee production and then plot the countries using Pygal.

我们将从获取一个网页开始。我喜欢咖啡，所以让我们尝试获取按咖啡产量列出国家的Wikipedia页面，然后使用Pygal绘制国家。

To fetch the HTML we will use the Requests library and then pass the fetched HTML to BeautifulSoup.

要获取HTML，我们将使用Requests库，然后将获取HTML传递给BeautifulSoup。

If we opened this wiki page, we will find a big table with the countries, and different measures of coffee production. We just want to extract the country name and the coffee production in tons.

如果打开此Wiki页面，我们将找到一张列出了各个国家/地区以及不同的咖啡生产量度的大表。我们只想提取国家名称和吨咖啡产量。

To extract this information, we need to study the HTML of the page to know what to query. We can just highlight a country name, right-click, and choose inspect.

要提取此信息，我们需要研究页面HTML以了解要查询的内容。我们可以仅突出显示一个国家名称， 单击鼠标右键 ，然后选择inspect 。

Through inspecting the page, we can see that the country names and the quantity are enclosed within a ‘table’ tag. Since it is the first table on the page, we can just use the find function to extract it.

通过检查页面，我们可以看到国家名称和数量包含在“表格”标签中。由于它是页面上的第一张表，因此我们可以使用find函数来提取它。

However, extracting the table directly will give us all the table’s content, with the table header — the first row of the table — and the quantity in different measures.

但是，直接提取表格将为我们提供表格的所有内容，包括表格标题(表格的第一行)和数量(采用不同度量)。

So, we need to fine-tune our search. Let’s try it out with the top 10 countries.

因此，我们需要微调搜索。让我们与前10个国家/地区一起尝试一下。

Notice that to clean up the results, I used string manipulation to extract the information I want.

注意，为了清理结果，我使用了字符串操作来提取所需的信息。

I can use this list to finally plot the top 10 countries using Pygal.

我可以使用此列表最终使用Pygal列出前10个国家/地区。

Complete code for this section:

本节的完整代码：

#Import needed libraries
from bs4 import BeautifulSoup as bs
import requests as rq
import pygal
from IPython.display import display, HTML
#Fetch HTML
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_coffee_production'
#Extract HTMl tree
page = rq.get(url).text
soup = bs(page)
#Find countries and quantiy
table = soup.find('table')
top_10_countries = []
for row in table.find_all('tr')[2:11]:temp = row.text.replace('\n\n',' ').strip() #obtain only the quantiy in tonstemp_list = temp.split()top_10_countries.append((temp_list[0],temp_list[2]))
#Plot the top 10 countries
bar_chart = pygal.Bar(height=400)
[bar_chart.add(item[0],int(item[1].replace(',',''))) for item in top_10_countries]
display(HTML(base_html.format(rendered_chart=bar_chart.render(is_unicode=True))))

抓取多个网页 (Scraping multiple webpages)

Wow, that was a lot!

python 实现分步累加_Python网页爬取分步指南相关推荐

Python数据分析：爬虫从网页爬取数据需要几步？
对于数据分析师来说,数据获取通常有两种方式,一种是直接从系统本地获取数据,另一种是爬取网页上的数据,爬虫从网页爬取数据需要几步?总结下来,Python爬取网页数据需要发起请求.获取响应内容.解析数据. ...
手把手 | 教你爬下100部电影数据：R语言网页爬取入门指南
前言网页上的数据和信息正在呈指数级增长.如今我们都使用谷歌作为知识的首要来源--无论是寻找对某地的评论还是了解新的术语.所有这些信息都已经可以从网上轻而易举地获得. 网络中可用数据的增多为数据科学家 ...
python爬虫提取人名_python爬虫—爬取英文名以及正则表达式的介绍
python爬虫-爬取英文名以及正则表达式的介绍爬取英文名: 一. 爬虫模块详细设计 (1)整体思路对于本次爬取英文名数据的爬虫实现,我的思路是先将A-Z所有英文名的连接爬取出来,保存在一个csv ...
python爬虫股票上证指数_Python爬虫爬取搜狐证券股票数据
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于IT信息教室,作者:M先森看世界数据的爬取我们以上证50的股票为例,首先需 ...
python ip动态代理_Python实现爬取可用代理IP
Python实现爬取可用代理IP,在实现爬虫时,动态设置代理IP可以有效防止反爬虫,但对于普通爬虫初学者需要在代理网站上测试可用代理IP.由于手动测试过程相对比较繁琐,且重复无用过程故编写代码以实现动 ...
python爬虫今日头条_python 简单爬取今日头条热点新闻(
今日头条如今在自媒体领域算是比较强大的存在,今天就带大家利用python爬去今日头条的热点新闻,理论上是可以做到无限爬取的: 在浏览器中打开今日头条的链接,选中左侧的热点,在浏览器开发者模式netwo ...
Python简单数据处理（静态网页爬取，jupter实现）
对于哔哩哔哩动漫排行榜网页信息的爬取及处理(静态网页) 1.数据来源: 哔哩哔哩排行榜 2.数据描述: 利用python的第三方库requests对网页进行爬取利用re库提供的正则表达式对网页数据进 ...
python查询高校信息_Python 3爬取全国高校基本信息-Go语言中文社区
最近接了一个单子,是爬取全国高校的基本信息,高校名字,高校层次,高校地区,招生办电话,招生办官网~ 镇楼图如下: 开发环境:python3.6(最新3.7也可以的) +pycharm 第三放库:req ...
python 头条上传_python 简单爬取今日头条热点新闻(一)
今日头条如今在自媒体领域算是比较强大的存在,今天就带大家利用python爬去今日头条的热点新闻,理论上是可以做到无限爬取的: 在浏览器中打开今日头条的链接,选中左侧的热点,在浏览器开发者模式netwo ...

python 实现分步累加_Python网页爬取分步指南