python外国网站爬虫_python 网络爬虫-爬取网页外部网站

前言

上一篇中我们在维基百科的内部网站上随机跳转进入文章类网页，而忽视外部网站链接。本篇文章将处理网站的外部链接并试图收集一些网站数据。和单个域名网站爬取不同，不同域名的网站结构千差万别，这就意味我们的代码需要更加的灵活以适应不同的网站结构。

因此，我们将代码写成一组函数，这些函数组合起来就可以应用在不同类型的网络爬虫需求。

随机跳转外部链接

利用函数组，我们可以在50行左右满足爬取外部网站的需求。

示例代码：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

import datetime

import random

from urllib.parse import quote

pages = set()

random.seed(datetime.datetime.now())

''' 获取一个网页的所有互联网链接'''

# 获取网页所有内部链接

def get_internal_links(soup, include_url):

internal_links = []

# find all links that befin with a '/'

print(include_url)

for link in soup.find_all('a',

href=re.compile(r'^((/|.)*' + include_url + ')')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in internal_links:

internal_links.append(link.attrs['href'])

return internal_links

# retrieves a list of all external links found on a page

#获取网页上所有外部链接

def get_external_links(soup, exclude_url):

external_links = []

# Finds all links that starts with 'http' or 'www' that do not contain the

# current URL

for link in soup.find_all('a',

href=re.compile(r'^(http|www)((?!' + exclude_url + ').)*$')):

if link.attrs['href'] is not None:

if link.attrs['href'] not in external_links:

external_links.append(link.attrs['href'])

return external_links

#拆分网址获取主域名

def split_address(address):

address_parts = address.replace('http://', '').split('/')

return address_parts

#随机外部链接跳转

def get_random_external_link(starting_page):

html = urlopen(starting_page)

soup = BeautifulSoup(html, 'lxml')

external_links = get_external_links(

soup, split_address(starting_page)[0]) # find the domain URL

if len(external_links) == 0:

internal_links = get_internal_links(soup, starting_page)

print(len(internal_links))

return get_external_links(soup,

internal_links[random.randint(0, len(internal_links) - 1)])

else:

return external_links[random.randint(0, len(external_links) - 1)]

hop_count = set()

#只跳转外部链接，设置跳转次数loop, 默认跳转5次

def follow_external_only(starting_site, loop=5):

global hop_count

external_link = get_random_external_link(

quote(starting_site, safe='/:?='))

print('Random external link is: ' + external_link)

while len(hop_count) < loop:

hop_count.add(external_link)

print(len(hop_count))

follow_external_only(external_link)

follow_external_only("http://www.baidu.com")

由于代码没有异常处理和反反爬虫处理，因此一定会报错。由于跳转是随机的，可以多运行几次，有兴趣的可以根据每次的报错原因完善代码。

输出结果：

Random external link is: http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=

Random external link is: http://baishi.baidu.com/watch/6388818335201070269.html

Random external link is: http://v.baidu.com/tv/

Random external link is: http://player.baidu.com/yingyin.html

Random external link is: http://help.baidu.com/question?prod_en=player

Random external link is: http://home.baidu.com

[Finished in 6.3s]

抓取网页上所有外部链接

把代码写成函数的好处是可以简单地修改或者添加以满足不同的需求而不会破坏代码。比如：

目的：爬取整个网页所有外部链接并对每个链接标记

我们可以添加如下函数：

# Collects a list of all external URLs found on the site

all_ext_links = set()

all_int_links = set()

def get_all_external_links(site_url):

html = urlopen(site_url)

soup = BeautifulSoup(html, 'lxml')

print(split_address(site_url)[0])

int

internal_links = get_internal_links(soup, split_address(site_url)[0])

external_links = get_external_links(soup, split_address(site_url)[0])

for link in external_links:

if link not in all_ext_links:

all_ext_links.add(link)

print(link)

for link in internal_links:

if link not in all_int_links:

print('About to get link: ' + link)

all_int_links.add(link)

get_all_external_links(link)

# follow_external_only("http://www.baidu.com")

get_all_external_links('http://oreilly.com')

输出结果如下：

oreilly.com

https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf

http://twitter.com/oreillymedia

http://fb.co/OReilly

https://www.linkedin.com/company/oreilly-media

https://www.youtube.com/user/OreillyMedia

About to get link: https://www.oreilly.com

https:

https://www.oreilly.com

http://www.oreilly.com/ideas

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

http://www.oreilly.com/conferences/

http://shop.oreilly.com/

http://members.oreilly.com

https://www.oreilly.com/topics

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now

https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in

https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course

https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path

https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access

http://www.oreilly.com/live-training/?view=grid

https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform

https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends

https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles

http://www.oreilly.com/about/

http://www.oreilly.com/work-with-us.html

http://www.oreilly.com/careers/

http://shop.oreilly.com/category/customer-service.do

http://www.oreilly.com/about/contact.html

http://www.oreilly.com/emails/newsletters/

http://www.oreilly.com/terms/

http://www.oreilly.com/privacy.html

http://www.oreilly.com/about/editorial_independence.html

About to get link: https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav

https:

https://www.oreilly.com/

About to get link: https://www.oreilly.com/

https:

About to get link: https://www.oreilly.com/topics

......

程序会一直循环下去直到达到python默认的循环极限，有兴趣的朋友可以像上面的代码一样添加默认循环限制loop=5。

python外国网站爬虫_python 网络爬虫-爬取网页外部网站相关推荐

python爬虫对炒股有没有用_使用python爬虫实现网络股票信息爬取的demo
实例如下所示: import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url ...
python爬虫股票市盈率_使用python爬虫实现网络股票信息爬取的demo
实例如下所示: import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url ...
【期末课设】python爬虫基础与可视化，使用python语言以及支持python语言的第三方技术实现爬虫功能，定向爬取网页的图片数据，并且实现批量自动命名分类下载。
1.大作业的内容本要求使用python语言以及支持python语言的第三方技术实现爬虫功能,定向爬取网页的图片数据,并且实现批量自动命名分类下载. 2.案例需求要求采用虚拟浏览器等动态爬虫技术,完 ...
Jsoup：用Java也可以爬虫，怎么使用Java进行爬虫，用Java爬取网页数据，使用Jsoup爬取数据，爬虫举例：京东搜索
Jsoup:用Java也可以爬虫,怎么使用Java进行爬虫,用Java爬取网页数据,使用Jsoup爬取数据,爬虫举例:京东搜索一.资源为什么接下来的代码中要使用el.getElementsByTa ...
python爬虫爬取数据如何将br去掉_Python怎么去除爬取下来的网站中的一些转义字符串 - 收获啦...
基本方法其实用python爬取网页很简单,只有简单的几句话这样就可以获得到页面的内容.接下来再用正则匹配去匹配所需要的内容就行了.但是,真正要做起来,就会有各种各样的细节问题. 2.登录这是一个 ...
python sub 不区分大小写_Python网络爬虫入门篇
1. 预备知识学习者需要预先掌握Python的数字类型.字符串类型.分支.循环.函数.列表类型.字典类型.文件和第三方库使用等概念和编程方法. Python入门篇:https://www.cnblo ...
Python爬虫之利用xpath爬取ip代理网站的代理ip
爬虫工具 python3 pycharm edge/chrome requests库的用法 requests库是python中简单易用的HTTP库用命令行安装第三方库 pip install req ...
python爬虫入门教程：爬取网页图片
在现在这个信息爆炸的时代,要想高效的获取数据,爬虫是非常好用的.而用python做爬虫也十分简单方便,下面通过一个简单的小爬虫程序来看一看写爬虫的基本过程: 准备工作语言:python IDE:py ...
使用 requests+lxml 库的 Python 爬虫实例（以爬取网页连载小说《撒野》为例）
需求目标介绍使用 requests 库与 lxml 库进行简单的网页数据爬取普通框架与爬虫实例,本文以爬取网页连载小说<撒野>为例~ 当然有很多小说基本都能找到现成的 .txt 或者 . ...

python外国网站爬虫_python 网络爬虫-爬取网页外部网站

python外国网站爬虫_python 网络爬虫-爬取网页外部网站相关推荐

最新文章

热门文章