一、爬取百度贴吧

import re
titleR ='<a rel="noreferrer" href=".*?" title=".*?" target="_blank" class="j_th_tit ">(.*?)</a>'
authorR='<span class=".*?" title="主题作者:(.*?)" data-field'
reduR ='<span class=".*?" title="回复">(.*?)</span>'
with open('test.html','r',encoding='utf-8') as f:data = f.read()title = re.findall(titleR,data)author = re.findall(authorR,data)redu = re.findall(reduR,data)for i in range(0,len(author)) :print(redu[i]+author[i]+'   '+title[i]+'    ')

二、提取小说内容

from lxml import etree
with open('work2.html','r') as f:text = f.read()
html = etree.HTML(text)
result = html.xpath('//*[@id="content"]/text()')
with open('斗罗大陆.txt','w',encoding='utf-8') as f:f.write(''.join(result))
print(result)

三、豆瓣小说

from lxml import etree
with open('work3.html','r',encoding='utf-8') as f:text = f.read()
html = etree.HTML(text)
allInfo =''
for i in range(1,25):title = html.xpath('//*[@id="content"]/div/div[1]/ol/li[%d]/div/div[2]/div[1]/a/span[1]/text()'%(i))score = html.xpath('//*[@id="content"]/div/div[1]/ol/li[%d]/div/div[2]/div[2]/div/span[2]/text()'%(i))comment = html.xpath('//*[@id="content"]/div/div[1]/ol/li[%d]/div/div[2]/div[2]/p[2]/span/text()'%(i))time = html.xpath('//*[@id="content"]/div/div[1]/ol/li[%d]/div/div[2]/div[2]/p[1]/text()[2]'%(i))info = ''.join(title)+' '+''.join(score)+' '+''.join(comment)+' '+''.join(time)+'\n'allInfo=allInfo+info
with open('豆瓣电影.txt','w',encoding='utf-8') as f:f.write(allInfo)

四、Ajax爬微博

from urllib.parse import urlencode
from pyquery import PyQuery as pq
import requests
base_url = 'https://m.weibo.cn/api/container/getIndex?'
headers = {'Host': 'm.weibo.cn','Referer': 'https://m.weibo.cn/u/2360812967','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','X-Requested-With': 'XMLHttpRequest',
}
def get_page():params = {'uid':'2360812967','t': '0','luicode': '10000011','lfid': '100103type=1&amp;q=李现','type': 'uid','value': '2360812967','containerid': '1076032360812967',}url = base_url + urlencode(params)try:response = requests.get(url, headers=headers)if response.status_code == 200:return response.json()except requests.ConnectionError as e:\print('Error', e.args)def parse_page(json):if json:items = json.get('data').get('cards')i = 0;for item in items:if(i ==0):i = 1continueitem = item.get('mblog')weibo = {}weibo['id'] = item.get('id')weibo['text'] = pq(item.get('text')).text()weibo['attitudes'] = item.get('attitudes_count')weibo['comments'] = item.get('comments_count')weibo['reposts'] = item.get('reposts_count')yield weiboif __name__ == '__main__':# result = get_page()# print(result)for page in range(1, 2):json = get_page()results = parse_page(json)for result in results:print(result)

五、多线程爬淘宝

from selenium import webdriver
import time
import threadingdef workthis(name):browser = webdriver.Chrome()browser.get('https://www.taobao.com')input = browser.find_element_by_id('q')input.send_keys(name)#向搜索框输输入值为iPhonetime.sleep(1)#暂停1s为了模拟人的行为，防止被屏蔽button = browser.find_element_by_class_name('btn-search')button.click()#点击操作phone = browser.find_element_by_id('fm-login-id')phone.send_keys('18224393018')password = browser.find_element_by_id('fm-login-password')password.send_keys('***********')login = browser.find_element_by_xpath('//*[@id="login-form"]/div[4]/button')login.click()time.sleep(3)#暂停1s为了模拟人的行为，防止被屏蔽for i in range(1,48):price = browser.find_element_by_xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div[%d]/div[2]/div[1]/div[1]/strong'%(i))title = browser.find_element_by_xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div[%d]/div[2]/div[2]'%(i))print(title.text+'\t'+price.text)browser.quit()
if __name__ == '__main__':threading.Thread(target=workthis,args=('小米手机',)).start()threading.Thread(target=workthis,args=('苹果手机',)).start()threading.Thread(target=workthis,args=('华为手机',)).start()

【Python网络编程】爬取百度贴吧、小说内容、豆瓣小说、Ajax爬微博、多线程爬淘宝相关推荐

IT宅男利用Python网络爬虫抓取百度贴吧评论区图片和视频（实战篇）
[一.项目背景] 百度贴吧是全球最大的中文交流平台,你是否跟我一样,有时候看到评论区的图片想下载呢?或者看到一段视频想进行下载呢? 今天,小编带大家通过搜索关键字来获取评论区的图片和视频. [二.项目 ...
【JavaWeb 爬虫】Java文本查重网页版爬取百度搜索结果页全部链接内容
! ! 更新:增加了网页过滤判断,只允许域名包含blog,jianshu的网站通过小技巧 Java中InputStream和String之间的转换方法 String result = new Buf ...
python爬取百度贴吧指定内容
环境:python3.6 1:抓取百度贴吧-linux吧内容基础版抓取一页指定内容并写入文件萌新刚学习Python爬虫,做个练习贴吧链接: http://tieba.baidu.com/f?k ...
爬取百度词语的相关内容
需求: 根据HSK词汇表搜索相关词语,并爬取其中的拼音,释义.同义/近义/反义词使用语言及编译器: python pycharm 目标网站: 百度汉语:https://hanyu.baidu.com ...
使用python和PyQt5编写爬取百度图片的界面工具
使用python和PyQt5编写爬取百度图片的界面工具本篇文章的主要内容是展示我个人编写的,以界面小工具的方式爬取百度上面的图片,功能很单一,根据关键词爬取图片,代码很简单,新手上路请多指教. 代码 ...
Python爬虫实战之爬取百度贴吧帖子
Python爬虫实战之爬取百度贴吧帖子大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 本篇目标对百度贴吧的 ...
【爬虫实战】10应用Python网络爬虫——定向爬取百度百科文字
python百度百科爬虫网页源代码分析编程实现小结网页源代码分析首先找一下需要爬取的正文: 对应的源代码有两个地方: 上图往后翻会发现省略号,所以下面这张图才是我们需要爬取的部分: 编程实现 ...
如何使用python爬取百度图片_python实现爬取百度图片的方法示例
本文实例讲述了python实现爬取百度图片的方法.分享给大家供大家参考,具体如下: import json import itertools import urllib import requests ...
详细分析如何利用python批量爬取百度图片
这篇文章主要写的是利用python网络爬虫批量来爬取百度图片并保存到文件夹中. 首先我们打开百度图片这个网页:https://image.baidu.com/ 我们现在随便搜一个类型的图片,比如小狗, ...

【Python网络编程】爬取百度贴吧、小说内容、豆瓣小说、Ajax爬微博、多线程爬淘宝

一、爬取百度贴吧

二、提取小说内容

三、豆瓣小说

四、Ajax爬微博

五、多线程爬淘宝

【Python网络编程】爬取百度贴吧、小说内容、豆瓣小说、Ajax爬微博、多线程爬淘宝相关推荐

最新文章

热门文章