python html解析对比_python htmlparse页面解析示例

11月

2013

0x01 今天写了个示例程序，用python解析网页，htmlparse是核心，配合urllib2，解析页面中的某些特定标签

0x02 代码如下，作用是用来爬取页面信息的，

#coding=utf-8

'''

Created on 2013-11-5

@author: lenovo

'''

from HTMLParser import HTMLParser

import time

import urllib2

import urllib

import time

from urllib2 import urlopen

loginPass=[]

check=0

webshell=[]

temp=0

savefile=''

pagenum=0

class MyParser(HTMLParser):

"""一个简单的HTMLparser的例子"""

def handle_decl(self, decl):

"""处理头文档"""

HTMLParser.handle_decl(self, decl)

#print decl

def handle_starttag(self, tag, attrs):

"""处理起始标签"""

global loginPass

global check

global webshell

global temp

HTMLParser.handle_starttag(self, tag, attrs)

#if not HTMLParser.get_starttag_text(self).endswith("/>"):

#print ""

if tag=='tr' and self.rawdata.find("""

check=1

if tag=='a' and check==1 and len(attrs)>1 and attrs[1][1][-3:]=='php' :

#print attrs[1][1]

z=[]

z.append(attrs[1][1])

webshell.append(z)

temp=len(webshell)

if tag=='input' and check==1 and len(attrs)>3 and attrs[3][0]=='value' and attrs[1][0]=='style':

#print attrs

webshell[temp-1].append(attrs[3][1])

if tag=='input' and check==1 and len(attrs)==3 and attrs[2][0]=='value' and attrs[1][0]=='style' and attrs[0][1]=='text':

#print attrs

webshell[temp-1].append(attrs[2][1])

if tag=='input':

#print attrs

if attrs[0][0]=='type' and attrs[0][1]=='checkbox' and attrs[1][1]=='pwd[]' :

#print attrs[2][1]

loginPass.append(attrs[2][1]) # 处理图片

#for attr in attrs:

# for t in attr:

# print t

def handle_data(self, data):

"""处理文本元素"""

HTMLParser.handle_data(self, data)

#print data,

def handle_endtag(self, tag):

"""处理结束标签"""

HTMLParser.handle_endtag(self, tag)

if tag=='tr':

check=0

#if not HTMLParser.get_starttag_text(self).endswith("/>"):

#print "",tag,">"

def handle_startendtag(self, tag, attrs):

"""处理自闭标签"""

HTMLParser.handle_startendtag(self, tag, attrs)

#print HTMLParser.get_starttag_text(self)

def handle_comment(self, data):

"""处理注释"""

HTMLParser.handle_comment(self, data)

#print data

def close(self):

HTMLParser.close(self)

#print "parser over"

def Post(url,s):

try:

s1=urllib.quote(s,"=&")

#print s1

req = urllib2.Request(url,s1)

resp = urllib2.urlopen(req,timeout=10)

web=resp.read()

#print strlist

except Exception:

return ""

return web

def Get(url):

try:

#s1=urllib.quote(s,"=#"()[],@'&\")

#print s1

req = urllib2.Request(url)

resp = urllib2.urlopen(req,timeout=60)

web=resp.read()

#print strlist

except Exception:

return ""

return web

def saveWebshell():

file=open(savefile,'w')

for i in webshell:

print i

file.write(i[0]+','+i[1]+'n')

file.close()

print '******* Save SuccessFul ********'

#print 'Save SuccessFul'

#def getInfo(page):

def getPage(url,num):

if num==0:

if len(loginPass)!=0:

print '******* Login ********'

passWd='pwd[]='+loginPass[2]+'&pwd[]='+loginPass[4]

demo=MyParser()

demo.feed(Post(url,passWd))

demo.close()

print '******* Get Page '+str(num)+' ********'

urlNew=url+'&p='+str(num)

demo=MyParser()

demo.feed(Get(urlNew))

demo.close()

#def getTotal(url):

def login(url):

print '*******Get Password********'

demo=MyParser()

demo.feed(Get(url))

demo.close()

def autoSearch(url):

#print loginPass

#getPage(url,0)

num=pagenum

for i in xrange(0,num+1):

page=getPage(url,i)

#getInfo(page)

print '******* Scan Over ********'

saveWebshell()

if __name__ == '__main__':

pagenum=9

savefile=r'E:work profitprofitjavaicetoolstoolsWebShellphp168.txt'

autoSearch('http://www.XXXX.com/core/centerxxxxx.php?s=php168')

pass

3,464 次访问过

python html解析对比_python htmlparse页面解析示例相关推荐

python的网页解析器_python 之网页解析器
一.什么是网页解析器 1.网页解析器名词解释首先让我们来了解下,什么是网页解析器,简单的说就是用来解析html网页的工具,准确的说:它是一个HTML网页信息提取工具,就是从html网页中解析提取出& ...
python自带网页解析器_python 之网页解析器
一.什么是网页解析器 1.网页解析器名词解释首先让我们来了解下,什么是网页解析器,简单的说就是用来解析html网页的工具,准确的说:它是一个HTML网页信息提取工具,就是从html网页中解析提取出& ...
python爬取网页内容_Python爬虫原理解析
笔者公众号:技术杂学铺笔者网站:mwhitelab.com 本文将从何为爬虫.网页结构.python代码实现等方面逐步解析网络爬虫. 1. 何为爬虫如今互联网上存储着大量的信息. 作为普通网民,我 ...
python处理pdf实例_python使用pdfminer解析pdf文件的方法示例
最近要做个从 pdf 文件中抽取文本内容的工具,大概查了一下 python 里可以使用 pdfminer 来实现.下面就看看怎样使用吧. PDFMiner是一个可以从PDF文档中提取信息的工具.与其他 ...
python zxing 识别条码_Python zxing 库解析（条形码二维码识别）
各种扫码软件最近要做个二维码识别的项目,查到二维码识别有好多开源的不开源的软件 Zbar 首先试了一下Zbar,python加载ZBar时各种报错.可能的原因是zbar的dll文件是32位的,而我系 ...
python 跳过迭代_Python迭代和解析(4)：自定义迭代器
Python迭代和解析(4):自定义迭代器发布时间:2019-01-13 17:10, 浏览次数:280 , 标签: Python 解析.迭代和生成系列文章:https://www.cnblogs. ...
python解析库_Python命令行解析库argparse
原博文 2014-08-13 05:48 − 2.7之后python不再对optparse模块进行扩展,python标准库推荐使用argparse模块对命令行进行解析. 1.example 有一道面试 ...
python 命令行解析函数_python命令行解析之parse_known_args(）函数和parse_args()使用区别介绍...
在python中,命令行解析的很好用, 首先导入命令行解析模块 import argparse import sys 然后创建对象 parse=argparse.ArgumentParser() 然后 ...
python 命令行解析模块_Python命令行解析模块详解
python2.7 怎么解析命令行输入的中文参数本文实例讲述了python读取命令行参数的方法.分享给大家供大家参考.具体分析如下: 如果想对python脚本传参数,python中对应的argc, ...

python html解析对比_python htmlparse页面解析示例

python html解析对比_python htmlparse页面解析示例相关推荐

最新文章

热门文章