解析poj页面获取题目

页面是这样的：http://poj.org/problem?id=3334

要从这样的页面里面提取题目标题，时间限制，内存限制，题目描述，输入，输出，示例输入，示例输出，提示，来源等信息，获取必要的题目中的图片。

#!/usr/bin/env python
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib
import re
def getpojhtml(pid):
    url = "http://poj.org/problem?id="+str(pid)
    html = urllib.urlopen(url)
    soup = BeautifulSoup(html)
    title = soup.title.string[7:]
    time_limit = soup.findAll(text = re.compile("Time Limit"))[0].next
    mem_limit = soup.findAll(text = re.compile("Memory Limit"))[0].next
    description = soup.findAll(text = re.compile("Description"))[0].next.contents
    input = soup.findAll(text = re.compile("Input"))[0].next.contents
    output = soup.findAll(text = re.compile("Output"))[0].next.contents
    sim_input = soup.findAll(text = re.compile("Sample Input"))[0].next.contents
    sim_output = soup.findAll(text = re.compile("Sample Output"))[0].next.contents
    try:
        hint = soup.findAll(text = re.compile("Hint"))[0].next.contents
    except:
        hint = []
    try:
        source = soup.findAll(text = re.compile("Source"))[0].next.contents
    except :
        source = []
    pattern = re.compile('images/\d{4}[.\w]*')
    pic =  pattern.findall(html)
    pic_url=[]
    for item in pic:
        pic_url.append( 'http://poj.org/'+str(item))

return title,time_limit,mem_limit,description,input,output,sim_input,sim_output,hint,source,pic_url

if __name__=='__main__':
ret = getpojhtml(3344)
for item in ret:

print item

实现方案

首先用urllib模块获取整个页面，然后用beautifulsoup来解析，由于个别页面没有hint或者source，所以用try避免出错退出

图片可以选择用beautifulsoup来解析，但是我还是选择了用正则表达式来解析，因为用正则表达式可以准确地定位到题目描述中的图片，而beautifulsoup把整个页面中的所有图片都找出来了，有些并不是我需要的。

运行结果

Chessboard Dance
2000MS
65536K
[<div>Another boring Friday afternoon, Betty the Beetle thinks how to amuse herself. She goes out of her hiding place to take a walk around the living room in Bennett's house. Mr. and Mrs. Bennett are out to the theatre and there is a chessboard on the table! "The best time to practice my chessboard dance," Betty thinks! She gets so excited that she does not note that there are some pieces left on the board and starts the practice session! She has a script showing her how to move on the chessboard. The script is a sequence like the following example:<center><img src="data:images/3344_1.GIF" /></center>At each instant of time Betty, stands on a square of the chessboard, facing one of the four directions (up, down, left, right) when the board is viewed from the above. Performing a "move n" instruction, she moves n squares forward in her current direction. If moving n squares goes outside the board, she stays at the last square on the board and does not go out. There are three types of turns: turn right, turn left, and turn back, which change the direction of Betty. Note that turning does not change the position of Betty.If Betty faces a chess piece when moving, she pushes that piece, together with all other pieces behind (a tough beetle she is!). This may cause some pieces fall of the edge of the chessboard, but she doesn't care! For example, in the following figure, the left board shows the initial state and the right board shows the state after performing the script in the above example. Upper-case and lower-case letters indicate the white and black pieces respectively. The arrow shows the position of Betty along with her direction. Note that during the first move, the black king (r) falls off the right edge of the board!<center><img src="data:images/3344_2.GIF" /></center>You are to write a program that reads the initial state of the board as well as the practice dance script, and writes the final state of the board after the practice.</div>]
[<div>There are multiple test cases in the input. Each test case has two parts: the initial state of the board and the script. The board comes in eight lines of eight characters. The letters r, d, t, a, c, p indicate black pieces, R, D, T, A, C, P indicate the white pieces and the period (dot) character indicates an empty square. The square from which Betty starts dancing is specified by one of the four characters <, >, ^, and v which also indicates her initial direction (left, right, up, and down respectively). Note that the input is not necessarily a valid chess game status.The script comes immediately after the board. It consists of several lines (between 0 and 1000). In each line, there is one instruction in one of the following formats (n is a non-negative integer number):move n turn left turn right turn backAt the end of each test case, there is a line containing a single # character. The last line of the input contains two dash characters.</div>]
[The output for each test case should show the state of the board in the same format as the input. Write an empty line in the output after each board.]
[u'.....c..\r\n.p..A..t\r\nD..>T.Pr\r\n....aP.P\r\np.d.C...\r\n.....p.R\r\n........\r\n........\r\nmove 2\r\nturn right\r\nmove 3\r\nturn left\r\nturn left\r\nmove 1\r\n#\r\n--\r\n']
[u'.....c..\r\n.p..A..t\r\nD.....TP\r\n....a..P\r\np.d.C^..\r\n.......R\r\n.....P..\r\n.....p..\r\n']
[]
[<a href="searchproblem?field=source&key=Tehran+2006">Tehran 2006</a>]

['http://poj.org/images/3344_1.GIF', 'http://poj.org/images/3344_2.GIF']

博主ma6174对本博客文章（除转载的）享有版权，未经许可不得用于商业用途。转载请注明出处http://www.cnblogs.com/ma6174/

对文章有啥看法或建议，可以评论或发电子邮件到ma6174@163.com

本文转自ma6174博客园博客，原文链接：http://www.cnblogs.com/ma6174/archive/2012/08/04/2623159.html，如需转载请自行联系原作者