Request: Headless HTML rendering engine? | Holovaty.com

Request: Headless HTML rendering engine?

Written by Adrian Holovaty on May 2, 2008

Warning: Seriously geeky request ahead!

I'm looking for a way to render arbitrary Web pages -- including CSS and JavaScript -- and access the resulting DOM tree programatically, i.e., in an automated/headless fashion. I want to be able to ask the following questions of the resulting DOM tree:

For a given element, what font family, size, and color is the text?
How tall and wide (in pixels) is a given

, , etc.?
What are the x/y coordinates of a given element (from the upper-left corner of the page, or lower-left, or wherever)?
For a given element, what is its text content?

The rendering must be state-of-the-art, handling advanced CSS that Firefox, Safari and IE handle. It should work on Linux. Bonus points if there's a Python API for this magical DOM tree.

This is all stuff that standard in-page JavaScript could accomplish, but the catch with me is that I need to be able to do it in a completely automated way, on arbitrary pages, on a headless server.

I know Gecko and Webkit provide this, but I'm not sure where to start with them. The docs and articles I've read seem to be focused more on embedding the full browser window in a GUI application than embedding the rendering engine itself and manipulating the resulting pages.

Help! If you have any clues, I'd be grateful if you left a comment or got in touch with me.
Comments
Posted by Andrew Sutherland on May 2, 2008 at 2:45 a.m.:

PyXPCOM (http://developer.mozilla.org/en/docs/PyXPCOM) should handle the Python part of the Gecko equation.

I myself am no specific help on the gecko side of things, but I think the following post/thread on the PyXPCOM mailing list may be of assistance:

http://aspn.activestate.com/ASPN/Mail/Message/pyxpcom/3619998
Posted by Rene Dudfield on May 2, 2008 at 3:19 a.m.:

You can set up a headless X server, then run firefox, or whatever browser with a standard build.
Posted by Michael Twomey on May 2, 2008 at 4:46 a.m.:

If you want an example of using webkit to do headless stuff you could look at webkit2png which is a tool for taking screenshots of websites from command line. It uses webkit and pyobjc, so you'll need a mac. It doesn't do any DOM stuff that I can see but I might be a useful starting point for writing an automated tool.
Posted by Justin Mason on May 2, 2008 at 5:01 a.m.:

http://khtml2png.sourceforge.net/ might be useful, if you're doing this on a *NIX platform. Looks like it's well-maintained, too, since the most recent release was only a couple of weeks ago.
Posted by Gábor Farkas on May 2, 2008 at 5:10 a.m.:

in case of firefox, there are 2 issues:

1. run it somehow in a headless mode: for this, try Xvfb. it starts a headless X server. then you can run firefox in it.

2. communicate with the firefox instance. there is PyXPCOM, as others already mentioned, which could make it work.
Posted by Jason on May 2, 2008 at 7:04 a.m.:

If you want to muck in C++ code you could look at RenderTreeAsText in Webkit. For actually setting up the rendering engine, there's some relatively simple high-level apis in the wx and qt ports that seem pretty readable; the kind of api you'd use for those neat "write a web browser in 5 lines of code" demos. See WebFrame in particular. Disclaimer: I've never written anything with webkit, but it might be fun to learn.
Posted by anonymous on May 2, 2008 at 8:15 a.m.:

What about Selenium? or Watir?
Posted by anonymous on May 2, 2008 at 8:50 a.m.:

I haven't tried this (but am planning to), so I don't know if it really meets your needs, but HTMLUnit is a Java-based headless browser (designed for testing).
Posted by anonymous on May 2, 2008 at 10:15 a.m.:

Attributes such as pixel width, height, font etc will either be determined by CSS, or they will be agent (and user setup) specific.

The pixel width of a div of width 50% will depend on the size of the viewport - which of course would be anything. Do you intend to 'fake' the settings of a user agent? If so, then a simple calculation would get the pixel width (as you would know your viewport dimensions).

I really would consider seeing how far you can get by simply manipulating the dom and parsing the css (both of which are easily achieved with the python libraries urllib, lxml / beautifulsoup and cssutils).

I know, I know; None of this helps with javascript dependent attributes.

RC
Posted by alan taylor on May 2, 2008 at 10:36 a.m.:

Have you looked at JSSh? Not sure if it fits the bill, but it just might - it's a "Mozilla C++ extension module that allows other programs (such as telnet) to establish JavaScript shell connections to a running Mozilla process via TCP/IP" I know it can return some parts of the DOM, but not sure how much detailed info you can get beack from it. http://www.croczilla.com/jssh
Posted by Matthew Marshall on May 2, 2008 at 10:42 a.m.:

I've played with doing this a little. The best I came up with was using PyKDE and khtml. I'm pretty sure it requires an X server, but if nothing else you could use a vnc server.

MWM
Posted by Kumar McMillan on May 2, 2008 at 11:40 a.m.:

There are probably several ways to do it, but the first that comes to mind is using the Python driver for Selenium RC ...

from selenium import selenium

# with the selenium-rc (Java) proxy sever running at localhost:4444 ...

selenium = selenium("localhost", 4444, "*firefox", "http://thewebsite.com")

selenium.open("/")

selenium.wait_for_page_to_load('30000')

selenium.get_html_source() # this is includes any JavaScript DOM manipulations, of course

selenium.get_element_position_left("xpath=//div[1]")

selenium.get_element_position_top("xpath=//div[1]")

selenium.get_element_height("xpath=//table[1]")

selenium.capture_screenshot('/tmp/site.png')

... but I'm not sure how you get the font/text info. Selenium RC is designed to run headless and also has a "grid" implementation so you can throw more hardware at it. Scaling up to the grid is very transparent -- same code as above, more or less.

Links:

http://selenium-rc.openqa.org/

http://selenium-rc.openqa.org/python.html

http://selenium-grid.openqa.org/
Posted by anonymous on May 2, 2008 at 12:02 p.m.:

seconding the jssh suggestion http://www.urbanhonking.com/ideasfordozens/archives/2008/03/automating_fire.html
Posted by Ryan Shaw on May 2, 2008 at 12:26 p.m.:

You might want to check out Crowbar:

Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.
Posted by mikeal on May 2, 2008 at 1:49 p.m.:

I would go with windmill over Selenium if you're going down that road. We have far more comprehensive javascript support, you can use execJS to get back the result of any arbitrary js.

http://windmill.osafoundation.org

And jssh is great, but MozRepl is jssh on crack.

http://hyperstruct.net/projects/mozrepl

The whole interface is much much nicer and I'm in the middle of a Python <-> JavaScript bridge using MozRepl that I'll be sure to send you a link to once it's public.
Posted by Henning on May 2, 2008 at 2:29 p.m.:

Qt 4.4 is available on all platforms and contains a WebKit port. Fortunately the newest PyQt snapshots also contain support for WebKit. Because Qt can render every widget to a pixmap, is should be fairly easy. To run Qt headless you could use xvfb.

To access the DOM you can query with Javascript.

The following is _not_ tested:

from PyQt4.QtCore import *

from PyQt4.QtGui import *

from PyQt4.QtWebKit import *

import sys

app = QApplication(sys.argv)

browser = QWebView()

browser.show()

browser.resize(800,600)

#browser.setHtml("Hello, world")

browser.load("http:///www.djangoproject.com")

pm = QPixmap.grabWidget(browser)

pm.save("website.jpg")

body = browser().page().mainFrame().evaluateJavaScript("getElementByName('body')")
Posted by anonymous on May 2, 2008 at 6:08 p.m.:

HTMLUnit is a very good headless browswer implementation. It supports different browsers and Jacascript (using Rhino I think). And finally, is under active development.

http://htmlunit.sourceforge.net/

Unfortuantely, its a Java library but you could use jpython to access it.
Posted by anonymous on May 2, 2008 at 6:12 p.m.:

I looked at a few open source projects to do headless rendering.

It's tempting to use firefox/gecko but the learning curve is steep,

it's 2 mln lines of netscape legacy C++ code.

But if you figure it out you'll have a fine tool.

What is working for me now is lobo renderer (from cobra browser) (in java).

It's not the best rendering engine, but it's decent, and easy to program.

You can get rendered blocks and dom objects, and answer all the questions

as to block location, color, text etc.

It can be made to work on linux completely headless without an x server,

the way I have it working is it takes in a url or html, and saves to another

textual file format. What's important is to encapsulate your choice

of rendering engine, because it will change.

Email me at dmitrim at yahoo dot com if you need help.
Posted by Phil on May 2, 2008 at 7:31 p.m.:

Personally I'd try it with MozRepl and an X virtual framebuffer: http://emacspeak.blogspot.com/2007/06/firebox-put-fox-in-box.html
Posted by Daniel on May 2, 2008 at 7:46 p.m.:

As suggested above, run firefox on a virtual X server. Use a firefox extension (mozrepl or jssh) to get automated control over the browser.

I set up a system doing exactly this (for taking screenshots) last summer. In the end it barely took any code, just a fair amount of faffing with config files. Happy to give more details if it's helpful: (my first name) at ohuiginn.net
Posted by rex on May 3, 2008 at 8:44 a.m.:

I went throught trying to work out a way to do this ages ago.

Not sure if you're feeling the same, Adrian, but what bothered me (purely from a principle level) was that I really wanted to be able to do this on my server _without_ having to run a headless X server, or an instance of firefox or whatever.. i wanted a library that was able to do it.. and give back my responses without having the uneccessary overhead of a browser, x server etc running (i know very little about it... but i can't help but feel that these are uneccessary elements in the equation).

Surely there is a way to do what you're asking without having a program running that is designed to actually render the pictures on a screen... *shrug*
Posted by anonymous on May 5, 2008 at 4:55 a.m.:

rex: Rendering HTML nowadays is a heavy complex task. So there is no light library, unfortunately. It sounds like using PyQt is the smartest approach because it does not load a full appliaction but only a rendering engine you can fully control. Having a dummy X-server on Unix seems to be a necessary evil.
Posted by Eric Moritz on May 5, 2008 at 3:50 p.m.:

I was thinking of this very issue a while back:

http://eric.themoritzfamily.com/2008/02/08/python-interface-mozilla-dom/

I came across this guy's post:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

He's using Rhino and some custom javascript to emulate the browser's window object.
Posted by John Herren on May 13, 2008 at 1:20 a.m.:

Rhino ftw

Comments have been turned off for this entry.

Request: Headless HTML rendering engine相关推荐

  1. 计算机术语rander是什么意思,Rendering Engine,呈现引擎还是渲染引擎?

    前面一些小节里,又是出现呈现引擎,又是出现渲染引擎,那么它们究竟是什么东西?又该怎么区分他们呢? 其实他们都是Rendering Engine的翻译. Render 的中文意思有翻译,呈送,粉刷的意思 ...

  2. Rendering Engine 主流的浏览器内核(排版引擎、渲染引擎、解释引擎)有哪几种,分别的特点...

    一.A web browser engine A rendering engine is software that draws text and images on the screen. The ...

  3. html5 lob,GitHub - LobbL/cax: HTML5 Canvas 2D Rendering Engine - 小程序、小游戏以及 Web 通用 Canvas 渲染引擎...

    English | 简体中文 Cax HTML5 Canvas 2D Rendering Engine Features Simple API, Lightweight and High perfor ...

  4. html5 2d小游戏,GitHub - pepsin/cax: HTML5 Canvas 2D Rendering Engine - 小程序、小游戏以及 Web 通用 Canvas 渲染引擎...

    English | 简体中文 Cax HTML5 Canvas 2D Rendering Engine Features Simple API, Lightweight and High perfor ...

  5. html5 2d小游戏,cax: HTML5 Canvas 2D Rendering Engine - 小程序、小游戏以及 Web 通用 Canvas 渲染引擎...

    Cax 小程序.小游戏以及 Web 通用 Canvas 渲染引擎 微信小游戏 特性 Learn Once, Write Anywhere(小程序.小游戏.PC Web.Mobile Web) Writ ...

  6. Web Page Request Principle

                                                                                             Web页面请求的原理 ...

  7. Game Engine Architecture by Jason Gregory:1.6 实时游戏引擎架构

    http://blog.csdn.net/skylmmm/article/details/6230420 一个游戏引擎一般是由工具集和一个运行时组件组成.下面部分我们将首先研究这个运行时组件,然后再看 ...

  8. html5上传steam,Steam 上的 HTML5 Javascript Game Engine

    不支持简体中文 本产品尚未对您目前所在的地区语言提供支持.在购买请先行确认目前所支持的语言. 购买 HTML5 Javascript Game Engine 关于这款软件 WEB GL ENGINE ...

  9. golang gin框架源码分析(二)---- 渐入佳境 摸索Engine ServeHTTP访问前缀树真正原理

    文章目录 全系列总结博客链接 前引 golang gin框架源码分析(二)---- 渐入佳境 摸索Engine ServeHTTP访问前缀树真正远原理 1.再列示例代码 从示例代码入手 2.r.Run ...

最新文章

  1. 第一部分:基础知识(第一章)屏幕部分续
  2. setTimeOut与 setInterval区别
  3. 当编程语言掌握在企业手中,是生机还是危机?
  4. BAT Window批量重命名
  5. 【转】进阶 JavaScript 必知的 33 个点【进阶必备】
  6. 人口增长模型_未来中国近一半人口将生活在20强城市,这是异想天开还是大势所趋?...
  7. 在ubuntu 8.04下安装Oracle 11g
  8. Python 和 egg 文件
  9. Gatech OMSCS的申请和学习之奥妙
  10. 使用rsyslog+loganalzey收集日志显示客户端ip
  11. Maven 设置本地仓库的地址
  12. linux查看创建目录命令,Linux菜鸟——常见命令一 查看及创建目录文件等命令
  13. 2020年河南对口升学计算机类专业课试卷,2009年河南对口升学计算机专业试卷专业课...
  14. 音视频即时通讯解决方案
  15. JAVA|IO流的练习
  16. 2020vue面试题汇总
  17. 【免费领取】石杉架构班Kafka消息中间件内核源码课程
  18. trackpoint_如何在戴尔笔记本电脑上禁用TrackPoint鼠标按钮?
  19. 胶片效果滤镜渲染工具:DxO FilmPack Mac
  20. linux平台的mmdetection安装

热门文章

  1. Hibernate-day04
  2. java/android 做题中整理的碎片小贴士(5)
  3. bzoj1066 蜥蜴 (dinic)
  4. 狼抓兔子(平面图转对偶图)
  5. PostgreSql与sqlserver对比杂记
  6. lucene 使用教程转
  7. 自动根据动态的intput计算值
  8. 下载r包IlluminaHumanMethylation450kanno.ilmn12.hg19
  9. boyer moore算法 java_Boyer-Moore算法
  10. java 比较对象内容是否相同的_Java 比较对象中的内容是否一致