Hacker News Clone

How to parse HTML

by ranit8 on 1/7/2012, 2:19 PM with 32 comments

by thristian on 1/7/2012, 3:08 PM
In the bad old days, parsing real-world HTML was a horrible task because every web-browser had a huge collection of undocumented corner-cases and hacks; some accidental, some the result of reverse-engineering other vendors' corner-cases and hacks. Most standalone HTML parsers could generate some document tree from a given input file; whether or not it would match the one generated by an actual browser was another matter.
These days, however, we have the HTML5 parsing algorithm, reverse-engineered from various vendors' web browsers but actually documented and implementable (still horribly complicated, but that's legacy content for you). Not only is the HTML5 parsing algorithm designed to be compatible with legacy browsers, modern browsers are replacing their old parsing code with new HTML5-compatible implementations, so parsing should be even more consistent (I know Firefox has switched to an HTML5 parser, I think IE has made a bunch of noise about it too; I don't follow WebKit all that closely, but I'd be surprised if they haven't moved towards an HTML5 parser).
by justincormack on 1/7/2012, 3:03 PM
Now that html5 defines how to parse all html fragments there is really no reason not to use that algorithm.
by jrockway on 1/7/2012, 7:36 PM
This article should be called, "how to write a Marpa-based HTML parser", not "how to parse HTML". If you're a Perl programmer and want to parse HTML into an XML-style DOM, use XML::LibXML. If you can't handle the libxml2 dependency, use HTML::Parser.
by perfunctory on 1/7/2012, 8:59 PM
The fact that browsers accept defective html is the most evil thing that happened to the web. Any library that tries to parse "real world" html just contributes to that evil. I am astonished that we tolerate this and still call ourselves (software) engineers.
by gambler on 1/7/2012, 6:07 PM
Since this seems to be aimed (among other things) towards input sanitization, here is a semi-relevant entry that might amuse someone.
https://gist.github.com/1575452
This is a sanitizing HTML "parser" done in roughly 100 lines of PHP code. It does tag and attribute whitelisting, checks for protocols to prevent XSS, deals with unclosed and unopened tags, and does some other things. The biggest issue is that it's not well-factored. However, its shortness is appealing, because I understand how it works. I would have hard time trusting a library with thousands of lines of code to do input validation.
by skadamat on 1/7/2012, 5:34 PM
For you python users, the BeautifulSoup module has a prettify module which does the same thing.
by ypcx on 1/7/2012, 9:21 PM
If you want to go serious about web crawling and/or web scraping (within legal boundaries of course), you want to use Node.js and appropriate modules (don't remember the exact names right now). This is because Node.js being based on the V8 JavaScript engine, can completely emulate a real web browser - it can load and parse the HTML, as well as JavaScript. And many sites won't load properly without JavaScript.