Reformatting Bad HTML
I think a better (or more moddern) approach would be to use an HTML5 parser (like html5lib), as it has a standar way to parse INVALID documents too, and then recreate the HTML with the DOM.
BTW: Paul Irish has this ticket asking for help/someone to update lazyweb to use an HTML5 parser: https://github.com/paulirish/lazyweb-requests/issues#issue/2...
My preferred method is "html2haml < filename.html | haml". Doesn't care about document validity - just cleans up attributes and tag nesting/indentation. Generally speaking, I don't want the document corrected, just reformatted. Works for any XML, too, not just HTML.
this even works for .html.erb files