[Warning: another technical post ahoy...]
I've always been a do-it-yourselfer. If there's something that's pre-made that doesn't do exactly what I want easily and reliably, I generally forget about it and make it myself. This has served me well.
However, in my latest IvoryTower HTML-parsing adventures, I decided that I would try to learn some off-the-shelf technologies. For the more technical of you, I decided on this: with a block of HTML, start by cleaning things up with a monstrous regular expression, then put it through HTML Tidy, then put it through a complex XSL transform, and then use the XML DOM to manipulate it further. I've already explained my feelings toward regular expressions (respect and disgust), and I'm not really sold on XSLT (I could have done the same thing in a lot less time manually). I do rather enjoy the XML DOM, even though it does have some annoying shortcomings (for example, it's a major operation to excise a substring from a block of text and replace it with a new node, especially if you want to do this more than once). However, I am furious at HTML Tidy.
At first, I was really excited about HTML Tidy. Its mission is to take as input the worst HTML ever, and turn it into something useful, even well-formed XML... no short order. It does this pretty well, it turns out, though there are a lot of not-well-documented options you need to learn. There's even a version of Tidy that you can use from other software called libtidy. So, I decided early on that I would use this in IvoryTower. Big mistake.
Libtidy is a C library, which was a bit of a pain to call from .NET (string? what's a string?), though that wasn't the biggest problem. The biggest problem is that it failed randomly. I mean, completely randomly, it would fail to allocate a buffer of memory. I got around this by catching the error, waiting a few milliseconds, and then doing it again, which worked. But then more problems started pouring in. The worst is that it wasn't Unicode-compliant, which was kind of an important requirement for me. So, I tried out TidyATL, a C++/ATL/COM wrapper around libtidy, which was easier to call from .NET, and Unicode-friendly. This, unfortunately, didn't get around the random failures... and in TidyATL, instead of a simple failure, you get a stack overflow, which is something you can't recover from. In the end, certain blocks of perfectly-valid HTML would hang the app at 100% CPU for a while, and then the program would crash.
I spent many, many, many late hours trying to come up with a solution, and finally I just couldn't. I scrapped the whole idea of using libtidy tonight, and tried a different option, one that I had originally decided would be too complicated: using an SGML parser and DOM. (HTML is a variant of SGML.) This turned out to be extraordinarily simple: I had my program up and running in half an hour, with no weird little special cases, bugs, or hacks. Wonderful. It doesn't do quite as much as HTML Tidy since it's just a parser and DOM (for example, Tidy could, if I wanted it to, replace ugly <font> tags with valid CSS), but it does everything I need it to, without a heaping helping of pain. It turns out that this SGML reader was developed by another Microsoft employee on his own time.
This whole experience has left me quite frustrated, though in the end, it all seems to have worked out. This was way more work than it needed to be. I should have just stuck with my original instincts: do all of the parsing myself, and then use the XML DOM for tree operations. Then, I would have avoided regular expressions and Tidy altogether, and I would have been a much happier developer.
1 comment:
But it works wonderfully well now, which is good. I'd heard of similar HTML Tidy problems elsewhere, but I can't remember where now. Was it possibly on slashdot?
Either way, I'm glad you got it working with the SGML parser and DOM.
Woohoo!
Oh! And don't work _too_ hard - breaks are good things.
Post a Comment