Tuesday, October 12, 2004

Understand the regex, win teh prize

I have enormous levels of simultaneous respect and disgust for regular expressions (regex). The fact that such complex behavior can be expressed in such a small space is undeniably cool to a software engineer. Yet, as someone who prizes readable and understandable code above almost all else, I generally find them extremely unpleasant to work with. I recently wrote this regular expression:


If this were written in procedural code (VB, C/C++/C#, Java, etc.), it would probably be a hundred lines, yet it can be expressed in the language of regular expressions in less than a hundred characters. The problem with something like this is that it's pretty much impossible to visualize. Once it's written and put into use, it's set in stone, because no one will be able to understand it later. Had I not documented this regular expression, I'd probably have no idea what it was for in a week. Now, this can be said about a lot of things; someone who works with regular expressions on a regular basis would have more luck at understanding that, but I don't expect that there are too many people in the world who could understand the line of code above without an extreme amount of effort.

In writing that expression, I learned all sorts of new terminology like "atomic zero-width assertions" and "positive lookbehind group." I am certain that these names were chosen primarily to make regular expressions sound even more technical than they really are. The main computer teacher at my high school, Mrs. Trumble, had a theory that most computer science terms were named the way they are to sound more daunting and frightening to non-technical people, setting those who could understand the terminology cleanly apart from the rest of the world, and therefore ensuring greater job security for the future. When I heard this theory, I was immediately convinced, and I still am today.


Matthew Beermann said...

It scares off a few of the technical people, too...

Anonymous said...

Regular expressions become second nature once you've used them for a while. Granted, I would have to get out the Camel Book and flip to chapter five to verify that lookbehind assertion, but they're really not that difficult to understand. They are certainly more readable and maintainable than the corresponding hundred-line code snippet would be.