Regex to the rescue

At work recently, I was tasked with splitting one massive directory into two equally massive directories. I soon realized that this seemingly simply task would require hundreds, nay, thousands of link changes across perhaps ten thousand pages. Going through each file one by one and manually editing links would have been unthinkable, so naturally, I turned to EditPad Pro. EditPad especially excels at mass file editing; it never ceases to amaze me how it can open hundreds of files at once, all while taking up minimal memory. Its regex features also came in handy, as we're about to see.

With hundreds of files open at once, I found myself presented with a golden opportunity to fix years of bad code and legacy issues, all while doing that which was most essential to the project: the actual link changes. What started out as a simple string "find and replace" soon became a full-on diagnostic scan of all content pages. Here's a rundown of some of the regular expressions that I used and honed during this process.

Problematic characters

(?<!href=.*|aspx?|cfml?|"|/|<)\?\w|\?\s(?-i:[a-z])|[^\x20-\x7E\s]| \s|\s |(?<!<cf.*)&(?:(?=\s)|(?!(?:\w{2,6}|#\d{2,5});))|%5F|%20

Problematic characters originally started when I discovered that many content pages were originally composed in Microsoft Word and still contained proprietary characters from Microsoft's abominable Windows-1252 text encoding. Depending upon the browser or user setup, these characters would display as empty rectangles, seemingly random characters, or other such gibberish. This regex identifies several characters:

A misplaced question mark, namely one immediately followed by a word character, yet not part of a URL string. This may indicate that a text editor had trouble converting Microsoft's "Smart Quote" characters (or other such nonsense) to a UTF or ISO encoding. Failing, the text editor may have automatically replaced the unknown character with a simple question mark to warn the user.
A question mark, followed by a whitespace character, followed by a lower-case letter. This may indicate that a letter that needs to be capitalized, or it may be a conversion problem (see #1)
A character that is both 1.) outside of the printable ASCII character set, and 2.) not a whitespace character. A good rule of thumb is to encode non-printable ASCII characters with their HTML entities. For instance, the copyright character (©) would be written in the code as "©".
A literal "space" followed by any whitespace character, or the reverse: any whitespace character followed by a literal "space" character. This expression captures a space adjacent to a tab. It also captures trailing space characters and two consecutive literal "space" characters. These characters are problematic only in the sense that they are unnecessary and slow down page loads (if even infinitesimally).
An unencoded ampersand (outside of a ColdFusion tag).
The literal text "%5F" or "%20". Both are URL encodings and are unnecessary in some contexts.

Bad code

(?: ){2,}|<(?:center|font|u)\b[^>]*>|</?(?-i:[A-Z]+[^>]*>)|(?: ){2,}|<(\w*)>\s*</\1>

Bad code was originally a part of the regex Problematic characters, but I forked it because the regex was getting too long. Features of Bad code:

The tag occurring two or more times in succession. Nine times out of ten, the developer is using s to space paragraphs instead of the venerable tag!
The following tags: <center>, , and . All are deprecated elements in HTML 4 (and HTML 5, for that matter)
Any tag written in uppercase.
" " occurring two or more times in succession. Again, nine times out of ten, the developer is trying to hack the non-breaking space! (whether he/she knows it or not)
An empty element. For instance, I've seen  — most likely residue from a WYSIWYG editor.

Old variables

Old variables is little more than a piped list of legacy variables that either need updating or removing. Using the list is still much faster than searching for each variable individually.

Remove line breaks in text

Regex: ([^>\s])\s*\r+\s*([^<\s.])

Replacement text: $1 $2

This one removes line breaks where they don't need to be, such is in passages of text. I say, "Let the text wrap on its own!" This may or may not reduce the file size, but nonetheless, it improves readability.

Caveats

The aforementioned regexes were developed using EditPad Pro, which uses a custom regex flavor — a flavor, I might add, that purports to combine the "best features" of the more prominent regex flavors available: Perl, PCRE, .NET, JavaScript, etc. These code samples may not work as intended in your favorite text editor or programming language.

I'm constantly improving and tightening these expressions. In fact, I made some key improvements to the first two examples as I was writing this post. What is acceptable to me this week I may find to be less than optimal next week. Just a month and a half ago, I posted a regex here; I look back on it and observe how naive and sloppy I was!

This entry was posted on Tuesday, July 14th, 2009 at 3:30 pm and is filed under technology. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Blog