Saturday, June 23, 2007

Regex Too Terse

The purpose of making programming languages terser is so they'll be more readable. But regexes are too terse. To make them readable, we need to make them more verbose. We can format them more easily by using the ?x flag, but the syntax is so different to the languages they're embedded within, they still stick out, requiring mental effort to digest. But JVM-languages like Groovy aren't stuck with them because Java has them. Just as JVM bytecodes are generated by Groovy, so also the terse regex syntax could be too. What would a more verbose yet readable regex syntax for Groovy look like?

For starters, we wouldn't need to embed the regex expression inside slashes / /, as the syntax would be mixable with Groovy's. Perhaps it could be generated by a builder. A small amount of syntax could remain the same. The alternation operator | acts like Groovy's | operator. The option operator ? has a parallel in Groovy's ?. operator. We could keep escaped control characters, and change the meaning of \b from word boundary to backspace as in Groovy, so we'd have '\t\n\b\f\r' instead of /\t\n\013\f\r/.

For character classes, we could use Groovy's sequence syntax, ('a'..'j') instead of /[a-j]/, or [*'a'..'z', *'A'..'Z', '_', *'0'..'9'] instead of /[a-zA-Z_0-9]/. We could use !['a','c','e'] instead of /[^ace]/. Pre-defined classes could have special variable names within the regex builder context, eg, ws for /\s/, digit for /\d/, and word for /\w/. We could even define our own character classes, eg, def hexDigit= [*'0'..'9', *'A'..'F', *'a'..'f'], or def notDigit = !digit for /\D/.

For groups, parentheses are sensible, so 'a'+('b'|'c')+'d' as new syntax for /a(b|c)d/, but groups should be non-capturing by default. For capturing groups, we can use variable names instead of numbers, ie, 'a'+(bc='b'|'c')+'d'+bc+'e' instead of /a(b|c)d\1e/.

For the wildcard, perhaps replace the dot with an underscore , as in 'a'+_+'c' instead of /a.c/. For the repetition operators, we could use sequences, so a new special syntax 'a'*(0..) + 'b'*(1..) instead of /a*b+/, and 'a'*(3..5) instead of /a{3,5}/.

Flags could be indicated by names heading a closure, eg, caseInsignificant{ 'aBc'*(1..)+'DeFg' } instead of /?i:(aBc)+DeFg/. Lazy and possessive operators could be indicated by such names, eg, lazy{ 'abc'*(0..) } instead of /(abc)*?/, and possessive{ 'def'*(1..) } instead of /(def)++/.

Lookarounds could also be shown by names, after{'a'} instead of (?=a), !after{'b'} instead of (?!b), and before{'c'} instead of (?<=c). The pre-defined anchors would have special variable names, eg, wordBoundary instead of /\b/, lineStart instead of /$/, and lineEnd instead of /^/. And we could define our own anchors, eg, def sentenceEnd= before{['.','?','!']}.

I thought of this replacement syntax off the top of my head. It's just an idea for a RegexBuilder for Groovy. We could have Groovy statements interacting with the regex syntax, just like other builders do, so we could capture information that would normally be lost in the regex backtracking. Maybe regex functions normally outside the regex string, such as text replacement, could also be done within the builder syntax.

So instead of regex syntax being so terse it's unreadable, and sticking out like a sore thumb from the cool Groovy syntax, it could be made more verbose so it's easily readable, and mixes nicely with other Groovy syntax.

No comments: