In-place lexing definitions
Beta-4 features a basic in-place comment lex-definition feature, e.g.
addcomment "(?s:#{4}.*)" in the source will cause the parser to add this comment definition to the whitespace lexer for subsequently parsed and/or evaluated files. This particular example of comment lets us add #### in the source file to comment to end-of-file. (We'll change the addcomment keyword into an annotation later.) In-place string lex-defines are also coming. For my earlier thoughts on in-place lexical definitions, see part one and part two.To make such in-place lex-definitions finer grained, I need to weave just-in-time lexing (with pushbacks) throughout the parser. I've just started on this. We could also define other lex-definitions in the source, e.g. numeric and date formats. See an earlier post of mine for more ideas on this.
In-place parsing hooks
We'll also provide hooks in the parser to enable Groovy programmers to define their own primary expressions, path elements, operators, and statements. Path elements are a more recent innovation in C-syntax languages. We'll change the postfix operators,
++ and --, into path elements. The operators will then have 4 well-defined level groups: right-associative prefix unary, left-associative binary, right-associative ternary, and right-associative assignment.About a year ago, Larry Wall give his 12th Annual State of the Onion talk, talking about progress on the Perl 6 language. His parser didn't use numbers to define precedence levels, instead used "surreal" precedence, i.e. defined a precedence with respect to an earlier operator definition, specifying either "same as", "looser than", or "higher than". We'll use this technique for the operator hooks.
Implied statement/etc termination
Currently, I do a 2-phase parse on statements to implement the implied semicolons. For the first phase, I grab tokens up to the following newline or semicolon. The second phase attempts to parse them, accepting every line that parses into some valid node. Every line that doesn't parse successfully, we continue the parsing by appending the next group of valid tokens. This means we can end a line with tokens like
+ or ., but not begin one with these. Later, I'll add a third phase, so for every line that fails to parse before reaching its end, it'll be joined to the end of the previously parsed tokens and parsing resumed from those. I'll then abstract this technique to all lists in Groovy's syntax. I've already blogged about this.Such multi-phase parses let us do context-sensitive parses. I use a similar technique with the lexer/parser to parse valid GStrings. I use a ManyLazily combinator parser, which itself is composed of Bind and Retn parsers. (Of course, such stuff can become inefficient, and even intractible, very quickly, but we won't worry about that until later. Premature optimization is evil, right? We'll then add memoization, perhaps even experiment with multicore stuff. Parsing is one algorithm that could really benefit from the looming multicore revolution.)
Multi-option syntax forms
Python uses indentation to group statements into blocks, C-syntax languages use curlies. Groovy will provide both techniques simultaneously, using symbol
{:. When working, we'll abstract the technique to other lists in the syntax using symbols (: and [:. Groovy developers will have the choice of syntactic forms, even able to mix both styles together in the same portion of code, just like in natural language.Another code formatting technique related to indenting is spacing on a single line. I often use the number of spaces between tokens to show how items are grouped, to make the code more readable. An example where I use 3 spaces instead of none or one to show groupings on one line:
[Trig.CTD, object.method( call( [7.89, 'abc'], 7 * (2+x), Trig.END ), Parse.jig(99) )]I use this technique all the time when coding. Just as Groovy will enable both list-bracketing and indentation, so also it will enable both expression-bracketing and token-spacing. So
7 * 2+x will evaluate as 7 * (2+x) due to the spacing. The Fortress language already checks and rejects code where the spacing doesn't match the syntactic bracketing; the Groovy Language will go a step further and actually enable both formats. Every degree of syntactic freedom will be utilized in the Groovy Language syntax, just like in natural language.Name aliasing
Name aliasing is what got me interested in writing an alternative lexer/parser for the GrAST (Groovy AST) in the first place. I've long wanted to use Chinese characters as aliases for the English names and keywords in programs, to make the code shorter. I don't want to write this:
content.tokenize().groupBy{ it }.
collect{ ['key':it.key, 'value':it.value.size()] }.
findAll{ it.value > 1 }.sort{ it.value }.reverse().
each{ println "${it.key.padLeft( 12 )} : $it.value" }when I can write this:
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}(You need a CJK font to see the above correctly.)
Groovy will enable the keywords and names in programs to be aliased lexically. Even right-directional languages will be enabled. All the vocabulary of Unicode will be tightly integrated into the Groovy Language.
Custom input method editor
Because I think CJK characters will prove quite popular when Groovy enables them, I'm also working on an Input Method that non-Chinese like me can learn and use easily. Last year, I posted online a graphical decomposition of the 20,000 most common Chinese characters, both simplified and complex ones. I periodically munge this to get ideas for the best component-to-key assignments, but won't make any decisions until much later. Such key assignments are analogous to natural language phonetics, and can't be changed easily once people start using them, so I'm not releasing anything else too quickly in this area.
An easily-used IME for CJK characters is just the first task; ultimately the IME will enable every Unicode character, even SMP ones, to be entered. Within 10 years, developers could be programming in Egyptian heiroglyphs for fun! Can you see my vision for Groovy, anyone???
One last thing...
Oh yes, I almost forgot to mention, I'm not doing any of this in C#. I'm switching languages, back to the JVM, back to the (J)Groovy AST, to SCALA.
I'd been dabbling a bit with Lisp/Scheme and Haskell lately, and have wanted to learn a functional language thoroughly, but hadn't known which one till now. I chose Scala because:
- I can start doing things in the object-oriented style, and incrementally switch to the functional style
- I can easily port Groovy-DLR into it, which gives me a real-world programming task to use it for
- I can use it to write to the (J)Groovy AST, putting me back onto the true Groovy platform, after an absence
- it's been blessed by at least one (J)Groovy core developer, Andres Almiray, for use in (J)Groovy
The resulting code will be called Groovier. It'll be Apache 2.0 licensed to encourage the (J)Groovy developers to copy it for bundling with Groovy. Groovier will be to Scala and the GrAST what the (J)Groovy Language is to Java and the JVM. Groovy is getting groovier and Groovier.