Saturday, May 09, 2009

Groovy's Groovier Roadmap

I've just put the source code for beta-04 of Groovy-DLR 1.0 online, with some improvements, including evaluating code in separate files, and in-place comment lex-definitions. Time for a brief roadmap...

In-place lexing definitions
Beta-4 features a basic in-place comment lex-definition feature, e.g. addcomment "(?s:#{4}.*)" in the source will cause the parser to add this comment definition to the whitespace lexer for subsequently parsed and/or evaluated files. This particular example of comment lets us add #### in the source file to comment to end-of-file. (We'll change the addcomment keyword into an annotation later.) In-place string lex-defines are also coming. For my earlier thoughts on in-place lexical definitions, see part one and part two.

To make such in-place lex-definitions finer grained, I need to weave just-in-time lexing (with pushbacks) throughout the parser. I've just started on this. We could also define other lex-definitions in the source, e.g. numeric and date formats. See an earlier post of mine for more ideas on this.

In-place parsing hooks
We'll also provide hooks in the parser to enable Groovy programmers to define their own primary expressions, path elements, operators, and statements. Path elements are a more recent innovation in C-syntax languages. We'll change the postfix operators, ++ and --, into path elements. The operators will then have 4 well-defined level groups: right-associative prefix unary, left-associative binary, right-associative ternary, and right-associative assignment.

About a year ago, Larry Wall give his 12th Annual State of the Onion talk, talking about progress on the Perl 6 language. His parser didn't use numbers to define precedence levels, instead used "surreal" precedence, i.e. defined a precedence with respect to an earlier operator definition, specifying either "same as", "looser than", or "higher than". We'll use this technique for the operator hooks.

Implied statement/etc termination
Currently, I do a 2-phase parse on statements to implement the implied semicolons. For the first phase, I grab tokens up to the following newline or semicolon. The second phase attempts to parse them, accepting every line that parses into some valid node. Every line that doesn't parse successfully, we continue the parsing by appending the next group of valid tokens. This means we can end a line with tokens like + or ., but not begin one with these. Later, I'll add a third phase, so for every line that fails to parse before reaching its end, it'll be joined to the end of the previously parsed tokens and parsing resumed from those. I'll then abstract this technique to all lists in Groovy's syntax. I've already blogged about this.

Such multi-phase parses let us do context-sensitive parses. I use a similar technique with the lexer/parser to parse valid GStrings. I use a ManyLazily combinator parser, which itself is composed of Bind and Retn parsers. (Of course, such stuff can become inefficient, and even intractible, very quickly, but we won't worry about that until later. Premature optimization is evil, right? We'll then add memoization, perhaps even experiment with multicore stuff. Parsing is one algorithm that could really benefit from the looming multicore revolution.)

Multi-option syntax forms
Python uses indentation to group statements into blocks, C-syntax languages use curlies. Groovy will provide both techniques simultaneously, using symbol {:. When working, we'll abstract the technique to other lists in the syntax using symbols (: and [:. Groovy developers will have the choice of syntactic forms, even able to mix both styles together in the same portion of code, just like in natural language.

Another code formatting technique related to indenting is spacing on a single line. I often use the number of spaces between tokens to show how items are grouped, to make the code more readable. An example where I use 3 spaces instead of none or one to show groupings on one line:
[Trig.CTD, object.method(   call( [7.89, 'abc'], 7 * (2+x), Trig.END ),   Parse.jig(99)   )]
I use this technique all the time when coding. Just as Groovy will enable both list-bracketing and indentation, so also it will enable both expression-bracketing and token-spacing. So 7 * 2+x will evaluate as 7 * (2+x) due to the spacing. The Fortress language already checks and rejects code where the spacing doesn't match the syntactic bracketing; the Groovy Language will go a step further and actually enable both formats. Every degree of syntactic freedom will be utilized in the Groovy Language syntax, just like in natural language.

Name aliasing
Name aliasing is what got me interested in writing an alternative lexer/parser for the GrAST (Groovy AST) in the first place. I've long wanted to use Chinese characters as aliases for the English names and keywords in programs, to make the code shorter. I don't want to write this:
content.tokenize().groupBy{ it }.
collect{ ['key':it.key, 'value':it.value.size()] }.
findAll{ it.value > 1 }.sort{ it.value }.reverse().
each{ println "${it.key.padLeft( 12 )} : $it.value" }

when I can write this:
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}

(You need a CJK font to see the above correctly.)

Groovy will enable the keywords and names in programs to be aliased lexically. Even right-directional languages will be enabled. All the vocabulary of Unicode will be tightly integrated into the Groovy Language.

Custom input method editor
Because I think CJK characters will prove quite popular when Groovy enables them, I'm also working on an Input Method that non-Chinese like me can learn and use easily. Last year, I posted online a graphical decomposition of the 20,000 most common Chinese characters, both simplified and complex ones. I periodically munge this to get ideas for the best component-to-key assignments, but won't make any decisions until much later. Such key assignments are analogous to natural language phonetics, and can't be changed easily once people start using them, so I'm not releasing anything else too quickly in this area.

An easily-used IME for CJK characters is just the first task; ultimately the IME will enable every Unicode character, even SMP ones, to be entered. Within 10 years, developers could be programming in Egyptian heiroglyphs for fun! Can you see my vision for Groovy, anyone???

One last thing...
Oh yes, I almost forgot to mention, I'm not doing any of this in C#. I'm switching languages, back to the JVM, back to the (J)Groovy AST, to SCALA.

I'd been dabbling a bit with Lisp/Scheme and Haskell lately, and have wanted to learn a functional language thoroughly, but hadn't known which one till now. I chose Scala because:
  • I can start doing things in the object-oriented style, and incrementally switch to the functional style

  • I can easily port Groovy-DLR into it, which gives me a real-world programming task to use it for

  • I can use it to write to the (J)Groovy AST, putting me back onto the true Groovy platform, after an absence

  • it's been blessed by at least one (J)Groovy core developer, Andres Almiray, for use in (J)Groovy


The resulting code will be called Groovier. It'll be Apache 2.0 licensed to encourage the (J)Groovy developers to copy it for bundling with Groovy. Groovier will be to Scala and the GrAST what the (J)Groovy Language is to Java and the JVM. Groovy is getting groovier and Groovier.

Wednesday, May 06, 2009

Syntactic Macros in Groovy

In CLisp, macros are defined by quoting symbols and lists with backquote `, and escaping from them with comma , or comma-at ,@, perhaps nestedly so. One day I realized this syntax is similar to GStrings in Groovy, which are quoted with double-quote " and escaped with dollar $, perhaps nestedly so. Lisp symbols are similar to Java interned strings. I wondered if GString syntax be used to specify syntactic macros in Groovy.

In Paul Graham's online book, On Lisp, he describes how to write a macro:
(defmacro nil! (var) `(setq ,var nil))
Using GString notation, we could define an equivalent macro in Groovy by writing:
defMacro("setq($var, null)"){"nilBang($var)"}

For Groovy to match Lisp in functionality, we would need two more functions:
'[sum, a, b]'.asFunc() == 'sum(a, b)'
'[+, a, b]'.asFunc() == '(a + b)'
and:
"[a, $b, c]".expand() == [a, 2, c]
as well as the already-provided:
"[$a, $b]".evaluate() == 3

Paul Graham writes that when using the backquote in Lisp:
`(a b c) is the same as '(a b c)
and `(a b c) is the same as (list 'a 'b 'c)
With Groovy syntactic macros:
"[a, b, c]" will be the same as '[a, b, c]'
and "[a, b, c]" the same as ["a", "b", "c"].

Using the comma with the backquote in Lisp:
`(a ,b c ,d) is the same as (list 'a b 'c d)
With Groovy macros:
"[a, $b, c, $d]" will be the same as ["a", b, "c", d]

In Lisp, if we assign some values using (setq a 1 b 2 c 3), then we'll get:
> `(a ,b c)
(A 2 C)

In Groovy, if we assign using def a=1, b=2, c=3, then we would get:
assert "[a, $b, c]".expand() == [a, 2, c]

For an example with nesting using these values:
> `(a (,b c))
(A (2 C))

In Groovy it would be:
assert "[a, [$b, c]]".expand() == [a, [2, c]]

A more complex example from On Lisp:
> `(a b ,c (',(+ a b c)) (+ a b) 'c '((,a ,b)))
(A B 3 ('6) (+ A B) 'C '((1 2)))

In Groovy it would be:
assert "[a, b, $c, ['${"[+, a, b, c]".asFunc()}'], " +
       "[+, a, b], 'c', '[[$a, $b]]']".expand() ==
  [a, b, 3, ['6'], [+, a, b], 'c', '[[1, 2]]']


Another example, in Lisp:
`(,a ,(b `,c))
and in Groovy:
"[$a, ${[b, "$c"]}]"

We'd need to introduce some additional syntax to match the comma-at notation from Lisp:
> (setq b ’(1 2 3))
(1 2 3)
> `(a ,b c)
(A (1 2 3) C)
> `(a ,@b c)
(A 1 2 3 C)


The Groovy equivalent, with additional syntax $* to show interpolate-with-spreading, would be:
def b= [1, 2, 3]
"[a, $b, c]".expand == [a, [1, 2, 3], c]
"[a, $*b, c]".expand == [a, 1, 2, 3, c]


I'm nowhere near providing this sort of functionality in Groovy/DLR, but such self-referential syntactic macros would complement the recently added AST macro system in Groovy 1.6.

Wednesday, April 29, 2009

The Groovy DevCons

Looks like another Groovy DevCon is being held soon, perhaps to coincide with the Gr8 conference in Copenhagen, mid-May 2009, but very little info is being made public. Let's take a brief look at the history of the Groovy DevCons...

  • London, Nov 2004: The only info I could find about Groovy DevCon 1, then called GroovyONE, is a powerpoint presentation given by James Strachan. I remember seeing some online pics of the attendees at the time.


  • Paris, Nov 2005: Jeremy Rayner reports, with pics, here on the DevCon 2. Shortly afterwards, Groovy co-founder James Strachan moved on to other things.


  • Paris, Jan 2007: Many decisions were made, some followed, some delayed, others changed, sometimes at the last moment. Shortly after DevCon 3, John Wilson left the Groovy Developer team.


  • London, Oct 2007: Held concurrent with the Grails eXchange conference, where Guillaume Laforge gave this keynote, shortly after the core developers' next move to monetize their work on the language: the formation of G2One, Inc. It was just after that DevCon I heard about the discussions over changing Groovy's name. The Groovy Developers quickly plugged up that hole in their internal communications, and it seems the decision was made not to rename it.


  • Late last year, the Groovy/Grails controlling company, G2One Inc, was acquired by SpringSource, controller of the Spring Framework and Tomcat, in a ”stock-and-cash” deal. The SpringSource shareholders and those of G2One decided they can get acquired in a ”cash-and-stock” deal more easily if they do it together. Central to that is tightly managing all information flowing out of the Groovy development effort. So only insiders know if this coming DevCon is DevCon 5. The developers recently added Griffon to their list of officially quoted Groovy-based technologies, calling them Groovy, Grails, and Griffon, all in one breath. Now are they resisting the branding impact of my recent Groovy/DLR release by seeding the web with the Boo is to C# what Groovy is to Java” message? Are we going to see a Groovy-Boo marketing linkup in time for the Gr8/DevCon meeting? Where they all sing “Groovy Groovy Boo, where are you?” ?


  • Building Groovy/DLR
    But Groovy for the DLR is NOT Groovy for .NET. It's purpose is different, simply a lexer/parser for a Groovy-like syntax, only better, and its true target is the (J)Groovy AST. I'm far more interested in programming to (J)Groovy's new ASTBuilder, as recently spec'd by Hamlet D'Arcy, than to the DLR, because (J)Groovy is the real Groovy. Groovy presently has an Antlr 2.x lexer/parser, which some say should be updated to Antlr 3.0. But others, e.g. some Grails developers, want a much faster hand-written parser instead. I'd like a slower but more extensible combinator parser. So to cater for everyone, Groovy needs clearly-specified AST node interfaces and visitation states. Perhaps the coming ASTBuilder will be like this. The Groovy AST could become the JVM's answer to .NET's DLR.

    My original intention was to write the lexer/parser in C# because it had delegates (i.e. closures), then translate to Java 7.0 later. But it appears Java 7.0 is unlikely to have closures, so I could be left in limbo on the DLR. I recently got in-place comment lex-definition running, e.g. @AddComment("(?s:#{4}.*)") in the source code will cause the parser to add this comment definition to the whitespace lexer. This particular example of comment syntax is very handy, i.e. being able to add #### in the source file to comment to end-of-file, without needing to page down to add close-comment syntax at the end. I'm now starting on in-place string syntax definitions, so those who want things like multiline regexes can just add it themselves instead of waiting years for the (J)Groovy developers to do so.

    Underwriting (J)Groovy
    Besides building Groovy/DLR, another involvement I have with the Groovy language is underwriting it: Should the Groovy Language ever have its development abandoned or its name changed, I, as Groovy's underwriter, will pick up from where they left off. When I worked in IT, I found being the ”Backup Programmer” for a project to be quite difficult because I never knew how much time and effort to put into following something I would probably never work on. Many a business manager, having little understanding about insuring against risks, has slashed programmer jobs on a computer system to cut costs, making others bear the true costs later on. And so it is with underwriting the Groovy Language. Unlike that of the Groovy Developers, who manage Groovy's upside, my performance in covering Groovy's downside can't be measured unless I actually need to do something.

    I've tried to participate in Groovy's success in other ways, such as writing Groovy documentation for Java newbies. I was being encouraged publicly, but got a different message from somewhere through backchannels, telling me to “stop messing with the brand”. I didn't know for sure where the antagonism was coming from, nor can I prove it, but had suspicions when I read this excellent analysis on brand power in open source software. It seems a name is not just a game. Because of this, I'm quite hesitant to join the Codehaus Groovy effort, prefering to work on the sidelines both underwriting Groovy, and increasing Groovy programmers' lexical and syntactic choices. The Groovy Language will give more choices to developers.

    Thursday, April 23, 2009

    Interactional function of English and Groovy

    Michael A.K. Halliday writes in his 1970 paper Language Structure and Language Function that we should analyze language in terms of its use, considering both its structure and function in so doing. He's found the vast numbers of options embodied in it combine into three relatively independent components, and they each correspond to a certain basic function of language: representational (a.k.a. ideational), interactional (a.k.a. interpersonal), and textual. Within each component, the networks of options are closely interconnected, while between components, the connections are few.

    For natural language, the representational component represents our experience of the outside world, and of our consciousness within us. The representational similarities between natural and computer languages are most easily noticed:
    Mary.have(@Little lamb)
    lamb.fleece.Color = Color.SNOW_WHITE
    synchronized{ place-> Mary.go(place); lamb.go(place) }

    Computer languages' increasing use of abstraction over the years was no doubt based on the representational component of natural languages, giving rise to the functional and object-oriented paradigms. The ideas represented in computer language must be more precise than those in natural language.


    Interactional component of English
    The interactional component of language involves its producer and receiver/s, and the relationship between them. For natural language, there's one or more human receivers, and for computer language, one or more electronic producers and/or receivers as well as the human one/s.

    In English, the interactional component accounts for:

    • many adverbs of opinion, e.g. “That's an incredibly interesting piece of code!”

    • interjections within a clause, e.g. I'm hoping to, er, well, go back sometime, or even in the middle of words, e.g. abso-bloomin'-lutely

    • expressions of politeness we prepend to English sentences, e.g. “Are you able to...” in front of “Tell me the time”

    • the hundreds of different attitudinal intonations we overlay onto our speech, e.g. ”Dunno!” (can you native English speakers hear that intonation?)

    • the mood, whether indicative e.g. “He's gone.”, interrogative e.g. ”Is she there?”, imperative e.g. ”Go now!”, or exclamative e.g. ”How clever!”

    • the modal structure of English grammar, i.e. verbal phrases have certainty e.g. “I might see him”, ability e.g. ”I can see her”, allowability e.g. ”Can he do that?”, polarity e.g. ”They didn't know”, and/or tense e.g. ”We did make it”


    Natural language offers many choices regarding how closely to intertwine the interactional component with the representational.
    An example... for closely intertwined reported speech: She said that she had already visited her brother, that the day before she'd been with her teacher, and that at that moment she was shopping with her friend.
    and using quoted speech to reduce the tangling between interactional and representational components: She said "I've already visited my brother, yesterday I was with my teacher, and right now I'm shopping with my friend."
    Another example of keeping these two components disjoint: I'm going to tell the following story exactly as she told it, the way she said it, not how I'd say it...

    The original human languages long ago, just like chimpanzee language today, was perhaps mainly interactional, with the representional component slowly added on afterwards.


    Interactional component of computer languages
    For computer languages, the interactional component determines how people interact with the program, and how other programs interact with it. Like natural languages, the interactional component came first, and representational abstractions added on later. Many have tried to create a representational-only computer language, perhaps the most successful is Haskell. But the Haskell language creators went to great trouble to tack on the minimumly required interactional component, that of Input/Output. They introduced monads to add the I/O capability onto the “purer” underlying functional-paradigm function. Perhaps some functional-paradigm language creators don't appreciate the centrality of the interactional component in language.

    Siobhan Clarke et al, writes about the tyranny of the dominant decomposition:
    Current object-oriented design methods suffer from the “tyranny of the dominant decomposition” problem, where the dominant decomposition dimension is by object. As a result, designs are caught in the middle of a significant structural misalignment between requirements and code. The units of abstraction and decomposition of object-oriented designs align well with object-oriented code, as both are written in the object-oriented paradigm, and focus on interfaces, classes and methods. However, requirements specifications tend to relate to major concepts in the end user domain, or capabilities like synchronisation, persistence, and failure handling, etc., all of which are unsuited to the object-oriented paradigm.

    The object-paradigm is a representational one. The other user-domain capabilities are interactional ones, either human-to-computer or computer-to-computer. Some examples:
    • I/O actions, i.e. between computer and human/s

    • logging, i.e. between processor and recording medium

    • persistence, database access, i.e. between computer and storage unit/s

    • security, i.e. between computer and certain humans only

    • execution performance, i.e. how to maximize use of computing resources

    • entry point to program, i.e. between procesor and external scheduler

    • concurrency, synchronization, i.e. between two processors in one computer

    • distribution, i.e. between two geographically separated computers

    • exceptions, failure handling, i.e. between results of different human-expected certainties

    • testing, i.e. interaction between two different external humans


    These capabilities are often interwoven into the programming code, just as mood and modality are overlaid onto all the finite verbal phrases in English. And just as in English, where various interactional functions can be disentangled from the representation functions, e.g. quoted speech above, so also in computer languages, such user-domain capabilities can be extracted as system-wide aspects in aspect-oriented programming.

    AspectJ is a well-known attempt to let each aspect use the same syntax, that of the base language. But the idea of limited AOP is much older, often different syntaxes are used for each different user-domain capability.


    Finally...
    I've already blogged about some aspects of the textual component of English and Groovy. Whereas the other two components of language exist for reasons independent of the medium itself, the textual component comes into being because the other two components exist, and refers both to those other two components and to itself self-referentially. The textual component ensures every degree of freedom available in the medium itself is utilized.

    In computer languages, the textual component if often called ”syntactic sugar”. Often computer language designers scorn the use of lots of syntactic sugar, but natural language designers, i.e. the speakers of natural languages, use all the syntactic sugar available in the communication medium. Programming languages designers should do the same. In the DLR-targetted Groovy I'm working on, I'm focusing on this aspect of the Groovy Language.

    Friday, April 10, 2009

    Groovy for the DLR

    The source code for beta-3 of Groovy/DLR 1.0 is out. Groovy for the Microsoft DLR is an Apache-2-licenced combinator parsing library that lexes and parses (J)Groovy-lookalike syntax to build DLR nodes.

    What can execute...
    I've now successfully plugged in all of the node builds from the ToyScript sample language in DLR version 0.9. Here's an example of syntax beta-03 can parse and execute...
    //various atomic values...
    print 444;
    print - 34;
    print('abc') // <--- implied semis to end statements
    print 7.515
    print(true);
    print(3+7*
      5); //multi-line statement
    print(3+7*
      5) //multi-line stmt with implied semis
    a = 2*3000;
    print a;
    print 12 + a * 2;
    pause

    //we can access .NET classes...
    import System
    zz= new System.Collections.ArrayList();
    zz.Add(7)
    zz.Add('abc')
    zz.Add(6.66)
    print zz.Count
    print zz[0]
    print zz[1]
    print zz[2]
    print zz[1] == 'abc'
    print zz[1]
    pause

    try{ //while and if statements...
      i = 0;
      while(i < 5){
        i=i+ 1;
        print i
      }
      if(i<1){print '...first...';}else{print '...second...';};
      pause;
    };

    def add(a,b) { //function defn and invocation...
      j = a + b;
      return j
    }
    print add(1, 2)
    print add('Hello, ', 'Groovy/DLR.')
    pause

    //'parse' is a temporary command to parse, but not run, another file...
    parse 'test/test01.gvy';
    parse 'test/test02.gvy';
    parse 'test/test03.gvy'
    parse 'test/test04.gvy'
    pause


    Groovy/DLR can parse more syntax than it can execute, including GStrings, closures, and classes, which are tested in extra parse-only files.

    Goals (reposted)
    My key goal when building the Groovy/DLR Language syntax has been to make it totally configurable by the programmer. That's why I've used combinator parsers, instead of bottom-up parsing. Programmers can easily read and change what combinator parsers do. They can swap a 3-character operator symbol e.g. <=> with a single-character Unicode one. They can put in a lookup table to map method names in French or Chinese to ones in English. They can rewrite the entire grammar, based on an in-house corporate policy, but keep the same AST tree semantics. They can embed other little languages in the grammar, without quoting. They can change the tokenizer rules to embed multiline strings with any escapes they want.

    Some salient features of these combinator parsers:
    • Lexing and parsing GStrings, which can be embedded to any number of levels, was an early challenge. When lexing, we return a list of tokens, some of which could be fully parsed GStrings. None of the parsing logic is in the lexer, instead, the lexer calls the parser repeatedly to find the end of GString enclosures.

    • After building the lexer and then the parser, we query the parser to get a list of operator tokens, then pass this to the lexer before running it. That way, we specify tokens only once, in the parser, not in the lexer. The DRY principle at work.

    • The lexing and parsing libraries are separate, fine-tuned for their specific tasks. Not worrying about how a feature required for the parsing can work for the lexing, and vice versa, caused the lexer and parser to evolve into two hugely different beasts.

    • When parsing statements and expressions, we don't litter the code with newline checks, like (J)Groovy does, instead this is separated out into a separate context-sensitive pass using monadic parsers (i.e. bind and return). That way, programmers can add their own statement or expression parsers with the minimum of code, without worrying about the separate concern of how the statement fits into the overall syntactic structure.

    • The original source code can be completely rebuilt from the AST nodes. All whitespace is stored on the tokens.

    • Unlike my previous code and Groovy's Antlr-based parser and AST generation code, this C#-based code uses objects and visitation a lot. Example: the IfStatementToken contains a method returning the combinator parser, a method for generating that portion of the original source, a method for generating source code in a (J)Groovy-compatible normal form, a method for generating the DLR nodes, etc.


    Future plans...
    I now working on getting Python-style indenting blocks working, using syntax {:, so eventually the developer can mix close-curly-delimited and indented blocks in the same code. When this is working, I'll separate out the code into a separate generic combinator parser, then reapply it using token (: to argument/parameter lists, and token [: to list/map constructors, so developers can mix close-delimited and indented structures of any kind in their code.

    I'm hoping to eventually enable a syntactic macro style whereby we can define syntax, then use it in the same source code, e.g.
    public class WhileStmtToken: StatementToken {:
      NameToken WhileTag
      ExpressionToken Expr
      BlockToken Block

      static Parser BuildParser( //how to parse the syntax...
        Parser parsery, Parser exprParser, Parser blockParser) {:
          parsery.Name("while")
          .Seq(exprParser){:
            whileTag, expr->
            new WhileStmtToken{:
              Expr = expr as ExpressionToken
              WhileTag = whileTag as NameToken
          .Seq(blockParser){:
            token, block->
            (token as WhileStmtToken).Block = block as BlockToken
            token

      Expression Generate(Generator gen) {: //how to generate the DLR node...
        Utils.While(:
          gen.ConvertTo(typeof(bool), Expr.Generate(gen)),
          Block.Generate(gen),
          null, null, null, ss.End, ss

      string ToString(int indent) {:
        //...etc...
      string NormalForm(int indent) {:
        //...etc...
      string SourceCode() {:
        //...etc...

    useSyntax(WhileStmtToken){: //use class above when parsing code below
      while(condition){:
        codeToRun()


    I'm hesitant to put in any more DLR nodes until DLR version 1.0 is out, because things might change. Perhaps the ToyScript sample language shipped with DLR 1.0 will have more structures implemented so I can copy them. I'm focusing on the combinator parsers because those are what I want copied by someone and put into (J)Groovy. One big hope I have for the Groovy/DLR language is it'll inspire the (J)Groovy developers at Codehaus to also enable a flexible syntactic skin in (J)Groovy for the JVM, the Groovy Language's first target platform. (J)Groovy originally used a handwritten bottom-up parser: no doubt it parsed quicker, but it couldn't easily be expanded with new syntax. After 10 betas, (J)Groovy switched to an Antlr 2.x-based parser, initially allowing more syntax to be added, but now has become so complex the (J)Groovy Developers don't want to add much new syntax, nor even allow developers to customize the parsing themselves. A true DSL needs a lot more flexibility than parenthesis-less method calls and annotation-driven AST transformations. The (J)Groovy AST could become the JVM's answer to the Microsoft DLR.

    Monday, March 16, 2009

    Groovy Combinator Parsing

    About 18 months ago, I began writing an alternative lexer and parser for the Groovy AST. My original plan was to write different layers of code:
    • an ASTBuilder using Groovy's builder facility
    • a lexer and parser using Ben Yu's JParsec 1.2 combinator parsing library, calling the ASTBuilder in the closures
    • an AST decompiler to Groovy source

    The ASTBuilder was difficult to build so it would be callable from closures embedded within JParsec. At the time I concluded the builder pattern that Groovy builders embody was unsuitable for building AST's, and that the interpreter pattern would be better. (Perhaps that's why programming languages became the preferred method to interface to AST's way back :-) So I switched to building the AST directly in the JParsec closures. I called all this Vy Language 1.0.

    However, I eventually abandoned development. The reasons:

    • although JParsec 1.2 was well-behaved, without bugs, I found it brittle when used for a large grammar of my own design
    • my development methodology was similar to what I'd done in previous paid employment, i.e. get it working ASAP, meet the deadline, and don't worry about bugs, quality, and documentation till later. The resulting code worked (tho with bugs), but was messy.
    • I tried to reproduce the entire Groovy 1.0 grammar
    • I got the GString lexing/parsing working (incl nested GStrings) using monadic parsers, but not the implied semicolons
    • the layering design was wrong
    • the closures I gave JParsec expanded out to 2kb each, all 200-odd of them
    • the Groovy AST was a moving target, with no spec, and the Groovy Developers weren't adding the setters I needed to the AST nodes

    Second try...

    I again tried writing a lexer/parser for the Groovy AST, this time using my own combinator parsers, which I based on Ken Barclay's ones written in Groovy. I called it Vy Language 1.1. Ken later announced GParsec, where he would use his ones to reproduce the Groovy grammar, though 1 yr later, no code's appeared. Perhaps he had the same problems I had:

    • the code ran slooooowly, all of it, not just the exponential edge-cases
    • although I implemented a lot less functionality than for my earlier version (v 1.0), the code produced over 300 closures, each at 2kb

    Third try...

    After a break, I was still excited about building a computer language grammar with all the cool ideas I've had while studying natural language theory. So late last year, I tried again, but didn't want to repeat my previous mistakes. So:

    • I needed a fast language with closures. Java was fast but doesn't have closures; Groovy has closures but runs slowly. So I started playing with C#.
    • Instead of whacking out as much code as possible, as soon as possible, I've decided to enjoy the process and refactor often.
    • The Microsoft DLR version 0.9 is available. I've recently fitted some of the nodes from the ToyScript sample language into my grammar.

    I've put the source code online, calling it Groovy for the DLR. It's really just a combinator parsing library, with some Groovy-like grammar, and some DLR nodes created. I didn't start out intending to build Groovy for the DLR, but it's natural evolution is exactly that.

    I'm now deciding whether to continue down that lane, or to switch back to the JVM. Because the code is closure-heavy (i.e. C# delegates), it'd be difficult but doable to port it to Java (JParsec manages to implement such stuff in Java without closures). If Java 7 has closures built in, that would help. And of course I would need some target platform.

    Guillaume Laforge mentions AST builders in the most recent Groovy roadmap, but does that mean an openly-specified AST interface available for everyone to use, or an elusively ever-changing implementation that only works with the insiders' bundled builders? There's an opportunity here for the Groovy AST to become the JVM's answer to the Microsoft DLR. Will the Groovy Developers grasp the opportunity?

    Monday, February 23, 2009

    Groovy 1.6 Released

    Groovy 1.6 has recently been released, the first update to the Groovy Language since version 1.5 over a year ago. Version 1.6 brings Python-style tuples to Groovy, allowing multi-value assignments. This was the last unfulfilled item from James Strachan's original 29 August 2003 manifesto for Groovy.

    But many items remain unfulfilled from Groovy's submission as JSR 241 on 16 March 2004 and subsequent approval 2 weeks later. James expanded on his vision at that time. Let's look at some of his points...

    (1) James wrote: “One area we've not yet looked at is merging a Java and Groovy compiler together so Groovy and Java code can be compiled together in the same compilation unit.” The Groovy developers added this capability in their implementation of Groovy 1.5 a year ago.

    (2) He also wrote: “The JSR allows people to [implement Groovy] if they wish - whether its a complete rewrite, replacing a part of it or just some tinkering, and provides a TCK to know if they correctly implement the JSR. Even just embedding the RI inside a container can sometimes affect things so having a JSR & TCK helps even if we're sharing the same codebase but just configuring & deploying it in different ways..” A spec and test kit encourages other developers to come along and contribute to the language correctly, improving it or the parts that need it. This was the vision in place when I first started tinkering with Groovy.

    (3) Also: “Just to be clear, its the expert group's responsibility to make a great spec and a reference implementation and TCK. It's up to others to create different implementations if they want to.” (his emphasis) Not having a JSR provides only one avenue for contributing to the Groovy Language, that of joining the development team, submitting to the commercial interests and internal politics of Codehaus and SpringSource and whoever else might come along, something very few developers want to do. But having a completed JSR provides many avenues to contribute to Groovy. A spec and test kit are like the market rules for a bazaar, but the present Groovy development is still a centrally-planned cathedral. The only really committed developers are those with shares in the business, and even then, when they're not working on Spring or Grails.

    Having tight control over the initial stages of an open-source software package keeps development focused. A year after development began, Groovy creator James Strachan thought the time was right for creating a spec and test kit. After some initial activity, work on the spec halted, the reason given was “possible copyright violations with Sun's Java language spec”. The Groovy developers since then have focused on creating the fastest possible JVM meta-object engine, and on melding their implementation of Groovy closely into Grails, both good activities but not an end in themselves.

    (4) Finally, James also added: “I'm hoping Groovy goes along the same way - that one day maybe Jikes / gcj implement a Groovy compiler or maybe IDEs (say in Eclipse / IDEA / NetBeans / Workshop) do their own Groovy compiler, reusing their internal Java compilers - maybe reusing the Groovy AST compiler, just replacing the bytecode generation with something else, like Java AST generation, Java code generation or whatever.” James Strachan envisioned Groovy as a loose collection of language technologies, just as the Spring technologies are. A Groovy spec, therefore, needs to define not only the language syntax, but also the meta-object interface, the default Groovy method interface, the AST interface, with clearly-defined visitation states, and so forth.

    The Groovy developers, to their credit, have now completed all the items in James' original manifesto, as well as adding a joint Groovy/Java compiler. But the JSR hasn't moved an inch in the past 5 years. Let's hope Groovy 1.7 (or whatever it's called) can address this issue.

    Thursday, February 12, 2009

    The Rise of Unicode

    The next version of Unicode is v.5.2, the latest of a unified character set now with over 100,000 current tokens. One notable addition to v.5.2 will be the Egyptian hieroglyphs, the earliest known system of human writing. Perhaps they will mark Unicode's coming of age, it being another huge step in representing language with graphical symbols. Let's look at a consolidated short history of writing systems, courtesy of various Wikipedia pages, to see Unicode's rise in perspective...

    Hieroglyphs
    Egyptian hieroglyphs were invented around 4000-3000 BC. The earliest type of hieroglyph was the logogram, where a common noun (such as sun or mountain) is represented by a simple picture. These existing hieroglyphs were then used as phonograms, to denote more abstract ideas with the same sound. Later, these were modified by extra trailing hieroglyphs, called semagrams, to clarify their meaning in context. About 5000 Egyptian hieroglyphs existed by Roman times. When papyrus replaced stone tablets, the hieroglyphs were simplified to accommodate the new medium, sometimes losing their resemblance to the original picture.

    The idea of such hieroglyphic writing quickly spread to Sumeria, and eventually to ancient China. The ancient Egyptian and Sumerian hieroglyphs are no longer used, but modern Chinese characters are descended directly from the ancient Chinese ones. Because Chinese characters spread to Japan and ancient Korea, they're now called CJK characters. By looking at such CJK characters, we can get some idea of how Egyptian hieroglyphs worked. Many CJK characters were originally pictures, such as 日 for sun, 月 for moon, 田 for field, 水 for water, 山 for mountain, 女 for woman, and 子 for child. Some pictures have meanings composed of other meanings, such as 女 (woman) and 子 (child) combining into 好, meaning good. About 80% of Chinese characters are phonetic, consisting of two parts, one semantic, the other primarily phonetic, e.g. 土 sounds like tu, and 口 means mouth, so 吐 also sounds like tu, and means to spit (with the mouth). The phonetic part of many phonetic characters often also provides secondary semantics to the character, e.g. the phonetic 土 (in 吐) means ground, where the spit ends up.

    Alphabets
    Eventually in Egypt, a set of 24 hieroglyphs called uniliterals evolved, each denoting one consonant sound in ancient Egyptian speech, though they were probably only used for transliterating foreign names. This idea was copied by the Phoenicians by 1200BC, and their symbols spread around the Middle East into various other languages' writing systems, having a major social effect. It's the base of almost all alphabets used in the world today, except CJK characters. These Phoenician symbols for consonants were copied by the ancient Hebrews and for Arabic, but when the Greeks copied them, they adapted the symbols of unused consonants for vowel sounds, becoming the first writing system to represent both consonants and vowels.

    Over time, cursive versions of letters evolved for the Latin, Greek, and Cyrillic alphabets so people could write them easily on paper. They used either the block or the cursive letters, but not both, in one document. The Carolingian minuscule became the standard cursive script for the Latin alphabet in Europe from 800AD. Soon after, it became common to mix block (uppercase) and cursive (lowercase) letters in the same document. The most common system was to capitalize the first letter of each sentence and of each noun. Chinese characters have only one case, but that may change soon. Simplified characters were invented in 1950's mainland China, replacing the more complex characters still used in Hong Kong, Taiwan, and western countries. Nowadays in mainland China though, both complex and simplified Chinese are sometimes used in the same document, the complex ones for more formal parts of the document. Perhaps one day complex characters will sometimes mix with simplified ones in the same sentence, turning Chinese into another two-case writing system.

    Punctuation
    Punctuation was popularized in Europe around the same time as cursive letters. Punctuation is chiefly used to indicate stress, pause, and tone when reading aloud. Underlining is a common way of indicating stress. In English, the comma, semicolon, colon, and period (,;:.) indicated pauses of varying degrees, though nowadays, only comma and period is used much in writing. The question mark (?) replaces the period to indicate a question, of either rising or falling tone; the exclamation mark (!) indicates a sharp falling tone.

    The idea of separating words with a special mark also began with the Phoenicians. Irish monks began using spaces in 600-700AD, and this quickly spread throughout Europe. Nowadays, the CJK languages are the only major languages not using some form of word separation. Until recently, the Chinese didn't recognize the concept of word in their language, only of (syllabic) character.

    The bracketing function of spoken English is usually performed by saying something at a higher or lower pitch, between two pauses. At first, only the pauses were shown in writing, perhaps by pairs of commas. Hyphens might replace spaces between words to show which ones are grouped together. Eventually, explicit bracketing symbols were introduced at the beginning and end of the bracketed text. Sometimes the same symbol was used to show both the beginning and the end, such as pairs of dashes to indicate appositives, and pairs of quotes, either single or double, to indicate speech. Sometimes different paired symbols were used, such as parentheses ( and ). In the 1700's, Spanish introduced inverted ? and ! at the beginning of clauses, in addition to the right-way-up ones at the end, to bracket questions and exclamations. Paragraphs are another bracketing technique, being indicated by indentation.

    Printing
    Around 1050, movable-type printing was invented in China. Instead of carving an entire page on one block as in block printing, each character was on a separate tiny block. These were fastened together into a plate to reflect a page of a book, and after printing, the plate was broken up and the characters reused. But because thousands of characters needed to be stored and manipulated, making movable-type printing difficult, it never replaced block printing in China. But less than a hundred letters and symbols need to be manipulated for European alphabets, much easier. So when movable-type printing reached Europe, the printing revolution began.

    With printing a new type of language matured, one that couldn't be spoken very well, only written: the language of mathematics. Mathematics, unlike natural languages, needs to be precisely represented. Natural languages are very expressive, but can also be quite vague. Numbers were represented by many symbols in ancient Egypt and Sumeria, and had reduced to a mere 10 by the Renaissance. But from then on, mathematics started requiring many more symbols than merely two cases of 26 letters, 10 digits, and some operators. Many symbols were imported from other alphabets, different fonts introduced for Latin letters, and many more symbols invented to accommodate the requirements of writing mathematics. Mathematical symbols are now almost standardized throughout the world. Many other symbol systems, such as those for chemistry, music, and architecture, also require precise representation. Existing writing systems changed to utilize the extra expressiveness that came with movable-type printing. Underlining in handwriting was supplemented with bolding and italics. Parentheses were supplemented with brackets [] and curlies {}.

    Fifty years ago, yet another type of language arose, for specifying algorithms: computer languages. The first computer languages were easy to parse, requiring little backtracking, but the most popular syntax, that of C and its descendants, requires more complex logic and greater resources to parse. Most programming languages used a small repetoire of letters, digits, punctuation, and symbols, being limited by the keyboard. Other languages, most notably APL, attempted to use many more, but this never became popular. Unlike mathematics, computer languages relied on parsing syntax, rather than a large variety of tokens, to represent algorithms, being limited by the keyboard. Computer programs generally copied natural language writing systems, using letters, numbers, bracketing, separators, punctuation, and symbols in similar ways. One notable innovation of computer languages, though, is camel case, popularized for names in C-like language syntaxes.

    The natural language that spread around the world in modern times, English, doesn't use a strict pronunciation-spelling correspondence, perhaps one of the many reasons it spread so rapidly. English writing therefore caters for people who speak English with widely differing vowel sounds and stress, pause, and tone patterns. In this way, English words are a little like Chinese ideographs. As Asian economies developed, techniques for quickly entering large-character-set natural languages were invented, known as IME's (input method editors). But these Asian countries still use English for computer programming.

    Unification
    Around 1990 Unicode was born, unifying the character sets of the world. Initially, there was only room for about 60,000 tokens in Unicode, so the CJK characters of China, Japan, and Korea were unified to save space. Unicode is also bidirectional, catering to Arabic and Hebrew. Topdown languages such as Mongolian and traditional Chinese script can be simulated with left-to-right or right-to-left directioning by using a special sideways font. However, Unicode didn't become very popular until its UTF-8 encoding was invented 10 years ago, allowing backwards compatibility with ASCII. Another benefit of UTF-8 is there's now room for about one million characters in the Unicode character set, allowing less commonly used scripts such as Egyptian hieroglyphs to be encoded.

    Many programming languages have recently adopted different policies for using Unicode tokens in names and operators. The display tokens in Unicode are divided into various categories and subcategories, mirroring their use in natural language writing systems. Examples of such subcategories are: uppercase letters (Lu), lowercase ones (Ll), digits (Nd), non-spacing combining marks, e.g. accents (Mn), spacing combining marks, e.g. Eastern vowel signs (Mc), enclosing marks (Me), invisible separators that take up space (Zs), math symbols (Sm), currency symbols (Sc), start bracketing punctuation (Ps), end bracketing (Pe), initial quote (Pi), final quote (Pf), and connector punctuation, e.g. underscore (Pc).

    For it to become popular to use a greater variety of Unicode tokens in computer programs, there must be a commonly available IME for their entry with keyboards. Sun's Fortress provides keystroke sequences for entering mathematical symbols in programs, but leaves it vague whether the Unicode tokens or the ASCII keys used to enter them are the true tokens in the program text. And of course there must be a commonly available font representing every token. Perhaps because of the large number of CJK characters, and the recent technological development of mainland China, a large number of programmers may one day suddenly begin using them in computer programming to make their programs terser.

    Conclusion
    Language representation using graphical symbols has taken many huge leaps in history: Egyptian hieroglyphs to represent speech around 4000 years ago, an alphabet to represent consonant and vowel sounds by the Phoenicians and Greeks around 2500 years ago, movable-type printing in Europe around 500 years ago, and unifying the world's alphabets and symbols into Unicode a mere 20 years ago. And who knows what the full impact of this latest huge leap will be?

    Saturday, December 06, 2008

    The Thematic Structure of English and Groovy

    After working as a programmer for many years, I tossed it in to teach English in China. I spent a few years reading the many books on language and linguistics in the bookshops up here, before returning to programming as a hobby. I then started to see many similarities between natural and computer languages, which I'm intermittently blogging about. Here's today's installment...

    Introduction
    Of the books on language I've read, M.A.K. Halliday's ideas make a lot of sense. He suggests we should analyse language in terms of what it's used for, rather than its inherent structure. From this basis, he's isolated three basic functions of natural language, and their corresponding structural subsystems: the ideational, the interpersonal, and the textual.

    The ideational function is a representation of experience of the outside world, and of our consciousness within us. It has two main components: the experiential and the logical. The experiential component embodies single processes, with their participants and circumstances, in a transitivity structure. For example, “At quarter past four, the train from Newcastle will arrive at the central station.” has a transitive structure with process to arrive, participants train from Newcastle and central station, and circumstance quarter past four. The primary participant is called the actor, here, the train from Newcastle. Computer languages have a structure paralleling the transitivity structure of natural languages, e.g. train.arrive(station, injectedCircumstance) for object-oriented languages. The logical component of ideational function concerns links between the experiential components, attained with English words such as and, which, and while. These have obvious parallels in programming languages.

    The interpersonal function involves the producer and receiver/s of language, and the relationship between them. This function accounts for the hundreds of different attitudinal intonations we overlay onto our speech, interjections, expressions of politeness we prepend to English sentences, e.g. “Are you able to...”, many adverbs of opinion, the mood (whether indicative, interrogative, imperative, or exclamative), and the modal structure of English grammar. The mood structure causes verbal phrases to have certainty, ability, allowability, polarity, and/or tense prepended in English, and can be repeated in the question tag, e.g. isn't he?, can't we?, should they?. The interpersonal function gives the grammatical subject-and-predicate structure to English. In programming languages, the interpersonal function determines how people interact with the program, and how other programs interact with it. The interpersonal functions are what would normally be extracted into aspects in aspect-oriented programming. They generally disrupt the “purer” transitivity structure of the languages.

    The textual function brings context to the language through different subsystems. The informational subsystem divides the speech or text into tone units using pauses, then gives stress/es to that part of the unit that is new information. The cohesive subsystem enables links between sentences, using conjunctions and pronouns, substitution, ellipsis, etc. The thematic subsystem makes it easy for receivers to follow the flow of thought. Comparing this structure of the English and Groovy languages is the topic of today's blog post...

    Thematic structure of English
    Theme in English is overlaid onto the clause, a product of the transitive and modal structures. The theme is the first position in the clause. English grammar allows any lexical item from the clause to be placed in first position. (In fact, English allows the lexical items to be arranged in any order to enable them to be broken up in any combination into tone units.) Some examples, using lexical items to give, Alan, me, and the book, with theme bolded:
      Alan gave me that book in London.
      Alan gave that book to me in London. (putting indirect object into prepositional phrase)
      To me Alan gave that book in London. (fronting indirect object)
      I am who Alan gave that book to in London. (fronting indirect object, with extra emphasis)
      To me that book was given in London. (using passive voice to front indirect object)
      That book was given in London. (using passive voice to omit indirect object)
      That book Alan gave me in London. (fronting direct object as topic)
      That book is the one Alan gave me in London. (fronting direct object in more formal register)
      In London, Alan gave me that book. (fronting adverbial, into separate tone unit)
      London is where Alan gave me that book. (fronting adverbial in the same tone unit)
      There is a book given by Alan to me in London. (null topic)

    Although not common, English also allows the verb to be put in first position as theme:
      Give the book Alan did to me in London.
      Give me the book did Alan in London.
      Give did Alan of the book to me in London.
      Given was the book by Alan to me in London.

    First position is merely the way English indicates what the theme is, not the definition of it. Japanese indicates the theme in the grammatical structure (with the inflection はwa), while Chinese (I think) uses a combination of first position and grammatical structure (prepending with 是shi).

    Thematic structure of Groovy
    One way of indicating theme could be to bold it, assuming the text editor had rich text capabilities. This would similar to Japanese. For example, for thematic variable a
      def b = a * 2; def c = a / 2;.
    Another way is to use first position, how English indicates it. This would be an Anglo-centric thematic structure to programming languages, which generally already have an Anglo-centric naming system. Perhaps the best way is a combination of both front position and bolding.

    Let's look at how Groovy could enable front-position thematic structure. We'll start with something simple, the lowest precedence operator: a = b. If we want to front the b, we can't. We would need some syntax like =:, the reverse of Algol's :=
      b =: a

    We'd need to provide the same facility for the other precedence operators at the same level += -= *= /= %= <<= >>= >>>= &= ^= |=. Therefore, we'd have operators =+: =-: =*: =/: =%: =<<: =>>: =>>>: =&: =^: =|:.

    At the next higher precedence level are the conditional and Elvis operators. Many programming languages, such as Perl and Ruby, enable unless as statement suffix, allowing the action to be fronted as the theme. Groovy users frequently request this feature of Groovy on the user mailing list. An unless keyword would be useful, but we could also make the ? : and ?: operators multi-theme-enabling by reversing them, i.e. : ? and :?, with opposite (leftwards) associativity. The right-associative ones would have higher precedence over these new ones, so, for example:
      a ? b : c ? d : e would associate like a ? b : (c ? d : e)
      a : b ? c : d ? e would associate like (a : b ? c) : d ? e
      a : b : c ? d ? e would associate like a : (b : c ? d) ? e
      and a ? b ? c : d : e would associate like a ? (b ? c : d) : e

    On a similar note: Groovy dropped the do while statement because of parser ambiguities. It should be renamed do until to overcome the ambiguities.

    Next up the precedence hierarchy, we need shortcut boolean operators ||: and &&:, which evaluate, associate, and shortcut rightwards. Most of the next few operators up the hierarchy | ^ & == != <=> < <= > >= + * don't need reverse versions, but these do: =~ ==~ << >> >>> - / % **. It's good Groovy supplies the ..< operator so we can emphasize an endpoint in a range without actually processing it. We'll also provide the >.. and >..< operators.

    Just as in English we have the choice of saying the king's men or the men of the king, depending on what we want to make thematic, we should have that choice in Groovy too.
    We can easily encode reverse-associating versions of *. ?. .& .@ *.@ as .* .? &. @. @.*. To encode the standard path operator ., we could use .:.

    A positive by-product of having these reverse-associative versions of the Groovy operators is they'll work nicely with names in right-directional alphabets, such as Arabic and Hebrew, when we eventually enable that.

    When defining methods in Groovy, we should have the choice to put return values and modifiers after the method name and parameters, like in Pascal. This would cater speakers of Romance languages, e.g. French, who generally put the adjectives after the nouns.

    Conclusion
    Groovy, like most programming languages, doesn't enable programmers to supply their own thematic structure to code, only the transitive structure. When used well, thematic structure in code enables someone later on to more easily read and understand the program. Perl was a brave attempt at providing “more than one way to do things”, but most programming languages haven't learnt from it. I'm working on a preprocessor for the Groovy Language, experimenting with some of these ideas. If it looks practical, I'll release it one day, as GroovyScript. It will make Perl code look like utter verbosity.

    Saturday, November 29, 2008

    GrerlVy Symbology

    In my previous post, I suggested we should reserve alphanumeric tokens (Unicode categories L, N, M) for lexical items (e.g. names and numbers) in a programming language, and the other tokens (categories S, P, Z) for grammatical items. That way, people can use any alphanumeric tokens for user-defined names. Programs are also terser because programmers don't need to put whitespaces between name and symbol, or (usually) between two symbols.

    The grammatical tokens should have less emphasis than the lexical ones. When a programming language uses names for grammatical items, it's not following the way of natural languages, which can confuse those new to programming. Just as unusual, those same grammatical words are bolded in many IDE's, not the lexical ones. In Fortran these grammatical words were even capitalized, while user-defined names weren't. Programming languages should follow natural languages, or else programmers can't analyze and model the problem space in programs as easily.

    Let's look at some of the culprits...

    Keywords
    Perhaps Smalltalk has the least number of keywords:
    true false nil self super thisContext

    The first language I ever programmed in was a dialect of Algol60, the forerunner of C. We can see there the origins of today's well-known keywords:
    true false boolean integer real string array procedure own value label comment while do for step until if then else switch goto begin end

    C had 32 keywords:
    auto break case char const continue default do double else enum extern float for goto if int long register return short signed sizeof static struct switch typedef union unsigned void volatile while

    C++, besides using those in C, added these many more:
    asm bool catch class const_cast delete dynamic_cast explicit export false friend inline mutable namespace new operator private protected public reinterpret_cast static_cast template this throw true try typeid typename using virtual wchar_t

    C# has 79 keywords, removing a few of C++'s, but adding all these:
    abstract as base byte checked decimal delegate event finally fixed foreach get implicit in interface internal is lock null object out override params readonly ref sbyte sealed set stackalloc string typeof uint ulong unchecked unsafe
    though 2 of them, get and set, are only keywords in the right context.

    Java has 53 keywords:
    abstract assert boolean break byte case catch char class continue default do double else extends final finally float for if implements import instanceof int interface long native new package private protected public return short static super switch synchronized this throw throws transient try void volatile while true false null
    Groovy adds in as it and def to these.

    It's very difficult to extend a language without adding keywords. AspectJ adds many to Java:
    aspect aspectOf() issingleton() perthis() pertarget() percflow() percflowbelow() privileged declare precedence parents error warning soft thisJoinPoint thisEnclosingJoinPointStaticPart pointcut before around after returning throwing execution call get set target args initialization preinitialization staticinitialization handler adviceexecution within withincode cflow cflowbelow proceed()
    though many are sensitive to the context.

    Python has a more frugal attitude to keywords:
    False None True class continue def and as assert break finally for from del elif else except is lambda nonlocal global if import in return try while not or pass raise with yield

    Likewise, Ruby:
    BEGIN END alias and begin break case class def defined do else elsif end ensure false for if in module next nil not or redo rescue retry return self super then true undef unless until when while yield

    PHP is far more wasteful of the naming space:
    and or xor __FILE__ exception __LINE__ array() as break case class const continue declare default die() do echo() else elseif empty() enddeclare endfor endforeach endif endswitch endwhile eval() exit() extends for foreach function global if include() include_once() isset() list() new print() require() require_once() return() static switch unset() use var while __FUNCTION__ __CLASS__ __METHOD__ final php_user_filter interface implements instanceof public private protected abstract clone try catch throw cfunction old_function this final __NAMESPACE__ namespace goto __DIR__

    Perl has so many keywords I can't list them.

    And those Fortran keywords I mentioned before:
    ACCEPT ASSIGN AUTOMATIC BACKSPACE BLOCK BYTE CALL CHARACTER CLOSE COMMON COMPLEX CONTINUE DATA DECODE DIMENSION DO DOUBLE ELSE ENCODE END ENTRY EQUIVALENCE EXTERNAL FILE FORMAT FUNCTION GOTO IF IMPLICIT INCLUDE INQUIRE INTEGER INTRINSIC LOGICAL MAP NAMELIST OPEN OPTIONS PARAMETER PAUSE POINTER PRAGMA PRECISION PROGRAM REAL RECORD RETURN REWIND SAVE STATIC STOP STRUCTURE SUBROUTINE TYPE UNION VIRTUAL VOLATILE WHILE WRITE


    Operators
    The original Fortran operators had only a few precedences:
    R: **
    L: * /
    L: + -
    .eq. .ne. .lt. .le. .gt. .ge.
    .not. .and. .or. .eqv. .neqv.
    =


    When choosing symbols and punctuation to use, we don't want any alphanumeric symbols, we should keep the same precedences and associativity, and, at first, we should only use those already existing in other languages.

    The Algol60 operators were:
    ^
    * %
    + -
    < > <= >= = <>
    == => ∨ ∧ ~
    := ; .

    Note: % meant divide.

    The modern operators were first used by C:
    L: ++postfix –-postfix ()funcCall [] . ->
    R: ++prefix --prefix +unary -unary ! ~ (castType) *deref &addr sizeof
    L: *mult / %
    L: +binary -binary
    L: << >>
    L: < <= > >=
    L: == !=
    L: &bitAnd
    L: ^
    L: |
    L: &&
    L: ||
    R: ?:conditional
    R: = += -= *= /= %= <<= >>= &= ^= |=
    L: ,


    Apart from alpha token sizeof, all these operators can be used in GrerlVy. We can find another use for dereferencing * and addressing &, but keep their precedence and associativity. And the comma , and -> operators will be reinstated.

    C++ adds non-alpha tokens :: ::* .* ->*. The GCC version of C++ adds operators &&label <? >? .

    The Java operators are:
    L: [] ()methCall .
    R: ++prefix ++postfix --prefix --postfix +unary -unary ~ ! (typeCast) new
    L: * / %
    L: +addition -binary +stringConcat
    L: << >> >>>
    L: < <= > >= instanceof
    L: == !=
    L: &bitAnd &boolAnd
    L: ^bitXor ^boolXor
    L: |bitOr |boolOr
    L: &&
    L: ||
    R: ?:conditional
    R: = += -= *= /= %= <<= >>= >>>= &= ^= |=


    Groovy adds:
    **
    <=>
    *. ?. .& .@ *.@
    as
    .. ..< in
    ?:elvis
    =~ ==~

    GrerlVy will replace as in new instanceof with symbols.

    C# adds alpha operators ?? and => at the lowest precedence. JavaScript adds === and !== at the same precedence as ==. Ruby adds =~ and !~ at the same level as ===, plus ... &&= ||=. There's no further unique operators in PHP.

    Perl 5 uses variable prefixes $ @ % & \. We could swipe $ to mean an escape even when we're not inside a GString. Perl 6 plans on adding many more operators: see the Perl 6 Synopsis 3. Once it's out, I'll use it as a source for even more symbols.

    As well as those operator symbols from those programming languages, GrerlVy will also add:
    !=~ !~ !in !instanceof

    For integer division and modulo, returning two values:
    /%

    For defining parser combinators:
    ::= &&& ||| ???

    I don't know what these will do yet, but might as well keep symbols balanced:
    <<< <<<=


    Special Variables
    Groovy disallows names containing $, reserving them for its internal use. Because GrerlVy also needs internal variables, it needs to reserve some names, those containing underscore _. Some pre-existing names in Groovy/Java are coded with all capitals separated by underscores. GrerlVy will code them in camel case with a trailing underscore, e.g. SYMBOLIC_CONSTANT_FOR_IDEC will be coded as the easier-to-type symbolicConstantForIdec_, the trailing underscore showing it must be converted to the all-capitals name.

    Names beginning with _ will be used as special variables, similar to those in Perl and Ruby. When choosing the meanings of special variables, we'll consider their use in regexes and Format strings, as well as those languages. Some examples from Ruby:
    $! $@ $_ $. $& $~ $1 $2 $3 $= $/ $\ $0 $* $$ $?

    The _ standing alone will refer to the result of the previous statement, similar to Python's interactive mode. Natural languages have pronouns, so should programming languages.


    Numeric Formats
    Formats of numbers must also be terse. C and C++ used the well-known syntax:
    34 -72
    34.1 -23.4 .77 34.1e6 78.9E-2
    077 0xFF -0Xff
    34U 24L 42UL


    Java removes the suffix for unsigned numbers, but adds:
    34.7F 22f 34.8d 77D 67.89e-2d
    34 -72
    Double.NEGATIVE_INFINITY Float.NaN


    When supplying a string to a Float object, we can use a special syntax:
    new Float('0xDEDEp-3')
    new Float('0x.defP-3f')

    We'll inline this into GrerlVy's syntax.

    Groovy adds 34i 22G 34.8g. C# removes octal numbers, but adds decimal, e.g. 34.7m. We'll keep octal numbers, but require an o, as in 0o377.

    Perl allows an underline _ for legibility, e,g. 1_234_567_890. We'll certainly enable that in GrerlVy.

    Ruby mainly copies Perl, but adds ASCII codes and binary:
    ?a //Ascii code for 'a'
    0b01101 //binary


    Python has optional decimal parts after the decimal point, and imaginary numbers:
    5. 5.e7
    10j 11.1j .1j 1.j 7e5j

    There's no further unique syntax in JavaScript and PHP.

    One source of good ideas is the J language:
    2b1001011 //base 2
    2b12202102 //base 3, etc
    3r5 // 3/5, similar to Scheme's rational numbers
    3j5 // 3+5i, terser than Python's complex numbers
    3p5 // 3*pi^5
    3x5 // 3*e^5



    GrerlVy will be about extreme tersity in programming. Grerlvy is a better choice of syntax for the Groovy/Grape AST.