Friday, March 30, 2007

Programming Language Fluency

Steven Pinker, in his book "Words and Rules", suggests a natural language such as English is stored in the human brain as words (vocabulary) and rules (grammar). As children we learn words, such as "cat", "cats", "dog", "dogs", "lion", and "lions". Our brains then recognize a rule, remember the rule "add -s for plural", then forget the plural words. After remembering more words such as "tiger", "elephant", and "mouse", our brains then remember the exception-to-the-rule words "mice is plural of mouse". Most rules of English, however, are about joining words together to generate a phrase or clause. And sometimes a word group is remembered as vocabulary when it can't be generated using a rule, eg, phrasal verbs like "put up with" and idioms like "pulling one's leg".

Computer languages have many similarities to natural languages. The syntax is the grammar, and the keywords, libraries, commands, macros, etc are the vocabulary. Computer languages let us define our own variable names, just as natural languages let us invent new proper nouns. Comments are like interjections, strings like foreign words, expressions are phrases, statements are clauses, closures are relative clauses, blocks are sentences, semi-colons and commas are conjunctions, methods and functions are verbs, interfaces and annotations are adjectives, classes and properties are nouns, some packages’ classes are commonly-used nouns, while others’ are less-common technical words.

Over the last few years, powerful "scripting" languages such as Python, Ruby, and Groovy have been becoming more popular. They follow the deadend in programming language evolution known as "Fourth Generation Languages". It was easier to write a program in a visual 4GL, but far more difficult to read and debug it. Most people prefer to use a combination of both language and visual design, but it’s easier to decorate language with visuals (eg, a program listing using indentation, with IDE color highlighting) than visuals with language (eg, a form layout with code snippets on each element). Human brains are wired for language, whether natural or programming, hence scripting languages' recent rising popularity.

But scripting languages do have some drawbacks that prevent them working well with an IDE. Dynamic typing means the language often can’t be supported very well with the visual code auto-completion feature. So to use a dynamically-typed language well, we must be able to recall the most common methods and properties of the most common classes.

Imagine we can speak English with the correct grammar words, stresses, pauses, tones, etc, but don’t know a substantial portion of the vocabulary. To help us, we keep a friend with us always when we go out. So when speaking to a waiter at a restaurant, we say “There’s a”, then turn to our friend for a hint. Our friend tells us “fly”, so we say to the waiter “fly in my”, turn to our friend who tells us “soup”, then tell the waiter “soup!” This friend is just like the IDE auto-completors we use when programming. It’s unnatural to know far more language grammar than vocabulary, and the increasing uptake of dynamically typed languages could correct this imbalance in our knowledge of programming languages.

Young babies learn the vocabulary of their natural language faster than the grammar. Adult foreign language classes focus on topics, primarily requiring learning vocabulary and expressions about the topic. But those learning programming learn language grammar, and keep a class library reference book handy or use a memory-clogging GUI with auto-completion. A programming language should be learnt just like a natural language, grammar and vocabulary together. Perhaps the faulty learning method is why only certain types of people learn to use such languages well. If programming languages were taught properly, perhaps everyone would be able to learn them, just as they learn to count. Perhaps dynamically typed languages that don’t work well with IDE auto-completors could be the first programming languages to be learnt as all languages, whether natural or computer, should be: grammar and vocabulary together. Programmers must become fluent in the language, not just learn about the language.

I'm working on such a tutorial for the Groovy Programming Language. Java programmers are Groovy’s target market, and those who have never programmed in Java can benefit from learning Groovy first: they can be productive sooner. Perhaps they'll go on to learn and work with Java, or maybe they'll find Groovy to be sufficient for all their needs. Either way, Groovy should be easily learnable by them. The Java class libraries, both standard and enterprize editions, are huge, so I've drawn a rough circle around groovy.lang, groovy.util, java.lang, java.util,,, java.math, java.text, java.util.regex, and maybe also java.lang.reflect and java.util.concurrent. They seem to be the "core" packages and I'll consider them to be the "vocabulary" of Groovy. In my first pass through them, I'll focus on completeness of information.

The tutorial will pay attention to those whose native language isn’t English. Rather than relying on many translations eventually going out of sync with one another, it’s best the tutorial be in English, because most foreign programmers can read English, if not speak it. But the tutorial must use a foreigner-readable style and internationally-known words. The focus of learning must be on code examples, not wordy explanations.

The tutorial will also aim to present topics in the best sequence. People learning Groovy want to be productive as quickly as possible, and so should often learn things in the opposite sequence to people who learnt Java first. Just because closures, categories, interceptors, and builders are more recent concepts doesn’t mean learners should learn them later. Perhaps they need to learn closures before functions, functions and expandos before classes, collections before arrays, encapsulation and static members before instances, categories and interceptors before the complexity of static typing, inheritance, method overriding, multi-methods, casting, etc.

The tutorial will treat commonly-used extensions as an integral part of the language. Regex is an extension to both Java and Groovy, and its syntax is quite different to Groovy’s. But they are meant to be used together, and should be learnt together. There's no conceptual difference between a math expression and a regex pattern. Only because one's quoted and the other isn't in the syntax do many not consider regexes to be part of the language. But programmers must be comfortable using both the procedural and sometimes lispy styles of base Groovy and the more declarative style of regexes, printf format strings, or any other commonly-used extension.

I'll firm up the sequence of topics during my second pass through the information, when I expand the examples and explanations. For now, I've drawn a rough circle around these ones:
  • Getting started - basic concepts needed for each subsequent tutorial
  • integers, BigDecimal, floating-points
  • enums
  • dates & times
  • Collections, Queues/Deques, Arrays, Maps
  • Strings, Characters, regexes
  • input/output, networking
  • blocks, Closures, functions
  • Expandos, classes, categories
  • static typing
  • interfaces
  • inheritance, method overriding, multi-methods
  • method-to-syntax mappings (in, as, operator overloading)
  • exceptions
  • permissions
  • annotations
  • multi-threading
  • tree processing, builders, XML
  • interceptors, ExpandoMetaClass
  • packages, evaluate, class loading
  • reflection
  • internationalization
  • Further learning - pointers to Java, IDE's, Swing, Grails, Gant, etc
I'm expecting all this to take me until next year sometime.

Tuesday, March 13, 2007

Internationalizing Keywords

Some programming languages, such as Perl and PHP, distinguish variable names from other language tokens by prefixing them with a special character. Sometimes that character also indicates something about the type of the variable, such as $i to indicate a scalar value, or @arr for a vector one. This is handy because the language can then use any word beginning with an alphabetic character as a keyword. Other languages allow variables to begin with any alphabetic character, but not to be one of the keywords of the language, eg, no variables may be called class in Java.

It would be quite easy to internationalize the keywords in a programming language of the first type. A context-sensitive lexical preprocessor could simply replace the native-language-specific keywords with the English ones, eg, 作 with
do, 回 with return, etc for Chinese. For the second type of programming language, if programs used, say, Chinese words as keywords, then they should let programmers use the unused English keywords as variable names. To allow this, before a preprocessor converted Chinese characters to the equivalent English keywords, it must somehow mangle names that are English keywords into an eligible alternative. In some languages, such as Groovy, this would be as simple as quoting the name, eg, method.'static'.params instead of method.static.params. In other languages, the mangling is more difficult. In any language with sizable libraries, there'd also be inter-module and namespace issues to deal with.

Another way a language of the second type could internationalize its keywords is to replace all its keywords with symbols and punctuation, then use a preprocessor on all programs. The keywords would be macros that add in these symbols.
The default preprocessor would be the English language one, but any could be used. To use keywords in another language, we could exchange the default preprocessor for one in that other language, eg, a Spanish one. The default preprocessor would be conceptually separate from the compiler, but could in fact be tightly coupled at the implementation-level to provide more efficiency to those using keywords in English, the default natural language. Such a language of the second type is in contrast to one of the first type as far as token usage goes: one uses alphabetic characters for names only, the other to begin only the keywords.

As a case study, I'll look at how a hypothetical language with Java's keywords could replace them with ASCII symbols and punctuation. I'll divide them into nouns, verbs, and adjectives/adverbs as far as possible.

The Nouns: With auto- and unboxing, the keywords for void, char, int, long, short, byte, float, double, and boolean could be macros that add in java.lang.Void, java.lang.Character, java.lang.Integer, etc, and the semantics would be unchanged. true and false could be replaced with java.lang.Boolean.TRUE and java.lang.Boolean.FALSE. If the Null type comes in Java 7, null could similarly be replaced.

The Adjectives: The modifiers could be considered to be annotations that the compiler sees first, before any "other" annotation processor. So static, private, protected, public, abstract, final, volatile, transient, native, strictfp, and synchronized (as a modifier) could be macros that add in equivalent annotations, eg, static with @Static, protected with @Access("Protected"), etc, depending on the lexical context of the macro. Because interface acts as a modifier of an implied class, it could be a macro that adds in @Interface class.

The Verbs: Many of the keywords are at the beginning of the line, and look like commands, function calls, closure calls, etc. By leaving those keywords in the language but also allowing them to be used as names, the compiler could determine from the context which usage was intended. From the programmer's point of view, a while{ ... } statement would be no different to the use{ ... } closure call, and the assert ... statement no different to the println ... call. Eligible keywords from Java are: for, while, do, if, else, switch, case, default, try, catch, finally, return, throw, break, continue, package, import, class, assert, and synchronized (as a block header). They could even look like they're defined in a standard library class, eg, mylang.lang.System.for(...), mylang.lang.System.while(...), etc. To enable this, the language would need to allow closures with multi-name syntax, eg, myIf(...) myElse(...). Such keywords could then be internationalized by programmers using the same mechanism as for names in standard libraries. (Though in fact, some of these verb-keywords may be definable in terms of others, eg, default defined as case Object || null.)

As for the other keywords:
extends and implements can be distinguished from their context, and could be replaced with a colon. So could throws. const and goto could finally be retired. new and instanceof could each be macros that add in some alternative symbols. (Some languages with Java syntax do the opposite, adding in new keywords, such as as and in, as alternatives to symbols in the language.) We could eliminate this and super as keywords by considering a class to be divided into an outer, static portion, and an inner, instantiable portion, borrowing an idea from the Scala language. The static modifiers would be absent from the outer portion, and the inner portion would be bracketed with object this extends super{ ... }, where this and super are simply defaults for any names a programmer might choose. The current class definition syntax would expand as a macro to this new syntax, and the new object keyword would be a verb-keyword, just like class.

We could thus internationalize the keywords for a Java-style language which uses keywords and names with the same syntactic form.