Friday, March 30, 2007

Programming Language Fluency

Steven Pinker, in his book "Words and Rules", suggests a natural language such as English is stored in the human brain as words (vocabulary) and rules (grammar). As children we learn words, such as "cat", "cats", "dog", "dogs", "lion", and "lions". Our brains then recognize a rule, remember the rule "add -s for plural", then forget the plural words. After remembering more words such as "tiger", "elephant", and "mouse", our brains then remember the exception-to-the-rule words "mice is plural of mouse". Most rules of English, however, are about joining words together to generate a phrase or clause. And sometimes a word group is remembered as vocabulary when it can't be generated using a rule, eg, phrasal verbs like "put up with" and idioms like "pulling one's leg".

Computer languages have many similarities to natural languages. The syntax is the grammar, and the keywords, libraries, commands, macros, etc are the vocabulary. Computer languages let us define our own variable names, just as natural languages let us invent new proper nouns. Comments are like interjections, strings like foreign words, expressions are phrases, statements are clauses, closures are relative clauses, blocks are sentences, semi-colons and commas are conjunctions, methods and functions are verbs, interfaces and annotations are adjectives, classes and properties are nouns, some packages’ classes are commonly-used nouns, while others’ are less-common technical words.

Over the last few years, powerful "scripting" languages such as Python, Ruby, and Groovy have been becoming more popular. They follow the deadend in programming language evolution known as "Fourth Generation Languages". It was easier to write a program in a visual 4GL, but far more difficult to read and debug it. Most people prefer to use a combination of both language and visual design, but it’s easier to decorate language with visuals (eg, a program listing using indentation, with IDE color highlighting) than visuals with language (eg, a form layout with code snippets on each element). Human brains are wired for language, whether natural or programming, hence scripting languages' recent rising popularity.

But scripting languages do have some drawbacks that prevent them working well with an IDE. Dynamic typing means the language often can’t be supported very well with the visual code auto-completion feature. So to use a dynamically-typed language well, we must be able to recall the most common methods and properties of the most common classes.

Imagine we can speak English with the correct grammar words, stresses, pauses, tones, etc, but don’t know a substantial portion of the vocabulary. To help us, we keep a friend with us always when we go out. So when speaking to a waiter at a restaurant, we say “There’s a”, then turn to our friend for a hint. Our friend tells us “fly”, so we say to the waiter “fly in my”, turn to our friend who tells us “soup”, then tell the waiter “soup!” This friend is just like the IDE auto-completors we use when programming. It’s unnatural to know far more language grammar than vocabulary, and the increasing uptake of dynamically typed languages could correct this imbalance in our knowledge of programming languages.

Young babies learn the vocabulary of their natural language faster than the grammar. Adult foreign language classes focus on topics, primarily requiring learning vocabulary and expressions about the topic. But those learning programming learn language grammar, and keep a class library reference book handy or use a memory-clogging GUI with auto-completion. A programming language should be learnt just like a natural language, grammar and vocabulary together. Perhaps the faulty learning method is why only certain types of people learn to use such languages well. If programming languages were taught properly, perhaps everyone would be able to learn them, just as they learn to count. Perhaps dynamically typed languages that don’t work well with IDE auto-completors could be the first programming languages to be learnt as all languages, whether natural or computer, should be: grammar and vocabulary together. Programmers must become fluent in the language, not just learn about the language.

I'm working on such a tutorial for the Groovy Programming Language. Java programmers are Groovy’s target market, and those who have never programmed in Java can benefit from learning Groovy first: they can be productive sooner. Perhaps they'll go on to learn and work with Java, or maybe they'll find Groovy to be sufficient for all their needs. Either way, Groovy should be easily learnable by them. The Java class libraries, both standard and enterprize editions, are huge, so I've drawn a rough circle around groovy.lang, groovy.util, java.lang, java.util,,, java.math, java.text, java.util.regex, and maybe also java.lang.reflect and java.util.concurrent. They seem to be the "core" packages and I'll consider them to be the "vocabulary" of Groovy. In my first pass through them, I'll focus on completeness of information.

The tutorial will pay attention to those whose native language isn’t English. Rather than relying on many translations eventually going out of sync with one another, it’s best the tutorial be in English, because most foreign programmers can read English, if not speak it. But the tutorial must use a foreigner-readable style and internationally-known words. The focus of learning must be on code examples, not wordy explanations.

The tutorial will also aim to present topics in the best sequence. People learning Groovy want to be productive as quickly as possible, and so should often learn things in the opposite sequence to people who learnt Java first. Just because closures, categories, interceptors, and builders are more recent concepts doesn’t mean learners should learn them later. Perhaps they need to learn closures before functions, functions and expandos before classes, collections before arrays, encapsulation and static members before instances, categories and interceptors before the complexity of static typing, inheritance, method overriding, multi-methods, casting, etc.

The tutorial will treat commonly-used extensions as an integral part of the language. Regex is an extension to both Java and Groovy, and its syntax is quite different to Groovy’s. But they are meant to be used together, and should be learnt together. There's no conceptual difference between a math expression and a regex pattern. Only because one's quoted and the other isn't in the syntax do many not consider regexes to be part of the language. But programmers must be comfortable using both the procedural and sometimes lispy styles of base Groovy and the more declarative style of regexes, printf format strings, or any other commonly-used extension.

I'll firm up the sequence of topics during my second pass through the information, when I expand the examples and explanations. For now, I've drawn a rough circle around these ones:
  • Getting started - basic concepts needed for each subsequent tutorial
  • integers, BigDecimal, floating-points
  • enums
  • dates & times
  • Collections, Queues/Deques, Arrays, Maps
  • Strings, Characters, regexes
  • input/output, networking
  • blocks, Closures, functions
  • Expandos, classes, categories
  • static typing
  • interfaces
  • inheritance, method overriding, multi-methods
  • method-to-syntax mappings (in, as, operator overloading)
  • exceptions
  • permissions
  • annotations
  • multi-threading
  • tree processing, builders, XML
  • interceptors, ExpandoMetaClass
  • packages, evaluate, class loading
  • reflection
  • internationalization
  • Further learning - pointers to Java, IDE's, Swing, Grails, Gant, etc
I'm expecting all this to take me until next year sometime.

No comments: