Tuesday, March 13, 2007

Internationalizing Keywords

Some programming languages, such as Perl and PHP, distinguish variable names from other language tokens by prefixing them with a special character. Sometimes that character also indicates something about the type of the variable, such as $i to indicate a scalar value, or @arr for a vector one. This is handy because the language can then use any word beginning with an alphabetic character as a keyword. Other languages allow variables to begin with any alphabetic character, but not to be one of the keywords of the language, eg, no variables may be called class in Java.

It would be quite easy to internationalize the keywords in a programming language of the first type. A context-sensitive lexical preprocessor could simply replace the native-language-specific keywords with the English ones, eg, 作 with
do, 回 with return, etc for Chinese. For the second type of programming language, if programs used, say, Chinese words as keywords, then they should let programmers use the unused English keywords as variable names. To allow this, before a preprocessor converted Chinese characters to the equivalent English keywords, it must somehow mangle names that are English keywords into an eligible alternative. In some languages, such as Groovy, this would be as simple as quoting the name, eg, method.'static'.params instead of method.static.params. In other languages, the mangling is more difficult. In any language with sizable libraries, there'd also be inter-module and namespace issues to deal with.

Another way a language of the second type could internationalize its keywords is to replace all its keywords with symbols and punctuation, then use a preprocessor on all programs. The keywords would be macros that add in these symbols.
The default preprocessor would be the English language one, but any could be used. To use keywords in another language, we could exchange the default preprocessor for one in that other language, eg, a Spanish one. The default preprocessor would be conceptually separate from the compiler, but could in fact be tightly coupled at the implementation-level to provide more efficiency to those using keywords in English, the default natural language. Such a language of the second type is in contrast to one of the first type as far as token usage goes: one uses alphabetic characters for names only, the other to begin only the keywords.

As a case study, I'll look at how a hypothetical language with Java's keywords could replace them with ASCII symbols and punctuation. I'll divide them into nouns, verbs, and adjectives/adverbs as far as possible.

The Nouns: With auto- and unboxing, the keywords for void, char, int, long, short, byte, float, double, and boolean could be macros that add in java.lang.Void, java.lang.Character, java.lang.Integer, etc, and the semantics would be unchanged. true and false could be replaced with java.lang.Boolean.TRUE and java.lang.Boolean.FALSE. If the Null type comes in Java 7, null could similarly be replaced.

The Adjectives: The modifiers could be considered to be annotations that the compiler sees first, before any "other" annotation processor. So static, private, protected, public, abstract, final, volatile, transient, native, strictfp, and synchronized (as a modifier) could be macros that add in equivalent annotations, eg, static with @Static, protected with @Access("Protected"), etc, depending on the lexical context of the macro. Because interface acts as a modifier of an implied class, it could be a macro that adds in @Interface class.

The Verbs: Many of the keywords are at the beginning of the line, and look like commands, function calls, closure calls, etc. By leaving those keywords in the language but also allowing them to be used as names, the compiler could determine from the context which usage was intended. From the programmer's point of view, a while{ ... } statement would be no different to the use{ ... } closure call, and the assert ... statement no different to the println ... call. Eligible keywords from Java are: for, while, do, if, else, switch, case, default, try, catch, finally, return, throw, break, continue, package, import, class, assert, and synchronized (as a block header). They could even look like they're defined in a standard library class, eg, mylang.lang.System.for(...), mylang.lang.System.while(...), etc. To enable this, the language would need to allow closures with multi-name syntax, eg, myIf(...) myElse(...). Such keywords could then be internationalized by programmers using the same mechanism as for names in standard libraries. (Though in fact, some of these verb-keywords may be definable in terms of others, eg, default defined as case Object || null.)

As for the other keywords:
extends and implements can be distinguished from their context, and could be replaced with a colon. So could throws. const and goto could finally be retired. new and instanceof could each be macros that add in some alternative symbols. (Some languages with Java syntax do the opposite, adding in new keywords, such as as and in, as alternatives to symbols in the language.) We could eliminate this and super as keywords by considering a class to be divided into an outer, static portion, and an inner, instantiable portion, borrowing an idea from the Scala language. The static modifiers would be absent from the outer portion, and the inner portion would be bracketed with object this extends super{ ... }, where this and super are simply defaults for any names a programmer might choose. The current class definition syntax would expand as a macro to this new syntax, and the new object keyword would be a verb-keyword, just like class.

We could thus internationalize the keywords for a Java-style language which uses keywords and names with the same syntactic form.

No comments: