Saturday, June 23, 2007

Programming in Unicode

(republished portion)

Unicode, used by both Java and Windows, now has 100,000 characters, a collection of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, etc. But computer programs are still written using a mere 100 tokens, the ASCII characters. It's difficult to key in other characters, and programmers don't know other alphabets. But in a few years, using all the Unicode characters in programs may be standard. Math is a language that uses many more tokens than programming languages, both dedicated symbols and letters from many alphabets. Math can describe concepts extremely tersely, since the greater the range of tokens a language has, the terser it can be written. Yet programming is limited to the 100 tokens on the keyboard. Many people can type those 100 characters faster than they can write them, but can write thousands more they can't type.

Committed programmers are continually looking for ways to make programs terser, yet still readable. They choose languages and tools that enable such tersity, so programming languages evolved, into the 2GL (assembler), the 3GL, and the visual 4GL. But 4GL’s were limited in their scalability and readability. It’s easier to write a program in a 4GL than a 3GL, but more difficult to read and debug it. So some used IDE’s, supplementing 3GL's with visual aids. Others looked for a more productive language, so terser languages, such as Perl, Python, and Ruby, became popular. Regular expressions are a successful attempt at tersity, now used by many languages, but many consider them unreadable. The K programming language, used by financial businesses, could be the tersest language ever invented. It only uses ASCII symbols, but overloads them profusely. However, the price is the inability to give different precedences to the operators, so everything unbracketed is evaluated from the right. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. The same 100 tokens are used.

Perhaps adding the many Unicode symbols to programming languages would enable terser programs to be written. Operator overloading in C++ was a similar attempt at tersity. Programmers could define meanings for some combinations of the 35 ASCII symbols. Although programs became terser, they were more difficult to understand because of the unpredictable meaning of these symbols in different code contexts, and they were eventually dropped in Java. The problem wasn't with operator overloading itself, but with the uncontrolled association of meanings with each operator. Eventually certain meanings would have become generally accepted, the others falling into disuse, but this would have taken many years, with too many incompatible uses produced in the meantime. If there was such a problem with a few dozen operators, what hope would there be for the hundreds of unused Unicode symbols? If programmers were allowed to overload them with any meaning, the increase in program tersity would be at the cost of readability. Although some Unicode symbols will have an obvious meaning, such as some math symbols, most would have no meaning that could be transferred easily to the programming context. To retain readability of programs in a terse language, the meanings of the Unicode symbols would have to be carefully controlled by the custodians of that language. They would activate new Unicode symbols at a gradual pace only, with control of their meanings, after carefully considering existing use of the symbols.

Programming languages do, however, already allow Unicode characters in some parts of their programs. The contents of strings and comments can use any Unicode character. User-defined names can use all the alphabetic letters and CJK characters, and because there already exists agreed meanings for combinations of these, derived from their respective natural languages, we can increase tersity while keeping readability. But the core of the language, the grammar keywords and symbols, and names in supplied libraries, still only use ASCII characters. Perhaps some programmers use non-Latin characters wherever they can in their programs. A browse through the computer shelves of a typical bookshop in mainland China suggests they only do so for comments and contents of strings, not for user-defined names.

Programmers from cultures not using the Latin alphabet won't be motivated to use their own alphabets in user-defined names when they don't with pre-supplied names, such as keywords or standard libraries. Often, most of the names in a program are from libraries standard to the language. To trigger the widespread use of Unicode characters from non-ASCII alphabets in programs, the pre-supplied names must also be in those alphabets. And this could easily be done. The grammar of a language and its vocabulary are two different concepts. A programming language grammar could conceivably have many vocabulary options. Almost all programming languages only have English. Other vocabularies could be based on other natural languages. A Spanish vocabulary plugged into a certain programming language would have Spanish names for the keywords, modules, classes, methods, variables, etc.

Computer software nowadays is internationalized, webpages are, and most programming languages enable internationalized software. But the languages themselves and their libraries are not internationalized. An internationalized programming language would enable a fully functional program to be written totally in a natural language of one’s choice. Not only could all user-defined names be specified in any alphabet, but also all keywords and names in the standard libraries would be available in many natural languages. Ideally, when a software supplier ships a library, they'll specify the public names in many languages in the definition. But this is likely to have a slow uptake, so languages must allow a library to be translated in an incremental manner easily from one natural language into another. Some languages let programmers use mixins and interceptors to do this to various degrees. And some could conceivably allow a preprocessor, pluggable lexer, or closures to internationalize the keywords. But full foreign-language support must be a declared aim of a language's development.

Internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. The first to be translated will probably be the core Java class libraries, and the first language translated into, probably simplified Chinese.

No comments: