Wednesday, September 12, 2007

Internationalized Programming

(reposted portion)

One day programmers will choose from all the 100,000 Unicode tokens to use in their programs, not just the ASCII ones. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. Yet the programs are limited to the 100 tokens on the keyboard. In future, languages will also maximize the use of vocabulary to enable tersity, because the greater the range of tokens a language has, the terser it can be written. Unicode supplies many more tokens than ASCII. It contains characters from the scripts of most natural languages, including 70,000 unified CJK (Chinese, Japanese, and Korean) characters. An internationalized programming language enables a fully functional program to be written totally in a natural language of one’s choice. Such internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. I've blogged about this before, but what about the effects of internationalized programming languages being available?

I suspect most natural languages wouldn't actually be used with internationalized programming: there's no real reason to. Many people going to college in non-English countries can read English, if not speak or write it, and so can use a programming language's libraries easily. Typing the library names into programs involves skill in spelling, not in writing, and so such people can program in English easily enough. And with auto-completors in IDE's, programmers really only need to know the first few letters of the name. Unless there's a strong nationalist movement promoting the native tongue of the country, writing foreigner-readable programs in English will be more important.

Maybe in Northeast Asia, an internationalized programming language could take off. Their natural languages, traditional Chinese, simplified Chinese, Japanese, and Korean, have many more tokens than alphabetic languages, and so enable programs to be written much more tersely. The Chinese written language has tens of thousands of characters. In fact, 80% of Unicode characters are CJK or Korean-only characters. However, the characters used for writing Japanese kanji and traditional Chinese (used in Hong Kong, Taiwan, and Chinatowns) must be read at a much larger font size than English, which would cancel out the benefits of using them. Possibly why the Japanese-invented Ruby only uses English alphabetic library names. Korean can be read at the same font size as English, but, unlike Chinese and Japanese characters, it's really an alphabetic script. There's only 24 letters in the Korean alphabet, and they're arranged in a square to form each sound, instead of one after the other as in English. Unicode simply provides Korean with the choice to be coded letter by letter, or by the square-shaped sound. Thus Korean is potentially terser than alphabetic languages, but not as terse as simplified Chinese, the script used in mainland China.

In the 1950's, mainland China's government replaced hundreds of commonly-used but complexly-written Chinese characters with versions that use far fewer strokes. These simplified characters, also used in Singapore, can be read at the same font size as alphabetic letters and digits. Written Chinese using simplified characters takes up half the page space as written English, and can be condensed even further by using proverb writing style and text-messaging abbreviations. Not all simplified characters can be read at the same font size as alphabetic characters, but the thousands that can enable far greater tersity than 26-letter alphabets. A non-proportional font would enable many horizontally-dense characters, both simplified and traditional, to be read at a normal font size also, though that would be a radical departure from the historical square shape of characters.

Chinese characters are each composed of smaller components, recursively in a treelike manner. When breaking them down into their atomic components, the choice of what's atomic is arbitrary because all components can be broken down further, all the way to individual strokes, whereas in Korean it's clear which component is atomic. A common way of breaking Chinese characters into components gives over 400 of them in simplified Chinese, over 600 in traditional. Some components in a character give a rough idea of the pronunciation in some Chinese dialects, while others, the radicals, give an idea of the meaning. A certain sequence of Chinese components can often be arranged in more than one manner to form a character, unlike Korean where a certain sequence of letters can be arranged into a square in one way only. The arrangement is as much a part of a Chinese character as the components themselves. Also, any two components can combine together into another component in a variety of ways, such as across, downwards, diagonally, repeating into different shapes, reflecting, surrounding another around different sides or corners, being inserted between or within others, melding or touching together, being threaded, overlapping or hooking in various ways, or being modified by strokes. These many ways of combining components, reflecting the pictorial roots of Chinese characters, provide another dimension of lexical variation that increases the potential tersity of written Chinese in computer programs.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every package, class, method, field, and annotation name in the entire Java standard and enterprize edition class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary. One day using the tersity of Chinese characters in programming languages will be of more value to mainland Chinese programmers than writing foreigner-readable programs in English, and when they decide to switch, it’ll be over in a year.

If a programming language enabled multi-vocabulary programming, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their programming, but so could Western programmers. Committed programmers want to write terse programs, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, and they can enter the characters easily. IME's (input method editors) and IDE auto-completors work in similar ways. With an IME, someone types the sound or shape of the character, then selects the character from a popup menu of other similar-sounding or similarly-shaped ones. Instead of two different popup systems, each at a different level of the software stack, IME's for programming could be combined into IDE auto-completors. Entering terser Chinese names for class library names would then be as easy as entering the English names, limited only by how quickly a Western programmer could learn to recognize the characters in the popup box. They would incrementally learn more Chinese characters, simply to program more tersely.

An internationalized programming language and IDE plugins must allow programmers to begin using another natural language in their programs incrementally, only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough. Non-Chinese programmers won't actually have to learn any Chinese speaking, listening, grammar, or character writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is very different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape. Because the characters are limited to names in class libraries, they won't need to know grammar or write sentences.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within committed programmers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. So software developers all over the world could be typing in Chinese within decades.