Friday, June 13, 2008

The Future of Programming: Chinese Characters

Last year, I wrote some wordy blog entries on this subject, see Programming in Unicode, part 1 and part 2. Here's a brief rehash of them...

Deep down in their hearts, programmers want to read and write terse code. Hence, along came Perl, Python, Ruby, and for the JVM, Groovy. Lisp macros are enjoying a renaissance. But programmers still want code to be clear, so many reject regexes and the J language as being so terse they're unreadable. All these languages rely on ASCII. The tersity of these languages comes from maximizing the use of grammar, the different ways tokens can be combined. The same 100 or so tokens are used.

Many people can type those 100 tokens faster than they can write them, but can write thousands more they can't type. If there were more tokens available for programming, we could make our code terser by using a greater vocabulary. The greater the range of tokens a language has, the terser its code can be written. APL tried it many years ago but didn't become popular, perhaps because programmers didn't want to learn the unfamiliar tokens and how to enter them. But Unicode has since arrived and is used almost everywhere, including on Java, Windows, and Linux, so programmers already know some of it.

With over 100,000 tokens, Unicode consists of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, and other stuff. Many programming languages already allow all Unicode characters in string contents and comments, which programmers in non-Latin-alphabet countries (e.g. Greece, Russia, China, Japan) often use. Very few programming languages allow Unicode symbols and punctuation in the code: perhaps language designers don't want to allow anything resembling C++ operator overloading.

But many programming languages do allow Unicode alphabetic letters and CJK characters in names. Because there already exists agreed meanings for combinations of these, derived from their respective natural languages, programmers can increase tersity while keeping readability. However, this facility isn't used very much, maybe because the keywords and names in supplied libraries (such as class and method names in java.lang) are only available in English.

I suspect programmers from places not using the Latin alphabet would use their own alphabets in user-defined names in a certain programming language if it was fully internationalized, i.e., if they could:
  • configure the natural language for the pre-supplied keywords and names (e.g. in java.lang)
  • easily refactor their code between natural languages (ChinesePython doesn't do this)
  • use a mixture of English and their own language in code so they could take up new names incrementally
Most programming languages enable internationalized software and webpages, but the languages themselves and their libraries are not internationalized. Although now rare, most programming languages will one day be internationalized, following in the trend of the software they're used to write. The only question is how long this will take.

However, I suspect most natural languages wouldn't actually be used with internationalized programming as there's no real reason to. Programmers in non-English countries can read English and use programming libraries easily, especially with IDE auto-completors. Writing foreigner-readable programs in English will be more important.

To become popular in a programming language, a natural language must:
  • have many tokens available, enabling much terser code, while retaining clarity. East Asian ideographic languages qualify here: in fact, 80% of Unicode tokens are CJK or Korean-only characters.
  • be readable at the normal coding font. Japanese kanji and complex Chinese characters (used in Hong Kong, Taiwan, and Chinatowns) don't qualify here, leaving only Korean and simplified Chinese (used in Mainland China).
  • be easily entered via the keyboard. An IME (input method editor) allows Chinese characters to be entered easily, either as sounds or shapes. The IME for programming could be merged with an IDE auto-completor for even easier input.
And to be the most popular natural language used in programming, it must:
  • enable more tokens to be added, using only present possible components and their arrangements. Chinese characters are composed of over 500 different components (many still unused), in many possible arrangements, while Korean has only 24 components in only one possible arrangement.
  • be used by a large demographic and economic base. Mainland China has over 1.3 billion people and is consistently one of the fastest growing economies in the world.
About a year ago, I posted a comment on Daniel Sun's blog on how to write a Groovy program in Chinese. (The implementation is proof-of-concept only; a scalable one would be different.) The English version is:
content.tokenize().groupBy{ it }.
  collect{ ['key':it.key, 'value':it.value.size()] }.
  findAll{ it.value > 1 }.sort{ it.value }.reverse().
  each{ println "${it.key.padLeft( 12 )} : $it.value" }

The Chinese version reduces by over half the size (Chinese font required):
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
  都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}

I believe this reduction is just the beginning of the tersity that using all Chinese characters in programming will bring. The syntax of present-day programming languages is designed to accommodate their ASCII vocabulary. With a Unicode vocabulary, the language grammar could be designed differently to make use of the greater vocabulary of tokens. As one example of many: if all modifiers are each represented by a single Chinese character, for 'public class' we could just write '公类' without a space between (just like in Chinese writing), instead of '公 类', making it terser.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. Dynamic language advocates claim this benefit for dynamic programming over static programming: the benefit is enhanced for Chinese characters over the Latin alphabet.

If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every name (packages, classes, methods, fields, etc) in the entire Java class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary.

Furthermore, in an internationalized programming language, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their code, but so could Western programmers. Hackers want to write terse code, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, they can enter the characters easily, and start using them incrementally. By incrementally I mean only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough.

Non-Chinese programmers won't have to learn Chinese speaking, listening, grammar, or writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is quite different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within hackers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. Software developers all over the world could be typing in Chinese within decades.

Chinese character data file available...
Recently, I analyzed the most common 20,934 Chinese characters in Unicode (the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block), aiming to design an input method easy for foreigners to enter CJK characters.

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one based on how I think a fellow foreigner will analyse the character. I've created a few thousand characters to cater for decomposition components not themselves among my collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

The data file is CSV-format, with 4 fields:
  • the character
  • first component
  • either second component, or -
  • type of decomposition
Here's a zip of that data file and truetype font file if anyone's interested.


aldebaran said...

Hello Gavin,

I'm pretty impressed about your decomposition table, thats what I was looking for for a while.

How did you do the font? That must have been a huge amout of work! Is there any copyleft to it?

Many thanks and kind regards,


Gavin Grover said...

Sorry for the very late reply; I've been on holiday without the net at home for a few months.

For the font, I quickly pasted a glyph from Windows font "SimSun" into an editor, and chopped off the part I didn't need. I'm not sure on the license.

Most of the work is the decomposition table. If there's a licensing issue with the font, you can always write a quick script to replace the user-defined glyphs with a multi-character sequence.

I've distributed it under Apache software licence 2.0, but neglected to put a notice in the distro. I'll correct that soon.