Wednesday, September 12, 2007

Internationalized Programming

(reposted portion)

One day programmers will choose from all the 100,000 Unicode tokens to use in their programs, not just the ASCII ones. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. Yet the programs are limited to the 100 tokens on the keyboard. In future, languages will also maximize the use of vocabulary to enable tersity, because the greater the range of tokens a language has, the terser it can be written. Unicode supplies many more tokens than ASCII. It contains characters from the scripts of most natural languages, including 70,000 unified CJK (Chinese, Japanese, and Korean) characters. An internationalized programming language enables a fully functional program to be written totally in a natural language of one’s choice. Such internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. I've blogged about this before, but what about the effects of internationalized programming languages being available?

I suspect most natural languages wouldn't actually be used with internationalized programming: there's no real reason to. Many people going to college in non-English countries can read English, if not speak or write it, and so can use a programming language's libraries easily. Typing the library names into programs involves skill in spelling, not in writing, and so such people can program in English easily enough. And with auto-completors in IDE's, programmers really only need to know the first few letters of the name. Unless there's a strong nationalist movement promoting the native tongue of the country, writing foreigner-readable programs in English will be more important.

Maybe in Northeast Asia, an internationalized programming language could take off. Their natural languages, traditional Chinese, simplified Chinese, Japanese, and Korean, have many more tokens than alphabetic languages, and so enable programs to be written much more tersely. The Chinese written language has tens of thousands of characters. In fact, 80% of Unicode characters are CJK or Korean-only characters. However, the characters used for writing Japanese kanji and traditional Chinese (used in Hong Kong, Taiwan, and Chinatowns) must be read at a much larger font size than English, which would cancel out the benefits of using them. Possibly why the Japanese-invented Ruby only uses English alphabetic library names. Korean can be read at the same font size as English, but, unlike Chinese and Japanese characters, it's really an alphabetic script. There's only 24 letters in the Korean alphabet, and they're arranged in a square to form each sound, instead of one after the other as in English. Unicode simply provides Korean with the choice to be coded letter by letter, or by the square-shaped sound. Thus Korean is potentially terser than alphabetic languages, but not as terse as simplified Chinese, the script used in mainland China.

In the 1950's, mainland China's government replaced hundreds of commonly-used but complexly-written Chinese characters with versions that use far fewer strokes. These simplified characters, also used in Singapore, can be read at the same font size as alphabetic letters and digits. Written Chinese using simplified characters takes up half the page space as written English, and can be condensed even further by using proverb writing style and text-messaging abbreviations. Not all simplified characters can be read at the same font size as alphabetic characters, but the thousands that can enable far greater tersity than 26-letter alphabets. A non-proportional font would enable many horizontally-dense characters, both simplified and traditional, to be read at a normal font size also, though that would be a radical departure from the historical square shape of characters.

Chinese characters are each composed of smaller components, recursively in a treelike manner. When breaking them down into their atomic components, the choice of what's atomic is arbitrary because all components can be broken down further, all the way to individual strokes, whereas in Korean it's clear which component is atomic. A common way of breaking Chinese characters into components gives over 400 of them in simplified Chinese, over 600 in traditional. Some components in a character give a rough idea of the pronunciation in some Chinese dialects, while others, the radicals, give an idea of the meaning. A certain sequence of Chinese components can often be arranged in more than one manner to form a character, unlike Korean where a certain sequence of letters can be arranged into a square in one way only. The arrangement is as much a part of a Chinese character as the components themselves. Also, any two components can combine together into another component in a variety of ways, such as across, downwards, diagonally, repeating into different shapes, reflecting, surrounding another around different sides or corners, being inserted between or within others, melding or touching together, being threaded, overlapping or hooking in various ways, or being modified by strokes. These many ways of combining components, reflecting the pictorial roots of Chinese characters, provide another dimension of lexical variation that increases the potential tersity of written Chinese in computer programs.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every package, class, method, field, and annotation name in the entire Java standard and enterprize edition class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary. One day using the tersity of Chinese characters in programming languages will be of more value to mainland Chinese programmers than writing foreigner-readable programs in English, and when they decide to switch, it’ll be over in a year.

If a programming language enabled multi-vocabulary programming, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their programming, but so could Western programmers. Committed programmers want to write terse programs, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, and they can enter the characters easily. IME's (input method editors) and IDE auto-completors work in similar ways. With an IME, someone types the sound or shape of the character, then selects the character from a popup menu of other similar-sounding or similarly-shaped ones. Instead of two different popup systems, each at a different level of the software stack, IME's for programming could be combined into IDE auto-completors. Entering terser Chinese names for class library names would then be as easy as entering the English names, limited only by how quickly a Western programmer could learn to recognize the characters in the popup box. They would incrementally learn more Chinese characters, simply to program more tersely.

An internationalized programming language and IDE plugins must allow programmers to begin using another natural language in their programs incrementally, only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough. Non-Chinese programmers won't actually have to learn any Chinese speaking, listening, grammar, or character writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is very different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape. Because the characters are limited to names in class libraries, they won't need to know grammar or write sentences.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within committed programmers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. So software developers all over the world could be typing in Chinese within decades.

Friday, June 29, 2007

Pictorial Analysis of CJK Characters

(republished portion)

One day programmers will use all the Unicode tokens in their programs, not just the ASCII ones. To enter the CJK characters, which make up 70% of Unicode tokens, programmers must enter the pictorial representation of the character, if they don't know its sound in some Asian language or its meaning. I've been analysing the pictorial structure of the 20,000 or so CJK characters in the Unicode CJK Unified Ideograph block with a view to making them easy for Westerners to type.

Basic Constituents

The Chinese often categorize their characters and components based on the first stroke, depending on whether it's horizontal (一), vertical (丨), left-leaning (丿), right-leaning(丶), or bending (eg, 乙). But I saw many more basic strokes than that.

I saw the non-bending basic strokes of equal length as being on a circle:
  1. slightly upwards from horizontal stroke (the bottom upwards stroke in 扌)
  2. horizontal stroke (the common 一)
  3. slightly downwards from horizontal stroke (the bottom right stroke of 之)
  4. perfect right-leaning diagonal stroke (the right side of 八)
  5. vertical stroke (the common 丨)
  6. slightly left-leaning from vertical stroke (丿, the left side of 厂)
  7. perfect left-leaning diagonal stroke (the left side of 八)
  8. almost level left-leaning stroke (top of 禾)
The only difference between the first and last ones is the direction of the stroke and they would be considered as the same stroke by a foreigner when they first started to learn characters. Strokes at one point on the circle often transform into a stroke next to it (eg, the horizontal stroke of 子 from type 2 into type 1 in 孙).

The perfect right-leaning and left-leaning strokes can each shorten into a short right-leaning dot (丶) and short left-leaning dot (top of 白) respectively. These two short dots can often transform into each other. There's also a longer right-leaning dot (eg, the right-leaning stroke of the top half of 爻) that's an only slightly shortened form of the perfect right-leaning stroke.

Some of these 8 strokes also have variants with hooks:
  • the bottom of 冫 - a variant of 1 above
  • the stem of 戈 - a variant of 4
  • the bottom of 丁 - a variant of 5
  • the stem of 乄 - a variant of 7
The other basic strokes in CJK characters are distinguished by how many times they bend, and the bending direction.

Strokes that bend once downwards:
  • the top-right surrounding part of 司 (the stem of 犭 is a variant)
  • the bottom-right part of 片
  • 乛 - a variant of each of the above two
  • the rightmost part of 又
  • the bottom of 辶
  • 乀 - a rare character
  • 乁 - a rare character
Strokes that bend once rightwards:
  • the bottom-left of 亾
  • 乚 (including the bottom of 心 in some fonts where it slopes before hooking)
  • right part of inside of 四
  • bottom-right of 鼠
  • bottom of 饣 - a variant of each of the above ones
  • main part of 厶
  • leftmost part of 女
  • central horizontal stroke of 牙
Strokes that bend twice, first downwards then rightwards:
  • stroke from topleft to bottomright of 卍
  • rightmost stroke of top half of 殳
  • rightmost stroke of 九
  • bottom of 气
  • bottom of 讠- a variant of each of the above ones
Strokes that bend twice, first rightwards then downwards:
  • bottom half of 丂
  • stroke from topleft to bottomright of 卐
  • central stem of 专
Strokes that bend three times, first downwards, second rightwards, and then downwards:
  • rightside of 乃
  • central stem of 及
  • right-most stroke of 郑

Components Transformed

When analysing the CJK characters into constituent components, sometimes one component was transformed into another, other times, two components were joined together in some way.

I related pairs of similarly-shaped components to each other with a special transformation. Examples are: 子 and 孑, 勺 and 夕, 己 and 已, 千 and 干, 壬 and 王, 日 and 曰, 土 and 士, 刀 and the bottom of 节.

Another transformation is to repeat a certain component a number of times in a certain shape:
  • twice across (从夶朋林奻)
  • twice downwards (多昌畕)
  • three in a triangle (晶众姦森)
  • three across (巛州川)
  • three down (perhaps, the topright-surrounding component in 司 when constructing 為)
  • four in a square (叕朤燚)
  • four across (the 丨 in 卌)
  • four down (perhaps, the 一 when constructing 隹)
Some components reflect another acrosswards (eg the components of 北 and of 非, and 爿片) or downwards (eg the components of 忽 according to some).

Some characters are best analysed as outlines of another (凹 of 凵, 凸 of 丄).

Components Joined Together

Components can be joined together in many ways.

The most common join configuration is across, the second most common is downwards. The same two CJK components can sometimes be arranged both across and downwards to form different characters, eg 叭只, 略畧, 杠杢, 杍李, 峒峝, 叻另, and 呐呙. A handful of components join diagonally (eg 以, the part of 亥 under the 亠 is 丩 diagonally joined to 人). When two components join downwards, they can touch (eg 示去卡且丘元早光兄支).

A common configuration is where one component surrounds another somehow:
  • on two sides at the top left (厷厄右后)
  • on two sides at the bottom left (亾这迎廷咫尫爬)
  • on two sides at the top right (句匂勾可司匃)
  • on three sides around the top (网闪用瓜同)
  • on three sides around the bottom (凶鼎函)
  • on three sides around the left (区匜匹)
  • on three sides around the right (the left side of 臦, the smaller one on the right of 龜)
  • on all sides (回囚囟)
A character can be inserted between others, either across (eg 衎衒衍 is 行 with another character like 干玄氵 between) or downwards (eg 裒衷衰 is 衣 with another like 臼中母 between).

When two components join, two strokes can be molded together into one stroke, either across (我 is 手 before 戈), downwards (缶 is 午 over 山; 里 is 田 over 土; 重 is 千 over 里), repeating down (岀), or surrounding (飛).

Two components can be threaded together (申 is 曰 threaded with 丨), overlap in various ways(肉民包世氏冉丑內西), or be within each other (夷來乘坐爽兆臾幽巫吏束夾噩承乖).

Components can be modified by a single stroke of some sort in some position (圡太主凡玉叉弋勺, 生午牛, 必才少).

It was straightforward to relate these transformations and join configurations together using an inheritance hierarchy when programming.

Saturday, June 23, 2007

Regex Too Terse

The purpose of making programming languages terser is so they'll be more readable. But regexes are too terse. To make them readable, we need to make them more verbose. We can format them more easily by using the ?x flag, but the syntax is so different to the languages they're embedded within, they still stick out, requiring mental effort to digest. But JVM-languages like Groovy aren't stuck with them because Java has them. Just as JVM bytecodes are generated by Groovy, so also the terse regex syntax could be too. What would a more verbose yet readable regex syntax for Groovy look like?

For starters, we wouldn't need to embed the regex expression inside slashes / /, as the syntax would be mixable with Groovy's. Perhaps it could be generated by a builder. A small amount of syntax could remain the same. The alternation operator | acts like Groovy's | operator. The option operator ? has a parallel in Groovy's ?. operator. We could keep escaped control characters, and change the meaning of \b from word boundary to backspace as in Groovy, so we'd have '\t\n\b\f\r' instead of /\t\n\013\f\r/.

For character classes, we could use Groovy's sequence syntax, ('a'..'j') instead of /[a-j]/, or [*'a'..'z', *'A'..'Z', '_', *'0'..'9'] instead of /[a-zA-Z_0-9]/. We could use !['a','c','e'] instead of /[^ace]/. Pre-defined classes could have special variable names within the regex builder context, eg, ws for /\s/, digit for /\d/, and word for /\w/. We could even define our own character classes, eg, def hexDigit= [*'0'..'9', *'A'..'F', *'a'..'f'], or def notDigit = !digit for /\D/.

For groups, parentheses are sensible, so 'a'+('b'|'c')+'d' as new syntax for /a(b|c)d/, but groups should be non-capturing by default. For capturing groups, we can use variable names instead of numbers, ie, 'a'+(bc='b'|'c')+'d'+bc+'e' instead of /a(b|c)d\1e/.

For the wildcard, perhaps replace the dot with an underscore , as in 'a'+_+'c' instead of /a.c/. For the repetition operators, we could use sequences, so a new special syntax 'a'*(0..) + 'b'*(1..) instead of /a*b+/, and 'a'*(3..5) instead of /a{3,5}/.

Flags could be indicated by names heading a closure, eg, caseInsignificant{ 'aBc'*(1..)+'DeFg' } instead of /?i:(aBc)+DeFg/. Lazy and possessive operators could be indicated by such names, eg, lazy{ 'abc'*(0..) } instead of /(abc)*?/, and possessive{ 'def'*(1..) } instead of /(def)++/.

Lookarounds could also be shown by names, after{'a'} instead of (?=a), !after{'b'} instead of (?!b), and before{'c'} instead of (?<=c). The pre-defined anchors would have special variable names, eg, wordBoundary instead of /\b/, lineStart instead of /$/, and lineEnd instead of /^/. And we could define our own anchors, eg, def sentenceEnd= before{['.','?','!']}.

I thought of this replacement syntax off the top of my head. It's just an idea for a RegexBuilder for Groovy. We could have Groovy statements interacting with the regex syntax, just like other builders do, so we could capture information that would normally be lost in the regex backtracking. Maybe regex functions normally outside the regex string, such as text replacement, could also be done within the builder syntax.

So instead of regex syntax being so terse it's unreadable, and sticking out like a sore thumb from the cool Groovy syntax, it could be made more verbose so it's easily readable, and mixes nicely with other Groovy syntax.

Programming in Unicode

(republished portion)

Unicode, used by both Java and Windows, now has 100,000 characters, a collection of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, etc. But computer programs are still written using a mere 100 tokens, the ASCII characters. It's difficult to key in other characters, and programmers don't know other alphabets. But in a few years, using all the Unicode characters in programs may be standard. Math is a language that uses many more tokens than programming languages, both dedicated symbols and letters from many alphabets. Math can describe concepts extremely tersely, since the greater the range of tokens a language has, the terser it can be written. Yet programming is limited to the 100 tokens on the keyboard. Many people can type those 100 characters faster than they can write them, but can write thousands more they can't type.

Committed programmers are continually looking for ways to make programs terser, yet still readable. They choose languages and tools that enable such tersity, so programming languages evolved, into the 2GL (assembler), the 3GL, and the visual 4GL. But 4GL’s were limited in their scalability and readability. It’s easier to write a program in a 4GL than a 3GL, but more difficult to read and debug it. So some used IDE’s, supplementing 3GL's with visual aids. Others looked for a more productive language, so terser languages, such as Perl, Python, and Ruby, became popular. Regular expressions are a successful attempt at tersity, now used by many languages, but many consider them unreadable. The K programming language, used by financial businesses, could be the tersest language ever invented. It only uses ASCII symbols, but overloads them profusely. However, the price is the inability to give different precedences to the operators, so everything unbracketed is evaluated from the right. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. The same 100 tokens are used.

Perhaps adding the many Unicode symbols to programming languages would enable terser programs to be written. Operator overloading in C++ was a similar attempt at tersity. Programmers could define meanings for some combinations of the 35 ASCII symbols. Although programs became terser, they were more difficult to understand because of the unpredictable meaning of these symbols in different code contexts, and they were eventually dropped in Java. The problem wasn't with operator overloading itself, but with the uncontrolled association of meanings with each operator. Eventually certain meanings would have become generally accepted, the others falling into disuse, but this would have taken many years, with too many incompatible uses produced in the meantime. If there was such a problem with a few dozen operators, what hope would there be for the hundreds of unused Unicode symbols? If programmers were allowed to overload them with any meaning, the increase in program tersity would be at the cost of readability. Although some Unicode symbols will have an obvious meaning, such as some math symbols, most would have no meaning that could be transferred easily to the programming context. To retain readability of programs in a terse language, the meanings of the Unicode symbols would have to be carefully controlled by the custodians of that language. They would activate new Unicode symbols at a gradual pace only, with control of their meanings, after carefully considering existing use of the symbols.

Programming languages do, however, already allow Unicode characters in some parts of their programs. The contents of strings and comments can use any Unicode character. User-defined names can use all the alphabetic letters and CJK characters, and because there already exists agreed meanings for combinations of these, derived from their respective natural languages, we can increase tersity while keeping readability. But the core of the language, the grammar keywords and symbols, and names in supplied libraries, still only use ASCII characters. Perhaps some programmers use non-Latin characters wherever they can in their programs. A browse through the computer shelves of a typical bookshop in mainland China suggests they only do so for comments and contents of strings, not for user-defined names.

Programmers from cultures not using the Latin alphabet won't be motivated to use their own alphabets in user-defined names when they don't with pre-supplied names, such as keywords or standard libraries. Often, most of the names in a program are from libraries standard to the language. To trigger the widespread use of Unicode characters from non-ASCII alphabets in programs, the pre-supplied names must also be in those alphabets. And this could easily be done. The grammar of a language and its vocabulary are two different concepts. A programming language grammar could conceivably have many vocabulary options. Almost all programming languages only have English. Other vocabularies could be based on other natural languages. A Spanish vocabulary plugged into a certain programming language would have Spanish names for the keywords, modules, classes, methods, variables, etc.

Computer software nowadays is internationalized, webpages are, and most programming languages enable internationalized software. But the languages themselves and their libraries are not internationalized. An internationalized programming language would enable a fully functional program to be written totally in a natural language of one’s choice. Not only could all user-defined names be specified in any alphabet, but also all keywords and names in the standard libraries would be available in many natural languages. Ideally, when a software supplier ships a library, they'll specify the public names in many languages in the definition. But this is likely to have a slow uptake, so languages must allow a library to be translated in an incremental manner easily from one natural language into another. Some languages let programmers use mixins and interceptors to do this to various degrees. And some could conceivably allow a preprocessor, pluggable lexer, or closures to internationalize the keywords. But full foreign-language support must be a declared aim of a language's development.

Internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. The first to be translated will probably be the core Java class libraries, and the first language translated into, probably simplified Chinese.

Sunday, April 22, 2007

Google's IME

(Updated on 4 May 2007)

I now use Google's IME (Input Method Editor) for Chinese, released about a month ago, much more than Microsoft's. As well as pinyin input, Google's IME lets me enter the character strokes for the many characters I don't yet know the sound of. Very handy for foreigners learning Chinese. The keyboard assignments are intuitive: h for 一(heng), s for 丨(shu), p for 丿(pie), n or d for 丶(na or dian), and z for 乙(zhe). Because most foreigners learning Chinese would know the pinyin names of the four main strokes, having memorized "heng, shu, pie, na" till it hurts, it's easy to type h, s, p, n for these four strokes. And 乙 looks like z.

But only 6 keys are used. It would be nice if some extra common sequences of strokes were added to the 20 unused keys. For example, k for 口(kou), r for 日(ri), y for 月(yue), etc. This would be a lightweight addition, and easy to learn. We could learn them incrementally, always having the choice of stroke-by-stroke, or entering those components. Nothing heavy-duty like Wubizixing, which requires us to learn all of it before we can use any of it.

The IME lets us toggle between English and Chinese using the shift-key. Perhaps other natural languages that usually require an IME will be plugged in later, eg, Japanese, Korean. Maybe the IME will soon let us toggle between languages that don't usually require an IME, eg, English and French keyboard assignments. Also cool would be if we could use Google's IME to enter Unicode symbols and punctuation easily. Some possible keystroke sequences for the Latin-1 (0x0080 to 0x00FF) tokens:


































































































































































































Perhaps the Google IME could become the software most people worldwide use to enter Unicode, not just Northeast Asians entering CJK characters.

Sunday, April 08, 2007

Foreigners Typing Chinese

(updated on Thursday 19 April 2007)

A recent study at Hong Kong University using magnetic brain imaging shows that native Chinese speakers reading Chinese characters generate a pattern of brain activity related to spatial processing, believed to be because each visual Chinese character maps to a syllable of speech. When they read English, this same brain activity occurs, whereas the brain activity is different in native English speakers reading English. The researchers believe this suggests that native Chinese readers use their Chinese-reading capability when reading English, associating written English syllables directly to the sounds, and being less capable of applying rules to convert sequences of individual letters into sounds as native English-readers do.

Many years ago, a teacher of Chinese told me the only way foreigners could learn to write Chinese is the same way the Chinese learn at school, that is, to practise writing each character many times until they know it. But I found that as I learnt to read Chinese, I found it easiest to recognize the simplest written characters, the ones that weren't composed of others, and hardest to recognize the most complexly written ones.

Most learning material requires us to learn the commonly used characters first. When I learnt "我是一个人。这个人是我的爱人。", I'd have no trouble remembering the 一 and 人, both atomic characters. Characters made up of only two components, such as 个 (人 above 丨) and 的 (白 left of 勺), were easy if I already knew the meaning of their constituent components, 人、丨、白、and 勺. Other two-component characters, such as 是 (日 over 正), 这 (辶 around the bottom-left of 文), and 我 (手 before and joined to 戈), were more difficult when I hadn't already learnt 日、正、手、and 戈. And a character like 爱 (爪 over 冂 over 十 around the top-left of 又) was very difficult to recognize when reading. Further complicating the issue, the choice of what's an atomic component in Chinese is arbitrary, as Chinese characters can be continually sub-divided in a tree-like structure all the way down to individual strokes, eg, 正 is 一 over 止, 止 is 丄 around the bottom-left of 卜, 文 is 亠 over 乂, 亠 is 丶 over 一.

I'm wondering if the converse of the Hong Kong University study applies to native English speakers reading Chinese. When we read Chinese, do we use the same brain activity as when we read English? If so, that may be why I "spell out" the components of Chinese characters I don't know. A native Chinese reader may see the character 爱 as a single shape, but I see 4 distinct "letters" (爪冂十又) in that character. For those of us whose first reading language was alphabetic, perhaps the easiest way to learn to read Chinese is to learn the components first. In primary school we practised writing each letter of the alphabet before we practised writing words, and maybe we should learn all the components of Chinese before writing characters.

In the future, many Westerners will need to learn Chinese characters for various reasons, primarily to remain competitive in information-based industries. Because I believe the Chinese will one day use simplified Chinese characters in computer programs to gain the advantage that comes from writing terser code, I've been motivated to study how foreigners can best learn Chinese characters so they can also benefit from writing such terser code. I needed to analyse the constituents of Chinese characters.

There were many issues to consider when choosing which characters to include for analysis. Simplified characters are used in Mainland China, complex ones in Hong Kong, Taiwan, and Chinatowns. My experience is that when someone needs to learn one type of Chinese characters, they often need to learn those of the other type also. Additionally, Unicode treats characters from both Chinese and Japanese languages as one script, the "Unified CJK Characters", or "Unihan" characters. The Unicode Consortium considered that there would be too many duplicately encoded characters if they encoded simplified Chinese, complex Chinese, Japanese, and the other East Asian ideographic scripts separately, and since the characters are all descended from one ancient source, the consortium unified the scripts using some guidelines. These guidelines provided that characters with the same meaning and abstract shape be unified into one, irrespective of source script and ignoring typeface differences. (There were exceptions for the 20,000 most common CJK characters, such as not unifying two characters that are distinct in a single source script.)

I chose the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block. This seemed the most natural place to draw the ring, as foreigners who learn Chinese characters for programming would likely progress to using other CJK characters. I inputted decompositions of those characters, aided by various internet listings. For each decomposition, I recorded a "Decomposition Configuration", and up to two constituent components. Only pictorial configurations were used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters had typeface differences I used the one in the Unicode spec reference listing. When there was more than one possible configuration, I selected one based on how I thought a fellow foreigner would analyse the character. In future, I may add alternative configurations to the data.

I had to "create" a few thousand "characters" to cater for decomposition components not themselves among my collected characters. (Though I could have found many of them in the CJK extension A and B blocks, I wanted to keep those out of scope.) To represent these extra characters in the data, sometimes I used a multi-character sequence, sometimes a user-defined glyph. I've avoided using them in this blog entry because my user-defined fonts can't be used, instead I'm relying on verbal descriptions such as "the bottom part of 节".

With this data, I've been able to perform much analysis using programming. I've used the Groovy Programming Language, similar in function to the more popular Perl, Python, PHP, and Ruby "scripting" languages. The two key issues regarding the data were choosing the decomposition configurations and choosing the basic atomic components.

My main purpose is to generate an optimal collection of atomic components and their keyboard assignments that are intuitive for foreigners first learning to type CJK characters pictorially. Asian IME's generally target one particular script (simplified Chinese, traditional Chinese, Japanese, or Korean), not all CJK characters, and are intended for those who already intimately know the written language of that script, eg, the popular Wubizixing is for native readers of simplified Chinese, not for foreigners.

Wubizixing enables characters to be entered using at most four keystrokes by representing characters by the first three and last components only. Westerners are used to typing full words, and wouldn't want to type my previous sentence as "wubg enas chas to be entd usig at most four keys by repg chas by the firt thre and last coms only". Wubizixing also makes commonly-used groups of components available as one keystroke, such as 鱼 (instead of 勹,田 and 一),早 (instead of 日 and 十), and 四 (instead of 囗 and 儿). As a result, Wubizixing uses almost 200 components. The popular Taiwanese Cangjie input method is similar. But Westerners aren't used to typing common sequences of letters, such as 'tion', 'str', and 'th', with one keystroke. For foreigners the number of keystrokes for a character should be related to its complexity, not its frequency of use. There's another input method, Wubihua, used with mobile phones in mainland China, using only 5 components, the basic strokes 一丨丿丶乙. This is too low level for a foreigner, similar to typing a keystroke for each stroke of a capital Latin letter. I'm looking for a set of components at a level between these 5 strokes and Wubizixing's 200 components.

Although I've only analysed 20,000 of the 70,000 CJK characters presently in Unicode, my set of atomic components must be useful for entering those other CJK characters also. There's another 10,000 Korean-only characters, which are based on 24 Korean components. If I match my set of CJK components in the keyboard assignments to the standard Korean keyboard as much as possible, I could have 80,000 of the 100,000 Unicode characters in one seamless pictorial-based input method.

I don't want to rush through this final stage of the task, as it's important I get this part right.