Wednesday, September 12, 2007

Internationalized Programming

(reposted portion)

One day programmers will choose from all the 100,000 Unicode tokens to use in their programs, not just the ASCII ones. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. Yet the programs are limited to the 100 tokens on the keyboard. In future, languages will also maximize the use of vocabulary to enable tersity, because the greater the range of tokens a language has, the terser it can be written. Unicode supplies many more tokens than ASCII. It contains characters from the scripts of most natural languages, including 70,000 unified CJK (Chinese, Japanese, and Korean) characters. An internationalized programming language enables a fully functional program to be written totally in a natural language of one’s choice. Such internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. I've blogged about this before, but what about the effects of internationalized programming languages being available?

I suspect most natural languages wouldn't actually be used with internationalized programming: there's no real reason to. Many people going to college in non-English countries can read English, if not speak or write it, and so can use a programming language's libraries easily. Typing the library names into programs involves skill in spelling, not in writing, and so such people can program in English easily enough. And with auto-completors in IDE's, programmers really only need to know the first few letters of the name. Unless there's a strong nationalist movement promoting the native tongue of the country, writing foreigner-readable programs in English will be more important.

Maybe in Northeast Asia, an internationalized programming language could take off. Their natural languages, traditional Chinese, simplified Chinese, Japanese, and Korean, have many more tokens than alphabetic languages, and so enable programs to be written much more tersely. The Chinese written language has tens of thousands of characters. In fact, 80% of Unicode characters are CJK or Korean-only characters. However, the characters used for writing Japanese kanji and traditional Chinese (used in Hong Kong, Taiwan, and Chinatowns) must be read at a much larger font size than English, which would cancel out the benefits of using them. Possibly why the Japanese-invented Ruby only uses English alphabetic library names. Korean can be read at the same font size as English, but, unlike Chinese and Japanese characters, it's really an alphabetic script. There's only 24 letters in the Korean alphabet, and they're arranged in a square to form each sound, instead of one after the other as in English. Unicode simply provides Korean with the choice to be coded letter by letter, or by the square-shaped sound. Thus Korean is potentially terser than alphabetic languages, but not as terse as simplified Chinese, the script used in mainland China.

In the 1950's, mainland China's government replaced hundreds of commonly-used but complexly-written Chinese characters with versions that use far fewer strokes. These simplified characters, also used in Singapore, can be read at the same font size as alphabetic letters and digits. Written Chinese using simplified characters takes up half the page space as written English, and can be condensed even further by using proverb writing style and text-messaging abbreviations. Not all simplified characters can be read at the same font size as alphabetic characters, but the thousands that can enable far greater tersity than 26-letter alphabets. A non-proportional font would enable many horizontally-dense characters, both simplified and traditional, to be read at a normal font size also, though that would be a radical departure from the historical square shape of characters.

Chinese characters are each composed of smaller components, recursively in a treelike manner. When breaking them down into their atomic components, the choice of what's atomic is arbitrary because all components can be broken down further, all the way to individual strokes, whereas in Korean it's clear which component is atomic. A common way of breaking Chinese characters into components gives over 400 of them in simplified Chinese, over 600 in traditional. Some components in a character give a rough idea of the pronunciation in some Chinese dialects, while others, the radicals, give an idea of the meaning. A certain sequence of Chinese components can often be arranged in more than one manner to form a character, unlike Korean where a certain sequence of letters can be arranged into a square in one way only. The arrangement is as much a part of a Chinese character as the components themselves. Also, any two components can combine together into another component in a variety of ways, such as across, downwards, diagonally, repeating into different shapes, reflecting, surrounding another around different sides or corners, being inserted between or within others, melding or touching together, being threaded, overlapping or hooking in various ways, or being modified by strokes. These many ways of combining components, reflecting the pictorial roots of Chinese characters, provide another dimension of lexical variation that increases the potential tersity of written Chinese in computer programs.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every package, class, method, field, and annotation name in the entire Java standard and enterprize edition class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary. One day using the tersity of Chinese characters in programming languages will be of more value to mainland Chinese programmers than writing foreigner-readable programs in English, and when they decide to switch, it’ll be over in a year.

If a programming language enabled multi-vocabulary programming, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their programming, but so could Western programmers. Committed programmers want to write terse programs, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, and they can enter the characters easily. IME's (input method editors) and IDE auto-completors work in similar ways. With an IME, someone types the sound or shape of the character, then selects the character from a popup menu of other similar-sounding or similarly-shaped ones. Instead of two different popup systems, each at a different level of the software stack, IME's for programming could be combined into IDE auto-completors. Entering terser Chinese names for class library names would then be as easy as entering the English names, limited only by how quickly a Western programmer could learn to recognize the characters in the popup box. They would incrementally learn more Chinese characters, simply to program more tersely.

An internationalized programming language and IDE plugins must allow programmers to begin using another natural language in their programs incrementally, only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough. Non-Chinese programmers won't actually have to learn any Chinese speaking, listening, grammar, or character writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is very different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape. Because the characters are limited to names in class libraries, they won't need to know grammar or write sentences.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within committed programmers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. So software developers all over the world could be typing in Chinese within decades.

Friday, June 29, 2007

Pictorial Analysis of CJK Characters

(republished portion)

One day programmers will use all the Unicode tokens in their programs, not just the ASCII ones. To enter the CJK characters, which make up 70% of Unicode tokens, programmers must enter the pictorial representation of the character, if they don't know its sound in some Asian language or its meaning. I've been analysing the pictorial structure of the 20,000 or so CJK characters in the Unicode CJK Unified Ideograph block with a view to making them easy for Westerners to type.

Basic Constituents

The Chinese often categorize their characters and components based on the first stroke, depending on whether it's horizontal (一), vertical (丨), left-leaning (丿), right-leaning(丶), or bending (eg, 乙). But I saw many more basic strokes than that.

I saw the non-bending basic strokes of equal length as being on a circle:
  1. slightly upwards from horizontal stroke (the bottom upwards stroke in 扌)
  2. horizontal stroke (the common 一)
  3. slightly downwards from horizontal stroke (the bottom right stroke of 之)
  4. perfect right-leaning diagonal stroke (the right side of 八)
  5. vertical stroke (the common 丨)
  6. slightly left-leaning from vertical stroke (丿, the left side of 厂)
  7. perfect left-leaning diagonal stroke (the left side of 八)
  8. almost level left-leaning stroke (top of 禾)
The only difference between the first and last ones is the direction of the stroke and they would be considered as the same stroke by a foreigner when they first started to learn characters. Strokes at one point on the circle often transform into a stroke next to it (eg, the horizontal stroke of 子 from type 2 into type 1 in 孙).

The perfect right-leaning and left-leaning strokes can each shorten into a short right-leaning dot (丶) and short left-leaning dot (top of 白) respectively. These two short dots can often transform into each other. There's also a longer right-leaning dot (eg, the right-leaning stroke of the top half of 爻) that's an only slightly shortened form of the perfect right-leaning stroke.

Some of these 8 strokes also have variants with hooks:
  • the bottom of 冫 - a variant of 1 above
  • the stem of 戈 - a variant of 4
  • the bottom of 丁 - a variant of 5
  • the stem of 乄 - a variant of 7
The other basic strokes in CJK characters are distinguished by how many times they bend, and the bending direction.

Strokes that bend once downwards:
  • the top-right surrounding part of 司 (the stem of 犭 is a variant)
  • the bottom-right part of 片
  • 乛 - a variant of each of the above two
  • the rightmost part of 又
  • the bottom of 辶
  • 乀 - a rare character
  • 乁 - a rare character
Strokes that bend once rightwards:
  • the bottom-left of 亾
  • 乚 (including the bottom of 心 in some fonts where it slopes before hooking)
  • right part of inside of 四
  • bottom-right of 鼠
  • bottom of 饣 - a variant of each of the above ones
  • main part of 厶
  • leftmost part of 女
  • central horizontal stroke of 牙
Strokes that bend twice, first downwards then rightwards:
  • stroke from topleft to bottomright of 卍
  • rightmost stroke of top half of 殳
  • rightmost stroke of 九
  • bottom of 气
  • bottom of 讠- a variant of each of the above ones
Strokes that bend twice, first rightwards then downwards:
  • bottom half of 丂
  • stroke from topleft to bottomright of 卐
  • central stem of 专
Strokes that bend three times, first downwards, second rightwards, and then downwards:
  • rightside of 乃
  • central stem of 及
  • right-most stroke of 郑

Components Transformed

When analysing the CJK characters into constituent components, sometimes one component was transformed into another, other times, two components were joined together in some way.

I related pairs of similarly-shaped components to each other with a special transformation. Examples are: 子 and 孑, 勺 and 夕, 己 and 已, 千 and 干, 壬 and 王, 日 and 曰, 土 and 士, 刀 and the bottom of 节.

Another transformation is to repeat a certain component a number of times in a certain shape:
  • twice across (从夶朋林奻)
  • twice downwards (多昌畕)
  • three in a triangle (晶众姦森)
  • three across (巛州川)
  • three down (perhaps, the topright-surrounding component in 司 when constructing 為)
  • four in a square (叕朤燚)
  • four across (the 丨 in 卌)
  • four down (perhaps, the 一 when constructing 隹)
Some components reflect another acrosswards (eg the components of 北 and of 非, and 爿片) or downwards (eg the components of 忽 according to some).

Some characters are best analysed as outlines of another (凹 of 凵, 凸 of 丄).

Components Joined Together

Components can be joined together in many ways.

The most common join configuration is across, the second most common is downwards. The same two CJK components can sometimes be arranged both across and downwards to form different characters, eg 叭只, 略畧, 杠杢, 杍李, 峒峝, 叻另, and 呐呙. A handful of components join diagonally (eg 以, the part of 亥 under the 亠 is 丩 diagonally joined to 人). When two components join downwards, they can touch (eg 示去卡且丘元早光兄支).

A common configuration is where one component surrounds another somehow:
  • on two sides at the top left (厷厄右后)
  • on two sides at the bottom left (亾这迎廷咫尫爬)
  • on two sides at the top right (句匂勾可司匃)
  • on three sides around the top (网闪用瓜同)
  • on three sides around the bottom (凶鼎函)
  • on three sides around the left (区匜匹)
  • on three sides around the right (the left side of 臦, the smaller one on the right of 龜)
  • on all sides (回囚囟)
A character can be inserted between others, either across (eg 衎衒衍 is 行 with another character like 干玄氵 between) or downwards (eg 裒衷衰 is 衣 with another like 臼中母 between).

When two components join, two strokes can be molded together into one stroke, either across (我 is 手 before 戈), downwards (缶 is 午 over 山; 里 is 田 over 土; 重 is 千 over 里), repeating down (岀), or surrounding (飛).

Two components can be threaded together (申 is 曰 threaded with 丨), overlap in various ways(肉民包世氏冉丑內西), or be within each other (夷來乘坐爽兆臾幽巫吏束夾噩承乖).

Components can be modified by a single stroke of some sort in some position (圡太主凡玉叉弋勺, 生午牛, 必才少).

It was straightforward to relate these transformations and join configurations together using an inheritance hierarchy when programming.

Saturday, June 23, 2007

Regex Too Terse

The purpose of making programming languages terser is so they'll be more readable. But regexes are too terse. To make them readable, we need to make them more verbose. We can format them more easily by using the ?x flag, but the syntax is so different to the languages they're embedded within, they still stick out, requiring mental effort to digest. But JVM-languages like Groovy aren't stuck with them because Java has them. Just as JVM bytecodes are generated by Groovy, so also the terse regex syntax could be too. What would a more verbose yet readable regex syntax for Groovy look like?

For starters, we wouldn't need to embed the regex expression inside slashes / /, as the syntax would be mixable with Groovy's. Perhaps it could be generated by a builder. A small amount of syntax could remain the same. The alternation operator | acts like Groovy's | operator. The option operator ? has a parallel in Groovy's ?. operator. We could keep escaped control characters, and change the meaning of \b from word boundary to backspace as in Groovy, so we'd have '\t\n\b\f\r' instead of /\t\n\013\f\r/.

For character classes, we could use Groovy's sequence syntax, ('a'..'j') instead of /[a-j]/, or [*'a'..'z', *'A'..'Z', '_', *'0'..'9'] instead of /[a-zA-Z_0-9]/. We could use !['a','c','e'] instead of /[^ace]/. Pre-defined classes could have special variable names within the regex builder context, eg, ws for /\s/, digit for /\d/, and word for /\w/. We could even define our own character classes, eg, def hexDigit= [*'0'..'9', *'A'..'F', *'a'..'f'], or def notDigit = !digit for /\D/.

For groups, parentheses are sensible, so 'a'+('b'|'c')+'d' as new syntax for /a(b|c)d/, but groups should be non-capturing by default. For capturing groups, we can use variable names instead of numbers, ie, 'a'+(bc='b'|'c')+'d'+bc+'e' instead of /a(b|c)d\1e/.

For the wildcard, perhaps replace the dot with an underscore , as in 'a'+_+'c' instead of /a.c/. For the repetition operators, we could use sequences, so a new special syntax 'a'*(0..) + 'b'*(1..) instead of /a*b+/, and 'a'*(3..5) instead of /a{3,5}/.

Flags could be indicated by names heading a closure, eg, caseInsignificant{ 'aBc'*(1..)+'DeFg' } instead of /?i:(aBc)+DeFg/. Lazy and possessive operators could be indicated by such names, eg, lazy{ 'abc'*(0..) } instead of /(abc)*?/, and possessive{ 'def'*(1..) } instead of /(def)++/.

Lookarounds could also be shown by names, after{'a'} instead of (?=a), !after{'b'} instead of (?!b), and before{'c'} instead of (?<=c). The pre-defined anchors would have special variable names, eg, wordBoundary instead of /\b/, lineStart instead of /$/, and lineEnd instead of /^/. And we could define our own anchors, eg, def sentenceEnd= before{['.','?','!']}.

I thought of this replacement syntax off the top of my head. It's just an idea for a RegexBuilder for Groovy. We could have Groovy statements interacting with the regex syntax, just like other builders do, so we could capture information that would normally be lost in the regex backtracking. Maybe regex functions normally outside the regex string, such as text replacement, could also be done within the builder syntax.

So instead of regex syntax being so terse it's unreadable, and sticking out like a sore thumb from the cool Groovy syntax, it could be made more verbose so it's easily readable, and mixes nicely with other Groovy syntax.

Programming in Unicode

(republished portion)

Unicode, used by both Java and Windows, now has 100,000 characters, a collection of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, etc. But computer programs are still written using a mere 100 tokens, the ASCII characters. It's difficult to key in other characters, and programmers don't know other alphabets. But in a few years, using all the Unicode characters in programs may be standard. Math is a language that uses many more tokens than programming languages, both dedicated symbols and letters from many alphabets. Math can describe concepts extremely tersely, since the greater the range of tokens a language has, the terser it can be written. Yet programming is limited to the 100 tokens on the keyboard. Many people can type those 100 characters faster than they can write them, but can write thousands more they can't type.

Committed programmers are continually looking for ways to make programs terser, yet still readable. They choose languages and tools that enable such tersity, so programming languages evolved, into the 2GL (assembler), the 3GL, and the visual 4GL. But 4GL’s were limited in their scalability and readability. It’s easier to write a program in a 4GL than a 3GL, but more difficult to read and debug it. So some used IDE’s, supplementing 3GL's with visual aids. Others looked for a more productive language, so terser languages, such as Perl, Python, and Ruby, became popular. Regular expressions are a successful attempt at tersity, now used by many languages, but many consider them unreadable. The K programming language, used by financial businesses, could be the tersest language ever invented. It only uses ASCII symbols, but overloads them profusely. However, the price is the inability to give different precedences to the operators, so everything unbracketed is evaluated from the right. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. The same 100 tokens are used.

Perhaps adding the many Unicode symbols to programming languages would enable terser programs to be written. Operator overloading in C++ was a similar attempt at tersity. Programmers could define meanings for some combinations of the 35 ASCII symbols. Although programs became terser, they were more difficult to understand because of the unpredictable meaning of these symbols in different code contexts, and they were eventually dropped in Java. The problem wasn't with operator overloading itself, but with the uncontrolled association of meanings with each operator. Eventually certain meanings would have become generally accepted, the others falling into disuse, but this would have taken many years, with too many incompatible uses produced in the meantime. If there was such a problem with a few dozen operators, what hope would there be for the hundreds of unused Unicode symbols? If programmers were allowed to overload them with any meaning, the increase in program tersity would be at the cost of readability. Although some Unicode symbols will have an obvious meaning, such as some math symbols, most would have no meaning that could be transferred easily to the programming context. To retain readability of programs in a terse language, the meanings of the Unicode symbols would have to be carefully controlled by the custodians of that language. They would activate new Unicode symbols at a gradual pace only, with control of their meanings, after carefully considering existing use of the symbols.

Programming languages do, however, already allow Unicode characters in some parts of their programs. The contents of strings and comments can use any Unicode character. User-defined names can use all the alphabetic letters and CJK characters, and because there already exists agreed meanings for combinations of these, derived from their respective natural languages, we can increase tersity while keeping readability. But the core of the language, the grammar keywords and symbols, and names in supplied libraries, still only use ASCII characters. Perhaps some programmers use non-Latin characters wherever they can in their programs. A browse through the computer shelves of a typical bookshop in mainland China suggests they only do so for comments and contents of strings, not for user-defined names.

Programmers from cultures not using the Latin alphabet won't be motivated to use their own alphabets in user-defined names when they don't with pre-supplied names, such as keywords or standard libraries. Often, most of the names in a program are from libraries standard to the language. To trigger the widespread use of Unicode characters from non-ASCII alphabets in programs, the pre-supplied names must also be in those alphabets. And this could easily be done. The grammar of a language and its vocabulary are two different concepts. A programming language grammar could conceivably have many vocabulary options. Almost all programming languages only have English. Other vocabularies could be based on other natural languages. A Spanish vocabulary plugged into a certain programming language would have Spanish names for the keywords, modules, classes, methods, variables, etc.

Computer software nowadays is internationalized, webpages are, and most programming languages enable internationalized software. But the languages themselves and their libraries are not internationalized. An internationalized programming language would enable a fully functional program to be written totally in a natural language of one’s choice. Not only could all user-defined names be specified in any alphabet, but also all keywords and names in the standard libraries would be available in many natural languages. Ideally, when a software supplier ships a library, they'll specify the public names in many languages in the definition. But this is likely to have a slow uptake, so languages must allow a library to be translated in an incremental manner easily from one natural language into another. Some languages let programmers use mixins and interceptors to do this to various degrees. And some could conceivably allow a preprocessor, pluggable lexer, or closures to internationalize the keywords. But full foreign-language support must be a declared aim of a language's development.

Internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. The first to be translated will probably be the core Java class libraries, and the first language translated into, probably simplified Chinese.

Sunday, April 08, 2007

Foreigners Typing Chinese

(updated on Thursday 19 April 2007)

A recent study at Hong Kong University using magnetic brain imaging shows that native Chinese speakers reading Chinese characters generate a pattern of brain activity related to spatial processing, believed to be because each visual Chinese character maps to a syllable of speech. When they read English, this same brain activity occurs, whereas the brain activity is different in native English speakers reading English. The researchers believe this suggests that native Chinese readers use their Chinese-reading capability when reading English, associating written English syllables directly to the sounds, and being less capable of applying rules to convert sequences of individual letters into sounds as native English-readers do.

Many years ago, a teacher of Chinese told me the only way foreigners could learn to write Chinese is the same way the Chinese learn at school, that is, to practise writing each character many times until they know it. But I found that as I learnt to read Chinese, I found it easiest to recognize the simplest written characters, the ones that weren't composed of others, and hardest to recognize the most complexly written ones.

Most learning material requires us to learn the commonly used characters first. When I learnt "我是一个人。这个人是我的爱人。", I'd have no trouble remembering the 一 and 人, both atomic characters. Characters made up of only two components, such as 个 (人 above 丨) and 的 (白 left of 勺), were easy if I already knew the meaning of their constituent components, 人、丨、白、and 勺. Other two-component characters, such as 是 (日 over 正), 这 (辶 around the bottom-left of 文), and 我 (手 before and joined to 戈), were more difficult when I hadn't already learnt 日、正、手、and 戈. And a character like 爱 (爪 over 冂 over 十 around the top-left of 又) was very difficult to recognize when reading. Further complicating the issue, the choice of what's an atomic component in Chinese is arbitrary, as Chinese characters can be continually sub-divided in a tree-like structure all the way down to individual strokes, eg, 正 is 一 over 止, 止 is 丄 around the bottom-left of 卜, 文 is 亠 over 乂, 亠 is 丶 over 一.

I'm wondering if the converse of the Hong Kong University study applies to native English speakers reading Chinese. When we read Chinese, do we use the same brain activity as when we read English? If so, that may be why I "spell out" the components of Chinese characters I don't know. A native Chinese reader may see the character 爱 as a single shape, but I see 4 distinct "letters" (爪冂十又) in that character. For those of us whose first reading language was alphabetic, perhaps the easiest way to learn to read Chinese is to learn the components first. In primary school we practised writing each letter of the alphabet before we practised writing words, and maybe we should learn all the components of Chinese before writing characters.

In the future, many Westerners will need to learn Chinese characters for various reasons, primarily to remain competitive in information-based industries. Because I believe the Chinese will one day use simplified Chinese characters in computer programs to gain the advantage that comes from writing terser code, I've been motivated to study how foreigners can best learn Chinese characters so they can also benefit from writing such terser code. I needed to analyse the constituents of Chinese characters.

There were many issues to consider when choosing which characters to include for analysis. Simplified characters are used in Mainland China, complex ones in Hong Kong, Taiwan, and Chinatowns. My experience is that when someone needs to learn one type of Chinese characters, they often need to learn those of the other type also. Additionally, Unicode treats characters from both Chinese and Japanese languages as one script, the "Unified CJK Characters", or "Unihan" characters. The Unicode Consortium considered that there would be too many duplicately encoded characters if they encoded simplified Chinese, complex Chinese, Japanese, and the other East Asian ideographic scripts separately, and since the characters are all descended from one ancient source, the consortium unified the scripts using some guidelines. These guidelines provided that characters with the same meaning and abstract shape be unified into one, irrespective of source script and ignoring typeface differences. (There were exceptions for the 20,000 most common CJK characters, such as not unifying two characters that are distinct in a single source script.)

I chose the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block. This seemed the most natural place to draw the ring, as foreigners who learn Chinese characters for programming would likely progress to using other CJK characters. I inputted decompositions of those characters, aided by various internet listings. For each decomposition, I recorded a "Decomposition Configuration", and up to two constituent components. Only pictorial configurations were used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters had typeface differences I used the one in the Unicode spec reference listing. When there was more than one possible configuration, I selected one based on how I thought a fellow foreigner would analyse the character. In future, I may add alternative configurations to the data.

I had to "create" a few thousand "characters" to cater for decomposition components not themselves among my collected characters. (Though I could have found many of them in the CJK extension A and B blocks, I wanted to keep those out of scope.) To represent these extra characters in the data, sometimes I used a multi-character sequence, sometimes a user-defined glyph. I've avoided using them in this blog entry because my user-defined fonts can't be used, instead I'm relying on verbal descriptions such as "the bottom part of 节".

With this data, I've been able to perform much analysis using programming. I've used the Groovy Programming Language, similar in function to the more popular Perl, Python, PHP, and Ruby "scripting" languages. The two key issues regarding the data were choosing the decomposition configurations and choosing the basic atomic components.

My main purpose is to generate an optimal collection of atomic components and their keyboard assignments that are intuitive for foreigners first learning to type CJK characters pictorially. Asian IME's generally target one particular script (simplified Chinese, traditional Chinese, Japanese, or Korean), not all CJK characters, and are intended for those who already intimately know the written language of that script, eg, the popular Wubizixing is for native readers of simplified Chinese, not for foreigners.

Wubizixing enables characters to be entered using at most four keystrokes by representing characters by the first three and last components only. Westerners are used to typing full words, and wouldn't want to type my previous sentence as "wubg enas chas to be entd usig at most four keys by repg chas by the firt thre and last coms only". Wubizixing also makes commonly-used groups of components available as one keystroke, such as 鱼 (instead of 勹,田 and 一),早 (instead of 日 and 十), and 四 (instead of 囗 and 儿). As a result, Wubizixing uses almost 200 components. The popular Taiwanese Cangjie input method is similar. But Westerners aren't used to typing common sequences of letters, such as 'tion', 'str', and 'th', with one keystroke. For foreigners the number of keystrokes for a character should be related to its complexity, not its frequency of use. There's another input method, Wubihua, used with mobile phones in mainland China, using only 5 components, the basic strokes 一丨丿丶乙. This is too low level for a foreigner, similar to typing a keystroke for each stroke of a capital Latin letter. I'm looking for a set of components at a level between these 5 strokes and Wubizixing's 200 components.

Although I've only analysed 20,000 of the 70,000 CJK characters presently in Unicode, my set of atomic components must be useful for entering those other CJK characters also. There's another 10,000 Korean-only characters, which are based on 24 Korean components. If I match my set of CJK components in the keyboard assignments to the standard Korean keyboard as much as possible, I could have 80,000 of the 100,000 Unicode characters in one seamless pictorial-based input method.

I don't want to rush through this final stage of the task, as it's important I get this part right.

Friday, March 30, 2007

Programming Language Fluency

Steven Pinker, in his book "Words and Rules", suggests a natural language such as English is stored in the human brain as words (vocabulary) and rules (grammar). As children we learn words, such as "cat", "cats", "dog", "dogs", "lion", and "lions". Our brains then recognize a rule, remember the rule "add -s for plural", then forget the plural words. After remembering more words such as "tiger", "elephant", and "mouse", our brains then remember the exception-to-the-rule words "mice is plural of mouse". Most rules of English, however, are about joining words together to generate a phrase or clause. And sometimes a word group is remembered as vocabulary when it can't be generated using a rule, eg, phrasal verbs like "put up with" and idioms like "pulling one's leg".

Computer languages have many similarities to natural languages. The syntax is the grammar, and the keywords, libraries, commands, macros, etc are the vocabulary. Computer languages let us define our own variable names, just as natural languages let us invent new proper nouns. Comments are like interjections, strings like foreign words, expressions are phrases, statements are clauses, closures are relative clauses, blocks are sentences, semi-colons and commas are conjunctions, methods and functions are verbs, interfaces and annotations are adjectives, classes and properties are nouns, some packages’ classes are commonly-used nouns, while others’ are less-common technical words.

Over the last few years, powerful "scripting" languages such as Python, Ruby, and Groovy have been becoming more popular. They follow the deadend in programming language evolution known as "Fourth Generation Languages". It was easier to write a program in a visual 4GL, but far more difficult to read and debug it. Most people prefer to use a combination of both language and visual design, but it’s easier to decorate language with visuals (eg, a program listing using indentation, with IDE color highlighting) than visuals with language (eg, a form layout with code snippets on each element). Human brains are wired for language, whether natural or programming, hence scripting languages' recent rising popularity.

But scripting languages do have some drawbacks that prevent them working well with an IDE. Dynamic typing means the language often can’t be supported very well with the visual code auto-completion feature. So to use a dynamically-typed language well, we must be able to recall the most common methods and properties of the most common classes.

Imagine we can speak English with the correct grammar words, stresses, pauses, tones, etc, but don’t know a substantial portion of the vocabulary. To help us, we keep a friend with us always when we go out. So when speaking to a waiter at a restaurant, we say “There’s a”, then turn to our friend for a hint. Our friend tells us “fly”, so we say to the waiter “fly in my”, turn to our friend who tells us “soup”, then tell the waiter “soup!” This friend is just like the IDE auto-completors we use when programming. It’s unnatural to know far more language grammar than vocabulary, and the increasing uptake of dynamically typed languages could correct this imbalance in our knowledge of programming languages.

Young babies learn the vocabulary of their natural language faster than the grammar. Adult foreign language classes focus on topics, primarily requiring learning vocabulary and expressions about the topic. But those learning programming learn language grammar, and keep a class library reference book handy or use a memory-clogging GUI with auto-completion. A programming language should be learnt just like a natural language, grammar and vocabulary together. Perhaps the faulty learning method is why only certain types of people learn to use such languages well. If programming languages were taught properly, perhaps everyone would be able to learn them, just as they learn to count. Perhaps dynamically typed languages that don’t work well with IDE auto-completors could be the first programming languages to be learnt as all languages, whether natural or computer, should be: grammar and vocabulary together. Programmers must become fluent in the language, not just learn about the language.

I'm working on such a tutorial for the Groovy Programming Language. Java programmers are Groovy’s target market, and those who have never programmed in Java can benefit from learning Groovy first: they can be productive sooner. Perhaps they'll go on to learn and work with Java, or maybe they'll find Groovy to be sufficient for all their needs. Either way, Groovy should be easily learnable by them. The Java class libraries, both standard and enterprize editions, are huge, so I've drawn a rough circle around groovy.lang, groovy.util, java.lang, java.util,,, java.math, java.text, java.util.regex, and maybe also java.lang.reflect and java.util.concurrent. They seem to be the "core" packages and I'll consider them to be the "vocabulary" of Groovy. In my first pass through them, I'll focus on completeness of information.

The tutorial will pay attention to those whose native language isn’t English. Rather than relying on many translations eventually going out of sync with one another, it’s best the tutorial be in English, because most foreign programmers can read English, if not speak it. But the tutorial must use a foreigner-readable style and internationally-known words. The focus of learning must be on code examples, not wordy explanations.

The tutorial will also aim to present topics in the best sequence. People learning Groovy want to be productive as quickly as possible, and so should often learn things in the opposite sequence to people who learnt Java first. Just because closures, categories, interceptors, and builders are more recent concepts doesn’t mean learners should learn them later. Perhaps they need to learn closures before functions, functions and expandos before classes, collections before arrays, encapsulation and static members before instances, categories and interceptors before the complexity of static typing, inheritance, method overriding, multi-methods, casting, etc.

The tutorial will treat commonly-used extensions as an integral part of the language. Regex is an extension to both Java and Groovy, and its syntax is quite different to Groovy’s. But they are meant to be used together, and should be learnt together. There's no conceptual difference between a math expression and a regex pattern. Only because one's quoted and the other isn't in the syntax do many not consider regexes to be part of the language. But programmers must be comfortable using both the procedural and sometimes lispy styles of base Groovy and the more declarative style of regexes, printf format strings, or any other commonly-used extension.

I'll firm up the sequence of topics during my second pass through the information, when I expand the examples and explanations. For now, I've drawn a rough circle around these ones:
  • Getting started - basic concepts needed for each subsequent tutorial
  • integers, BigDecimal, floating-points
  • enums
  • dates & times
  • Collections, Queues/Deques, Arrays, Maps
  • Strings, Characters, regexes
  • input/output, networking
  • blocks, Closures, functions
  • Expandos, classes, categories
  • static typing
  • interfaces
  • inheritance, method overriding, multi-methods
  • method-to-syntax mappings (in, as, operator overloading)
  • exceptions
  • permissions
  • annotations
  • multi-threading
  • tree processing, builders, XML
  • interceptors, ExpandoMetaClass
  • packages, evaluate, class loading
  • reflection
  • internationalization
  • Further learning - pointers to Java, IDE's, Swing, Grails, Gant, etc
I'm expecting all this to take me until next year sometime.

Tuesday, March 13, 2007

Internationalizing Keywords

Some programming languages, such as Perl and PHP, distinguish variable names from other language tokens by prefixing them with a special character. Sometimes that character also indicates something about the type of the variable, such as $i to indicate a scalar value, or @arr for a vector one. This is handy because the language can then use any word beginning with an alphabetic character as a keyword. Other languages allow variables to begin with any alphabetic character, but not to be one of the keywords of the language, eg, no variables may be called class in Java.

It would be quite easy to internationalize the keywords in a programming language of the first type. A context-sensitive lexical preprocessor could simply replace the native-language-specific keywords with the English ones, eg, 作 with
do, 回 with return, etc for Chinese. For the second type of programming language, if programs used, say, Chinese words as keywords, then they should let programmers use the unused English keywords as variable names. To allow this, before a preprocessor converted Chinese characters to the equivalent English keywords, it must somehow mangle names that are English keywords into an eligible alternative. In some languages, such as Groovy, this would be as simple as quoting the name, eg, method.'static'.params instead of method.static.params. In other languages, the mangling is more difficult. In any language with sizable libraries, there'd also be inter-module and namespace issues to deal with.

Another way a language of the second type could internationalize its keywords is to replace all its keywords with symbols and punctuation, then use a preprocessor on all programs. The keywords would be macros that add in these symbols.
The default preprocessor would be the English language one, but any could be used. To use keywords in another language, we could exchange the default preprocessor for one in that other language, eg, a Spanish one. The default preprocessor would be conceptually separate from the compiler, but could in fact be tightly coupled at the implementation-level to provide more efficiency to those using keywords in English, the default natural language. Such a language of the second type is in contrast to one of the first type as far as token usage goes: one uses alphabetic characters for names only, the other to begin only the keywords.

As a case study, I'll look at how a hypothetical language with Java's keywords could replace them with ASCII symbols and punctuation. I'll divide them into nouns, verbs, and adjectives/adverbs as far as possible.

The Nouns: With auto- and unboxing, the keywords for void, char, int, long, short, byte, float, double, and boolean could be macros that add in java.lang.Void, java.lang.Character, java.lang.Integer, etc, and the semantics would be unchanged. true and false could be replaced with java.lang.Boolean.TRUE and java.lang.Boolean.FALSE. If the Null type comes in Java 7, null could similarly be replaced.

The Adjectives: The modifiers could be considered to be annotations that the compiler sees first, before any "other" annotation processor. So static, private, protected, public, abstract, final, volatile, transient, native, strictfp, and synchronized (as a modifier) could be macros that add in equivalent annotations, eg, static with @Static, protected with @Access("Protected"), etc, depending on the lexical context of the macro. Because interface acts as a modifier of an implied class, it could be a macro that adds in @Interface class.

The Verbs: Many of the keywords are at the beginning of the line, and look like commands, function calls, closure calls, etc. By leaving those keywords in the language but also allowing them to be used as names, the compiler could determine from the context which usage was intended. From the programmer's point of view, a while{ ... } statement would be no different to the use{ ... } closure call, and the assert ... statement no different to the println ... call. Eligible keywords from Java are: for, while, do, if, else, switch, case, default, try, catch, finally, return, throw, break, continue, package, import, class, assert, and synchronized (as a block header). They could even look like they're defined in a standard library class, eg, mylang.lang.System.for(...), mylang.lang.System.while(...), etc. To enable this, the language would need to allow closures with multi-name syntax, eg, myIf(...) myElse(...). Such keywords could then be internationalized by programmers using the same mechanism as for names in standard libraries. (Though in fact, some of these verb-keywords may be definable in terms of others, eg, default defined as case Object || null.)

As for the other keywords:
extends and implements can be distinguished from their context, and could be replaced with a colon. So could throws. const and goto could finally be retired. new and instanceof could each be macros that add in some alternative symbols. (Some languages with Java syntax do the opposite, adding in new keywords, such as as and in, as alternatives to symbols in the language.) We could eliminate this and super as keywords by considering a class to be divided into an outer, static portion, and an inner, instantiable portion, borrowing an idea from the Scala language. The static modifiers would be absent from the outer portion, and the inner portion would be bracketed with object this extends super{ ... }, where this and super are simply defaults for any names a programmer might choose. The current class definition syntax would expand as a macro to this new syntax, and the new object keyword would be a verb-keyword, just like class.

We could thus internationalize the keywords for a Java-style language which uses keywords and names with the same syntactic form.