Thursday, February 12, 2009

The Rise of Unicode

The next version of Unicode is v.5.2, the latest of a unified character set now with over 100,000 current tokens. One notable addition to v.5.2 will be the Egyptian hieroglyphs, the earliest known system of human writing. Perhaps they will mark Unicode's coming of age, it being another huge step in representing language with graphical symbols. Let's look at a consolidated short history of writing systems, courtesy of various Wikipedia pages, to see Unicode's rise in perspective...

Egyptian hieroglyphs were invented around 4000-3000 BC. The earliest type of hieroglyph was the logogram, where a common noun (such as sun or mountain) is represented by a simple picture. These existing hieroglyphs were then used as phonograms, to denote more abstract ideas with the same sound. Later, these were modified by extra trailing hieroglyphs, called semagrams, to clarify their meaning in context. About 5000 Egyptian hieroglyphs existed by Roman times. When papyrus replaced stone tablets, the hieroglyphs were simplified to accommodate the new medium, sometimes losing their resemblance to the original picture.

The idea of such hieroglyphic writing quickly spread to Sumeria, and eventually to ancient China. The ancient Egyptian and Sumerian hieroglyphs are no longer used, but modern Chinese characters are descended directly from the ancient Chinese ones. Because Chinese characters spread to Japan and ancient Korea, they're now called CJK characters. By looking at such CJK characters, we can get some idea of how Egyptian hieroglyphs worked. Many CJK characters were originally pictures, such as 日 for sun, 月 for moon, 田 for field, 水 for water, 山 for mountain, 女 for woman, and 子 for child. Some pictures have meanings composed of other meanings, such as 女 (woman) and 子 (child) combining into 好, meaning good. About 80% of Chinese characters are phonetic, consisting of two parts, one semantic, the other primarily phonetic, e.g. 土 sounds like tu, and 口 means mouth, so 吐 also sounds like tu, and means to spit (with the mouth). The phonetic part of many phonetic characters often also provides secondary semantics to the character, e.g. the phonetic 土 (in 吐) means ground, where the spit ends up.

Eventually in Egypt, a set of 24 hieroglyphs called uniliterals evolved, each denoting one consonant sound in ancient Egyptian speech, though they were probably only used for transliterating foreign names. This idea was copied by the Phoenicians by 1200BC, and their symbols spread around the Middle East into various other languages' writing systems, having a major social effect. It's the base of almost all alphabets used in the world today, except CJK characters. These Phoenician symbols for consonants were copied by the ancient Hebrews and for Arabic, but when the Greeks copied them, they adapted the symbols of unused consonants for vowel sounds, becoming the first writing system to represent both consonants and vowels.

Over time, cursive versions of letters evolved for the Latin, Greek, and Cyrillic alphabets so people could write them easily on paper. They used either the block or the cursive letters, but not both, in one document. The Carolingian minuscule became the standard cursive script for the Latin alphabet in Europe from 800AD. Soon after, it became common to mix block (uppercase) and cursive (lowercase) letters in the same document. The most common system was to capitalize the first letter of each sentence and of each noun. Chinese characters have only one case, but that may change soon. Simplified characters were invented in 1950's mainland China, replacing the more complex characters still used in Hong Kong, Taiwan, and western countries. Nowadays in mainland China though, both complex and simplified Chinese are sometimes used in the same document, the complex ones for more formal parts of the document. Perhaps one day complex characters will sometimes mix with simplified ones in the same sentence, turning Chinese into another two-case writing system.

Punctuation was popularized in Europe around the same time as cursive letters. Punctuation is chiefly used to indicate stress, pause, and tone when reading aloud. Underlining is a common way of indicating stress. In English, the comma, semicolon, colon, and period (,;:.) indicated pauses of varying degrees, though nowadays, only comma and period is used much in writing. The question mark (?) replaces the period to indicate a question, of either rising or falling tone; the exclamation mark (!) indicates a sharp falling tone.

The idea of separating words with a special mark also began with the Phoenicians. Irish monks began using spaces in 600-700AD, and this quickly spread throughout Europe. Nowadays, the CJK languages are the only major languages not using some form of word separation. Until recently, the Chinese didn't recognize the concept of word in their language, only of (syllabic) character.

The bracketing function of spoken English is usually performed by saying something at a higher or lower pitch, between two pauses. At first, only the pauses were shown in writing, perhaps by pairs of commas. Hyphens might replace spaces between words to show which ones are grouped together. Eventually, explicit bracketing symbols were introduced at the beginning and end of the bracketed text. Sometimes the same symbol was used to show both the beginning and the end, such as pairs of dashes to indicate appositives, and pairs of quotes, either single or double, to indicate speech. Sometimes different paired symbols were used, such as parentheses ( and ). In the 1700's, Spanish introduced inverted ? and ! at the beginning of clauses, in addition to the right-way-up ones at the end, to bracket questions and exclamations. Paragraphs are another bracketing technique, being indicated by indentation.

Around 1050, movable-type printing was invented in China. Instead of carving an entire page on one block as in block printing, each character was on a separate tiny block. These were fastened together into a plate to reflect a page of a book, and after printing, the plate was broken up and the characters reused. But because thousands of characters needed to be stored and manipulated, making movable-type printing difficult, it never replaced block printing in China. But less than a hundred letters and symbols need to be manipulated for European alphabets, much easier. So when movable-type printing reached Europe, the printing revolution began.

With printing a new type of language matured, one that couldn't be spoken very well, only written: the language of mathematics. Mathematics, unlike natural languages, needs to be precisely represented. Natural languages are very expressive, but can also be quite vague. Numbers were represented by many symbols in ancient Egypt and Sumeria, and had reduced to a mere 10 by the Renaissance. But from then on, mathematics started requiring many more symbols than merely two cases of 26 letters, 10 digits, and some operators. Many symbols were imported from other alphabets, different fonts introduced for Latin letters, and many more symbols invented to accommodate the requirements of writing mathematics. Mathematical symbols are now almost standardized throughout the world. Many other symbol systems, such as those for chemistry, music, and architecture, also require precise representation. Existing writing systems changed to utilize the extra expressiveness that came with movable-type printing. Underlining in handwriting was supplemented with bolding and italics. Parentheses were supplemented with brackets [] and curlies {}.

Fifty years ago, yet another type of language arose, for specifying algorithms: computer languages. The first computer languages were easy to parse, requiring little backtracking, but the most popular syntax, that of C and its descendants, requires more complex logic and greater resources to parse. Most programming languages used a small repetoire of letters, digits, punctuation, and symbols, being limited by the keyboard. Other languages, most notably APL, attempted to use many more, but this never became popular. Unlike mathematics, computer languages relied on parsing syntax, rather than a large variety of tokens, to represent algorithms, being limited by the keyboard. Computer programs generally copied natural language writing systems, using letters, numbers, bracketing, separators, punctuation, and symbols in similar ways. One notable innovation of computer languages, though, is camel case, popularized for names in C-like language syntaxes.

The natural language that spread around the world in modern times, English, doesn't use a strict pronunciation-spelling correspondence, perhaps one of the many reasons it spread so rapidly. English writing therefore caters for people who speak English with widely differing vowel sounds and stress, pause, and tone patterns. In this way, English words are a little like Chinese ideographs. As Asian economies developed, techniques for quickly entering large-character-set natural languages were invented, known as IME's (input method editors). But these Asian countries still use English for computer programming.

Around 1990 Unicode was born, unifying the character sets of the world. Initially, there was only room for about 60,000 tokens in Unicode, so the CJK characters of China, Japan, and Korea were unified to save space. Unicode is also bidirectional, catering to Arabic and Hebrew. Topdown languages such as Mongolian and traditional Chinese script can be simulated with left-to-right or right-to-left directioning by using a special sideways font. However, Unicode didn't become very popular until its UTF-8 encoding was invented 10 years ago, allowing backwards compatibility with ASCII. Another benefit of UTF-8 is there's now room for about one million characters in the Unicode character set, allowing less commonly used scripts such as Egyptian hieroglyphs to be encoded.

Many programming languages have recently adopted different policies for using Unicode tokens in names and operators. The display tokens in Unicode are divided into various categories and subcategories, mirroring their use in natural language writing systems. Examples of such subcategories are: uppercase letters (Lu), lowercase ones (Ll), digits (Nd), non-spacing combining marks, e.g. accents (Mn), spacing combining marks, e.g. Eastern vowel signs (Mc), enclosing marks (Me), invisible separators that take up space (Zs), math symbols (Sm), currency symbols (Sc), start bracketing punctuation (Ps), end bracketing (Pe), initial quote (Pi), final quote (Pf), and connector punctuation, e.g. underscore (Pc).

For it to become popular to use a greater variety of Unicode tokens in computer programs, there must be a commonly available IME for their entry with keyboards. Sun's Fortress provides keystroke sequences for entering mathematical symbols in programs, but leaves it vague whether the Unicode tokens or the ASCII keys used to enter them are the true tokens in the program text. And of course there must be a commonly available font representing every token. Perhaps because of the large number of CJK characters, and the recent technological development of mainland China, a large number of programmers may one day suddenly begin using them in computer programming to make their programs terser.

Language representation using graphical symbols has taken many huge leaps in history: Egyptian hieroglyphs to represent speech around 4000 years ago, an alphabet to represent consonant and vowel sounds by the Phoenicians and Greeks around 2500 years ago, movable-type printing in Europe around 500 years ago, and unifying the world's alphabets and symbols into Unicode a mere 20 years ago. And who knows what the full impact of this latest huge leap will be?