Friday, April 30, 2010

Orders of Infinity

edited and abridged on 6 Feb 2018

A follow-up to my previous post Mass-Parity-Distance-invariant Universe where I described a Universe where negative mass explains dark energy...

Cantor's hierarchy of infinities

Let's look at its position in Cantor's hierarchy of infinities, as that seems more foundational than the other branches of mathematics, and move on up from there.

We know the zeroith order of infinity (a.k.a. finiteness) can define a logic system, by using two or more ordered finite values, i.e. false and true for boolean logic, more for other logic systems. The first order of infinity (a.k.a. aleph-0) can define an arithmetic system. The simplest is the natural numbers, created by starting at number 1 and applying induction. Extensions to this such as the integers and rational numbers are also aleph-0. Godel showed this arithmetic system cannot be both consistent and complete.

Things get more interesting at the second order of infinity. A next higher order of infinity is the power set of a lower one. The real numbers, introducing the square root of 2, extend the integers. We can prove they're at some higher order than the first order (integers, etc), but can't prove at what order they are. We therefore speculate they're at the second order (a.k.a. aleph-1), calling this the Continuum Hypothesis. Other number systems with the property of continuity (e.g. complex numbers, n-D manifolds) would then also be aleph-1, but complex numbers, which introduce the square root of -1, are "algebraically complete", not requiring any further extensions.

And what of aleph-2, the third order of infinity? The set of all curves (including fractal ones) is known to be at some order of infinity above aleph-1, hypothesized to be aleph-2. When looking at the curves, maybe it's best to consider the most complex curves first, such as 1D-lines with a (Hausdorff-) dimension of 2. The most famous of these are the Mandelbrot and Julia sets, which both happen to be defined on the complex plane. When we look at computer simulations of them, we see many concentric closed curves of various colors, reflecting the different integral-valued accuracies of calculation. If we could mathematically define a real-valued accuracy of calculation, would we still see concentric fractal curves? I suspect so, and that they would merge into a continuously-varying fractal structure with (Hausdorff-) dimension of 3, on the 2-dimensional complex plane. All of the Julia sets for the standard Mandelbrot look like they have this same concentricity property, though only some of them seem to have (Hausdorff-) dimension 2, the rest (on inspection) seeming to have dimension of less than 2. One is even a perfect circle, with only (Hausdorff-) dimension 1. And of course we could consider the nonstandard Mandelbrot views for all these Julia sets.

The Julibrot Set

Let's use these two fractals in a candidate mathematical model for our Universe. The Julibrot Set is defined as the topologically-4D union of all the Julia sets of the Mandelbrot set. If we consider real-valued accuracies of calculation for the entire Julibrot, we have a (Hausdorff-) dimensionality of somewhere between 4 and 6, embedded in the 4 topological dimensions. Now I suspect there could be a theorem concerning the (Hausdorff-) dimension in such a structure: I'll speculate it's 5 and see where that leads. If such a 2-complex-plane structure explains the 4D-spacetime of our Universe, then there's an extra non-expansive dimension supplied by considering the fractal curves at different positive-real-valued degrees of accuracy. This could explain the phenomenon of mass in our 4D-spacetime, along with certain rules of distribution within, which could be the law of Einsteinian gravity.

What would be the order of infinity for this Julibrot-based Universe? There could be a theorem saying this structure is necessary and sufficient to contain all curves, and is therefore at the third order of infinity, aleph-2. Our Universe could then be the minimumly-complexed structure that can exist at aleph-2. Furthermore, just as the complex numbers are algebraically complete at aleph-1-infinity, so also our 4D-spacetime with its inbuilt phenomenon of Mass, obeying rules both of gravitation and of quantum physics, could also be complete in some way at aleph-2-infinity.

The directed dimension and inbuilt polarity

Any recursively-applied polynomial equation seems to give the basic Mandelbrotly-edged shape on the computer, so presumably they all have high enough Hausdorff dimension to be a candidate model. But only simple Julibrot Sets are reflectionally or rotationally symmetric in 3 dimensions, but not in 4, giving a dimension that looks like Time.

By looking at the Julibrot's constituent Mandelbrot and Julia sets on the computer, including their colored accuracy levels, we see that the Mandelbrot-real dimension is the assymmetric dimension, i.e. the Time dimension. As a bonus, the Mandelbrot-imaginary dimension is the only one that's reflectionally symmetric, therefore the one along which positive and negative matter flew apart, i.e. the Axial Space dimension. The Julia sets at each point below the standard Mandelbrot Time axis (where y=0 on the plane) are similar to the corresponding one above, except for being reflected through their Julia-real axis. This gives the appearance of the Mass rotating in opposite directions in each half of the structure, perhaps suggesting the opposite handedness of gravitons and neutrinos in each half of the mass distribution in our Universe. But in each half of the Mandelbrot Mass distribution, when we look toward the Time axis, the Mass appears to rotate in the same direction, perhaps suggesting why space appears to have an inbuilt polarity.

Higher orders of infinity?

So we've seen a possible model of our physical Universe by regarding Mathematics not as something that simply describes the Universe, but as something which the Universe is at some order of infinity.

If our Universe is what exists at the third order of infinity, then what might constitute the next higher order of infinity? Using Cantor's theorem, by considering the power set of our Universe, we might say it's the set of everything that could have happened, could yet happen, and would yet have happened in our Universe, except for the number of dimensions. But how would we place our own specific Universe of actual happenings in this picture, with its unique quantum reductions into actualities? One person might say the power set is at a higher order of infinity than our own specific Universe of actual happenings, because of Cantor's theorem. But someone else might say the Universe of happenings is at a higher order, being the real, intentioned actualization, while the possibilities are at a lower order, being just the canvas for the actualization, so to speak. It sounds like an argument between atheists and theists, and perhaps never able to be proven either way.

So possibly our own consciousnesses can't really comprehend very high up the ladders of infinity. At the first order, we can't logically define a consistent and complete system. At the next order, we must hypothesize its actuality as The Continuum in our own Universe. And at another order or two higher, where our own consciousnesses dwell, we can't logically prove which of the two orders is the higher and which is the lower. If we could look higher up the hierarchy of infinities to the infinite level, then the hierarchy itself would be able to be counted by an induction-based counting scheme, and so become self-referenced, and even itself an inconsistent and incomplete arithmetic system. In fact, because different instances of our human consciousnesses can't agree which of the third and fourth orders of infinity is higher than the other, we can't even make the known physical representations of the orders of infinity into a propositional logic system, not knowing which to call True.

Moving each order of infinity up the ladder seems to require utilizing some new well-known mathematical concept. Moving from finiteness to aleph-0 requires Induction, and from aleph-0 to aleph-1 requires Continuity. If an MPD-invariant Universe is at aleph-2, then moving there requires Probability. This would be the lowest order of infinity which contains the entities of Time and Mass, as distinct from Space. And such entities, Time and Mass, are required for the mathematics of computational complexity, Time as a resource and Mass for building Turing machines. Perhaps the next order of infinity, aleph-3, requires the concept of Computation, which could explain the phenomena of consciousness. Such Computation is also required for calculating fractal curves in aleph-2 spacetime.

Computational complexity

Perhaps computational complexity will play an important role in a theory of the Universe. There's now many known complexity classes. If we can slot the theory of computational complexity directly onto a mathematical structure which defines a space-distinct Time, which behaves according to the laws of our Universe, then many of these complexity classes, such as PSPACE and P(TIME), may meld into one when we factor in the effects of Special and General Relativity, such as time dilation and space contraction. PSPACE problems require more computation "power" than P(TIME) problems, but if space can become time due to high acceleration or a nearby strong gravitational field, perhaps ultimately they're really the same complexity class.

Similarly, the distinction between the P(TIME) and NP(TIME) complexity classes may not exist at Planck scales because of the quantum nondeterminism. Perhaps the electric field generated by human brain neural structure makes use of such quantum nondeterminism to produce the effect of consciousness. Perhaps both large-scale (relativistic) and small-scale (quantum) effects together reduce the many complexity classes down to a mere few. They seem to fall into four broad groups: logarithmic, polynomial, exponential, and recursive.

Do these complexity groups each match up somehow to the various orders of infinity I've described? Do aleph-0-infinite structures like the integers relate somehow to logarithmic-space computation? Do aleph-1-infinite structures like the complex numbers relate to polynomial-resource computation? Is our Universe an aleph-2-infinite structure? Is it an MPD-invariant "canvas" for our Universe of actual happenings, related somehow to exponential-resource computation? Are the quantum reductions that form the actual happenings in our Universe, including our own consciousnesses, an aleph-3-infinite structure? Related somehow to recursively-enumerable computation? And what could possibly lie beyond that?

Programming Language Structure

Originally posted 16 January 2010 on my temporary blogspace...

Programming languages have their origin in natural language, so to understand the structure of computer languages, we need to understand natural ones. According to Systemic Functional Grammar (SFG) theory, to understand the structure of language, we need to consider its use: language is as it is because of the functions it's required to serve. Much analysis of the English language has been performed using these principles, but I haven't found much on programming languages.

Functional grammar of natural languages

According M.A.K. Halliday's SFG, the vast numbers of options for meaning potential embodied in language combine into three relatively independent components, and each of these components correspond to a certain basic function of language. Within each component, the networks of options are closely interconnected, while between components, the connections are few. He identifies the "representational" and "interactional" functions of language, and a third, the "textual" function, which is instrumental to the other two, linking with them, with itself, and with features of the situation in which it's used.

To understand these three components in natural languages, we need to understand the stages of encoding. Two principle encodings occur when speech is produced: the first converts semantic concepts into a lexical-syntactic encoding; the second converts this into spoken sounds. A secondary encoding converts some semantics directly into the vocal system, being overlaid onto the output of the lexical-syntactic encoding. Programming languages have the same three-level encoding: at the top is the semantic, in the middle is the language syntax, and at the bottom are the lexical tokens.

The representational function of language involves encoding our experience of the outside world, and of our own consciousness. It's often encoded in as neutral a way as possible for example's sake: "The Groovy Language was first officially announced by James Strachan on Friday 29 August 2003, causing some to rejoice and others to tremble."

We can analyze this as two related processes. The first has actor "James Strachan", process "to officially announce", goal "the Groovy Language", instance circumstance "first", and temporal circumstance "Friday 29 August 2008"; the second process is related as an effect in a cause-and-effect relationship, being two further equally conjoined processes: one with process "to rejoice" and actor "some"; the other with process "to tremble" and actor "others".

The interactional function of language involves injecting the language participants into the encoding. A contrived example showing many types of injects: "The Groovy Language was first announced by, of all people, creator James Strachan, sometime in August 2003. Was it on Friday 29th? Could you tell me if it was? Must have been. That august August day made some happy chappies like me rejoice, didn't it?, yeehaaaah, and probably some other unfortunates to tuh-rem-ble, ha-haaah!"

We see an informal tone, implying the relationship between speaker and listener. There's glosses added, i.e. "of all people", "august", "happy chappies like me", "unfortunates", semantic words added, i.e. "creator", semantic words removed, i.e. "officially", sounds inserted, i.e. "yeehaaaah", "ha-haaah", prepended expressions of politeness, i.e. "Could you tell me if", and words spoken differently, e.g. "tuh-rem-ble". Mood is added, i.e. a sequence of (indicative, interrogative, indicative). Probability modality is added, i.e. "must have", "probably". We could have added other modality, such as obligation, permission, or ability. We've added a tag, i.e. "didn't it?". We could have added polarity in the main predicate. What we can't indicate in this written encoding of speech is the attitudinal intonation overlaid onto each clause, of which English has hundreds. Neither can we show the body language, also part of the interactional function of speech.

Natural language in the human brain

A recent article in Scientific American says biologists now believe the specialization of the human brain’s two cerebral hemispheres was already in place when vertebrates arose 500 million years ago, and that "the left hemisphere originally seems to have focused in general on controlling well-established patterns of behavior; the right specialized in detecting and responding to unexpected stimuli. Both speech and right-handedness may have evolved from a specialization for the control of routine behavior. Face recognition and the processing of spatial relations may trace their heritage to a need to sense predators quickly."

I suspect the representational function of language is that which is produced by the left hemisphere of the brain, and the interactional function by the right hemisphere. Because the right side of the brain is responsible for unexpected stimuli, from both friend and foe, then perhaps interactional language in vertebrates began as body language and facial expressions to denote conditions relevant to others, e.g. anger, fear, affection, humidity, rain, danger, etc. Later, vocal sounds arose as the voice box developed in various species, and in humans, increasingly complex sounds became possible. The left side of the brain is responsible for dealing with regular behavior, and so allowed people to use their right hand to make sign language to communicate. Chimpanzees and gorillas use their right hands to communicate with each other, often in gestures that also incorporate the head and mouth. The article hypothesizes that the evolution of the syllable in humans triggered the ability to form sentences describing processes involving people, things, places, times, etc. Proto-representational language was probably a series of one-syllable sounds similar to what some chimps can do nowadays with sign language, e.g. "Cat eat son night". Later, these two separate functions of natural language intertwined onto human speech.

Programming language structure

When looking at programming languages, we can see the representational function easily. It maps closely to that for natural languages. The process is like a function, and the actor, goal, recipient, and other entities in the transitive structure of natural language are like the function parameters. In the object-oriented paradigm, one entity, the actor, is like the object. The circumstances are the surrounding static scope, and the relationships between processes is the sequencing of statements. Of course, the semantic domains of natural and programming languages are different: natural languages talk about a wider variety of things, themselves more vague, than programming languages. But the encoding systems are similar: the functional and object-oriented paradigms became popular for programming because between them it's easy for programmers to code about certain aspects of things they use natural language to talk about. The example in pseudocode:

Date("2003-8-29").events += {
def a = new Instances();
[1] = jamesStrachan.officiallyAnnounce(Language.GROOVY);
[1].effect = [some: s => s.rejoice(), others: o => o.tremble];

The similarities between the interactional functions of natural and programming languages is more difficult to comprehend. The major complication is the extra participants in programming languages. In natural language, one person speaks, maybe one, maybe more people listen, perhaps immediately, perhaps later. Occasionally it's intended someone overhears. In programming languages, one person writes. The computer reads, but good programming practice is that other human people read the code later. Commenting, use of whitespace, and variable naming partly enable this interactional function. So does including test scripts with code. Java/C#-style exception-handling enables programmer-to-programmer interaction similar to the probability-modality of English verbal phrases, e.g. will/definitely, should/probably, might/could/possibly, won't, probably won't.

Many programming systems allow some interactional code to be separated from the representational code. One way is using system-wide aspects. A security aspect will control the pathway between various humans and different functions of the program while it's running. Aspects can control communication between the running program and different facets of the computer equipment, e.g. a logging aspect comes between the program and recording medium, a persistence aspect between the program and some storage mechanism, an execution performance aspect between the program and CPU, a concurrency aspect between the program and many CPU's, a distribution aspect between the program and another executing somewhere else. Here, we are considering these differents facets of the computer equipment to be participants in the communication, just like the programmer. Aspects can also split out code for I/O actions and the program entry point, which are program-to-human interactions. This can also be done by monads in "pure functional" languages like Haskell. Representational function in Haskell is always kept separate from interactional functions like I/O and program entry, with monads enabling the intertwining between them. Monads also control all access between the program and modifiable state in the computer, another example of an interactional function.

Textual function of language

The textual function of language in SFG is that which concerns the language medium itself. In spoken natural language, this is primarily the sequential nature of voice, and in written language, the 2-D form of the page. Whereas in natural language theory, the voice-carrying atmosphere and the ink-carrying paper are obviously mediums and not participants, it's more difficult to categorize the difference between them in programming language theory. Because a program is written as much for the CPU as for other human readers, if not more so, we could call the CPU a participant. But then why can't the CPU cache, computer memory, hard-disk storage, and comms lines also be called participants? Perhaps the participants and the transmission medium for natural languages are also more similar than different.

The textual function of language is made up of the thematic, informational, and cohesive structures. Although mainly medium-oriented, they also involve the participants. The thematic structure is speaker-oriented, the informational structure is listener-oriented. The thematic structure is overlaid onto the clause. In English, what the speaker regards as the heading to what they're saying, the theme, is put in first position. Not only clauses, but also sentences, speech acts, written paragraphs, spoken discourses, and even entire novels have themes. Some examples using lexical items James, to give, programmers, Groovy, and 2003, with theme in italics:

  • James Strachan gave programmers Groovy in 2003.
  • Programmers are who James gave Groovy to in 2003.
  • The Groovy Language is what James gave programmers in 2003.
  • 2003 is when James gave programmers Groovy.
  • Given was Groovy by James to programmers in 2003.

In English, the Actor of the representational function's transitive structure is most likely to be separated from the interactional function's Subject and from the Theme in a clause, than those from each other. I think the textual functions of natural language are far more closely linked to the interactional function than to the representational. Perhaps the right side of the brain also processes for such texture structure.

The informational structure jumps from the top (i.e. semantic) encoding level directly to the bottom (i.e. phonological) one in English, skipping the middle (i.e. lexical/syntactic) level. This is mirrored by how programming languages such as Python use the lexical tokens to directly determine semantic meaning. In English, the speech is broken into tone units, separated by short pauses. Each tone unit has the stress on some part of it to indicate the new information. For example, each of these sentences has a different informational meaning (the bold indicates the stresses):

  • James gave programmers Groovy in 2003.
  • James gave programmers the Groovy Language in 2003.
  • James gave programmers Groovy in 2003.
  • James gave programmers Groovy in 2003.
  • James Strachan gave programmers Groovy in 2003.

Unlike the thematic structure, the informational structures the tone unit by relating it to what has gone before, reflecting what the speaker assumes is the status of the information in the mind of the listener. The informational structure usually uses the same structure used in the thematic, but needn't. English grammar allows the lexical items to be arranged in any order to enable them to be broken up in any combination into tone units. For example, these examples restructure the clause so it can be divided into two tone units (shown by the comma), each with its own stress, so two items of new information can be introduced in one clause:

  • James gave Groovy to programmers, in 2003.
  • As for Groovy, James gave it to programmers in 2003.
  • In 2003, James gave programmers Groovy.

Programming languages should follow the example of natural languages, and allow developers to structure their code to show both thematic and informational structure. The final textual function, the cohesive structure enables links between clauses, using various techniques, such as reference, pronouns, and conjunctions. Imperative programming languages rely heavily on reference, i.e. temporary variables, but don't use pronouns very much. Programming languages should also provide developers with many pronouns.


Programming languages initially represented information in the same way humans do, using transitive structures such as function calls, joined by logical relationships such as blocks and class definitions. Interactional aspects of code were initially intertwined, but could be separated out using aspects and monads. Enabling different textual structures in programs isn't very widespread, so far limited to providing different views of an AST in an IDE, only occasionally allowing "more than one way to do things" at the lexical level. When used well, textual structures in code enable someone later on to more easily read and understand the program.

In promoting the benefits of programming languages enabling different textual structures, I think it's useful to narrow down to two primary structures: the transitive and the thematic, as these two are easiest to communicate to programmers. See my earlier thoughts on how a programming language can enable more thematic variation. Programming languages of the future should provide the same functions for programmers that natural languages provide for humans.

And of course, I'm building Groovy 2.0, which will both enable thematic variation in the language syntax/morphology, and supply a vast vocabulary of Unicode tokens for names. The first iteraction will use Groovy 1.x's SwingBuilder? and ASTBuilder, along with my own Scala-based combinator parsers, to turn Groovy 2.0 source into Groovy 1.x bytecode. The accompanying Strach IME will enable programmers to enter the Unicode tokens intuitively. Groovy 2.0 will break the chains of the the Antlr/Eclipse syntactic bottleneck over Groovy 1.x !!!

Thursday, April 23, 2009

Interactional function of English and Groovy

Michael A.K. Halliday writes in his 1970 paper Language Structure and Language Function that we should analyze language in terms of its use, considering both its structure and function in so doing. He's found the vast numbers of options embodied in it combine into three relatively independent components, and they each correspond to a certain basic function of language: representational (a.k.a. ideational), interactional (a.k.a. interpersonal), and textual. Within each component, the networks of options are closely interconnected, while between components, the connections are few.

For natural language, the representational component represents our experience of the outside world, and of our consciousness within us. The representational similarities between natural and computer languages are most easily noticed:
Mary.have(@Little lamb)
lamb.fleece.Color = Color.SNOW_WHITE
synchronized{ place-> Mary.go(place); lamb.go(place) }

Computer languages' increasing use of abstraction over the years was no doubt based on the representational component of natural languages, giving rise to the functional and object-oriented paradigms. The ideas represented in computer language must be more precise than those in natural language.

Interactional component of English
The interactional component of language involves its producer and receiver/s, and the relationship between them. For natural language, there's one or more human receivers, and for computer language, one or more electronic producers and/or receivers as well as the human one/s.

In English, the interactional component accounts for:

  • many adverbs of opinion, e.g. “That's an incredibly interesting piece of code!”

  • interjections within a clause, e.g. I'm hoping to, er, well, go back sometime, or even in the middle of words, e.g. abso-bloomin'-lutely

  • expressions of politeness we prepend to English sentences, e.g. “Are you able to...” in front of “Tell me the time”

  • the hundreds of different attitudinal intonations we overlay onto our speech, e.g. ”Dunno!” (can you native English speakers hear that intonation?)

  • the mood, whether indicative e.g. “He's gone.”, interrogative e.g. ”Is she there?”, imperative e.g. ”Go now!”, or exclamative e.g. ”How clever!”

  • the modal structure of English grammar, i.e. verbal phrases have certainty e.g. “I might see him”, ability e.g. ”I can see her”, allowability e.g. ”Can he do that?”, polarity e.g. ”They didn't know”, and/or tense e.g. ”We did make it”

Natural language offers many choices regarding how closely to intertwine the interactional component with the representational.
An example... for closely intertwined reported speech: She said that she had already visited her brother, that the day before she'd been with her teacher, and that at that moment she was shopping with her friend.
and using quoted speech to reduce the tangling between interactional and representational components: She said "I've already visited my brother, yesterday I was with my teacher, and right now I'm shopping with my friend."
Another example of keeping these two components disjoint: I'm going to tell the following story exactly as she told it, the way she said it, not how I'd say it...

The original human languages long ago, just like chimpanzee language today, was perhaps mainly interactional, with the representional component slowly added on afterwards.

Interactional component of computer languages
For computer languages, the interactional component determines how people interact with the program, and how other programs interact with it. Like natural languages, the interactional component came first, and representational abstractions added on later. Many have tried to create a representational-only computer language, perhaps the most successful is Haskell. But the Haskell language creators went to great trouble to tack on the minimumly required interactional component, that of Input/Output. They introduced monads to add the I/O capability onto the “purer” underlying functional-paradigm function. Perhaps some functional-paradigm language creators don't appreciate the centrality of the interactional component in language.

Siobhan Clarke et al, writes about the tyranny of the dominant decomposition:
Current object-oriented design methods suffer from the “tyranny of the dominant decomposition” problem, where the dominant decomposition dimension is by object. As a result, designs are caught in the middle of a significant structural misalignment between requirements and code. The units of abstraction and decomposition of object-oriented designs align well with object-oriented code, as both are written in the object-oriented paradigm, and focus on interfaces, classes and methods. However, requirements specifications tend to relate to major concepts in the end user domain, or capabilities like synchronisation, persistence, and failure handling, etc., all of which are unsuited to the object-oriented paradigm.

The object-paradigm is a representational one. The other user-domain capabilities are interactional ones, either human-to-computer or computer-to-computer. Some examples:
  • I/O actions, i.e. between computer and human/s

  • logging, i.e. between processor and recording medium

  • persistence, database access, i.e. between computer and storage unit/s

  • security, i.e. between computer and certain humans only

  • execution performance, i.e. how to maximize use of computing resources

  • entry point to program, i.e. between procesor and external scheduler

  • concurrency, synchronization, i.e. between two processors in one computer

  • distribution, i.e. between two geographically separated computers

  • exceptions, failure handling, i.e. between results of different human-expected certainties

  • testing, i.e. interaction between two different external humans

These capabilities are often interwoven into the programming code, just as mood and modality are overlaid onto all the finite verbal phrases in English. And just as in English, where various interactional functions can be disentangled from the representation functions, e.g. quoted speech above, so also in computer languages, such user-domain capabilities can be extracted as system-wide aspects in aspect-oriented programming.

AspectJ is a well-known attempt to let each aspect use the same syntax, that of the base language. But the idea of limited AOP is much older, often different syntaxes are used for each different user-domain capability.

I've already blogged about some aspects of the textual component of English and Groovy. Whereas the other two components of language exist for reasons independent of the medium itself, the textual component comes into being because the other two components exist, and refers both to those other two components and to itself self-referentially. The textual component ensures every degree of freedom available in the medium itself is utilized.

In computer languages, the textual component if often called ”syntactic sugar”. Often computer language designers scorn the use of lots of syntactic sugar, but natural language designers, i.e. the speakers of natural languages, use all the syntactic sugar available in the communication medium. Programming languages designers should do the same. In the DLR-targetted Groovy I'm working on, I'm focusing on this aspect of the Groovy Language.

Thursday, February 12, 2009

The Rise of Unicode

The next version of Unicode is v.5.2, the latest of a unified character set now with over 100,000 current tokens. One notable addition to v.5.2 will be the Egyptian hieroglyphs, the earliest known system of human writing. Perhaps they will mark Unicode's coming of age, it being another huge step in representing language with graphical symbols. Let's look at a consolidated short history of writing systems, courtesy of various Wikipedia pages, to see Unicode's rise in perspective...

Egyptian hieroglyphs were invented around 4000-3000 BC. The earliest type of hieroglyph was the logogram, where a common noun (such as sun or mountain) is represented by a simple picture. These existing hieroglyphs were then used as phonograms, to denote more abstract ideas with the same sound. Later, these were modified by extra trailing hieroglyphs, called semagrams, to clarify their meaning in context. About 5000 Egyptian hieroglyphs existed by Roman times. When papyrus replaced stone tablets, the hieroglyphs were simplified to accommodate the new medium, sometimes losing their resemblance to the original picture.

The idea of such hieroglyphic writing quickly spread to Sumeria, and eventually to ancient China. The ancient Egyptian and Sumerian hieroglyphs are no longer used, but modern Chinese characters are descended directly from the ancient Chinese ones. Because Chinese characters spread to Japan and ancient Korea, they're now called CJK characters. By looking at such CJK characters, we can get some idea of how Egyptian hieroglyphs worked. Many CJK characters were originally pictures, such as 日 for sun, 月 for moon, 田 for field, 水 for water, 山 for mountain, 女 for woman, and 子 for child. Some pictures have meanings composed of other meanings, such as 女 (woman) and 子 (child) combining into 好, meaning good. About 80% of Chinese characters are phonetic, consisting of two parts, one semantic, the other primarily phonetic, e.g. 土 sounds like tu, and 口 means mouth, so 吐 also sounds like tu, and means to spit (with the mouth). The phonetic part of many phonetic characters often also provides secondary semantics to the character, e.g. the phonetic 土 (in 吐) means ground, where the spit ends up.

Eventually in Egypt, a set of 24 hieroglyphs called uniliterals evolved, each denoting one consonant sound in ancient Egyptian speech, though they were probably only used for transliterating foreign names. This idea was copied by the Phoenicians by 1200BC, and their symbols spread around the Middle East into various other languages' writing systems, having a major social effect. It's the base of almost all alphabets used in the world today, except CJK characters. These Phoenician symbols for consonants were copied by the ancient Hebrews and for Arabic, but when the Greeks copied them, they adapted the symbols of unused consonants for vowel sounds, becoming the first writing system to represent both consonants and vowels.

Over time, cursive versions of letters evolved for the Latin, Greek, and Cyrillic alphabets so people could write them easily on paper. They used either the block or the cursive letters, but not both, in one document. The Carolingian minuscule became the standard cursive script for the Latin alphabet in Europe from 800AD. Soon after, it became common to mix block (uppercase) and cursive (lowercase) letters in the same document. The most common system was to capitalize the first letter of each sentence and of each noun. Chinese characters have only one case, but that may change soon. Simplified characters were invented in 1950's mainland China, replacing the more complex characters still used in Hong Kong, Taiwan, and western countries. Nowadays in mainland China though, both complex and simplified Chinese are sometimes used in the same document, the complex ones for more formal parts of the document. Perhaps one day complex characters will sometimes mix with simplified ones in the same sentence, turning Chinese into another two-case writing system.

Punctuation was popularized in Europe around the same time as cursive letters. Punctuation is chiefly used to indicate stress, pause, and tone when reading aloud. Underlining is a common way of indicating stress. In English, the comma, semicolon, colon, and period (,;:.) indicated pauses of varying degrees, though nowadays, only comma and period is used much in writing. The question mark (?) replaces the period to indicate a question, of either rising or falling tone; the exclamation mark (!) indicates a sharp falling tone.

The idea of separating words with a special mark also began with the Phoenicians. Irish monks began using spaces in 600-700AD, and this quickly spread throughout Europe. Nowadays, the CJK languages are the only major languages not using some form of word separation. Until recently, the Chinese didn't recognize the concept of word in their language, only of (syllabic) character.

The bracketing function of spoken English is usually performed by saying something at a higher or lower pitch, between two pauses. At first, only the pauses were shown in writing, perhaps by pairs of commas. Hyphens might replace spaces between words to show which ones are grouped together. Eventually, explicit bracketing symbols were introduced at the beginning and end of the bracketed text. Sometimes the same symbol was used to show both the beginning and the end, such as pairs of dashes to indicate appositives, and pairs of quotes, either single or double, to indicate speech. Sometimes different paired symbols were used, such as parentheses ( and ). In the 1700's, Spanish introduced inverted ? and ! at the beginning of clauses, in addition to the right-way-up ones at the end, to bracket questions and exclamations. Paragraphs are another bracketing technique, being indicated by indentation.

Around 1050, movable-type printing was invented in China. Instead of carving an entire page on one block as in block printing, each character was on a separate tiny block. These were fastened together into a plate to reflect a page of a book, and after printing, the plate was broken up and the characters reused. But because thousands of characters needed to be stored and manipulated, making movable-type printing difficult, it never replaced block printing in China. But less than a hundred letters and symbols need to be manipulated for European alphabets, much easier. So when movable-type printing reached Europe, the printing revolution began.

With printing a new type of language matured, one that couldn't be spoken very well, only written: the language of mathematics. Mathematics, unlike natural languages, needs to be precisely represented. Natural languages are very expressive, but can also be quite vague. Numbers were represented by many symbols in ancient Egypt and Sumeria, and had reduced to a mere 10 by the Renaissance. But from then on, mathematics started requiring many more symbols than merely two cases of 26 letters, 10 digits, and some operators. Many symbols were imported from other alphabets, different fonts introduced for Latin letters, and many more symbols invented to accommodate the requirements of writing mathematics. Mathematical symbols are now almost standardized throughout the world. Many other symbol systems, such as those for chemistry, music, and architecture, also require precise representation. Existing writing systems changed to utilize the extra expressiveness that came with movable-type printing. Underlining in handwriting was supplemented with bolding and italics. Parentheses were supplemented with brackets [] and curlies {}.

Fifty years ago, yet another type of language arose, for specifying algorithms: computer languages. The first computer languages were easy to parse, requiring little backtracking, but the most popular syntax, that of C and its descendants, requires more complex logic and greater resources to parse. Most programming languages used a small repetoire of letters, digits, punctuation, and symbols, being limited by the keyboard. Other languages, most notably APL, attempted to use many more, but this never became popular. Unlike mathematics, computer languages relied on parsing syntax, rather than a large variety of tokens, to represent algorithms, being limited by the keyboard. Computer programs generally copied natural language writing systems, using letters, numbers, bracketing, separators, punctuation, and symbols in similar ways. One notable innovation of computer languages, though, is camel case, popularized for names in C-like language syntaxes.

The natural language that spread around the world in modern times, English, doesn't use a strict pronunciation-spelling correspondence, perhaps one of the many reasons it spread so rapidly. English writing therefore caters for people who speak English with widely differing vowel sounds and stress, pause, and tone patterns. In this way, English words are a little like Chinese ideographs. As Asian economies developed, techniques for quickly entering large-character-set natural languages were invented, known as IME's (input method editors). But these Asian countries still use English for computer programming.

Around 1990 Unicode was born, unifying the character sets of the world. Initially, there was only room for about 60,000 tokens in Unicode, so the CJK characters of China, Japan, and Korea were unified to save space. Unicode is also bidirectional, catering to Arabic and Hebrew. Topdown languages such as Mongolian and traditional Chinese script can be simulated with left-to-right or right-to-left directioning by using a special sideways font. However, Unicode didn't become very popular until its UTF-8 encoding was invented 10 years ago, allowing backwards compatibility with ASCII. Another benefit of UTF-8 is there's now room for about one million characters in the Unicode character set, allowing less commonly used scripts such as Egyptian hieroglyphs to be encoded.

Many programming languages have recently adopted different policies for using Unicode tokens in names and operators. The display tokens in Unicode are divided into various categories and subcategories, mirroring their use in natural language writing systems. Examples of such subcategories are: uppercase letters (Lu), lowercase ones (Ll), digits (Nd), non-spacing combining marks, e.g. accents (Mn), spacing combining marks, e.g. Eastern vowel signs (Mc), enclosing marks (Me), invisible separators that take up space (Zs), math symbols (Sm), currency symbols (Sc), start bracketing punctuation (Ps), end bracketing (Pe), initial quote (Pi), final quote (Pf), and connector punctuation, e.g. underscore (Pc).

For it to become popular to use a greater variety of Unicode tokens in computer programs, there must be a commonly available IME for their entry with keyboards. Sun's Fortress provides keystroke sequences for entering mathematical symbols in programs, but leaves it vague whether the Unicode tokens or the ASCII keys used to enter them are the true tokens in the program text. And of course there must be a commonly available font representing every token. Perhaps because of the large number of CJK characters, and the recent technological development of mainland China, a large number of programmers may one day suddenly begin using them in computer programming to make their programs terser.

Language representation using graphical symbols has taken many huge leaps in history: Egyptian hieroglyphs to represent speech around 4000 years ago, an alphabet to represent consonant and vowel sounds by the Phoenicians and Greeks around 2500 years ago, movable-type printing in Europe around 500 years ago, and unifying the world's alphabets and symbols into Unicode a mere 20 years ago. And who knows what the full impact of this latest huge leap will be?

Saturday, December 06, 2008

The Thematic Structure of English and Groovy

After working as a programmer for many years, I tossed it in to teach English in China. I spent a few years reading the many books on language and linguistics in the bookshops up here, before returning to programming as a hobby. I then started to see many similarities between natural and computer languages, which I'm intermittently blogging about. Here's today's installment...

Of the books on language I've read, M.A.K. Halliday's ideas make a lot of sense. He suggests we should analyse language in terms of what it's used for, rather than its inherent structure. From this basis, he's isolated three basic functions of natural language, and their corresponding structural subsystems: the ideational, the interpersonal, and the textual.

The ideational function is a representation of experience of the outside world, and of our consciousness within us. It has two main components: the experiential and the logical. The experiential component embodies single processes, with their participants and circumstances, in a transitivity structure. For example, “At quarter past four, the train from Newcastle will arrive at the central station.” has a transitive structure with process to arrive, participants train from Newcastle and central station, and circumstance quarter past four. The primary participant is called the actor, here, the train from Newcastle. Computer languages have a structure paralleling the transitivity structure of natural languages, e.g. train.arrive(station, injectedCircumstance) for object-oriented languages. The logical component of ideational function concerns links between the experiential components, attained with English words such as and, which, and while. These have obvious parallels in programming languages.

The interpersonal function involves the producer and receiver/s of language, and the relationship between them. This function accounts for the hundreds of different attitudinal intonations we overlay onto our speech, interjections, expressions of politeness we prepend to English sentences, e.g. “Are you able to...”, many adverbs of opinion, the mood (whether indicative, interrogative, imperative, or exclamative), and the modal structure of English grammar. The mood structure causes verbal phrases to have certainty, ability, allowability, polarity, and/or tense prepended in English, and can be repeated in the question tag, e.g. isn't he?, can't we?, should they?. The interpersonal function gives the grammatical subject-and-predicate structure to English. In programming languages, the interpersonal function determines how people interact with the program, and how other programs interact with it. The interpersonal functions are what would normally be extracted into aspects in aspect-oriented programming. They generally disrupt the “purer” transitivity structure of the languages.

The textual function brings context to the language through different subsystems. The informational subsystem divides the speech or text into tone units using pauses, then gives stress/es to that part of the unit that is new information. The cohesive subsystem enables links between sentences, using conjunctions and pronouns, substitution, ellipsis, etc. The thematic subsystem makes it easy for receivers to follow the flow of thought. Comparing this structure of the English and Groovy languages is the topic of today's blog post...

Thematic structure of English
Theme in English is overlaid onto the clause, a product of the transitive and modal structures. The theme is the first position in the clause. English grammar allows any lexical item from the clause to be placed in first position. (In fact, English allows the lexical items to be arranged in any order to enable them to be broken up in any combination into tone units.) Some examples, using lexical items to give, Alan, me, and the book, with theme bolded:
  Alan gave me that book in London.
  Alan gave that book to me in London. (putting indirect object into prepositional phrase)
  To me Alan gave that book in London. (fronting indirect object)
  I am who Alan gave that book to in London. (fronting indirect object, with extra emphasis)
  To me that book was given in London. (using passive voice to front indirect object)
  That book was given in London. (using passive voice to omit indirect object)
  That book Alan gave me in London. (fronting direct object as topic)
  That book is the one Alan gave me in London. (fronting direct object in more formal register)
  In London, Alan gave me that book. (fronting adverbial, into separate tone unit)
  London is where Alan gave me that book. (fronting adverbial in the same tone unit)
  There is a book given by Alan to me in London. (null topic)

Although not common, English also allows the verb to be put in first position as theme:
  Give the book Alan did to me in London.
  Give me the book did Alan in London.
  Give did Alan of the book to me in London.
  Given was the book by Alan to me in London.

First position is merely the way English indicates what the theme is, not the definition of it. Japanese indicates the theme in the grammatical structure (with the inflection はwa), while Chinese (I think) uses a combination of first position and grammatical structure (prepending with 是shi).

Thematic structure of Groovy
One way of indicating theme could be to bold it, assuming the text editor had rich text capabilities. This would similar to Japanese. For example, for thematic variable a
  def b = a * 2; def c = a / 2;.
Another way is to use first position, how English indicates it. This would be an Anglo-centric thematic structure to programming languages, which generally already have an Anglo-centric naming system. Perhaps the best way is a combination of both front position and bolding.

Let's look at how Groovy could enable front-position thematic structure. We'll start with something simple, the lowest precedence operator: a = b. If we want to front the b, we can't. We would need some syntax like =:, the reverse of Algol's :=
  b =: a

We'd need to provide the same facility for the other precedence operators at the same level += -= *= /= %= <<= >>= >>>= &= ^= |=. Therefore, we'd have operators =+: =-: =*: =/: =%: =<<: =>>: =>>>: =&: =^: =|:.

At the next higher precedence level are the conditional and Elvis operators. Many programming languages, such as Perl and Ruby, enable unless as statement suffix, allowing the action to be fronted as the theme. Groovy users frequently request this feature of Groovy on the user mailing list. An unless keyword would be useful, but we could also make the ? : and ?: operators multi-theme-enabling by reversing them, i.e. : ? and :?, with opposite (leftwards) associativity. The right-associative ones would have higher precedence over these new ones, so, for example:
  a ? b : c ? d : e would associate like a ? b : (c ? d : e)
  a : b ? c : d ? e would associate like (a : b ? c) : d ? e
  a : b : c ? d ? e would associate like a : (b : c ? d) ? e
  and a ? b ? c : d : e would associate like a ? (b ? c : d) : e

On a similar note: Groovy dropped the do while statement because of parser ambiguities. It should be renamed do until to overcome the ambiguities.

Next up the precedence hierarchy, we need shortcut boolean operators ||: and &&:, which evaluate, associate, and shortcut rightwards. Most of the next few operators up the hierarchy | ^ & == != <=> < <= > >= + * don't need reverse versions, but these do: =~ ==~ << >> >>> - / % **. It's good Groovy supplies the ..< operator so we can emphasize an endpoint in a range without actually processing it. We'll also provide the >.. and >..< operators.

Just as in English we have the choice of saying the king's men or the men of the king, depending on what we want to make thematic, we should have that choice in Groovy too.
We can easily encode reverse-associating versions of *. ?. .& .@ *.@ as .* .? &. @. @.*. To encode the standard path operator ., we could use .:.

A positive by-product of having these reverse-associative versions of the Groovy operators is they'll work nicely with names in right-directional alphabets, such as Arabic and Hebrew, when we eventually enable that.

When defining methods in Groovy, we should have the choice to put return values and modifiers after the method name and parameters, like in Pascal. This would cater speakers of Romance languages, e.g. French, who generally put the adjectives after the nouns.

Groovy, like most programming languages, doesn't enable programmers to supply their own thematic structure to code, only the transitive structure. When used well, thematic structure in code enables someone later on to more easily read and understand the program. Perl was a brave attempt at providing “more than one way to do things”, but most programming languages haven't learnt from it. I'm working on a preprocessor for the Groovy Language, experimenting with some of these ideas. If it looks practical, I'll release it one day, as GroovyScript. It will make Perl code look like utter verbosity.

Saturday, November 22, 2008

Stress and Unstress in Computer Languages

Computer languages could learn a few things from natural languages in their design...

Natural Language
Many natural languages, such as English, make a distinction between stressed and unstressed words. In general, nouns, verbs, and adjectives (incl adverbs ending in -ly) are stressed, while grammar words are unstressed.

For example: “I walked the spotty dog to the shop, quickly bought some bread, and returned home”. (I've bolded the syllables we stress during speech in this and following examples.)

We stress the nouns (dog, shop, bread, home), adjectives (spotty, quick), and verbs (walk, buy, return), and don't stress the grammar words (I, the, to, -ly, some, and). (Note: In Transformational Grammar, adverbs ending in -ly are considered to be a specific inflectional form of the corresponding adjectives.)

Examples of unstressed grammar words in English are conjunctions (and, or, but), conjunctive adverbs (while, because), pronouns (this, you, which), determiners (any, his), auxiliary verbs (is, may), prepositions (to, on, after), and other unclassed words (existential there, infinitive to), as well as many inflectional morphemes (-s, -'s, -ing, -ly).

Verbs are often only half-stressed instead of fully stressed, and prepositions half-stressed instead of unstressed, depending on the surrounding context, e.g. “The teacher saw the book behind the desk.” (Here, I've bold-italicized the half-stressed words.)

English has a clear distinction between grammar words and lexical words (nouns, adjectives/adverbs, and verbs) in speech.

Many languages distinguish between lexical and grammar words in their writing systems. German capitalizes the first letter of each noun. (Dutch stopped doing this in 1948, and English in the 1700's). Japanese uses Chinese characters for nouns and many adjectives, and the Japanese alphabet for grammar words and many verbs.

When using grammar words in a lexical capacity, we stress them when speaking, e.g. “I put an 'is' followed by an 'on', before the 'desk' with a 'the' before it, to make a predicate.” And when writing, we put the grammar words we're using as lexical ones inside quotes.

Using stress and unstress to separate lexical and grammar words enables English, and probably all natural languages, to be self-referential.

Computer Languages
Virtually every computer language differentiates between lexical words and grammar words.

Assembler and Cobol used indentation and leading keywords to distinguish different types of statements, and space and comma to separate items. Like many languages after them, the limited set of keywords couldn't be used for user-defined names. Fortran introduced a simple infix expression syntax for math calculations, using special symbols (+ - * etc) for the precedenced infix operators, and ( ) for bracketing. Lisp removed the indentation and keywords completely, making everything use bracketing, with space for separation, and a prefix syntax. APL removed the precedences, but introduced many more symbols for the operators. The experimentation continued until C became widespread.

C uses 3 different types of symbols for bracketing, ( ) [ ] { }. C++, Java, and C# added < > for bracketing. C uses space and , ; . for separators, and a large number of operators, organized via a complex precedence system. Java has 53 keywords; C# has 77.

The lexical words of computer languages are clear. Classes and variables are nouns. Functions and methods are verbs. Keywords beginning a statement are imperative verbs, and in some languages are indistinguishable from functions. Modifiers, interfaces, and annotations are adjectives/adverbs. The operators (+ - * / % etc) bear a similarity to prepositions, some of them (+= -= *= etc), to verbs. And I'd suggest the tokens used for bracketing and separators are clear examples of grammar words in computer languages, being similar to conjunctions and conjunctive adverbs.

In general, computer languages use some tokens (e.g. A-Z a-z 0-9 _) for naming lexical words, and others (e.g. symbols and punctuation) for grammar. Occasionally, there's exceptions, such as new and instanceof in Java. Some computer languages use other means. Perl and PHP put a @ before all lexical words, enabling all combinations of tokens to be used for names. This is similar to capitalizing all nouns in German. C# allows @ before any lexical word, but only requires it before those which double as keywords. This is similar to quoting grammar words to use them as lexical ones in English.

Newer programming languages have different ways to use Unicode tokens in names and operators. The display tokens in Unicode fall into six basic categories: letters (L), marks (M), numbers (N), symbols (S), punctuation (P), and separators (Z). Python 3.0 names can begin with any Unicode letter (L), numeric letter (in N), or the underscore (in P); subsequent tokens can also be combining marks (in M), digits (in N), and connector punctuation (in P). Scala names can begin with an upper- or lowercase Unicode letter (in L), the underscore (in P), or the dollar sign (in S); subsequent tokens can also be certain other letters (in L), numeric letters (in N), and digits (in N). Scala operators can include math and other symbols (in S). Almost all languages have the same format for numbers, beginning with a number (in N), perhaps with letters (in L) as subsequent tokens.

Perhaps the easiest way to distinguish between lexical and grammar words in GrerlVy is to use Unicode letters (L), marks (M), and numbers (N) exclusively for lexical words, and symbols (S), punctuation (P), and separators (Z) exclusively for grammar words. Of course, we still have a difficulty with the borderline case: infix operators and prefix methods, which correspond roughly to prepositions and verbs, the half-stressed words in English. I'm still thinking about that one.

Saturday, September 06, 2008

Mass-Parity-Distance Invariance

Edited and abridged on 6 Feb 2018

During July and August 2008, I hoed into a copy of Roger Penrose's The Road to Reality, and came up with an idea to explain Dark Energy...

Negative Mass
Negative mass is usually defined in such a way that Einstein's equivalence principle still holds, where gravitational mass is proportional to inertial mass. This results in some bizarre effects. But while reading Penrose's book, I got an idea on how to define negative mass so that all the positive matter and all the negative fly off in two opposite directions at the Big Bang, with the equivalence principle still holding.

The key is how we calculate the (scalar) distance with respect to some mass. For positive matter, we would continue to use the positive solution to the formula where we square root the sum of the squares of the three spatial coordinates. But we'd introduce an invariance, known as the Mass-Distance Invariance, where we'd use the negative solution to the square root for scalar distances measured with respect to negative masses.

Some consequences of this invariance are:
  • The same vector values for velocity and acceleration would be used for negative mass as for positive mass, but their scalar values would depend on whether positive matter was referenced, or negative matter. Negative matter would use negative speeds and, to indicate increasing speeds, negative acceleration values.

  • A positive-valued g-force (created by positive matter) would still mean attraction for positive matter, but repulsion for negative matter. However, a negative g-force (created by negative matter) would mean attraction for negative matter, but repulsion for positive.

  • When calculating the (scalar) gravitational force between two objects, the square of the distance between them would always be positive, but a positive force is attraction, and a negative force is repulsion. This means two negative masses attract, as do two positive masses, but positive and negative masses repel each other.

  • Such scalar values for force involving negative matter would use negative distance again when calculating energies, resulting in negative energies. Penrose mentions negative energies mess with quantum mechanical calculations, but in the real Universe, this might be OK because positive and negative energies would be partitioned off due to the gravitational effects of the Big Bang.

Therefore, when calculating scalar values in the negatively-massed side of the Universe, we'd use (1) negative distances, (2) multiplied by positive time to give negative-valued speed, (3) multiplied by positive time to give negative acceleration values to indicate increasing speeds, (4) multiplied by negative mass to give positive-valued scalar forces to indicate attraction, (5) multiplied by negative distances to give negative values for energy.

Picturing All This
When picturing such a scenario using the common "matter bends space which moves matter" 2D curved-space picture to model the 3+1D reality in general relativity, the positive matter would be on top of the sheet sinking downwards as before, but the negative matter would be under the sheet, to indicate negative distances, floating upwards, to indicate the negative mass. We can then visualize positive and negative matter each self-gravitating, but repelling each other.

The positive matter would act via left-handed gravitons as before, but the negative matter would act via right-handed gravitions. Penrose, in his description of Twistor Theory, says that there's a problem in the calculations getting left-handed and right-handed gravitons to interact with each other to enable graviton plane polarization, similar to what's possible with electromagnetism. But in my theory, it would be a requirement that left-handed and right-handed gravitons don't interact in any way. This enables both attractive gravity and repulsive gravity to operate at different scales in the same spacetime.

This graviton-handedness has a counterpart in neutrinos, reponsible for the vast excess of matter over antimatter in the observable Universe. So we need to follow the lead of Charge-Parity-Time (CPT) Invariance, and likewise introduce parity invariance, resulting in what I'm now calling Mass-Parity-Distance Invariance, or MPD-invariance.

Dark Energy
Observational evidence of such MPD-invariant negative matter would be an expected after-effect of the inflation of the very early Universe. The modified version of the Big Bang is that the Universe's overall zero energy fractures into equal Planck-distance-separated positive and negative amounts in the first quantum instant of the Universe, then their respective gravitational fields repelled the positive and negative away from each other, resulting in a Big Bang in two different directions along one spatial axis. The actual reason for the Big Bang can therefore be explained by quantum effects.

After the faster-than-light inflation stopped, the right-handed gravitons from the negative matter would be travelling towards the positive matter at the speed of light only, resulting in a time lag between inflation ending and the gravitational repulsion of the negative mass beginning to affect the positive mass with a renewed expansion. This is exactly what happened after about 8 billion years, what's called Dark Energy. Though I suspect for it to have its observed strength and timing, the observable Universe would be a very tiny proportion of the actual Universe. Just as our sun is one of about 100 billion in the Milky Way, and our galaxy one of about 100 billion in the observable Universe, so our observable Universe could also be a 100-billionth of the actual Universe.

Negative-Frequency Electromagnetism
The photon would behave differently to the graviton. Planck's famous equation states photon energy equals Planck's constant multiplied by the frequency. Negative-energy photons would then have negative frequency, but for a photon this is not the same as changing the handedness (helicity), because photons have both electric and magnetic vectors. Both left-handed and right-handed photons have positive energy, and can polarize. Photons of negative energy/frequency, whether left-handed or right-handed, would have their electric and magnetic vectors swapped around.

Negative matter and antimatter are two separate concepts. Matter and antimatter created from positive energy in normal particle interactions would both have positive mass, similarly negative mass for negative energy. The first quantum event of the Big Bang would determine how much energy, positive or negative, is in each side of the Universe. The left-handed gravitons and left-handed neutrinos go one way, their right-handed counterparts, the other. So one half of the Universe is matter with positive mass, the other half, antimatter with negative mass. One spatial dimension of the Universe is thus different to the other two, with homogeneity and isotropy being more local effects.

An alternative shape of the Universe is a four-partitioned one, where positive matter, positive antimatter, negative matter, and negative antimatter fly off in 4 different directions on a plane. This can be visualized with the 2-D saddle-shape for a hyperbolic Universe, with positive matter on top of the sheet, its matter going one way and its antimatter the other, both down the saddle on each side, and negative matter underneath the sheet, its matter and antimatter each flying off up the saddle, at ninety-degree angles to the positive matter and antimatter.

MPD/CPT symmetry duality

There's an eiry similarity between the well-known Charge-Parity-Time (CPT) Invariance and my proposed Mass-Parity-Distance (MPD) Invariance. I've ignored the forces without an infinite range (the strong and weak forces) in this model. The basic difference between MPD-invariant gravity and CPT-invariant electromagnetism is that in gravity, like masses attract while unlike ones repel, whereas in electromagnetism, like charges repel while unlike ones attract. The logical effect of this (ignoring finite-range forces) is that gravity's masses are real numbers, while charges are polar.

Under presently known laws of physics, the Universe isn't a self-contained system because it looks different at its largest scale than it does at its smallest. The entire Universe might simply be contained in a speck of dust in another one with different laws. Perhaps the CPT/MPD symmetry duality suggests the Universe looks exactly the same at its largest scale as at its smallest.

So perhaps gravity and electromagnetism started off as exactly the same force. Gravity is simply what we see when we're looking outwards to the edge of the Universe, and electromagnetism is what we see when we're looking inwards to the smallest scale of the Universe. On the inside looking outwards, there's only one instance to look at, but on the outside looking in, we see many instances. From the inside looking out, it looks like MPD-symmetric positive and negative Mass obeying the laws of gravity, but from the outside looking in, it looks like CPT-symmetric positive and negative Charge obeying electromagnetic laws. The other forces, those with finite ranges, would fork from the electromagnetic under low energies, as explained by current theories.

How real is the negatively massed side of the Universe? If the only way we detect its presence is via Dark Energy at the expected strength and distribution, then surely it's no more real than quantum-based possibilities that never actualized in our positively-massed side of the Universe. Quantum uncertainty would therefore exist at the largest scale of the Universe, as well as the smallest. The main actualized effect is that such Dark Energy causes one of the spatial dimensions in our positively-massed Universe to have a downwards direction, perhaps a complement to one of the dimensions in our spacetime having a timelike arrow. It might be difficult to comprehend this, but 500 years ago people knew one dimension of space had a downwards direction, and Copernicus got into a lot of trouble for suggesting it didn't.

Friday, June 20, 2008

Word Classes in English and Groovy

When I was in primary school, I learnt that English had 8 parts of speech: nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and articles. Nowadays linguists call them word classes. Since working in Tesol, I've learnt that words in English are better classified as falling somewhere along a continuum, with conjunctions, the most grammatical words, at one end, and proper nouns, the most lexical, at the other.

We'll take a quick look at these word classes in English grammar, then look at the similar concept in the Groovy Language. (Note: The English grammar is very simple, and based on what I remember from personal reading, not academic study, so I don't guarantee total correctness).

Word Classes in English

The most grammatical words in English are and, or, and not, the same operators as in propositional logic. and and or can be used to join together most words anywhere further along the continuum. Most obvious are the lexical words, e.g:
  the book, the pen, and the pad (nouns)
  black and blue (adjectives)
  slowly and carefully (adverbs)
  to stop, listen, and sing (verbs)
  to put up or shut up (phrasal verbs)

Also, multiword lexical forms, such as phrases and clauses, can be similarly joined:
  the house, dark blue and three storeys high, ... (adjectival phrases)
  The batter hit the ball and the fielder caught it. (clauses)

But more grammatical words at the same position on the continuum can be joined:
  your performance is over and above expectations (prepositions)
  I could and should go (auxiliary verbs)
  They were and are studying (different type of auxiliary verbs)
  this and that (pronouns)

Incidentally, the continuum can have more than one type of multiword form at the same position, such as adverbials and prepositional phrases:
  They walked, very silently and with great care, ...

and and or are 2 of only 7 conjunctions in English, memorized by the acronym FANBOYS: for, and, nor, but, or, yet, and so. But and and or are more grammatical than the other five conjunctions, and can be used to join the others together, e.g:
  It was difficult, yet and so I tried.

The propositional logic operators are the most grammatical words in English.

Next along the continuum are proforms, words that take the place of other more lexical words. In English, the most common type of proform is the pronoun, e.g. he, she, this, which also has determiner form, e.g. his, hers. For example:
  The dog chased the cat, but lost it. (pronoun: it)
  The dog escaped from the goat, but lost its collar. (determiner: its)

Other word classes and multiword forms have proforms. For example, pro-verb do/did:
  I enjoyed the film, and so did the ushers.
Gap for pro-verb:
  We found the south exit, and the other team, the north exit.
Pro-adjective such:
  We experienced a humid day, and also such a night.
Pro-adverb thus:
  Swiftly the Italians played; thus also did the Brazilians.
Proform for multiword adverbial so:
  The programmers finished totally on time; so did the testers.

Next are a large number of miscellaneous words between grammatical and lexical, which some call particles. Examples are interjections, articles (a/an/the), phrasal verb particles, conjunctive adverbs, sentence connectors, verb auxiliaries, not, only, infinitive's to, etc.

English, and I guess every natural language, is really a mess, and the particles are a way of categorizing the messy stuff.

Prepositions and Verbs
The first lexical word class along the continuum is the prepositions. In Hallidayan Functional Grammar, they're considered to be reduced verbs. Some examples: under, over, through, in. There are also multiword prepositional groups, e.g: up to, out of, with respect to, in lieu of.

Further along the continuum are the verbs, e.g. listen, write, walk. Verbs can be multiword, such as phrasal verbs, e.g. put up, shut up, prepositional phrasal verbs, e.g. get on with, put up with, and verb groups, e.g: will be speaking, has walked, to have gotten on with.

Adjectives and Nouns
Next along the continuum are adjectives, e.g. black, blacker, blackest. In Chomskian Transformational Grammar, adverbs ending in -ly are considered to be the same as adjectives, only modified at the surface level, e.g. slowly, slower, slowest.

Adjectives/adverbs can be multiword, e.g:
  The building is three storeys high. (adjectival phrase)
  That cat walks incredibly slowly. (adverb word group)

Next are common nouns, both count nouns, e.g. pen, pens, and mass nouns, e.g. coffee, hope. Nouns can be built into noun phrases, e.g. the long dark blue pen.

Just as verbs and prepositions are related, so are nouns and adjectives. Abstract ideas often only differ grammatically, e.g:
  Jack is very hopeful.
  Jack has much hope.
  Jack has many hopes.

At the lexical end of the grammar-lexis continuum are proper nouns. These can be phrases we construct from other words, e.g. the Speaker's Tavern, foreign words, e.g. pyjamas, fooyung, or even invented words, e.g. Kodak, Pepsi.

The largest word class in English are the nouns, then the adjectives, then verbs. When new words enter English, they're usually nouns. Some will become adjectives and maybe verbs, but very few ever move further along the continuum towards the grammar end. Although English has many Norman words from 800 or 900 years ago, very few are prepositions, and all the other more grammatical words came from Anglo-Saxon.

Perhaps all natural languages have a word class continuum with prepositional logic words at one end, and definable nouns at the other.

Word Classes in Groovy
Groovy uses both symbols and alphanumeric keywords for grammar, both lexed and parsed grammar. Groovy builds on Java, and hence C++ and C, for its tokens.

Bracketing and Separators
Perhaps the most grammatical along the continuum are the various bracketing symbols. Some have different tokens for opening and closing, e.g:
  /* */ ( ) [ ] { } < >
while others use the same token for both, e.g:
  """ ''' " ' /
There's no corresponding word class in English because English uses prosody (tone, stress, pause, etc) rather than words for the bracketing function.

Next along Groovy's continuum could be separators, e.g:
  , ; : ->
We can use , and ; for lists of elements, similar to and and or in English.

Groovy has a very limited repertoire of pronouns, only this and it.

Verbs and Prepositions
Perhaps operators are like English prepositions, e.g:
  == != > >= < <= <=>
  .. ..< ?: ? : . .@ ?. *. .& ++ -- + - * / % **
  & | ^ ! ~ << >> >>> && || =~ ==~

while some operators are almost like verbs, e.g:
  = += -= *= /= %= **= &= |= ^= <<= >>= >>>=

Some operators are represented by keywords in Groovy, viz. prepositions, an adjective, and a multiword noun-preposition, i.e:
  in as new instanceof

Verbs in indicative form are used in definitions, e.g:
throws extends implements
... .*

The most common verb form is the imperative, e.g:
  switch, do, try, catch, assert, return, throw
  break, continue, import, def, goto

though sometimes English adverbs are used as commands in Groovy, e.g:
  if, else, while, for, finally
Also used for this are nouns, e.g:
  case, default
and symbols, e.g:
  \ // #! $

Nouns and Adjectives
Groovy uses English adjectives for adjectival functions in Groovy, e.g:
  public, protected, private, abstract, final, static
  transient, volatile, strictfp, synchronized, native, const

Groovy has many built-in Groovy common nouns, e.g:
  class, interface, enum, package
  super, true, false, null
  25.49f, \u004F, 0x7E, 123e7

Some of them can also be used like adjectives, e.g:
  boolean, char, byte, short, int, long, float, double, void
are nouns (types) that can precede other nouns (variables), like Toy in A Toy Story.

We can define our own Groovy proper nouns using letters, digits, underscore, and dollar sign, e.g:
  MY_NAME, closure$17, αβγδε

Using @, we can also define our own Groovy adjectives.

Because Groovy is syntactically derived from Java, and hence from C++ and C, it, like English, is a little messy in its choice of tokens.

Notice also the different emphasis of word classes between English and Groovy, e.g:
  • Groovy uses tokens for bracketing while English uses non-token cues
  • English uses far more proforms than Groovy, which forces us to use temporary variables a lot
  • English uses Huffman coding by shortening common words like prepositions, while Groovy retains instanceof and implements

Conclusion: The Unicode Future
Unicode divides its tokens into different categories: letters (L), marks (M), separators (Z), symbols (S), numbers (N), punctuation (P), and other (C). Within each are various sub-categories. I'm looking at how best to use all Unicode characters (not just CJK ones) when extending a Java-like language such as Groovy with more tokens. The more tokens a language has, the terser it can be written while retaining clarity. Unicode's now a standard, so perhaps programmers will be more motivated to learn them than when APL was released. And modern IME's enable all tokens to be entered easily, for example, my ideas for the Latin-1 characters. Such a Unicode-based grammar must be backwards-compatible, Huffman coded, and easy to enter in the keyboard.

Friday, June 13, 2008

The Future of Programming: Chinese Characters

Last year, I wrote some wordy blog entries on this subject, see Programming in Unicode, part 1 and part 2. Here's a brief rehash of them...

Deep down in their hearts, programmers want to read and write terse code. Hence, along came Perl, Python, Ruby, and for the JVM, Groovy. Lisp macros are enjoying a renaissance. But programmers still want code to be clear, so many reject regexes and the J language as being so terse they're unreadable. All these languages rely on ASCII. The tersity of these languages comes from maximizing the use of grammar, the different ways tokens can be combined. The same 100 or so tokens are used.

Many people can type those 100 tokens faster than they can write them, but can write thousands more they can't type. If there were more tokens available for programming, we could make our code terser by using a greater vocabulary. The greater the range of tokens a language has, the terser its code can be written. APL tried it many years ago but didn't become popular, perhaps because programmers didn't want to learn the unfamiliar tokens and how to enter them. But Unicode has since arrived and is used almost everywhere, including on Java, Windows, and Linux, so programmers already know some of it.

With over 100,000 tokens, Unicode consists of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, and other stuff. Many programming languages already allow all Unicode characters in string contents and comments, which programmers in non-Latin-alphabet countries (e.g. Greece, Russia, China, Japan) often use. Very few programming languages allow Unicode symbols and punctuation in the code: perhaps language designers don't want to allow anything resembling C++ operator overloading.

But many programming languages do allow Unicode alphabetic letters and CJK characters in names. Because there already exists agreed meanings for combinations of these, derived from their respective natural languages, programmers can increase tersity while keeping readability. However, this facility isn't used very much, maybe because the keywords and names in supplied libraries (such as class and method names in java.lang) are only available in English.

I suspect programmers from places not using the Latin alphabet would use their own alphabets in user-defined names in a certain programming language if it was fully internationalized, i.e., if they could:
  • configure the natural language for the pre-supplied keywords and names (e.g. in java.lang)
  • easily refactor their code between natural languages (ChinesePython doesn't do this)
  • use a mixture of English and their own language in code so they could take up new names incrementally
Most programming languages enable internationalized software and webpages, but the languages themselves and their libraries are not internationalized. Although now rare, most programming languages will one day be internationalized, following in the trend of the software they're used to write. The only question is how long this will take.

However, I suspect most natural languages wouldn't actually be used with internationalized programming as there's no real reason to. Programmers in non-English countries can read English and use programming libraries easily, especially with IDE auto-completors. Writing foreigner-readable programs in English will be more important.

To become popular in a programming language, a natural language must:
  • have many tokens available, enabling much terser code, while retaining clarity. East Asian ideographic languages qualify here: in fact, 80% of Unicode tokens are CJK or Korean-only characters.
  • be readable at the normal coding font. Japanese kanji and complex Chinese characters (used in Hong Kong, Taiwan, and Chinatowns) don't qualify here, leaving only Korean and simplified Chinese (used in Mainland China).
  • be easily entered via the keyboard. An IME (input method editor) allows Chinese characters to be entered easily, either as sounds or shapes. The IME for programming could be merged with an IDE auto-completor for even easier input.
And to be the most popular natural language used in programming, it must:
  • enable more tokens to be added, using only present possible components and their arrangements. Chinese characters are composed of over 500 different components (many still unused), in many possible arrangements, while Korean has only 24 components in only one possible arrangement.
  • be used by a large demographic and economic base. Mainland China has over 1.3 billion people and is consistently one of the fastest growing economies in the world.
About a year ago, I posted a comment on Daniel Sun's blog on how to write a Groovy program in Chinese. (The implementation is proof-of-concept only; a scalable one would be different.) The English version is:
content.tokenize().groupBy{ it }.
  collect{ ['key':it.key, 'value':it.value.size()] }.
  findAll{ it.value > 1 }.sort{ it.value }.reverse().
  each{ println "${it.key.padLeft( 12 )} : $it.value" }

The Chinese version reduces by over half the size (Chinese font required):
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
  都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}

I believe this reduction is just the beginning of the tersity that using all Chinese characters in programming will bring. The syntax of present-day programming languages is designed to accommodate their ASCII vocabulary. With a Unicode vocabulary, the language grammar could be designed differently to make use of the greater vocabulary of tokens. As one example of many: if all modifiers are each represented by a single Chinese character, for 'public class' we could just write '公类' without a space between (just like in Chinese writing), instead of '公 类', making it terser.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. Dynamic language advocates claim this benefit for dynamic programming over static programming: the benefit is enhanced for Chinese characters over the Latin alphabet.

If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every name (packages, classes, methods, fields, etc) in the entire Java class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary.

Furthermore, in an internationalized programming language, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their code, but so could Western programmers. Hackers want to write terse code, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, they can enter the characters easily, and start using them incrementally. By incrementally I mean only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough.

Non-Chinese programmers won't have to learn Chinese speaking, listening, grammar, or writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is quite different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within hackers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. Software developers all over the world could be typing in Chinese within decades.

Chinese character data file available...
Recently, I analyzed the most common 20,934 Chinese characters in Unicode (the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block), aiming to design an input method easy for foreigners to enter CJK characters.

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one based on how I think a fellow foreigner will analyse the character. I've created a few thousand characters to cater for decomposition components not themselves among my collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

The data file is CSV-format, with 4 fields:
  • the character
  • first component
  • either second component, or -
  • type of decomposition
Here's a zip of that data file and truetype font file if anyone's interested.