Saturday, December 06, 2008

The Thematic Structure of English and Groovy

After working as a programmer for many years, I tossed it in to teach English in China. I spent a few years reading the many books on language and linguistics in the bookshops up here, before returning to programming as a hobby. I then started to see many similarities between natural and computer languages, which I'm intermittently blogging about. Here's today's installment...

Introduction
Of the books on language I've read, M.A.K. Halliday's ideas make a lot of sense. He suggests we should analyse language in terms of what it's used for, rather than its inherent structure. From this basis, he's isolated three basic functions of natural language, and their corresponding structural subsystems: the ideational, the interpersonal, and the textual.

The ideational function is a representation of experience of the outside world, and of our consciousness within us. It has two main components: the experiential and the logical. The experiential component embodies single processes, with their participants and circumstances, in a transitivity structure. For example, “At quarter past four, the train from Newcastle will arrive at the central station.” has a transitive structure with process to arrive, participants train from Newcastle and central station, and circumstance quarter past four. The primary participant is called the actor, here, the train from Newcastle. Computer languages have a structure paralleling the transitivity structure of natural languages, e.g. train.arrive(station, injectedCircumstance) for object-oriented languages. The logical component of ideational function concerns links between the experiential components, attained with English words such as and, which, and while. These have obvious parallels in programming languages.

The interpersonal function involves the producer and receiver/s of language, and the relationship between them. This function accounts for the hundreds of different attitudinal intonations we overlay onto our speech, interjections, expressions of politeness we prepend to English sentences, e.g. “Are you able to...”, many adverbs of opinion, the mood (whether indicative, interrogative, imperative, or exclamative), and the modal structure of English grammar. The mood structure causes verbal phrases to have certainty, ability, allowability, polarity, and/or tense prepended in English, and can be repeated in the question tag, e.g. isn't he?, can't we?, should they?. The interpersonal function gives the grammatical subject-and-predicate structure to English. In programming languages, the interpersonal function determines how people interact with the program, and how other programs interact with it. The interpersonal functions are what would normally be extracted into aspects in aspect-oriented programming. They generally disrupt the “purer” transitivity structure of the languages.

The textual function brings context to the language through different subsystems. The informational subsystem divides the speech or text into tone units using pauses, then gives stress/es to that part of the unit that is new information. The cohesive subsystem enables links between sentences, using conjunctions and pronouns, substitution, ellipsis, etc. The thematic subsystem makes it easy for receivers to follow the flow of thought. Comparing this structure of the English and Groovy languages is the topic of today's blog post...

Thematic structure of English
Theme in English is overlaid onto the clause, a product of the transitive and modal structures. The theme is the first position in the clause. English grammar allows any lexical item from the clause to be placed in first position. (In fact, English allows the lexical items to be arranged in any order to enable them to be broken up in any combination into tone units.) Some examples, using lexical items to give, Alan, me, and the book, with theme bolded:
  Alan gave me that book in London.
  Alan gave that book to me in London. (putting indirect object into prepositional phrase)
  To me Alan gave that book in London. (fronting indirect object)
  I am who Alan gave that book to in London. (fronting indirect object, with extra emphasis)
  To me that book was given in London. (using passive voice to front indirect object)
  That book was given in London. (using passive voice to omit indirect object)
  That book Alan gave me in London. (fronting direct object as topic)
  That book is the one Alan gave me in London. (fronting direct object in more formal register)
  In London, Alan gave me that book. (fronting adverbial, into separate tone unit)
  London is where Alan gave me that book. (fronting adverbial in the same tone unit)
  There is a book given by Alan to me in London. (null topic)

Although not common, English also allows the verb to be put in first position as theme:
  Give the book Alan did to me in London.
  Give me the book did Alan in London.
  Give did Alan of the book to me in London.
  Given was the book by Alan to me in London.

First position is merely the way English indicates what the theme is, not the definition of it. Japanese indicates the theme in the grammatical structure (with the inflection はwa), while Chinese (I think) uses a combination of first position and grammatical structure (prepending with 是shi).

Thematic structure of Groovy
One way of indicating theme could be to bold it, assuming the text editor had rich text capabilities. This would similar to Japanese. For example, for thematic variable a
  def b = a * 2; def c = a / 2;.
Another way is to use first position, how English indicates it. This would be an Anglo-centric thematic structure to programming languages, which generally already have an Anglo-centric naming system. Perhaps the best way is a combination of both front position and bolding.

Let's look at how Groovy could enable front-position thematic structure. We'll start with something simple, the lowest precedence operator: a = b. If we want to front the b, we can't. We would need some syntax like =:, the reverse of Algol's :=
  b =: a

We'd need to provide the same facility for the other precedence operators at the same level += -= *= /= %= <<= >>= >>>= &= ^= |=. Therefore, we'd have operators =+: =-: =*: =/: =%: =<<: =>>: =>>>: =&: =^: =|:.

At the next higher precedence level are the conditional and Elvis operators. Many programming languages, such as Perl and Ruby, enable unless as statement suffix, allowing the action to be fronted as the theme. Groovy users frequently request this feature of Groovy on the user mailing list. An unless keyword would be useful, but we could also make the ? : and ?: operators multi-theme-enabling by reversing them, i.e. : ? and :?, with opposite (leftwards) associativity. The right-associative ones would have higher precedence over these new ones, so, for example:
  a ? b : c ? d : e would associate like a ? b : (c ? d : e)
  a : b ? c : d ? e would associate like (a : b ? c) : d ? e
  a : b : c ? d ? e would associate like a : (b : c ? d) ? e
  and a ? b ? c : d : e would associate like a ? (b ? c : d) : e

On a similar note: Groovy dropped the do while statement because of parser ambiguities. It should be renamed do until to overcome the ambiguities.

Next up the precedence hierarchy, we need shortcut boolean operators ||: and &&:, which evaluate, associate, and shortcut rightwards. Most of the next few operators up the hierarchy | ^ & == != <=> < <= > >= + * don't need reverse versions, but these do: =~ ==~ << >> >>> - / % **. It's good Groovy supplies the ..< operator so we can emphasize an endpoint in a range without actually processing it. We'll also provide the >.. and >..< operators.

Just as in English we have the choice of saying the king's men or the men of the king, depending on what we want to make thematic, we should have that choice in Groovy too.
We can easily encode reverse-associating versions of *. ?. .& .@ *.@ as .* .? &. @. @.*. To encode the standard path operator ., we could use .:.

A positive by-product of having these reverse-associative versions of the Groovy operators is they'll work nicely with names in right-directional alphabets, such as Arabic and Hebrew, when we eventually enable that.

When defining methods in Groovy, we should have the choice to put return values and modifiers after the method name and parameters, like in Pascal. This would cater speakers of Romance languages, e.g. French, who generally put the adjectives after the nouns.

Conclusion
Groovy, like most programming languages, doesn't enable programmers to supply their own thematic structure to code, only the transitive structure. When used well, thematic structure in code enables someone later on to more easily read and understand the program. Perl was a brave attempt at providing “more than one way to do things”, but most programming languages haven't learnt from it. I'm working on a preprocessor for the Groovy Language, experimenting with some of these ideas. If it looks practical, I'll release it one day, as GroovyScript. It will make Perl code look like utter verbosity.

Saturday, November 22, 2008

Stress and Unstress in Computer Languages

Computer languages could learn a few things from natural languages in their design...

Natural Language
Many natural languages, such as English, make a distinction between stressed and unstressed words. In general, nouns, verbs, and adjectives (incl adverbs ending in -ly) are stressed, while grammar words are unstressed.

For example: “I walked the spotty dog to the shop, quickly bought some bread, and returned home”. (I've bolded the syllables we stress during speech in this and following examples.)

We stress the nouns (dog, shop, bread, home), adjectives (spotty, quick), and verbs (walk, buy, return), and don't stress the grammar words (I, the, to, -ly, some, and). (Note: In Transformational Grammar, adverbs ending in -ly are considered to be a specific inflectional form of the corresponding adjectives.)

Examples of unstressed grammar words in English are conjunctions (and, or, but), conjunctive adverbs (while, because), pronouns (this, you, which), determiners (any, his), auxiliary verbs (is, may), prepositions (to, on, after), and other unclassed words (existential there, infinitive to), as well as many inflectional morphemes (-s, -'s, -ing, -ly).

Verbs are often only half-stressed instead of fully stressed, and prepositions half-stressed instead of unstressed, depending on the surrounding context, e.g. “The teacher saw the book behind the desk.” (Here, I've bold-italicized the half-stressed words.)

English has a clear distinction between grammar words and lexical words (nouns, adjectives/adverbs, and verbs) in speech.

Many languages distinguish between lexical and grammar words in their writing systems. German capitalizes the first letter of each noun. (Dutch stopped doing this in 1948, and English in the 1700's). Japanese uses Chinese characters for nouns and many adjectives, and the Japanese alphabet for grammar words and many verbs.

When using grammar words in a lexical capacity, we stress them when speaking, e.g. “I put an 'is' followed by an 'on', before the 'desk' with a 'the' before it, to make a predicate.” And when writing, we put the grammar words we're using as lexical ones inside quotes.

Using stress and unstress to separate lexical and grammar words enables English, and probably all natural languages, to be self-referential.


Computer Languages
Virtually every computer language differentiates between lexical words and grammar words.

Assembler and Cobol used indentation and leading keywords to distinguish different types of statements, and space and comma to separate items. Like many languages after them, the limited set of keywords couldn't be used for user-defined names. Fortran introduced a simple infix expression syntax for math calculations, using special symbols (+ - * etc) for the precedenced infix operators, and ( ) for bracketing. Lisp removed the indentation and keywords completely, making everything use bracketing, with space for separation, and a prefix syntax. APL removed the precedences, but introduced many more symbols for the operators. The experimentation continued until C became widespread.

C uses 3 different types of symbols for bracketing, ( ) [ ] { }. C++, Java, and C# added < > for bracketing. C uses space and , ; . for separators, and a large number of operators, organized via a complex precedence system. Java has 53 keywords; C# has 77.

The lexical words of computer languages are clear. Classes and variables are nouns. Functions and methods are verbs. Keywords beginning a statement are imperative verbs, and in some languages are indistinguishable from functions. Modifiers, interfaces, and annotations are adjectives/adverbs. The operators (+ - * / % etc) bear a similarity to prepositions, some of them (+= -= *= etc), to verbs. And I'd suggest the tokens used for bracketing and separators are clear examples of grammar words in computer languages, being similar to conjunctions and conjunctive adverbs.

In general, computer languages use some tokens (e.g. A-Z a-z 0-9 _) for naming lexical words, and others (e.g. symbols and punctuation) for grammar. Occasionally, there's exceptions, such as new and instanceof in Java. Some computer languages use other means. Perl and PHP put a @ before all lexical words, enabling all combinations of tokens to be used for names. This is similar to capitalizing all nouns in German. C# allows @ before any lexical word, but only requires it before those which double as keywords. This is similar to quoting grammar words to use them as lexical ones in English.

Newer programming languages have different ways to use Unicode tokens in names and operators. The display tokens in Unicode fall into six basic categories: letters (L), marks (M), numbers (N), symbols (S), punctuation (P), and separators (Z). Python 3.0 names can begin with any Unicode letter (L), numeric letter (in N), or the underscore (in P); subsequent tokens can also be combining marks (in M), digits (in N), and connector punctuation (in P). Scala names can begin with an upper- or lowercase Unicode letter (in L), the underscore (in P), or the dollar sign (in S); subsequent tokens can also be certain other letters (in L), numeric letters (in N), and digits (in N). Scala operators can include math and other symbols (in S). Almost all languages have the same format for numbers, beginning with a number (in N), perhaps with letters (in L) as subsequent tokens.

Perhaps the easiest way to distinguish between lexical and grammar words in GrerlVy is to use Unicode letters (L), marks (M), and numbers (N) exclusively for lexical words, and symbols (S), punctuation (P), and separators (Z) exclusively for grammar words. Of course, we still have a difficulty with the borderline case: infix operators and prefix methods, which correspond roughly to prepositions and verbs, the half-stressed words in English. I'm still thinking about that one.

Saturday, September 06, 2008

Mass-Parity-Distance Invariance

During July and August, I took a break from China and Tesol and Groovy, visiting my home country, New Zealand. I hoed into a copy of Roger Penrose's The Road to Reality, and came up with an idea to explain Dark Energy...

Negative Mass
Negative mass is usually defined in such a way that Einstein's equivalence principle still holds, where gravitational mass is proportional to inertial mass. This results in some bizarre effects. But while reading Penrose's book, I got an idea on how to define negative mass so that all the positive matter and all the negative fly off in two opposite directions at the Big Bang, with the equivalence principle still holding.

The key is how we calculate the (scalar) distance with respect to some mass. For positive matter, we would continue to use the positive solution to the formula where we square root the sum of the squares of the three spatial coordinates. But we'd introduce an invariance, known as the Mass-Distance Invariance, where we'd use the negative solution to the square root for scalar distances measured with respect to negative masses.

Some consequences of this invariance are:
  • The same vector values for velocity and acceleration would be used for negative mass as for positive mass, but their scalar values would depend on whether positive matter was referenced, or negative matter. Negative matter would use negative speeds and, to indicate increasing speeds, negative acceleration values.

  • A positive-valued g-force (created by positive matter) would still mean attraction for positive matter, but repulsion for negative matter. However, a negative g-force (created by negative matter) would mean attraction for negative matter, but repulsion for positive.

  • When calculating the (scalar) gravitational force between two objects, the square of the distance between them would always be positive, but a positive force is attraction, and a negative force is repulsion. This means two negative masses attract, as do two positive masses, but positive and negative masses repel each other.

  • Such scalar values for force involving negative matter would use negative distance again when calculating energies, resulting in negative energies. Penrose mentions negative energies mess with quantum mechanical calculations, but in the real Universe, this would be OK because positive and negative energies would be partitioned off due to the gravitational effects of the Big Bang.

Therefore, when calculating scalar values in the negatively-massed side of the Universe, we'd use (1) negative distances, (2) multiplied by positive time to give negative-valued speed, (3) multiplied by positive time to give negative acceleration values to indicate increasing speeds, (4) multiplied by negative mass to give positive-valued scalar forces to indicate attraction, (5) multiplied by negative distances to give negative values for energy.


Picturing All This
When picturing such a scenario using the common "matter bends space which moves matter" 2D curved-space picture to model the 3+1D reality in general relativity, the positive matter would be on top of the sheet sinking downwards as before, but the negative matter would be under the sheet, to indicate negative distances, floating upwards, to indicate the negative mass. We can then visualize positive and negative matter each self-gravitating, but repelling each other.


Gravitons
The positive matter would act via left-handed gravitons as before, but the negative matter would act via right-handed gravitions. Penrose, in his description of Twistor Theory, says that there's a problem in the calculations getting left-handed and right-handed gravitons to interact with each other to enable graviton plane polarization, similar to what's possible with electromagnetism. But in my theory, it would be a requirement that left-handed and right-handed gravitons don't interact in any way. This enables both attractive gravity and repulsive gravity to operate at different scales in the same spacetime.

This graviton-handedness has a counterpart in neutrinos, reponsible for the vast excess of matter over antimatter in the observable Universe. So we need to follow the lead of Charge-Parity-Time (CPT) Invariance, and likewise introduce parity invariance, resulting in what I'm now calling Mass-Parity-Distance Invariance, or MPD-invariance.


Dark Energy
Observational evidence of such MPD-invariant negative matter would be an expected after-effect of the inflation of the very early Universe. The modified version of the Big Bang is that the Universe's overall zero energy fractures into equal Planck-distance-separated positive and negative amounts in the first quantum instant of the Universe, then their respective gravitational fields repelled the positive and negative away from each other, resulting in a Big Bang in two different directions along one spatial axis. The actual reason for the Big Bang can therefore be explained by quantum effects.

After the faster-than-light inflation stopped, the right-handed gravitons from the negative matter would be travelling towards the positive matter at the speed of light only, resulting in a time lag between inflation ending and the gravitational repulsion of the negative mass beginning to affect the positive mass with a renewed expansion. This is exactly what happened after about 10 billion years, what's called Dark Energy.


Negative-Frequency Electromagnetism
The photon would behave differently to the graviton. Planck's famous equation states photon energy equals Planck's constant multiplied by the frequency. Negative-energy photons would then have negative frequency, but for a photon this is not the same as changing the handedness (helicity), because photons have both electric and magnetic vectors. Both left-handed and right-handed photons have positive energy, and can polarize. Photons of negative energy/frequency, whether left-handed or right-handed, would have their electric and magnetic vectors swapped around.


Antimatter
Negative matter and antimatter are two separate concepts. Matter and antimatter created from positive energy in normal particle interactions would both have positive mass, similarly negative mass for negative energy. But a virtual particle-antiparticle pair in a vacuum would not only have an overall charge of zero, but also an overall energy of zero, one of the pair having positive energy, the other, negative. Perhaps the particle has negative mass, or perhaps the antiparticle does. This fact could provide a solution to the "hierarchy problem", there no longer being any need for supersymmetric particles to adjust quantum energy values.

The first quantum event of the Big Bang would determine how much energy, positive or negative, is in each side of the Universe. The left-handed gravitons and left-handed neutrinos go one way, their right-handed counterparts, the other. So one half of the Universe is matter with positive mass, the other half, antimatter with negative mass. One spatial dimension of the Universe is thus different to the other two, with homogeneity and isotropy being more local effects.

An alternative shape of the Universe is a four-partitioned one, where positive matter, positive antimatter, negative matter, and negative antimatter fly off in 4 different directions on a plane. This can be visualized with the 2-D saddle-shape for a hyperbolic Universe, with positive matter on top of the sheet, its matter going one way and its antimatter the other, both down the saddle on each side, and negative matter underneath the sheet, its matter and antimatter each flying off up the saddle, at ninety-degree angles to the positive matter and antimatter.


It's been two decades since I finished my undergrad maths degree, and I haven't used it since, so I'm rusty. And although I basically followed the maths in Penrose's book, I didn't get all the intricacies of manifold calculus and bundles and Langrangians. If anyone out there fills my wordy explanation of MPD-Invariance with numbers, let me know if it works or if it's rubbish. But there's more follow-on ideas I've had...

The Universe as Two Complex Planes
There's an eiry similarity between the well-known Charge-Parity-Time (CPT) Invariance and my proposed Mass-Parity-Distance (MPD) Invariance. I think it suggests a certain structure to the Universe alluded to by Penrose in his Twistor Theory. He suggests the Universe can be modelled as three complex planes (i.e. 6 real dimensions), the "imaginary" dimension being as physically real as a "real" one. But elsewhere Penrose says if there are only 4 observational dimensions of spacetime, we shouldn't try to model them with 11 or 26 dimensions. I'd suggest the Universe can be modelled as only 2 complex planes to match the 4 observational dimensions of spacetime. The extra 2 dimensions required by Penrose's model could come from the fractal dimensions created by those 2 complex planes.

A curve on a complex plane usually has a (Hausdorff) dimension of 1, but fractal curves have a dimension higher than 1, but less than or equal to 2. Only very special fractals, such as the Mandelbrot set and Julia sets, have Hausdorff dimension of 2. If there exist on any complex plane an (aleph-zero-)infinite number of concentric Hausdorff-dimension-2 sets, then I suspect the plane itself would have Hausdorff dimension 3. The union into a manifold of two such complex planes would have Hausdorff dimension 6, while only having topological dimension 4, thus satisfying Penrose's minimal number of dimensions to model our Universe.

We can create such an arrangement on both our complex planes by relating them together using an uncertainty relation. Because the Mandelbrot and Julia sets are the only sets I know of with Hausdorff dimension high enough to be valid in this model, I'll use the Mandelbrot set as an example. The basic set is only one connected curve on the complex plane, but when a computer calculates it, many circles of various colors are usually displayed to reflect different accuracies of calculation. These circles are concentric. Although only the infinitely accurate Mandelbrot set normally has any mathematical significance, when relating two complex planes together in an uncertainty relationship, the curve generated from each accuracy level takes on significance.


Relationship Between CPT and MPD Invariances
Mass would be modelled as one of the fractal dimensions, while charge modelled as the other. The two invariances, CPT and MPD, both of them having parity (i.e. space reflection) included, bear a vague resemblance to the requirements for 2n-D real manifolds to be treated as n-D complex manifolds under the Newlander-Niremberg theorem, in this case 4 real dimensions as 2 complex planes. One plane, required to be CPT-invariant, would have time as one dimension, say, the real. The imaginary dimension would be a dimension of space, and the fractal dimension, charge. The other complex plane, required to be MPD-invariant, would have the other two dimensions of space for its real and imaginary dimensions, and mass for its fractal dimension.

Planck's constant defines the uncertainty relationship between time (i.e. the real dimension of one complex plane) and energy (i.e. a proxy for mass, the fractal dimension of the other complex plane). This would be the uncertainty relationship that makes the complex planes have (Hausdorff-)dimension-3.

The other dimensionless constants of nature could be interpreted as observational coordinate mappings between dimensions on these two complex planes. The speed of light is a mapping between the time and space dimensions on the same plane. Newton's gravitational constant is a mapping between the (fractal) mass dimension and a space dimension. Coulomb's constant is a mapping between the (fractal) charge dimension and space dimension. The three space dimensions wouldn't need mapping between one another, as their differences from one another are only apparent in the helicity of the graviton and neutrino. So the four dimensionless constants would be sufficient mappings for the two planes.


Everyday Observation
I've ignored the forces without an infinite range (the strong and weak forces) in this model. The basic difference between MPD-invariant gravity and CPT-invariant electromagnetism is that in gravity, like masses attract while unlike ones repel, whereas in electromagnetism, like charges repel while unlike ones attract. The logical effect of this (ignoring finite-range forces) is that gravity's masses are real numbers, while charges are polar.

So we have two complex planes, each with three dimensions, i.e. real-imaginary-fractal. The first has Time-Distance-Charge, the second, Distance-Distance-Mass. Perhaps, in our own everyday observation of these planes, the charge, having polar (i.e. 0 or +1 or -1) values only, doesn't require its total dimensional freedom to operate, and only needs a Planck-distance portion of the (fractal) charge and (imaginary) distance dimensions. So the second complex plane "takes" the excess distance dimension from the first plane to create 3D-space, and the mass, having aggregative values, also "takes" the excess fractal freedom of the charge. So we end up with 1D-time, 3D-space, polar-charge, and aggregative-mass.


If I had the time, I'd be looking at the maths for relating two complex planes together, each with (Hausdorff-)2D fractal curves, using an uncertainty relationship, trying to derive relativity axioms and asymmetric time and such stuff. But I've now got other demands on my time, hence this blog entry. If my description rings a bell with anyone, let me know how it goes.

Friday, June 20, 2008

Word Classes in English and Groovy

When I was in primary school, I learnt that English had 8 parts of speech: nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and articles. Nowadays linguists call them word classes. Since working in Tesol, I've learnt that words in English are better classified as falling somewhere along a continuum, with conjunctions, the most grammatical words, at one end, and proper nouns, the most lexical, at the other.

We'll take a quick look at these word classes in English grammar, then look at the similar concept in the Groovy Language. (Note: The English grammar is very simple, and based on what I remember from personal reading, not academic study, so I don't guarantee total correctness).


Word Classes in English

Conjunctions
The most grammatical words in English are and, or, and not, the same operators as in propositional logic. and and or can be used to join together most words anywhere further along the continuum. Most obvious are the lexical words, e.g:
  the book, the pen, and the pad (nouns)
  black and blue (adjectives)
  slowly and carefully (adverbs)
  to stop, listen, and sing (verbs)
  to put up or shut up (phrasal verbs)

Also, multiword lexical forms, such as phrases and clauses, can be similarly joined:
  the house, dark blue and three storeys high, ... (adjectival phrases)
  The batter hit the ball and the fielder caught it. (clauses)

But more grammatical words at the same position on the continuum can be joined:
  your performance is over and above expectations (prepositions)
  I could and should go (auxiliary verbs)
  They were and are studying (different type of auxiliary verbs)
  this and that (pronouns)

Incidentally, the continuum can have more than one type of multiword form at the same position, such as adverbials and prepositional phrases:
  They walked, very silently and with great care, ...

and and or are 2 of only 7 conjunctions in English, memorized by the acronym FANBOYS: for, and, nor, but, or, yet, and so. But and and or are more grammatical than the other five conjunctions, and can be used to join the others together, e.g:
  It was difficult, yet and so I tried.

The propositional logic operators are the most grammatical words in English.


Proforms
Next along the continuum are proforms, words that take the place of other more lexical words. In English, the most common type of proform is the pronoun, e.g. he, she, this, which also has determiner form, e.g. his, hers. For example:
  The dog chased the cat, but lost it. (pronoun: it)
  The dog escaped from the goat, but lost its collar. (determiner: its)

Other word classes and multiword forms have proforms. For example, pro-verb do/did:
  I enjoyed the film, and so did the ushers.
Gap for pro-verb:
  We found the south exit, and the other team, the north exit.
Pro-adjective such:
  We experienced a humid day, and also such a night.
Pro-adverb thus:
  Swiftly the Italians played; thus also did the Brazilians.
Proform for multiword adverbial so:
  The programmers finished totally on time; so did the testers.


Particles
Next are a large number of miscellaneous words between grammatical and lexical, which some call particles. Examples are interjections, articles (a/an/the), phrasal verb particles, conjunctive adverbs, sentence connectors, verb auxiliaries, not, only, infinitive's to, etc.

English, and I guess every natural language, is really a mess, and the particles are a way of categorizing the messy stuff.


Prepositions and Verbs
The first lexical word class along the continuum is the prepositions. In Hallidayan Functional Grammar, they're considered to be reduced verbs. Some examples: under, over, through, in. There are also multiword prepositional groups, e.g: up to, out of, with respect to, in lieu of.

Further along the continuum are the verbs, e.g. listen, write, walk. Verbs can be multiword, such as phrasal verbs, e.g. put up, shut up, prepositional phrasal verbs, e.g. get on with, put up with, and verb groups, e.g: will be speaking, has walked, to have gotten on with.


Adjectives and Nouns
Next along the continuum are adjectives, e.g. black, blacker, blackest. In Chomskian Transformational Grammar, adverbs ending in -ly are considered to be the same as adjectives, only modified at the surface level, e.g. slowly, slower, slowest.

Adjectives/adverbs can be multiword, e.g:
  The building is three storeys high. (adjectival phrase)
  That cat walks incredibly slowly. (adverb word group)

Next are common nouns, both count nouns, e.g. pen, pens, and mass nouns, e.g. coffee, hope. Nouns can be built into noun phrases, e.g. the long dark blue pen.

Just as verbs and prepositions are related, so are nouns and adjectives. Abstract ideas often only differ grammatically, e.g:
  Jack is very hopeful.
  Jack has much hope.
  Jack has many hopes.

At the lexical end of the grammar-lexis continuum are proper nouns. These can be phrases we construct from other words, e.g. the Speaker's Tavern, foreign words, e.g. pyjamas, fooyung, or even invented words, e.g. Kodak, Pepsi.

The largest word class in English are the nouns, then the adjectives, then verbs. When new words enter English, they're usually nouns. Some will become adjectives and maybe verbs, but very few ever move further along the continuum towards the grammar end. Although English has many Norman words from 800 or 900 years ago, very few are prepositions, and all the other more grammatical words came from Anglo-Saxon.

Perhaps all natural languages have a word class continuum with prepositional logic words at one end, and definable nouns at the other.



Word Classes in Groovy
Groovy uses both symbols and alphanumeric keywords for grammar, both lexed and parsed grammar. Groovy builds on Java, and hence C++ and C, for its tokens.

Bracketing and Separators
Perhaps the most grammatical along the continuum are the various bracketing symbols. Some have different tokens for opening and closing, e.g:
  /* */ ( ) [ ] { } < >
while others use the same token for both, e.g:
  """ ''' " ' /
There's no corresponding word class in English because English uses prosody (tone, stress, pause, etc) rather than words for the bracketing function.

Next along Groovy's continuum could be separators, e.g:
  , ; : ->
We can use , and ; for lists of elements, similar to and and or in English.

Groovy has a very limited repertoire of pronouns, only this and it.


Verbs and Prepositions
Perhaps operators are like English prepositions, e.g:
  == != > >= < <= <=>
  .. ..< ?: ? : . .@ ?. *. .& ++ -- + - * / % **
  & | ^ ! ~ << >> >>> && || =~ ==~

while some operators are almost like verbs, e.g:
  = += -= *= /= %= **= &= |= ^= <<= >>= >>>=

Some operators are represented by keywords in Groovy, viz. prepositions, an adjective, and a multiword noun-preposition, i.e:
  in as new instanceof

Verbs in indicative form are used in definitions, e.g:
throws extends implements
... .*

The most common verb form is the imperative, e.g:
  switch, do, try, catch, assert, return, throw
  break, continue, import, def, goto

though sometimes English adverbs are used as commands in Groovy, e.g:
  if, else, while, for, finally
Also used for this are nouns, e.g:
  case, default
and symbols, e.g:
  \ // #! $


Nouns and Adjectives
Groovy uses English adjectives for adjectival functions in Groovy, e.g:
  public, protected, private, abstract, final, static
  transient, volatile, strictfp, synchronized, native, const


Groovy has many built-in Groovy common nouns, e.g:
  class, interface, enum, package
  super, true, false, null
  25.49f, \u004F, 0x7E, 123e7

Some of them can also be used like adjectives, e.g:
  boolean, char, byte, short, int, long, float, double, void
are nouns (types) that can precede other nouns (variables), like Toy in A Toy Story.

We can define our own Groovy proper nouns using letters, digits, underscore, and dollar sign, e.g:
  MY_NAME, closure$17, αβγδε

Using @, we can also define our own Groovy adjectives.


Because Groovy is syntactically derived from Java, and hence from C++ and C, it, like English, is a little messy in its choice of tokens.

Notice also the different emphasis of word classes between English and Groovy, e.g:
  • Groovy uses tokens for bracketing while English uses non-token cues
  • English uses far more proforms than Groovy, which forces us to use temporary variables a lot
  • English uses Huffman coding by shortening common words like prepositions, while Groovy retains instanceof and implements



Conclusion: The Unicode Future
Unicode divides its tokens into different categories: letters (L), marks (M), separators (Z), symbols (S), numbers (N), punctuation (P), and other (C). Within each are various sub-categories. I'm looking at how best to use all Unicode characters (not just CJK ones) when extending a Java-like language such as Groovy with more tokens. The more tokens a language has, the terser it can be written while retaining clarity. Unicode's now a standard, so perhaps programmers will be more motivated to learn them than when APL was released. And modern IME's enable all tokens to be entered easily, for example, my ideas for the Latin-1 characters. Such a Unicode-based grammar must be backwards-compatible, Huffman coded, and easy to enter in the keyboard.

Friday, June 13, 2008

The Future of Programming: Chinese Characters

Last year, I wrote some wordy blog entries on this subject, see Programming in Unicode, part 1 and part 2. Here's a brief rehash of them...


Deep down in their hearts, programmers want to read and write terse code. Hence, along came Perl, Python, Ruby, and for the JVM, Groovy. Lisp macros are enjoying a renaissance. But programmers still want code to be clear, so many reject regexes and the J language as being so terse they're unreadable. All these languages rely on ASCII. The tersity of these languages comes from maximizing the use of grammar, the different ways tokens can be combined. The same 100 or so tokens are used.

Many people can type those 100 tokens faster than they can write them, but can write thousands more they can't type. If there were more tokens available for programming, we could make our code terser by using a greater vocabulary. The greater the range of tokens a language has, the terser its code can be written. APL tried it many years ago but didn't become popular, perhaps because programmers didn't want to learn the unfamiliar tokens and how to enter them. But Unicode has since arrived and is used almost everywhere, including on Java, Windows, and Linux, so programmers already know some of it.

With over 100,000 tokens, Unicode consists of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, and other stuff. Many programming languages already allow all Unicode characters in string contents and comments, which programmers in non-Latin-alphabet countries (e.g. Greece, Russia, China, Japan) often use. Very few programming languages allow Unicode symbols and punctuation in the code: perhaps language designers don't want to allow anything resembling C++ operator overloading.

But many programming languages do allow Unicode alphabetic letters and CJK characters in names. Because there already exists agreed meanings for combinations of these, derived from their respective natural languages, programmers can increase tersity while keeping readability. However, this facility isn't used very much, maybe because the keywords and names in supplied libraries (such as class and method names in java.lang) are only available in English.

I suspect programmers from places not using the Latin alphabet would use their own alphabets in user-defined names in a certain programming language if it was fully internationalized, i.e., if they could:
  • configure the natural language for the pre-supplied keywords and names (e.g. in java.lang)
  • easily refactor their code between natural languages (ChinesePython doesn't do this)
  • use a mixture of English and their own language in code so they could take up new names incrementally
Most programming languages enable internationalized software and webpages, but the languages themselves and their libraries are not internationalized. Although now rare, most programming languages will one day be internationalized, following in the trend of the software they're used to write. The only question is how long this will take.

However, I suspect most natural languages wouldn't actually be used with internationalized programming as there's no real reason to. Programmers in non-English countries can read English and use programming libraries easily, especially with IDE auto-completors. Writing foreigner-readable programs in English will be more important.

To become popular in a programming language, a natural language must:
  • have many tokens available, enabling much terser code, while retaining clarity. East Asian ideographic languages qualify here: in fact, 80% of Unicode tokens are CJK or Korean-only characters.
  • be readable at the normal coding font. Japanese kanji and complex Chinese characters (used in Hong Kong, Taiwan, and Chinatowns) don't qualify here, leaving only Korean and simplified Chinese (used in Mainland China).
  • be easily entered via the keyboard. An IME (input method editor) allows Chinese characters to be entered easily, either as sounds or shapes. The IME for programming could be merged with an IDE auto-completor for even easier input.
And to be the most popular natural language used in programming, it must:
  • enable more tokens to be added, using only present possible components and their arrangements. Chinese characters are composed of over 500 different components (many still unused), in many possible arrangements, while Korean has only 24 components in only one possible arrangement.
  • be used by a large demographic and economic base. Mainland China has over 1.3 billion people and is consistently one of the fastest growing economies in the world.
About a year ago, I posted a comment on Daniel Sun's blog on how to write a Groovy program in Chinese. (The implementation is proof-of-concept only; a scalable one would be different.) The English version is:
content.tokenize().groupBy{ it }.
  collect{ ['key':it.key, 'value':it.value.size()] }.
  findAll{ it.value > 1 }.sort{ it.value }.reverse().
  each{ println "${it.key.padLeft( 12 )} : $it.value" }


The Chinese version reduces by over half the size (Chinese font required):
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
  都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}


I believe this reduction is just the beginning of the tersity that using all Chinese characters in programming will bring. The syntax of present-day programming languages is designed to accommodate their ASCII vocabulary. With a Unicode vocabulary, the language grammar could be designed differently to make use of the greater vocabulary of tokens. As one example of many: if all modifiers are each represented by a single Chinese character, for 'public class' we could just write '公类' without a space between (just like in Chinese writing), instead of '公 类', making it terser.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. Dynamic language advocates claim this benefit for dynamic programming over static programming: the benefit is enhanced for Chinese characters over the Latin alphabet.

If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every name (packages, classes, methods, fields, etc) in the entire Java class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary.

Furthermore, in an internationalized programming language, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their code, but so could Western programmers. Hackers want to write terse code, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, they can enter the characters easily, and start using them incrementally. By incrementally I mean only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough.

Non-Chinese programmers won't have to learn Chinese speaking, listening, grammar, or writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is quite different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within hackers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. Software developers all over the world could be typing in Chinese within decades.


Chinese character data file available...
Recently, I analyzed the most common 20,934 Chinese characters in Unicode (the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block), aiming to design an input method easy for foreigners to enter CJK characters.

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one based on how I think a fellow foreigner will analyse the character. I've created a few thousand characters to cater for decomposition components not themselves among my collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

The data file is CSV-format, with 4 fields:
  • the character
  • first component
  • either second component, or -
  • type of decomposition
Here's a zip of that data file and truetype font file if anyone's interested.

Wednesday, June 04, 2008

Base-100 Arithmetic

(reposted)

In The Number Sense, Stanislas Dehaene says that in Cantonese and Mandarin, the sounds for the numbers are much shorter than in Western languages, and so native speakers of those Chinese languages can speak numbers quicker. He argues that this enables them to do mental math quicker than speakers of Western languages. In many parts of Asia including China, learning mental math is considered very important for children.

Dehaene also writes elsewhere in his book that many people who can do fast mental math not only practise the many calculation shortcuts, but also often memorize the products of 2-digit numbers. I've wondered if people memorizing such products would be better off to use a base-100 instead of base-10 system, that is, to create a hundred digits and map them to the numbers from 0 to 99. After some initial memorization, it would be easy to convert back and forth between them. Even better is if Chinese sounds were used for the base-100 digits, taking advantage of the short sounds. The Chinese group digits into groups of four, unlike English-speakers' groups of three, making Chinese numbering even more suitable.

The first ten digits already exist: 0零, 1一, 2二, 3三, 4四, 5五, 6六, 7七, 8八, and 9九. There's already characters for some of the other 2-digit numbers: 10十, 20廿, 30卅, and 40卌. Perhaps also 木 for 80 (from Chinese riddles) and 半 (meaning ½) for 50. Maybe in some cases these characters for multiples of ten could be used as radicals in associated numbers, for example, digits related in some certain way to 80 could be represented by characters with the 木 radical (eg, 相枩來枳林柬朿朾朽朳朲朰東杰, etc). There's many more existing sequences that could be used in some way, like the 10 stems (甲乙丙丁戊己庚辛壬癸), the 12 branches (子丑寅卯辰巳午未申酉戌亥), or the Yi Ching characters. What is most important, though, is that the sound of each digit from 0 to 99 be different. Because there's about 400 different sounds in Mandarin Chinese, that would be possible.

The easy part for those learning such base-100 arithmetic would be memorizing every mapping between a 2-digit base-10 number and the matching base-100 digit. Children could learn that before they're 3 years old. To do any effective mental math, they would need to memorize many sums and products of pairs of base-100 digits, far more difficult. If they memorized sums by putting the higher number first, and products by putting the lower first, they wouldn't need to remember whether a sequence of four base-100 digits was a sum or product, they would only memorize the sequence itself. If the two numbers were the same, it would be the product. This gives 5050 different ways two base-100 digits can be multiplied together and 4950 ways they can be added: 10,000 combinations in total.

Many of those 10,000, though, could be worked out using shortcuts based on patterns. For example, to multiply two numbers, such as 93 x 98, by using the complement (on 100) of each number, 7 and 2, we can calculate the complement of their sum, 91, followed by their product, 14, giving the final result 9114. This particular example is really only useful in base-10 for numbers quite close to 100, but in base-100, it can be used for all numbers over 50. At the cost of memorizing 50 pairs of complements (1+99, 2+98, etc), we can reduce the 10,000 combinations down by 1275, to 8725.

There's many other shortcuts that could be utilized to reduce that number down considerably further. I suspect those shortcuts would be based on the common divisors of 100, i.e. 2, 4, 5, 10, 20, 25, and 50. For example, when adding 25 + 22, in my mind I calculate it as 25 + (25 – 3) = (2 * 25) – 3.

Of the four-character sequences that would need to be memorized, if many of them bore some pictorial or phonetic resemblance to the thousands of four-character proverbs (成语) that Chinese children already learn by rote, they'd find it much easier to memorize them. In Chinese proverbs, only the content words are recited, not the grammar words, so English proverbs in the Chinese style would be "Stitch time, save nine", "Stone roll, no moss", "Bird hand, two bush", etc. This is what would make it far easier for native Chinese speakers to do base-100 mental math than Westerners learning such arithmetic.

Here's an example of this technique, but using an English proverb instead, with associations 13=bird, 19=hand, 2=two, and 47=bush. To multiply 13 x 19, there's no shortcut, so we'd recite the associated sounds, with the lower number first for multiplication, i.e., 13 x 19 = “bird hand”. We'd automatically finish it in our heads, i.e., “two bush” = 0247. Viola!

I don't know of any existing base-100 arithmetic in China, having never seen any websites or books on the subject. What such base-100 arithmetic needs is for a native Chinese speaker with a background in computing and linguistics to design and run the intensive computations necessary to assign the best possible mapping between 2-digit numbers and base-100 digits, so the memorizations will be easiest for native-speaking Chinese children. It would be a time-consuming input-intensive programming task with a deliverable of only 90 ordered Chinese characters. An example of the future of computing, perhaps?

Tuesday, June 03, 2008

Ejoty in Groovy

Ejoty is a word invented by magician Stewart James to describe the mental skill of easily remembering the numeric value of each letter of the English alphabet (i.e. A=1, B=2, ..., Z=26) to enable quick mental calculation of the value of words (e.g. WORD = W + O + R + D = 23 + 15 + 18 + 4 = 60). The letters in ejoty refer to the ordered multiples of 5 in the alphabet, i.e. E=5, J=10, O=15, T=20, Y=25, which, if we memorize those, will enable us to easily calculate the values of most other letters by using only 1 or 2 offsets.

A way musical and aural learners could use to learn the values of letters is to remember the value of the first letter in each foot of the popular children's song "abcd efg, hijk lmnop, qrs tuv, wx yz", i.e. "1 5, 8 12, 17 20, 23 25".

If we can instantly know the value of each letter, we can more easily practise adding a sequence of numbers in our heads whenever we see words written down somewhere. For example, signs we see when riding public transport:
  SYDNEY = 19 + 25 + 4 + 14 + 5 + 25 = 92
  FLINDERS = 6 + 12 + 9 + 14 + 4 + 5 + 18 + 19 = 87


Using Groovy
To more quickly ejotize words, we can learn the sums of common letter sequences off by heart. To find the most common ones, we can write a Groovy program...

After extracting the English word list english.3 within this zip file, we can run this script:

def gramVal(gr){
  def tot= 0
  gr.each{ tot += (it as int) - 96 }
  tot
}

def grams= [:]
new File("english.3").eachLine{word->
  word -= "'"
  for(int i in 2..word.size())
    if(word.size() >= i)
      for(int j in 0..word.size() - i){
        def gram= word[j..j+i-1].toLowerCase()
        if( grams[gram] != null ) grams[gram]++
        else grams[gram]= 1
      }
}

grams.entrySet().findAll{it.value > 200}
     .sort{it.value}.reverse().each{
  def gm= gramVal(it.key)
  println "$it.key ($gm): $it.value"
}


Only the sequences of letters occuring more than 200 times in that word list will be displayed by that version of the program. The first 20 lines output are:

an (15): 2634
er (23): 2606
in (23): 2080
ar (19): 1977
on (29): 1780
te (25): 1750
ra (19): 1732
en (19): 1625
al (13): 1570
ro (33): 1498
ri (27): 1485
is (28): 1472
la (13): 1444
or (33): 1426
le (17): 1425
at (21): 1404
ch (11): 1327
st (39): 1303
re (23): 1269
ti (29): 1253


(The reason the commonly-occuring th doesn't occur is because the program doesn't consider word frequencies in normal text.)

Ejoty In Reverse
Perhaps we want to easily convert numbers to letters. We could learn the letters for the numbers up to 26 easily enough, but what if we want to convert higher numbers to something. What about to groups of letters, where their sum is the number? We'd need to generate some common possibilities. This Groovy code uses the grams list we generated in the previous code sample to generate the 5 most common sequences for numbers up to 100:

def grGrams= grams.groupBy{gramVal(it.key)}
grGrams.entrySet().findAll{it.key <= 100}
       .sort{it.key}.each{
  print "$it.key (${it.value.inject(0){flo,itt-> flo+itt.value}}): "
  def set= it.value.entrySet().sort{it.value}.reverse()
  def setSz= 5
  if(set.size() >= setSz) set= set[0..setSz-1]
  println set.collect{"$it.key($it.value)" }.join(', ')
}


Here's a segment of output showing the most common letter sequences for numbers greater than 26:

27 (10243): ri(1485), lo(973), ol(955), sh(587), ve(442)
28 (11695): is(1472), th(855), si(785), mo(709), om(691)
29 (11231): on(1780), ti(1253), it(919), no(768), ell(251)
30 (7766): oo(370), ing(362), ati(245), iu(234), sk(207)
31 (7516): op(660), po(534), rm(330), vi(294), her(217)
32 (8183): sm(361), rn(276), tic(255), eri(249), mar(188)
33 (11647): ro(1498), or(1426), ul(550), hy(403), ns(355)
34 (9931): nt(950), os(793), um(504), so(424), pr(304)
35 (9482): to(1077), ot(634), un(541), ant(321), sp(282)
36 (8867): ou(788), rr(288), pt(183), ato(165), min(165)
37 (8499): rs(330), ly(267), ov(250), yl(220), pu(149)
38 (10054): tr(740), ss(458), rt(414), ion(261), qu(256)
39 (11026): st(1303), ur(745), ent(320), ru(304), per(244)
40 (9339): us(1130), su(353), tt(308), ast(225), sta(211)
41 (9141): ut(364), tu(282), ism(252), rin(191), yp(178)
42 (8756): ers(200), mon(199), olo(170), res(144), ori(132)
43 (9499): ter(534), ry(378), tin(177), nti(151), ium(138)
44 (8233): ste(219), ys(184), est(168), tio(157), sy(107)
45 (7260): ty(229), ver(182), yt(128), tte(107), aceou(104)
46 (7459): ris(147), rom(135), mor(108), orm(105), los(77)
47 (7762): sis(206), tri(180), ron(156), rit(118), ssi(107)
48 (8292): ist(236), sti(162), uri(116), tis(115), eter(114)
49 (7790): ton(202), rop(143), ont(122), pro(120), phy(120)
50 (6711): oto(117), graph(83), oun(59), tric(57), low(54)

There's plenty of choices there.

The code uses the groupBy GDK function, and the output gives a visual representation of applying it to some data.

Wednesday, April 09, 2008

Syntactic Rank in English and Groovy

There's many similarities between natural languages and computer languages. Let's look at one aspect of language, syntactic rank, and compare it in English and Groovy. The analysis of English is basic only, just enough so I can compare it to Groovy.

Rank in English
In English, structures larger than a sentence are different in written and spoken English: writing has paragraphs and speech has exchanges. Sentences are strung together into a cohesive sequence by pronoun linking, transition phrases, etc. Syntax kicks in below the sentence-level by defining 5 ranks: (1) sentence, (2) clause, (3) phrase, (4) word, (5) morpheme. Items at one rank are composed of items from the next rank down, with the morpheme being the lowest, atomic, rank.

A sentence can be simple, consisting of only one clause, or more complex. For example, this compound-complex sentence:
The batter hit the ball hard, but, because the wind blew it his way, the out-fielder caught it easily.
consists of 3 clauses, in a tree-like manner:
(the batter hit the ball hard)   //compound constituent
but
( because
  (the wind blew it his way)   //complex constituents
  (the out-fielder caught it easily)
)

A clause consists of phrases (sometimes called groups). For example:
Surprisingly, the batter has slammed the ball out of the pitch.
has the tree structure:
surprisingly   //adverb phrase
( the batter   //noun phrase
  ( has slammed   //verb phrase
    the ball   //noun phrase
    (out of the pitch)   //prepositional phrase
  )
)

A very common structure of clause is noun phrase followed by predicate. For example:
The beekeeper became a mountain climber.
is divided into:
the beekeeper   //noun phrase at the head, called a subject
became a mountain climber
    //predicate, which can be further broken down

A phrase consists of words. For example, this noun phrase:
A big bright red truck
has structure:
a
big
(bright red) //not a word, but an example of rank-shifting
truck
The (bright red) isn't a word, but another phrase used as a word. This is called rank-shifting. Sometimes, a rank can be shifted more than one place. For example:
the pay-as-you-earn tax
has a clause shifted to the position of a word.


Phrases can nest deep to many levels easier than other ranks. For example, a noun phrase embellished with adjectives and prepositional phrases:
the big thick book with the silky red cover on the bookshelf by the fireplace
has structure:
( the (big (thick book)) ( with ( the (silky (red cover)) ) ) )
on ( the bookshelf ( by ( the fireplace ) ) )

Finally, words consist of morphemes. Some morphemes are lexical, others are grammatical. For example, the word undiscerningly has structure:
( un
  ( discern   //only one lexical morpheme
    ing)
)
ly

Whereas compound word beekeeper has two lexical morphemes, bee and keeper. Morphemes are the atomic structure in English grammar.


Rank in Groovy
Like English, Groovy is best analyzed as having 5 ranks: (1) top-level, (2) statement, (3) expression, (4) path, (5) primary. It could be useful to match up the ranks of Groovy with those of English.

A top-level is a class, interface, or enum definition, standalone method definition or statement, or package or import statement. For example:
def mean(a, b){
  def c= a + b
  c / 2
}
It could correspond to a sentence in English. Class and method definitions consist of statements, just as sentences consist of clauses. A standalone statement can be a top-level, just as a simple sentence consists of only one clause.

A statement is of various types, e.g. if, while, try, break, expression, or block. A statement in Groovy could correspond to a clause in English.

A common type of the common expression statement is the assignment, e.g:
def b= c + ( d * e )
This could correspond to the subject-predicate style of clause, where b is the subject and the rest is the predicate.

A block statement could correspond to the complex portion within a compound-complex statement.

An expression consists of path structures, just as a phrase consists of words.

One such structure is the closure, which itself consists of statements, entities of a higher rank. This is rank-shifting in Groovy, e.g:
def c= {
  it= it * 2
  println it
  it * 3
}(7)
Compare this with the rank-shifted compound sentence inside the relative clause:
The mountain range, of which the tallest was there and its peak needed knocking off, loomed before them.

Expressions enable the deepest nesting, just as phrases do in English. For example:
a + ( b * ( -c – d ** e ) / (z= f1 – (f2 * f3)) % g ) +
  ( ( !h1 || h2 ) ? i : -j )
In general, the unary operators have highest precedence, then left-associative binary, then ternary, then right-associative binary.

Each path structure consists of a head followed by path elements, e.g. arguments, subscripts, closures, names after . or ?. or *. or .& or .@ For example:
callMet(d, e, 2.71, "abcdefg")[7].&fin()?.gun()
  .g2{->}.g3(1, "b"){->}.@hun.&@iun*.jun[a,b,c]

Rank-shifting is far more likely within Groovy path structures, than in English words. Path structures are eventually comprised of primaries.

Primaries such as identifiers, operators, literals, numbers, and strings are the atomic structures of Groovy grammar, just as morphemes are the atomic structure of English. The operators ( + * % || && ) and literals (true, null, this) are like grammatical morphemes, while identifiers are like lexical morphemes.


Summary
The matching between Groovy and English syntactic rank isn't perfect, especially for the lower ranks, but does go a fair way. Groovy copies its 5-rank structure from other programming languages, such as Java and C++. I suspect the matching 5-rank structure makes it easier for English speakers to write and read programs in these languages. Perhaps people who don't already know Java or C++ would benefit from seeing comparisons with English grammar when they learn Groovy.

Monday, March 10, 2008

JRuby and Jython: Groovy's peers

Sun Microsystems recently hired Frank Wierzbicki and Ted Leung to work on Jython full-time at Sun. Glen Smith recently blogged concerning this and his meeting with James Gosling of Sun in Australia:
“One question that was asked was about the recent hires around JRuby & Jython, is Sun sidelining Groovy? James had some nice things to say about Groovy, actually, and reiterated that really it's a numbers game for Sun. Big numbers of Ruby and Python developers to lure to the platform, fewer (but growing) numbers of Groovy developers already productive there.”

In Charles Nutter's blog entry on the same subject he writes:
“So what does this mean for Sun? Well, it means we're serious about improving support for Python on Sun platforms. Jython is a big part of the story, since they have many challenges similar to JRuby, but a bunch of new ones as well. So we'll be looking to share key libraries and subsystems as much as possible, and we'll start looking at Jython as another driver for future JVM and platform improvement.”

As suggested in a recent blog entry of mine, the JVM needs to keep up with Microsoft's DLR support of IronPython and IronRuby. By adding the Jython experts to the JRuby ones already on the payroll, Sun is better able to build a solid MLVM to compete. Groovy must consider whether to also program to such a MLVM, about which Alex Tkachman says: “I think it is a serious effort for Groovy but i am for 100% sure it is doable. There is nothing I am aware about that prevent that in theory.”

Now Charles Nutter has blogged about Duby, JRuby's static-mode answer to what some Groovy developers would like for Groovy but others don't.

If Sun builds a MLVM with a tree-like DLR-style interface, and re-engineers JRuby and Jython for it, then Groovy must follow suit to stay relevant. But Groovy doesn't need to build a static Doovy mode because Java is Groovy's Duby.

Wednesday, January 30, 2008

The Groovy Releases

Today, Groovy 1.5.4 was released, the 30th official release of a Groovy version. For this month's entry tracking the Groovy Programming Language here at Gavin Grover's GROOVY Blog, let's look at the official releases of Groovy.

The Early Betas

The very first release was of Groovy 1.0 beta 1 on 11 December 2003, coming in at 4.9Mb. According to the release notes by James Strachan, it had compilation straight to Java bytecode, groovyc Ant task and command line script, an early Swing-based console, and a simple command line terminal. It had full support for properties and fields, native syntax for maps lists and regexes, autoboxing, Ruby 2.0-style closures, both static and dynamic typing, operator overloading, text templating, GPath, GroovyMarkup (supporting DOM, SAX, Swing, and Ant), and Groovlets. A few weeks later, beginning a tradition of releasing just before Christmas, beta 2 also featured full support for subscripts on lists, maps, and strings.

A month later, beta 3 brought both backwards and forwards inclusive and exclusive ranges in the collection string and map subscripting, more core Java polymorphism, the break statement, ternary expressions, and the Groovy GDK additions to the JDK. Guilaume Laforge had joined the project at that stage. As with every release of Groovy, there were many bug fixes. The most notable quote in the release notice: "Whilst the language syntax is not quite frozen for the final 1.0 release its getting very close (we hope the next release to freeze the syntax for backwards compatibility) and the projects codebase is getting stable and solid now." This was on 23 January 2004, a mere 6 weeks after the first release. A week later, the Groovy logo was chosen. Salute that green star!

Six weeks later, beta 4, weighing in at 5.2Mb, brought class imports with wildcards, default method and function parameters, I/O and process GDK methods, BigInteger and BigDecimal integration, closure currying, and the templating engine. A few days later, James, Richard Monson-Haefel, and Geir Magnusson Jr submitted Groovy for standardization to the Java Community Process as JSR-241. The project was promptly voted in, the expert group was formed on 30 March 2004, and there's been no progression during the 4 years since. The other major Java-syntax-compatible language, Beanshell, took the same path a year later as JSR-274.

The next few betas were released every two months or so, but there was no syntax freeze and final release. Beta 5 hit the web on 12 May 2004, a week after Guillaume became a Groovy despot, with parser and bytecode fixes, along with an expanding GDK. Beta 6 on 15 July 2004 had a new type inference engine for optimizing calls. Beta 7 on 29 September 2004 clocked in at 10Mb. This was, in fact, the first version of Groovy I myself used.

A New Parser

Guillaume writes that around this time Groovy development almost stalled. The developers met at DevCon1 (then called GroovyOne) in London to restart the project. This was when Russel Winder, Jochen Theodorou, and Dierk Koenig joined James, Guillaume, and Jeremy Rayner on the development team. On 17 December 2004, continuing the Christmas release tradition, with Guillaume taking over the release announcements, beta 8 arrived, all 11Mb of it, bringing lighter error messages and a better GroovyShell experience. A month later, beta 9 brought JDK 1.5 compliance and an improved GroovyServlet.

At this time, the lexer/parser was overhauled, the new "JSR" one being included inBeta 10 on 28 Feb 2005 as an "Early Access Preview", in addition to the old "Classic" one. For the first time in this announcement, Guillaume widens the announcer from just himself to "the Groovy development team and the JSR-241 Expert Group". Six weeks later, on 5 April 2005, the next beta of Groovy used the new "JSR Parser" by default, including the old "Classic Parser" for backwards compatibility only, with fairly extensive changes required to the source.

Instead of "beta 11", the new beta was called "Groovy 1.0 JSR 1" to tag it as the syntax intended for standardization. Guillaume wrote in the release notes: "We'll have a few jsr-x versions till this summer, two or three more before the final 1.0 release." In the "JSR 2" beta version on 15 June 2005, there were a record 1000 test cases, the error-reporting was much improved, and interfaces could be written in Groovy for the first time. Two months later on 16 August 2005, the "JSR 3" beta version was released. The summer ended with no final release.

The Baton Passes

Three months later, the second Groovy developers conference (DevCon2) occurred in Paris, with the 15Mb"JSR 4" beta released the week before on 21 November 2005. The release notes mentioned improved compilation and class loading, enhanced inner class imports, improved startup scripts, an upgrade to ASM 2.1, synchronized blocks, and improved namespace support for XML and builders, including quoted method names. Guillaume wrote: "The two main aspects remaining for Groovy to reach its 1.0 final milestone is to clarify the name resolution and scoping rules. Those two concerns will hopefully be addressed during the Groovy JSR meeting, and we're going to implement these rules as quickly and as thouroughly as possible. Keep in mind that those rules might be a little different than our current rules. However, we hope these rules will be more coherent and closer to what we're used to in Java."

At that DevCon 2 meeting, there was disagreement over whether or not Groovy's closures and builders should be distinct syntactic entities. Shortly after the meeting, Groovy's founder James Strachan moved on to other projects, having passed the leadership baton onto Guillaume. That year, 2005, has been the only year Groovy missed a Christmas release, the "JSR 5" beta not being released until 13 February 2006. That release brought multi-dimensional array support, calling method names defined in strings, semantic changes to "def" and binding variables, and improvements to the scoping algorithms.

Said Guillaume: "This is the last release of the JSR-xx line. The next release will be the first RC-x release before the final 1.0. We're planning to release RC-1 in about two months, and the final Groovy 1.0 release should be out in about three months. As you might know, and as decided during the last Groovy conference in Paris, the two main tasks towards the final version of the projects are the rework of the scoping algorithms, and the rewrite and enhancements of the Meta-Object Protocol. JSR-05 contains these new scoping algorithms. [...] The next step before RC-1 is the work on the MOP and name resolution algorithms." Unfortunately, the developers underestimated the time it took to complete this work on the MOP (meta-object protocol) and name resolution. RC-1 was released 10 months later.

Groovy weathered another unfortunate occurence at this time. Since July 2005, Guillaume and Graeme Rocher had been working on the Groovy on Rails project, some web infrastructure to duplicate that of Ruby on Rails. When they released version 0.1 on 30 March 2006, they changed its name to Grails, dropping "Groovy" from the name. The Ruby on Rails lead developer, David Heinemeier Hansson, had emailed saying he considered the "Rails" name to be exclusive to the Ruby on Rails project. I suspect the real concern of Hansson's email was just as much to limit the "Groovy" name as to protect the "Rails" name, and he succeeded in both aims. Perhaps at that time Guillaume and Graeme didn't fully understand the value of the "Groovy" brand in the way Jonathan Schwartz understands the value of the "Java" brand. I'm sure they've since wizened up.

Countdown to 1.0 Final

Although Groovy had clocked up 15 official releases, each better than the previous, the lead developers seemed concerned about the negative effect on marketing of not having one tagged "1.0 final". However, they didn't want to release 1.0 final without the MOP finished, so they released another beta, "JSR 6", four months later on 28 June 2006, with syntax changes for properties, class loader improvements including class initializers, and mocking for unit testing.

Then there were no official releases for another 5 months. I still remember that long wait, when I wrote so much Groovy code experimenting with interceptors in JSR-6 and sometimes wondering if version 1.0 would really ever arrive. But arrive it did on 4 December 2006. RC-1, the first release candidate, brought class coercion using the asType(Class) method, the 'in' operator, coercing Maps and closures to interfaces, and increased dynamicity using "$methodName"(*args). John Wilson, Paul King, and Guillaume Alléon had joined the development team by this stage.

On 23 December 2006 came RC-2, the only one of Groovy's 30 releases to not include the source code, just in time for Christmas. I downloaded it when it arrived, and ran my test scripts through it. Almost everything mildly complex using interceptors failed, even though they'd run OK through RC-1 and JSR-6. Without source code, I couldn't see what had changed in 3 weeks. When I tried to look at the source online, I couldn't: an earthquake had damaged the internet cables servicing where I live, and they weren't fixed for 6 to 8 weeks. Only my email account was accessible.

A week later on 2 January 2007, Groovy 1.0 final was released. On 29 January 2007 were the release parties around the world and the 3rd DevCon meeting, in Paris. By the time I downloaded 1.0 final sometime in late February, my MOP-dependant test scripts still didn't work, and the Groovy developers were talking online about a new MOP in Groovy 2.0, to be released sometime after Groovy 1.1. I think I might have moved on from Groovy at that stage if it weren't for Groovy's groovy name. But I put some time into learning and documenting for Java newbies the core JDK and GDK methods, coding that didn't require the MOP. In either 1.1 beta 1 or beta 2, my MOP-dependant scripts were working again, but by then, my coding interests had moved on.

Targeting Java 5.0

On 30 April 2007, a week before JavaOne, Groovy 1.1 beta 1 was released, with annotation use and static imports from Java 5.0, an ExpandoMetaClass, and many other smaller improvements. On 5 July 2007, beta 2 added generics from Java 5.0 to Groovy, joint Groovy/Java compilation, a ConfigSlurper, the classical C++/Java-style for loop, and named parameters without parentheses, as well as general performance. And on 20 September 2007, beta 3 brought enums from Java 5.0, coercion of maps and closures to concrete classes, the Elvis operator, closure/builder name resolution enhancements, and GroovyShell improvements.

The Groovy development during this time seems to have been the most productive ever, with estimates being roughly accurate. Of course, without the tough challenge of the MOP, the programming tasks were a lot more doable, Jochen had been working fulltime on the Groovy codebase since December 2006, and the developers have obviously learnt from mistakes in the previous years. Soon after the Groovy DevCon 3 meeting in January 2007, John Wilson suddenly left the Groovy development team. However, he's since been working on "Ng", a MOP for a Groovy-like language, which may proof to be of benefit to the development of Groovy 2.0.

On 10 October 2007, Alex Tkachman, former COO of Jetbrains, announced G2One, Inc, a company with funding from Bay Partners, a VC firm from California. G2One is a Groovy & Grails consulting and training firm, founded by Alex, Guillaume, and Graeme. Jochen also promptly joined them. A few days later, on 12 October 2007, Groovy 1.1 RC 1 was released, bringing improved performance and various extra features, such as string to class coercion and overloadable unary operators. Guillaume refered to the announcers as "The Groovy development team and the G2One company". A few days later, on 15-16 October 2007 in London, was the Groovy DevCon 4 meeting. Unlike previous meetings, this one was quite closed, the new commercial realities of running a consulting company for venture capitalists who expect profits and capital gains no doubt causing this. The following day, on 17-19 October 2007, the first ever Grails XChange was held, having been postponed from May.

On 2 November 2007, RC 2 was released, followed by Groovy 1.1 RC 3 on 28 November 2007, both performance improvement and bug fix releases. Immediately afterwards was a discussion on the mailing lists resulting in Groovy 1.1 being renamed version 1.5 for marketing reasons.

Present and Future

On 7 December 2007, Groovy 1.5.0 was released, only 11 months after version 1.0, probably the best project-managed calendar year in Groovy's history. Two bug fix and performance improvement releases, 1.5.1 on 21 December 2007 (the Christmas release) and 1.5.2 on 29 January 2008, have since followed, the latter quickly being followed by bug fixing 1.5.3 on 31 January 2008 and 1.5.4 on 1 February 2008. In the latest release notes, Guillaume attributes the recent success in bug fixes and performance increases to G2One's creation. No doubt full-time employees passionate for Groovy helps development a lot, but funded companies must make money, and clocking up chargeable hours with Grails consulting is far more lucrative than pro-bono Groovy development, so the tension between investing time and cashing in is always there from executives and investors.

The current roadmap defines a very tentative structure for future Groovy development, focusing on smaller sets of features in each release. 1.6 is slated to bring annotation definitions and multiple assignment, 1.7 to bring incremental compilation, upgrading to ASM 3.0, and AST transformations, 1.8 to bring nested and anonymous classes, 1.9 to bring Antlr 3.0 upgrading and concurrency features, and finally 2.0 to bring a new MOP, with homogenized features. Focusing on a small set of features at a time worked during 2007, so that's the best way to go. But the new MOP in version 2.0 threatens to bring back 2006-style development. 2007 also saw the timing of Groovy releases revolve around marketing events such as JavaOne and Grails eXchange, which could also impact release quality.

So how will the future releases of Groovy play out?