Friday, June 20, 2008

Word Classes in English and Groovy

When I was in primary school, I learnt that English had 8 parts of speech: nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and articles. Nowadays linguists call them word classes. Since working in Tesol, I've learnt that words in English are better classified as falling somewhere along a continuum, with conjunctions, the most grammatical words, at one end, and proper nouns, the most lexical, at the other.

We'll take a quick look at these word classes in English grammar, then look at the similar concept in the Groovy Language. (Note: The English grammar is very simple, and based on what I remember from personal reading, not academic study, so I don't guarantee total correctness).


Word Classes in English

Conjunctions
The most grammatical words in English are and, or, and not, the same operators as in propositional logic. and and or can be used to join together most words anywhere further along the continuum. Most obvious are the lexical words, e.g:
  the book, the pen, and the pad (nouns)
  black and blue (adjectives)
  slowly and carefully (adverbs)
  to stop, listen, and sing (verbs)
  to put up or shut up (phrasal verbs)

Also, multiword lexical forms, such as phrases and clauses, can be similarly joined:
  the house, dark blue and three storeys high, ... (adjectival phrases)
  The batter hit the ball and the fielder caught it. (clauses)

But more grammatical words at the same position on the continuum can be joined:
  your performance is over and above expectations (prepositions)
  I could and should go (auxiliary verbs)
  They were and are studying (different type of auxiliary verbs)
  this and that (pronouns)

Incidentally, the continuum can have more than one type of multiword form at the same position, such as adverbials and prepositional phrases:
  They walked, very silently and with great care, ...

and and or are 2 of only 7 conjunctions in English, memorized by the acronym FANBOYS: for, and, nor, but, or, yet, and so. But and and or are more grammatical than the other five conjunctions, and can be used to join the others together, e.g:
  It was difficult, yet and so I tried.

The propositional logic operators are the most grammatical words in English.


Proforms
Next along the continuum are proforms, words that take the place of other more lexical words. In English, the most common type of proform is the pronoun, e.g. he, she, this, which also has determiner form, e.g. his, hers. For example:
  The dog chased the cat, but lost it. (pronoun: it)
  The dog escaped from the goat, but lost its collar. (determiner: its)

Other word classes and multiword forms have proforms. For example, pro-verb do/did:
  I enjoyed the film, and so did the ushers.
Gap for pro-verb:
  We found the south exit, and the other team, the north exit.
Pro-adjective such:
  We experienced a humid day, and also such a night.
Pro-adverb thus:
  Swiftly the Italians played; thus also did the Brazilians.
Proform for multiword adverbial so:
  The programmers finished totally on time; so did the testers.


Particles
Next are a large number of miscellaneous words between grammatical and lexical, which some call particles. Examples are interjections, articles (a/an/the), phrasal verb particles, conjunctive adverbs, sentence connectors, verb auxiliaries, not, only, infinitive's to, etc.

English, and I guess every natural language, is really a mess, and the particles are a way of categorizing the messy stuff.


Prepositions and Verbs
The first lexical word class along the continuum is the prepositions. In Hallidayan Functional Grammar, they're considered to be reduced verbs. Some examples: under, over, through, in. There are also multiword prepositional groups, e.g: up to, out of, with respect to, in lieu of.

Further along the continuum are the verbs, e.g. listen, write, walk. Verbs can be multiword, such as phrasal verbs, e.g. put up, shut up, prepositional phrasal verbs, e.g. get on with, put up with, and verb groups, e.g: will be speaking, has walked, to have gotten on with.


Adjectives and Nouns
Next along the continuum are adjectives, e.g. black, blacker, blackest. In Chomskian Transformational Grammar, adverbs ending in -ly are considered to be the same as adjectives, only modified at the surface level, e.g. slowly, slower, slowest.

Adjectives/adverbs can be multiword, e.g:
  The building is three storeys high. (adjectival phrase)
  That cat walks incredibly slowly. (adverb word group)

Next are common nouns, both count nouns, e.g. pen, pens, and mass nouns, e.g. coffee, hope. Nouns can be built into noun phrases, e.g. the long dark blue pen.

Just as verbs and prepositions are related, so are nouns and adjectives. Abstract ideas often only differ grammatically, e.g:
  Jack is very hopeful.
  Jack has much hope.
  Jack has many hopes.

At the lexical end of the grammar-lexis continuum are proper nouns. These can be phrases we construct from other words, e.g. the Speaker's Tavern, foreign words, e.g. pyjamas, fooyung, or even invented words, e.g. Kodak, Pepsi.

The largest word class in English are the nouns, then the adjectives, then verbs. When new words enter English, they're usually nouns. Some will become adjectives and maybe verbs, but very few ever move further along the continuum towards the grammar end. Although English has many Norman words from 800 or 900 years ago, very few are prepositions, and all the other more grammatical words came from Anglo-Saxon.

Perhaps all natural languages have a word class continuum with prepositional logic words at one end, and definable nouns at the other.



Word Classes in Groovy
Groovy uses both symbols and alphanumeric keywords for grammar, both lexed and parsed grammar. Groovy builds on Java, and hence C++ and C, for its tokens.

Bracketing and Separators
Perhaps the most grammatical along the continuum are the various bracketing symbols. Some have different tokens for opening and closing, e.g:
  /* */ ( ) [ ] { } < >
while others use the same token for both, e.g:
  """ ''' " ' /
There's no corresponding word class in English because English uses prosody (tone, stress, pause, etc) rather than words for the bracketing function.

Next along Groovy's continuum could be separators, e.g:
  , ; : ->
We can use , and ; for lists of elements, similar to and and or in English.

Groovy has a very limited repertoire of pronouns, only this and it.


Verbs and Prepositions
Perhaps operators are like English prepositions, e.g:
  == != > >= < <= <=>
  .. ..< ?: ? : . .@ ?. *. .& ++ -- + - * / % **
  & | ^ ! ~ << >> >>> && || =~ ==~

while some operators are almost like verbs, e.g:
  = += -= *= /= %= **= &= |= ^= <<= >>= >>>=

Some operators are represented by keywords in Groovy, viz. prepositions, an adjective, and a multiword noun-preposition, i.e:
  in as new instanceof

Verbs in indicative form are used in definitions, e.g:
throws extends implements
... .*

The most common verb form is the imperative, e.g:
  switch, do, try, catch, assert, return, throw
  break, continue, import, def, goto

though sometimes English adverbs are used as commands in Groovy, e.g:
  if, else, while, for, finally
Also used for this are nouns, e.g:
  case, default
and symbols, e.g:
  \ // #! $


Nouns and Adjectives
Groovy uses English adjectives for adjectival functions in Groovy, e.g:
  public, protected, private, abstract, final, static
  transient, volatile, strictfp, synchronized, native, const


Groovy has many built-in Groovy common nouns, e.g:
  class, interface, enum, package
  super, true, false, null
  25.49f, \u004F, 0x7E, 123e7

Some of them can also be used like adjectives, e.g:
  boolean, char, byte, short, int, long, float, double, void
are nouns (types) that can precede other nouns (variables), like Toy in A Toy Story.

We can define our own Groovy proper nouns using letters, digits, underscore, and dollar sign, e.g:
  MY_NAME, closure$17, αβγδε

Using @, we can also define our own Groovy adjectives.


Because Groovy is syntactically derived from Java, and hence from C++ and C, it, like English, is a little messy in its choice of tokens.

Notice also the different emphasis of word classes between English and Groovy, e.g:
  • Groovy uses tokens for bracketing while English uses non-token cues
  • English uses far more proforms than Groovy, which forces us to use temporary variables a lot
  • English uses Huffman coding by shortening common words like prepositions, while Groovy retains instanceof and implements



Conclusion: The Unicode Future
Unicode divides its tokens into different categories: letters (L), marks (M), separators (Z), symbols (S), numbers (N), punctuation (P), and other (C). Within each are various sub-categories. I'm looking at how best to use all Unicode characters (not just CJK ones) when extending a Java-like language such as Groovy with more tokens. The more tokens a language has, the terser it can be written while retaining clarity. Unicode's now a standard, so perhaps programmers will be more motivated to learn them than when APL was released. And modern IME's enable all tokens to be entered easily, for example, my ideas for the Latin-1 characters. Such a Unicode-based grammar must be backwards-compatible, Huffman coded, and easy to enter in the keyboard.

Friday, June 13, 2008

The Future of Programming: Chinese Characters

Last year, I wrote some wordy blog entries on this subject, see Programming in Unicode, part 1 and part 2. Here's a brief rehash of them...


Deep down in their hearts, programmers want to read and write terse code. Hence, along came Perl, Python, Ruby, and for the JVM, Groovy. Lisp macros are enjoying a renaissance. But programmers still want code to be clear, so many reject regexes and the J language as being so terse they're unreadable. All these languages rely on ASCII. The tersity of these languages comes from maximizing the use of grammar, the different ways tokens can be combined. The same 100 or so tokens are used.

Many people can type those 100 tokens faster than they can write them, but can write thousands more they can't type. If there were more tokens available for programming, we could make our code terser by using a greater vocabulary. The greater the range of tokens a language has, the terser its code can be written. APL tried it many years ago but didn't become popular, perhaps because programmers didn't want to learn the unfamiliar tokens and how to enter them. But Unicode has since arrived and is used almost everywhere, including on Java, Windows, and Linux, so programmers already know some of it.

With over 100,000 tokens, Unicode consists of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, and other stuff. Many programming languages already allow all Unicode characters in string contents and comments, which programmers in non-Latin-alphabet countries (e.g. Greece, Russia, China, Japan) often use. Very few programming languages allow Unicode symbols and punctuation in the code: perhaps language designers don't want to allow anything resembling C++ operator overloading.

But many programming languages do allow Unicode alphabetic letters and CJK characters in names. Because there already exists agreed meanings for combinations of these, derived from their respective natural languages, programmers can increase tersity while keeping readability. However, this facility isn't used very much, maybe because the keywords and names in supplied libraries (such as class and method names in java.lang) are only available in English.

I suspect programmers from places not using the Latin alphabet would use their own alphabets in user-defined names in a certain programming language if it was fully internationalized, i.e., if they could:
  • configure the natural language for the pre-supplied keywords and names (e.g. in java.lang)
  • easily refactor their code between natural languages (ChinesePython doesn't do this)
  • use a mixture of English and their own language in code so they could take up new names incrementally
Most programming languages enable internationalized software and webpages, but the languages themselves and their libraries are not internationalized. Although now rare, most programming languages will one day be internationalized, following in the trend of the software they're used to write. The only question is how long this will take.

However, I suspect most natural languages wouldn't actually be used with internationalized programming as there's no real reason to. Programmers in non-English countries can read English and use programming libraries easily, especially with IDE auto-completors. Writing foreigner-readable programs in English will be more important.

To become popular in a programming language, a natural language must:
  • have many tokens available, enabling much terser code, while retaining clarity. East Asian ideographic languages qualify here: in fact, 80% of Unicode tokens are CJK or Korean-only characters.
  • be readable at the normal coding font. Japanese kanji and complex Chinese characters (used in Hong Kong, Taiwan, and Chinatowns) don't qualify here, leaving only Korean and simplified Chinese (used in Mainland China).
  • be easily entered via the keyboard. An IME (input method editor) allows Chinese characters to be entered easily, either as sounds or shapes. The IME for programming could be merged with an IDE auto-completor for even easier input.
And to be the most popular natural language used in programming, it must:
  • enable more tokens to be added, using only present possible components and their arrangements. Chinese characters are composed of over 500 different components (many still unused), in many possible arrangements, while Korean has only 24 components in only one possible arrangement.
  • be used by a large demographic and economic base. Mainland China has over 1.3 billion people and is consistently one of the fastest growing economies in the world.
About a year ago, I posted a comment on Daniel Sun's blog on how to write a Groovy program in Chinese. (The implementation is proof-of-concept only; a scalable one would be different.) The English version is:
content.tokenize().groupBy{ it }.
  collect{ ['key':it.key, 'value':it.value.size()] }.
  findAll{ it.value > 1 }.sort{ it.value }.reverse().
  each{ println "${it.key.padLeft( 12 )} : $it.value" }


The Chinese version reduces by over half the size (Chinese font required):
物.割().组{它}.集{ ['钥':它.钥, '价':它.价.夵()] }.
  都{它.价>1}.分{它.价}.向().每{打"${它.钥.左(12)}: $它.价"}


I believe this reduction is just the beginning of the tersity that using all Chinese characters in programming will bring. The syntax of present-day programming languages is designed to accommodate their ASCII vocabulary. With a Unicode vocabulary, the language grammar could be designed differently to make use of the greater vocabulary of tokens. As one example of many: if all modifiers are each represented by a single Chinese character, for 'public class' we could just write '公类' without a space between (just like in Chinese writing), instead of '公 类', making it terser.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. Dynamic language advocates claim this benefit for dynamic programming over static programming: the benefit is enhanced for Chinese characters over the Latin alphabet.

If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every name (packages, classes, methods, fields, etc) in the entire Java class libraries. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary.

Furthermore, in an internationalized programming language, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their code, but so could Western programmers. Hackers want to write terse code, and will experiment with new languages and tools at home if they can't in their day jobs. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, they can enter the characters easily, and start using them incrementally. By incrementally I mean only as fast as they can learn the new vocabulary, so that some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English with Chinese. A good IDE plugin could transform the names in a program between two such natural languages easily enough.

Non-Chinese programmers won't have to learn Chinese speaking, listening, grammar, or writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is quite different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character without the syllabic tone, or instead just learn the shape.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to all the left-to-right characters in the Unicode basic multilingual plane. They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within hackers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. Software developers all over the world could be typing in Chinese within decades.


Chinese character data file available...
Recently, I analyzed the most common 20,934 Chinese characters in Unicode (the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block), aiming to design an input method easy for foreigners to enter CJK characters.

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one based on how I think a fellow foreigner will analyse the character. I've created a few thousand characters to cater for decomposition components not themselves among my collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

The data file is CSV-format, with 4 fields:
  • the character
  • first component
  • either second component, or -
  • type of decomposition
Here's a zip of that data file and truetype font file if anyone's interested.

Wednesday, June 04, 2008

Base-100 Arithmetic

(reposted)

In The Number Sense, Stanislas Dehaene says that in Cantonese and Mandarin, the sounds for the numbers are much shorter than in Western languages, and so native speakers of those Chinese languages can speak numbers quicker. He argues that this enables them to do mental math quicker than speakers of Western languages. In many parts of Asia including China, learning mental math is considered very important for children.

Dehaene also writes elsewhere in his book that many people who can do fast mental math not only practise the many calculation shortcuts, but also often memorize the products of 2-digit numbers. I've wondered if people memorizing such products would be better off to use a base-100 instead of base-10 system, that is, to create a hundred digits and map them to the numbers from 0 to 99. After some initial memorization, it would be easy to convert back and forth between them. Even better is if Chinese sounds were used for the base-100 digits, taking advantage of the short sounds. The Chinese group digits into groups of four, unlike English-speakers' groups of three, making Chinese numbering even more suitable.

The first ten digits already exist: 0零, 1一, 2二, 3三, 4四, 5五, 6六, 7七, 8八, and 9九. There's already characters for some of the other 2-digit numbers: 10十, 20廿, 30卅, and 40卌. Perhaps also 木 for 80 (from Chinese riddles) and 半 (meaning ½) for 50. Maybe in some cases these characters for multiples of ten could be used as radicals in associated numbers, for example, digits related in some certain way to 80 could be represented by characters with the 木 radical (eg, 相枩來枳林柬朿朾朽朳朲朰東杰, etc). There's many more existing sequences that could be used in some way, like the 10 stems (甲乙丙丁戊己庚辛壬癸), the 12 branches (子丑寅卯辰巳午未申酉戌亥), or the Yi Ching characters. What is most important, though, is that the sound of each digit from 0 to 99 be different. Because there's about 400 different sounds in Mandarin Chinese, that would be possible.

The easy part for those learning such base-100 arithmetic would be memorizing every mapping between a 2-digit base-10 number and the matching base-100 digit. Children could learn that before they're 3 years old. To do any effective mental math, they would need to memorize many sums and products of pairs of base-100 digits, far more difficult. If they memorized sums by putting the higher number first, and products by putting the lower first, they wouldn't need to remember whether a sequence of four base-100 digits was a sum or product, they would only memorize the sequence itself. If the two numbers were the same, it would be the product. This gives 5050 different ways two base-100 digits can be multiplied together and 4950 ways they can be added: 10,000 combinations in total.

Many of those 10,000, though, could be worked out using shortcuts based on patterns. For example, to multiply two numbers, such as 93 x 98, by using the complement (on 100) of each number, 7 and 2, we can calculate the complement of their sum, 91, followed by their product, 14, giving the final result 9114. This particular example is really only useful in base-10 for numbers quite close to 100, but in base-100, it can be used for all numbers over 50. At the cost of memorizing 50 pairs of complements (1+99, 2+98, etc), we can reduce the 10,000 combinations down by 1275, to 8725.

There's many other shortcuts that could be utilized to reduce that number down considerably further. I suspect those shortcuts would be based on the common divisors of 100, i.e. 2, 4, 5, 10, 20, 25, and 50. For example, when adding 25 + 22, in my mind I calculate it as 25 + (25 – 3) = (2 * 25) – 3.

Of the four-character sequences that would need to be memorized, if many of them bore some pictorial or phonetic resemblance to the thousands of four-character proverbs (成语) that Chinese children already learn by rote, they'd find it much easier to memorize them. In Chinese proverbs, only the content words are recited, not the grammar words, so English proverbs in the Chinese style would be "Stitch time, save nine", "Stone roll, no moss", "Bird hand, two bush", etc. This is what would make it far easier for native Chinese speakers to do base-100 mental math than Westerners learning such arithmetic.

Here's an example of this technique, but using an English proverb instead, with associations 13=bird, 19=hand, 2=two, and 47=bush. To multiply 13 x 19, there's no shortcut, so we'd recite the associated sounds, with the lower number first for multiplication, i.e., 13 x 19 = “bird hand”. We'd automatically finish it in our heads, i.e., “two bush” = 0247. Viola!

I don't know of any existing base-100 arithmetic in China, having never seen any websites or books on the subject. What such base-100 arithmetic needs is for a native Chinese speaker with a background in computing and linguistics to design and run the intensive computations necessary to assign the best possible mapping between 2-digit numbers and base-100 digits, so the memorizations will be easiest for native-speaking Chinese children. It would be a time-consuming input-intensive programming task with a deliverable of only 90 ordered Chinese characters. An example of the future of computing, perhaps?

Tuesday, June 03, 2008

Ejoty in Groovy

Ejoty is a word invented by magician Stewart James to describe the mental skill of easily remembering the numeric value of each letter of the English alphabet (i.e. A=1, B=2, ..., Z=26) to enable quick mental calculation of the value of words (e.g. WORD = W + O + R + D = 23 + 15 + 18 + 4 = 60). The letters in ejoty refer to the ordered multiples of 5 in the alphabet, i.e. E=5, J=10, O=15, T=20, Y=25, which, if we memorize those, will enable us to easily calculate the values of most other letters by using only 1 or 2 offsets.

A way musical and aural learners could use to learn the values of letters is to remember the value of the first letter in each foot of the popular children's song "abcd efg, hijk lmnop, qrs tuv, wx yz", i.e. "1 5, 8 12, 17 20, 23 25".

If we can instantly know the value of each letter, we can more easily practise adding a sequence of numbers in our heads whenever we see words written down somewhere. For example, signs we see when riding public transport:
  SYDNEY = 19 + 25 + 4 + 14 + 5 + 25 = 92
  FLINDERS = 6 + 12 + 9 + 14 + 4 + 5 + 18 + 19 = 87


Using Groovy
To more quickly ejotize words, we can learn the sums of common letter sequences off by heart. To find the most common ones, we can write a Groovy program...

After extracting the English word list english.3 within this zip file, we can run this script:

def gramVal(gr){
  def tot= 0
  gr.each{ tot += (it as int) - 96 }
  tot
}

def grams= [:]
new File("english.3").eachLine{word->
  word -= "'"
  for(int i in 2..word.size())
    if(word.size() >= i)
      for(int j in 0..word.size() - i){
        def gram= word[j..j+i-1].toLowerCase()
        if( grams[gram] != null ) grams[gram]++
        else grams[gram]= 1
      }
}

grams.entrySet().findAll{it.value > 200}
     .sort{it.value}.reverse().each{
  def gm= gramVal(it.key)
  println "$it.key ($gm): $it.value"
}


Only the sequences of letters occuring more than 200 times in that word list will be displayed by that version of the program. The first 20 lines output are:

an (15): 2634
er (23): 2606
in (23): 2080
ar (19): 1977
on (29): 1780
te (25): 1750
ra (19): 1732
en (19): 1625
al (13): 1570
ro (33): 1498
ri (27): 1485
is (28): 1472
la (13): 1444
or (33): 1426
le (17): 1425
at (21): 1404
ch (11): 1327
st (39): 1303
re (23): 1269
ti (29): 1253


(The reason the commonly-occuring th doesn't occur is because the program doesn't consider word frequencies in normal text.)

Ejoty In Reverse
Perhaps we want to easily convert numbers to letters. We could learn the letters for the numbers up to 26 easily enough, but what if we want to convert higher numbers to something. What about to groups of letters, where their sum is the number? We'd need to generate some common possibilities. This Groovy code uses the grams list we generated in the previous code sample to generate the 5 most common sequences for numbers up to 100:

def grGrams= grams.groupBy{gramVal(it.key)}
grGrams.entrySet().findAll{it.key <= 100}
       .sort{it.key}.each{
  print "$it.key (${it.value.inject(0){flo,itt-> flo+itt.value}}): "
  def set= it.value.entrySet().sort{it.value}.reverse()
  def setSz= 5
  if(set.size() >= setSz) set= set[0..setSz-1]
  println set.collect{"$it.key($it.value)" }.join(', ')
}


Here's a segment of output showing the most common letter sequences for numbers greater than 26:

27 (10243): ri(1485), lo(973), ol(955), sh(587), ve(442)
28 (11695): is(1472), th(855), si(785), mo(709), om(691)
29 (11231): on(1780), ti(1253), it(919), no(768), ell(251)
30 (7766): oo(370), ing(362), ati(245), iu(234), sk(207)
31 (7516): op(660), po(534), rm(330), vi(294), her(217)
32 (8183): sm(361), rn(276), tic(255), eri(249), mar(188)
33 (11647): ro(1498), or(1426), ul(550), hy(403), ns(355)
34 (9931): nt(950), os(793), um(504), so(424), pr(304)
35 (9482): to(1077), ot(634), un(541), ant(321), sp(282)
36 (8867): ou(788), rr(288), pt(183), ato(165), min(165)
37 (8499): rs(330), ly(267), ov(250), yl(220), pu(149)
38 (10054): tr(740), ss(458), rt(414), ion(261), qu(256)
39 (11026): st(1303), ur(745), ent(320), ru(304), per(244)
40 (9339): us(1130), su(353), tt(308), ast(225), sta(211)
41 (9141): ut(364), tu(282), ism(252), rin(191), yp(178)
42 (8756): ers(200), mon(199), olo(170), res(144), ori(132)
43 (9499): ter(534), ry(378), tin(177), nti(151), ium(138)
44 (8233): ste(219), ys(184), est(168), tio(157), sy(107)
45 (7260): ty(229), ver(182), yt(128), tte(107), aceou(104)
46 (7459): ris(147), rom(135), mor(108), orm(105), los(77)
47 (7762): sis(206), tri(180), ron(156), rit(118), ssi(107)
48 (8292): ist(236), sti(162), uri(116), tis(115), eter(114)
49 (7790): ton(202), rop(143), ont(122), pro(120), phy(120)
50 (6711): oto(117), graph(83), oun(59), tric(57), low(54)

There's plenty of choices there.

The code uses the groupBy GDK function, and the output gives a visual representation of applying it to some data.