Skip to content

jgregoriods/rongopy

Repository files navigation

rongopy

Ideas for the decipherment of rongorongo using machine learning in Python

Jonas Gregorio de Souza
ORCiD

Background

What is rongorongo?

Rongorongo (henceforth RoR) is an undeciphered glyph system from Easter Island. The very nature of RoR as true writing is debated. In the past, the prevalent view was that the glyphs were a mnemonic device and were unrelated to the specific words of the chants they were meant to recall (Métraux 1957; Routledge 1919). Nowadays, most scholars assume that the system was either logographic, with a few phonetic complements (Guy 1990, 2006; Fischer 1995a), or predominantly syllabic, with certain glyphs working as determinatives or logograms (Pozdniakov 1996; Pozdniakov and Pozdniakov 2007; Horley 2005, 2007).

The canonical RoR corpus is comprised of texts carved on 20 wooden tablets, one staff, two reimiro (pectoral adornments), one birdman sculpture (tagata manu), and one snuffbox (assembled from an earlier tablet). A bark-cloth fragment has recently been recognized as another genuine inscription (Schoch and Melka 2019). The texts are repetitive, with three tablets (H, P, Q) containing the same text. Certain sequences of glyphs, some of them quite long, appear in multiple artefacts.

The only RoR passage whose meaning is thought to be understood by most scholars is the lunar calendar on tablet Mamari (Guy 1990; Horley 2011; but see Davletshin 2012b). Repeated crescent-shaped glyphs are combined with other signs, presumably phonetic complements used to spell the names of the nights.

The antiquity of the system is another point of contention (Langdon and Fischer 1996). Most of the artefacts appear to be recent. Three tablets were carved on European oars, and the only radiocarbon measurement available (for tablet Q, Small St. Petersburg) points to the 19th century (Orliac 2005). If, however, RoR can be proven earlier than the European encounter and its function as real writing can be ascertained, this would be a remarkable finding - one of the rare cases of independent invention of writing in the world.

On this website and repository, I offer some thoughts on a machine-learning approach towards decipherment, alongside the data (RoR corpus that can be loaded in Python) and code. This is not another claim to decipherment, as I don't think the results were acceptable, but the method is promising and can perhaps inspire others.

Approaches to decipherment

The earliest attempts at decipherment, still in the 19th century, took advantage of the fact that informants were still alive who had presumably been instructed in RoR - or at least heard the tablets being recited (Routledge 1919). Two informants, named Metoro and Ure Vaeiko, provided readings for entire tablets (Thompson 1889; Jaussen 1893). Metoro's readings - apparently just a description of the objects depicted by individual glyphs - formed the basis for Thomas Barthel's interpretation of RoR (Barthel 1958).

Yuri Knorozov, famous for the decipherment of Maya glyphs, was later involved with other Soviet scholars in the study of RoR (Butinov and Knorozov 1957). Their understanding was that RoR was a mixed writing with logograms and phonetic complements, similar to other hieroglyphic systems.

The many publications of Jacques Guy opened several routes to decipherment. Most importantly, we must mention the recognition of potential taxograms or determinatives (Guy 2006) and the interpretation of the structure of the lunar calendar in tablet Mamari, including a number of plausible phonetic readings for signs that accompany the moon glyphs (Guy 1990).

In the 1990s, Steven R. Fischer brought renewed attention to the field with his purported decipherment. Based on similarities with the structure of a cosmogonic chant recited by Ure Vaeiko, Fischer read a series of procreation triads in the Santiago Staff (Fischer 1995a) and other tablets (Fischer 1995b). His work, however, was heavily criticized by other RoR scholars (Guy 1998; Pozdniakov 1996).

The recent work by the Pozdniakovs (Pozdniakov 1996; Pozdniakov and Pozdniakov 2007) and Paul Horley (2005, 2007) is focused on simplifying Barthel's catalogue by isolating the basic glyphs in RoR and comparing glyph and Rapanui syllable statistics. Similarly, Albert Davletshin (Davletshin 2012a, 2012b) has been attempting to separate syllabograms and logograms in RoR based on glyph combinatorial properties. Most recently, he propsed some syllabic readings of glyphs based on apparent logogram substitutions and use as phonetic complements (Davletshin 2022).

Martyn Harris and Tomi Melka have been moving the field in the direction of machine learning and natural language processing with n-gram collocation and latent semantic analysis (LSA) (Harris and Melka 2011a, 2011b).

Finally, an updated tracing of the corpus using new recording methods is a priority in recent years (Lastilla et al. 2022; Valerio et al. 2022). The reference in that regard is Paul Horley's book (2021) which, in addition to the publication of updated tracings of the entire corpus, also provides numerous discussions about the glyph catalogue, parallel passages, structured sequences and list entries.

2023 Update

This is an update to my previous code (see ga_lstm folder), which used a genetic algorithm scored by an LSTM language model to "brute force" a mapping between glyphs and syllables.

The idea is to use syllable and glyph frequencies, combined with sequence-to-sequence (seq2seq) models, to map the glyphs to the language. Similar approaches have been proposed to decrypt substitution cyphers by encoding both source and target texts to a common space, either using the respective letter/symbol frequencies (Aldarrab and May 2021) or as recurrent integer sequences (Kambhatla et al. 2023).

Seq2Seq Model

Here, I use a selection of short recitations and chants, which most likely represent the genres present in some of the rongorongo texts (Barthel 1960; Blixen 1979; Fischer 1994). These texts, including the kaikai recitations that accompany string figures, very often preserve fossilised forms of the ancient Rapa Nui language. Assuming that rongorongo represents a logosyllabic writing system, the verses are split into syllables. The syllables are then converted to integer sequences according to the rank of each syllable by order of frequency.

If decipherment was a simple matter of matching glyph and syllable frequencies, the solution would have been found long ago. Unfortunately, frequencies vary considerably depending on the subset of texts that is considered. To account for that, each verse is encoded ten times, each time based on the frequencies calculated from a random sample of verses, similar to the approach of Aldarrab and May (2021).

An encoder-decoder model with attention is then trained with the integer sequences (i.e. each sylable encoded as its frequency rank) as sources and the syllable sequences as targets. I use a GRU model with 100 embedding units and 250 hidden units for both the encoder and the decoder (the TensorFlow authors 2018).

The model is trained to predict Rapa Nui sentences from integer sequences based on syllable frequencies. Glyph sequences are then encoded in the same way and fed into the model. Please, notice that the "decoded" glyphs are only shown as an example, and the correctness of such decoding is by no means endorsed.

Once trained, the model is tested on sequences of glyphs. I used the parallel passages catalogued by Horley (2021). These are numerous mini-texts which are found in various tablets, often in different spellings, with some tablets almost entirely consisting of collations of such passages (Pozdniakov 1996; Sproat 2003; Horley 2007; among many others). It is possible that they are self-contained texts consisting of short invocations, chants, prayers etc. similar to the content of the selected Rapa Nui texts. Another advantage of using the parallel passages is that the beginning and end of the sequences can be determined - facilitating the analysis of glyph collocations (see below). The glyphs, originally transcribed using Barthel's (1958) catalogue, were converted into the encoding proposed by Horley (2021), which, similar to previous proposals, simplifies the numerous ligatures in the catalogue to a set of about 130 basic glyphs. These are encoded according to their frequencies, just like the syllables in the language. Since there are many more glyphs than syllables (partly due to the source materials not distinguishing long vowels or the glottal stop), a maximum rank of 30 was considered, anything lower than that being encoded as 0.

Results

Results were not consistent, preventing the assignment of syllabic values to specific glyphs. In fact, the frequency distribution of the glyphs is incompatible with that of the Rapa Nui syllables. While the latter obeys Zipf's law, as expected of natural language, the frequencies of the first and second most frequent glyphs - glyphs 200 and 6, respectively - are not too distant from each other. This is in agreement with the observation that glyph 200 may be a taxogram, frequently omitted in parallel passages and apparently used to connect other glyphs in ligatures (Guy 2006). Thus, for the final evaluation, glyph 200 was omitted from the rongorongo sequences.

To assess the viability of the decoding, the predicted texts were evaluated using perplexity based on a bigram model of the Rapa Nui corpus. In addition, the closest matches were sought between predicted sentences and the Rapa Nui verses based on the Levenshtein edit distance (file results.csv). The results were far from acceptable, but still provide some insights. Glyph drawings below are from Paul Horley (2021).

Attention weights and prediction from one parallel passage (#11, Ar3) (Horley 2021). Please, notice that the "decoded" glyphs are only shown as an example, and the correctness of such decoding is by no means endorsed.

Some of the readings agree with the position of the glyphs/syllables as starters or enders of sequences. The tables below show the five most frequent glyphs/syllables in the parallel passages and selected Rapa Nui corpus, as well as the five most frequent in the beginning and end of a sequence. As expected, the model often decodes glyph 4, which is the most frequent in the beginning of a sequence, as i or ka. The most frequent glyph, 6, is often decoded as a, as long postulated by Pozdniakov and Pozdniakov (2007). Notice that very different readings were recently proposed by Davletshin (2022) based on substitutions and apparent use as phonetic complements, which may provide a more fruitful approach towards decipherment.

Glyphs Starters Enders
6 4 1
1 600 4
10 10 10
4 6 6
600 1 711
Syllables Starters Enders
a ka e
i e a
ka ko i
e i ga
u a na

Conclusion

It was not possible to arrive at a viable decoding using the Seq2Seq model, which may be due to the small size of the corpus when compared to similar studies (Aldarrab and May 2021). The uncertainty about the glyph catalogue, as well as the presence of logograms and determinatives (taxograms) in addition to syllabic glyphs, are also factors that hinder an automated approach to decipherment.

Releases

No releases published

Packages

No packages published

Languages