Influx Dev Log - Phase I
#post
Dev log for Influx
Development on Influx has been a slow start after it was proposed back in 2022. We finally landed on a somewhat exotic stack with Rust, Axum, SurrealDB, Svelte, and Tauri, perhaps partly because I’m tired of React and wanted something cleaner, and also perhaps because I just love the Rust language. Either way, there’s something to be learnt from taking on this project and creating some artificial challenges.
H2 Influx’s Philosophy
It has not changed that Influx shall be a language learning tool used by those who wish to teach themselves the language, rather than those who want an app that teaches them the language. For learners who believe in active learning, who have their own timeline for learning material, and who are interested in consuming large amounts of language content, Influx is meant to assist this process with a versatile and well-integrated toolbox. Its goal is to minimise the friction involved in consuming and learning from challenging foreign language material.
H2 Development Philosophy
The development plan is, in many ways, intentionally hard. I’m choosing to not stick with React so I get to learn Svelte, and not use UI libraries so I get to learn how to implement and style components from scratch. As with SurrealDB, it looked cool so why not give it a try by building a project? Ultimately, the problems I encounter and solve along the process are all learning opportunities. Meanwhile, there will be no test cases, no specs, and little existing software to mimic, so I get to learn how to design software.
Writing the backend in Rust might save some headaches with type safety, or at least that’s what I hoped. Though development may be slower, Rust seems like a more robust choice with powerful unit testing and documentation tools out of the box.
H2 What works so far
- Access content in the file system and render them in the front-end
- Working databases: vocabulary, languages, phrases
- Python scripting capability for extendable language support
- Text processing: tokenization, lemmatization, sentence segmentation, and phrase identification with optimization
- APIs: getting documents, vocab, languages, and modifying tokens
- Front-end: basic text reader, basic navigation, workspace layout
For screenshots, see Influx Continuous Dev Log.
H2 NLP Integration
One problem that draws me to working on Influx is how to make use of existing NLP tools to assist language learning.
Supporting custom learning material comes with the inherent issue of having to process whatever the learner brings. If the input is text, how does one know where to split sentences and words? Tricky—even more tricky if I want to support arbitrary languages.
For now, I’m using Stanza for tokenization, sentence segmentation, and lemmatization. It supports 70 human languages, which should be a good number for early-stage development. Rather than manually writing parsing rules, Stanza uses pre-trained neural models and produces output in a format consistent across languages—less of a headache compared to, say, using MeCab for Japanese, and separately Jieba for Chinese, and separately doing some horrible regex for Spanish…
The trouble, though, is that Stanza uses Python but our code is in Rust. Fortunately, there exist things like PyO3 to embed Python within Rust. Here’s an example of Stanza output being parsed into the desired Rust struct, with some extra work to bring back whitespaces, which Stanza throws away.
Example text parsing
Input (a horrible mess):
Let's write some SUpeR nasty text, tabs haha shall we?
Parsed document (structured!):
&sentences = [ Sentence { id: 0, text: "Let's write some SUpeR nasty text,", start_char: 0, end_char: 42, constituents: [ CompositToken { sentence_id: 0, ids: [ 1, 2, ], text: "Let's", start_char: 0, end_char: 5, }, SubwordToken { sentence_id: 0, id: 1, text: "Let", lemma: "let", }, SubwordToken { sentence_id: 0, id: 2, text: "'s", lemma: "'s", }, Whitespace { text: " ", start_char: 5, end_char: 6, }, SingleToken { sentence_id: 0, id: 3, text: "write", lemma: "write", start_char: 6, end_char: 11, }, Whitespace { text: " ", start_char: 11, end_char: 15, }, SingleToken { sentence_id: 0, id: 4, text: "some", lemma: "some", start_char: 15, end_char: 19, }, Whitespace { text: " ", start_char: 19, end_char: 23, }, SingleToken { sentence_id: 0, id: 5, text: "SUpeR", lemma: "SUpeR", start_char: 23, end_char: 28, }, Whitespace { text: " ", start_char: 28, end_char: 31, }, SingleToken { sentence_id: 0, id: 6, text: "nasty", lemma: "nasty", start_char: 31, end_char: 36, }, Whitespace { text: " ", start_char: 36, end_char: 37, }, SingleToken { sentence_id: 0, id: 7, text: "text", lemma: "text", start_char: 37, end_char: 41, }, SingleToken { sentence_id: 0, id: 8, text: ",", lemma: ",", start_char: 41, end_char: 42, }, ], }, Whitespace { text: " \n\n\t", start_char: 42, end_char: 46, }, Sentence { id: 1, text: "tabs haha", start_char: 46, end_char: 55, constituents: [ SingleToken { sentence_id: 1, id: 1, text: "tabs", lemma: "tabs", start_char: 46, end_char: 50, }, Whitespace { text: " ", start_char: 50, end_char: 51, }, SingleToken { sentence_id: 1, id: 2, text: "haha", lemma: "haha", start_char: 51, end_char: 55, }, ], }, Whitespace { text: "\n\n", start_char: 55, end_char: 57, }, Sentence { id: 2, text: "shall we?", start_char: 57, end_char: 66, constituents: [ SingleToken { sentence_id: 2, id: 1, text: "shall", lemma: "shall", start_char: 57, end_char: 62, }, Whitespace { text: " ", start_char: 62, end_char: 63, }, SingleToken { sentence_id: 2, id: 2, text: "we", lemma: "we", start_char: 63, end_char: 65, }, SingleToken { sentence_id: 2, id: 3, text: "?", lemma: "?", start_char: 65, end_char: 66, }, ], }, ]
H3 Lemma vs inflection, Learner’s Perspective
I don’t recall having a good experience with lemmas when using LWT. As a content-based language learner, the order in which the lemma and inflections of the same word are learnt may be all over the place. Sometimes an inflection can look very different from the lemma and requires some learning to recognize that the two are related, other times a reminder of what the lemma is would be sufficient.
A benefit to automated lemmatization is that tokens don’t have to have a lemma field—the lemma can be predicted on the fly based on the context even if an orthography can correspond to multiple lemmas. This is not something LWT does.
Perhaps there needs to be some flexibility in how the learner chooses to learn an inflection. For example, if suis (inflection of its lemma être) shows up in the content, we can write out this matrix. Here, vocab refers to the learner’s saved vocabulary.
être (lemma) in vocab | être (lemma) not in vocab | |
---|---|---|
suis (inflection) in vocab | learner knows both lemma and reflex, but reviewing which one is more helpful for their learning? | learner encounters a previously seen inflection, but they have not previously encountered the lemma. Is it sufficient to only remind them of the inflection without knowing the lemma? Or should they learn the underlying lemma? |
suis (inflection) not in vocab | learner encounters a new inflection of a lemma they know. should they learn to associate the inflection with the lemma? or is it more beneficial to build intuition using the inflection directly? | learner encounters a new inflection and the underlying lemma is also new to them. What should they learn first? And how do they learn to recognize the underlying lemma? |
Ideas:
être (lemma) in vocab | être (lemma) not in vocab | |
---|---|---|
suis (inflection) in vocab | Influx shows notes for suis along with a reminder that suis is an inflection of être; clicking on the reminder brings up their notes on être | Influx shows notes for suis and a reminder that être is the underlying lemma, along with a way for the learner to save être as a new token |
suis (inflection) not in vocab | Influx reminds the learner that suis is an inflection of a lemma they have seen, and lets the learner decide how they want to learn suis | Influx defaults to showing a new token form for suis, but a button allows the learner to bring up an additional new token form for être |
H3 Multi-word token (or let’s call it phrases)
The learner would almost certainly at some point want to learn a group of tokens together. How can one tokenise correctly based on the learner’s saved multi-word tokens?
Tricky.
Even worse, there could be situations where multiple saved multi-word tokens overlap. Then, how does one know where to split the tokens?
Consider:
vocab = {"aa", "bb", "aa bb", "bb cc"}
text = "aa bb cc"
Do you tokenize as [aa bb] cc
or aa [bb cc]
? I don’t know. Worse still, a naive algorithm can easily attempt (aa [bb) cc]
if it tries to apply every multi-word token. Perhaps we can prioritize longer multi-word tokens, but what about
vocab = {"aa", "bb", "aa bb", "cc dd", "aa bb cc"}
text = "aa bb cc dd"
But with higher priority for the longest sequence, we get [aa bb cc] dd
, but the split with a maximal number of tokens being grouped into some phrase is [aa bb] [cc dd]
.
Trouble.
Well, at the end of the day, Stanza only tokenizes without knowing the learner’s saved multi-word tokens, so there needs to be some non-trivial algorithm that does second-pass tokenisation based on what phrases the learner has in the database.
H2 Next Steps
- UI/UX design
- Phrase selection and modification support in text reader
- Packaging
H2 Open problems
- How to get dictionaries for arbitrary language? Note users will likely have to bring their own dictionaries.
- Packing a Python interpreter and with it the Stanza library