Tuesday, October 12, 2010

Recent progress in computer-aided language learning

Summary: This post tells why Anki and sentence mining are important steps forward in the computer-aided language learning scene. Both steps have happened during the last 3.5 years.

Background: The fundamental problem of flashcard programs

When studying languages, flashcard programs show you a word and ask you to give the translation. In recognition task the program shows the foreign word and asks for the English meaning. Production task tests your ability to spell out the foreign word. Flashcard programs are called spaced repetition systems because they contain timing algorithms which ask easy questions rarely and difficult questions often until they become easy. This ensures that the material is on average suitably difficult.

The fundamental problem is that you don't learn the word by remembering its translation. If you now memorize that Telugu word "adivaramu" means Sunday, you'll just forget it in a few weeks. Spaced repetition systems can delay this to months by reminding you about the word. But to permanently learn a word in the sense that Finnish English speakers know that "Sunday" means "sunnuntai", you need context. You need to see the foreign word in tens or hunderds of sentences, so that it integrates with larger data structures in your head and is no longer just a factlet like "the circumference of earth is 44000km".

This problem is specific to spaced repetition systems, because it is already solved in the analog world. Language textbooks provide the context in the text chapters. Filogists who train to be interpreters and translators mainly read books to expand their vocabulary. In that situation all words are in context.

I first realized this problem after I banged through 1000 Lojban words with Logflash only to forget them all in 3 months.

My first conclusion was that you should only flash cards for which you have text. This worked great with Practical Chinese Reader I & II. First I flashed the words and then I read the text. Thanks to spaced repetition system I could go through chapters much faster.

When I started to build my own language-learning website, I fully realized the importance of tackling this problem. I was using MDBG annotator. It can turn any text into decent study material, unless the text is much above your level. My first approach was to grab the context from the same source as the words. My website had a feature which turned a copy-pasted a Chinese text into a flashcard deck, which contained all words in the text. It also had an easy interface for removing familiar words. The word flascards had context attached: After you gave your answer, it showed the sentences where the word appeared. It also annotated the sentence MDBG-style: When your mouse hovered over any unknown word in the sentences, the meaning of the word appeared.

This solution had a shortcoming: The sentences were too long and difficult, and having just one sentence of context was not enough. I also realized that the real learning happened when studying the sentences, and that they were at least as important as the words being flashed.

My second solution was to collect a database of translated easy sentences and to automatically match them to flashcards. I never properly implemented this, because it required HUGE amount of database collection. Anyone who has ever written example sentences knows how slow it is. The best I achieved was to type enough sentences for an elementary course in Chinese. The material contained Skritter-style character drawing exercises for 200 characters and simple, clear, translated example sentences for them all. This produced adequate quality but it didn't scale. This lack of scalability made it a toy site. Shortly after that, I graduated and stopped developing the site.

Sentence-based flashcards

During the last 3.5 years, an ingenious solution surfaced to the Fundamental Problem: Sentence mining. The idea is that sentences are the basic unit of flashing, not words. Just like gymansts train whole-body movements and just trust that individual muscles get stronger, in sentence flashcards you just trust that you also learn words while flashing sentences.

This is a new developement, as Xamuel's artice is written September 2009 and the Chinese sentence deck I now use was written in 2008. I stopped working on SRS in 2007. This idea is so simple that it makes me ashamed that I didn't notice it. I had already diagnosed the problem and was trying different solutions to it, but somehow failed to take the last step of imagination and to fully move to sentence-based cards.

My own experience confirms that it works like dream. During my Chinese study, I've periodically benchmarked my character count with Clavis Sinica's character test. During the first 4 years, I reached the weekly average score of 2200. During 10 months with sentence deck, the character count exploded to 3000. I could have reached the current skill level a full year earlier, had I known about this method. Now I no longer use the sentence deck, because it has been so efficient that the bottleneck has moved away from single Chinese characters and more context-heavy methods like reading texts with MDBG are more appropriate.


The rise of Anki is the second big step forward in the computer-aided language learning (CALL) scene. Anki does not contain anything revolutionary, but it combines all good features from all previous flashcard programs into one consitent and easy package. It is so good that if I entered into CALL scene again for the purpose of doing research for graduate studies, I would scrap my old website, which included a spaced repetition system, and use the superior, refined and open-source Anki instead as a basis.


Although my own CALL efforts failed, recent developments in CALL field demonstrate that I was tackling the right questions: How to get context for words in flascards, and how to construct a good spaced repetition system. Progress happened when these problems were addressed. I've witnessed the superiority of the result myself with Anki and 20000-card HSK sentence deck.


Markku said...

I'm in the middle of reading your post, so I don't have anything more substantial to say, yet, but one thing stuck out from your text. You write:

a factoid like "the circumference of earth is 44000km"

A factoid is a fact-like item not really a fact. Analogously, an android is not quite an andros, a man.

Simo said...

I wasn't aware that the word factoid has two contradictory meanings: the original meaning you give, and the later meaning which Wikipedia mentions:

"A factoid is a questionable or spurious—unverified, incorrect, or fabricated—statement presented as a fact, but with no veracity. The word can also be used to describe a particularly insignificant or novel fact, in the absence of much relevant context."

Clearly a lousy word for any purpose.

Markku said...

The latter meaning is ascribed to the word only because so many people have misunderstood the word exactly in that manner.

Simo said...

Now it is a factoid that factoid means "factlet separate from relevant context." :)