7 minute read

Previous Article

Where We Left Off

In the last article, we discussed the formatting of the Master List and how it could be used to aid the study of Traditional Chinese characters, assuming familiarity with its Simplified counterpart. Since then, I have realized that the first list, containing 350 character simplifications that do not occur as components of other character simplifications is not just a random list of characters. The issue lies within the following: some components of radicals that are not considered legitimate radicals have also been simplified, and unlike in Japanese, they have been simplified across the board – pretty much every occurrence of that component has been simplified. For example, the characters 汉 - 漢 and 难 - 難 both see a common component being simplified to 又. Apparently, this component is a character itself: 堇 (jīn), which does not appear in the simplification table, and according to digital simplified dictionaries this character is used to describe a certain type of plant and is not simplified. It is understandable that 堇 did not get simplified to 又, as that would imply imbuing yet another definition and pronunciation to a common character such as 又. In this case, since the word 堇 was not itself simplified, words comprising it cannot be categorized as “words containing simplified component(s),” and therefore must fall under the miscellaneous category in the first table. It is also important to note that the word 难/難 is listed as a simplified component in the second table, but the point still holds nevertheless.

An Action Plan (Back to Anki!)

Now we have gone over most of the useful theory regarding the Master List and the simplification of Chinese in order to begin working towards learning traditional Chinese, it is time for our plan to spring into action. Perhaps the most difficult step to actually remember the new written forms of the Chinese characters, by which I mean the ability to reproduce it with a writing utensil. Being able to do so will “cure” Chinese word amnesia to some extent, which is why this is a compelling goal to set.1 I did some research on the Chinese internet regarding the best ways to memorize and remember characters. Though each source suggested the clever use of some mnemonic system or memory system, none of them escaped to fate of “just practice a lot.” So I set out to practice.

My first thought was Anki, the ultimate (and open-source) spaced repetition memory training program. I envisioned an Anki deck that prompted me with a simplified or traditional character (in the former case, potentially elaborating on meaning to avoid ambiguities) and I would have to write down the corresponding character. In the spirit of not reinventing the wheel, I quickly searched Anki for the type of decks that I would be looking for, and voilà, there it is! This deck contains 800 characters that differ between simplified and traditional forms. The card format is exactly what I was envisioning, although it does not come with animated stroke orders for simplified Chinese characters, so I would either have to search that up manually when I come across an unfamiliar one or write some code to scrape another set of gif animations off of an online database. Also, the characters come with preliminary English definitions. Although definitions of single characters tend to be abstract, ambiguous, and not useful to me, I left it there as a reference for unfamiliar characters. Regardless, I’m glad I came across this before I attempted to create my own deck from scratch. My hope is that I can sort out as many of the words from table 1 and 2 of the master list (excluding radicals of course, those 14 radicals could be learned separately). Hopefully, with little work, I could extract out the decks I sought to create. And for now, I will only be compiling the second table. I want to drill in those characters first before organizing a deck for the first table.

Compiling the Deck

As with any manual-labor-intensive data mining activity, I pulled up a new excel spreadsheet and exported the downloaded Anki deck first to a plain text file (with media references and all the other options checked), and pasted that into Excel. As Anki conveniently exports the TSV (tab separated values) format by default, Excel is able to directly accept it as input. Now, after writing about why I liked the Taiwanese Zhuyin system more than Pinyin, I feel obligated to include Zhuyin in the spreadsheet as well. Of course, this won’t be manual, I’ll just have to write up some python code. As it turns out, there is a python library called dragonmapper, which contains a transcriptions module that could convert between various types of Chinese transcription.2 Since the deck came with pinyin pronunciations already, we could let dragonmapper handle the remaining conversions automatically.

from dragonmapper import transcriptions as t
from pyperclip import copy, paste

lambda z: copy('\n'.join([zhuyin_separator.join([t.pinyin_to_zhuyin(j.replace('v','ü').strip()) for j in i.split(';')]) for i in paste().split('\n')]))
lambda p: copy('\n'.join(['; '.join([t.to_pinyin(j.replace('v','ü').strip()) for j in i.split(';')]) for i in paste().split('\n')]))

Here’s my lazy python code. Run it in the console in interactive mode and call z() or p() to convert the raw Pinyin into diacritic form or Zhuyin. Granted, for Zhuyin conversion, there are some data points that the module does not understand, such as 哟 (yo), which will require some manual treatment. That’s why it is not a good idea to write lazy code, but I somehow got away with it after a bit of trial and error.

After sorting the second table, I find that some of the words are missing. The Anki deck was organized from the most common 800 or so characters that differ between its simplified and traditional forms. However, many of the words on the second table aren’t “popular” themselves, though they appear inside other frequently used characters. Hence, I will have to manually add the following characters:

Simplified Traditional
3
鹵,滷

After some manual data mining, I became slightly dissatisfied with the stroke order format used by the deck. The website from which these stroke order animations were extracted has an incomplete database, and I also wanted stroke diagrams with a grid to help me visualize the character’s proportions.

Current stroke order animation sample (left) and an alternative that I found (right).

Now, the animation files in the new database are stored by decimal Unicode value of their corresponding Chinese characters. If I could automate the process of querying and downloading these animations, I could provide stroke order for both simplified and traditional characters, as opposed to having stroke order only for traditional in the current deck. Again, using some python magic, the ord() function will used on a Chinese character will return precisely the digits needed to access the corresponding webpage. I’ll spare the technical details, but after some experimenting, I was able to have python automatically download the stroke order gifs, using the MDBG database as much as possible, and resorting to other sources when necessary. Here is, once again, some python code that I used in interactive mode to download all the gifs.

import urllib.request
from pyperclip import copy, paste

# Setup for various sources.
base_link = ["https://www.mdbg.net/chinese/rsc/img/stroke_anim/", "https://learnchineseez.com/read-write/images/"]
convert = [lambda c : str(ord(c)), lambda c : hex(ord(c))[2:]]

x = []
ex = [] # words that failed to download

def retrieveChar(c, alt=0):
    urllib.request.urlretrieve(base_link[alt]+convert[alt](c)+".gif", c+".gif")

def getList():
    global x
    x = [c.strip() for c in paste().split("\n")]

def checkDupe():
    global x
    n = len(x)
    x = list(set(x))
    print(f"removed {n-len(x)} duplicate(s)")

def downloadAll(alt=0):
    global ex
    for c in x:
        try:
            retrieveChar(c, alt=alt)
        except: # usually 404 Error
            print(f"Error downloading gif for character {c}")
            ex.append(c)

Now, all I had left to do was to redesign the card template add the downloaded gif files to Anki’s media folder. After that, the deck is complete! I designed the deck to be as “symmetric” as possible, meaning that it is also possible for a Traditional Chinese user to use this deck to learn Simplified Chinese. The only difference is a short “disambiguation” added to the front card whenever a simplified character is mapped to from two traditional characters (simplification wasn’t isn’t an injective map). Since I mostly borrowed from existing Anki decks online, I don’t want to publish my deck to the Ankiweb yet. You can download the Anki deck with the button below.

Download Anki Deck

Closing Remarks

That’s all for now. I might go on to compile another Anki deck for the first table if this proves effective.

  1. As Chinese is a logographic language that eventually evolved to exist on a QWERTY keyboard for ease of communication, many Chinese language users, Native or otherwise, developed character amnesia (提筆忘字), causing them to forget how to physically write down characters. Refer to this interesting study. 

  2. Check out its documentation here 

  3. The character 匯, which also simplifies to 汇, has already been included the Anki deck.