Announcing dongsa.net

I have been working on a program that, when given an infinitive, can conjugate Korean verbs. You can see it in action at dongsa.net (Thanks for the great domain name recommendation Sangwhan!). My original attempt in Erlang never panned out (but it was good functional programming practice). I restarted a couple months ago using Python (a language that I'm much more familiar with) and it's now very close to passing a database of 115812 conjugations.

According to ohloh.net it would've cost $18,542 to develop this software commercially.

Features

dongsa.net is currently focused on just providing explanations about conjugations and pronunciation guides for Korean verbs.
It is currently ~784 lines of Python and ~1007 lines of unit tests written in Python.
It handles almost all irregulars I could find: ㅅ, ㅂ, 르, ㅎ, ㄷ, and ㄹ irregulars as well as the just downright irregular 이다, 아니다 and 푸다. The database tests (see below) revealed that there are some ㅎ verbs it can't handle properly still. They change from ㅎ-> ㅔ instead of ㅎ -> ㅐ. I'm investigating if there is a pattern or if this will require a lookup table.
It displays information about why a verb is conjugated a particular way. For example the declarative past informal high form of 모르다 (to know) is explained like this:

르 irregular stem change [모르 -> 몰라]

vowel contraction [ㅏ + ㅏ -> ㅏ] (몰라 + 았 -> 몰랐)

join (몰랐 + 어 -> 몰랐어)

join (몰랐어 + 요 -> 몰랐어요)

Testing

Since it deals with something as murky as a natural language it has been a rough ride getting this program to handle all the verbs in Korean. That said, I am sure that the core is pretty solid because I developed it by writing the tests first (a method called Test Driven Development). I have the book "500 Korean Verbs" written by Bryan Park and I made a bunch of unit tests using the data I extracted from the spreadsheet that someone made from the book:

$ ./test
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.................................................
Name                   Stmts   Exec  Cover   Missing
----------------------------------------------------
hangeul_utils             50     50   100%   
index                     33     26    78%   26-27, 46-50, 53-54
korean_conjugator        294    293    99%   298
korean_pronunciation      87     87   100%   
korean_stemmer            23     20    86%   14, 16, 18
----------------------------------------------------
TOTAL                    487    476    97%   
----------------------------------------------------------------------
Ran 532 tests in 3.113s

OK

I made heavy use of nose's ability to create tests from generator functions. In the pronunciation test I yield from a long list of pronunciation samples:

for x, y in [(u'국물',       u'궁물'),
             ...
             (u'앉아',       u'안자'),
             (u'잃어버리다', u'이러버리다'),
             (u'앉는',       u'안는'),
             (u'닮다',       u'담다'),
             (u'닮아',       u'달마'),
             (u'못하다',     u'모타다'),
             (u'학교',       u'학꾜'),
             (u'손이',       u'소니'),
             (u'산에',       u'사네'),
             (u'돈을',       u'도늘'),
             (u'문으로',     u'무느로'),
             (u'좋은',       u'조은')
             ...]:
   yield check_pronunciation, x, y

And here's an example from the conjugation tests:

def test_model_verbs():
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'
    yield check, inquisitive_past_informal_low, u'기다', u'겼어'
    yield check, propositive_present_informal_low, u'기다', u'겨'
    yield check, declarative_present_informal_high, u'기다', u'겨요'
    yield check, declarative_past_informal_high, u'기다', u'겼어요'
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'

On top of doing unit testing I also semi-regularly perform a complete check of a database I extracted from the spreadsheet mentioned above. The latest run was 114382 correct conjugations / 115812 total conjugations (98.77% accurate). Some of the data is incorrect. Sometimes there are verbs that are not contracted (but the conjugation algorithm currently only handles contractions). And of course, sometimes the algorithm is incorrect. I'm going to make the database test smarter and that will mark if a verb has been conjugated properly in a particular tense at least once to weed out the verbs that have not been contracted in the database. If a conjugation fails it is printed as an assert so I can put it in the tests. Here are the results from the last run. Currently it prints out the infinitive and what is expected. I should add what dongsa.net currently returns as well to make it so a human tester can more easily determine what is happening.

Lessons learned

I learned a lot about both Korean verbs and programming through this project. Here are the highlights:

I kept bouncing between back and forth between the belief that Korean is a highly regular language and the belief that it is a highly irregular language. In the end it turns out that, like most languages, there are classes of irregulars that all act the same and a few oddballs that do their own thing. Here's a table of model verbs and the number of verbs that conjugate similarly in Korean:

기다	1869
지다	1144
가깝다	976
적다	729
남다	683
가다	503
내다	464
만들다	456
부르다	177
살다	163
주다	163
외우다	140
보다	137
까맣다	119
애쓰다	119
바르다	115
뛰다	110
되다	108
걷다	90
하다	90
오다	75
잇다	66
서다	63
담그다	32
쓰다	24
눕다	23
그러다	15
푸르다	14
켜다	12
낫다	4
누르다	4
깨닫다	2
돕다	1
아니다	1
이다	1
푸다	1

Verbs that lose a ㅅ are still treated like they have a padchim: 낫다-> 나아 (not 나). This is one of those language points that you just pick up and use without thinking about it. 낫다 was hardcoded in my mind. Now I am aware of why it is conjugated this way. dongsa.net informs you when this happens:

infinitive 낫다
declarative present informal low 나아    
1. ㅅ irregular (낫 -> 나 [hidden padchim])
2. join (나 + 아 -> 나아)

This is implemented by subclassing unicode and keeping a flag if the last character has a hidden padchim. When a slice is taken from the character the flag is only transferred if the slice includes the last character.

On the face of it, it looks like Korean has 2 base forms for verbs, but in reality there are 3 (the 3^rd is only rarely seen). I'm not sure what the exact linguistic term for this is. I believe the third form is there to keep vowel harmony (but I'm just pulling that out of the air, I'm an amateur linguist). The verb 돕다 is a good example of this. Here's what dongsa.net says about it:

base                                 돕
base2                                도오
base3                                도우
declarative present informal low     도와
imperative present informal high     도우세요

As you can see, base3 is used in the imperative present informal high but base2 is used in the declarative present informal low conjugation. For most verbs base2 and base3 are the same.

There are several verbs in Korean that (according to their meaning) are conjugated as both regulars and irregulars:

일다     -- 일어 일러
곱다     -- 고와 곱아
파묻다   -- 파묻어 파물어
누르다   -- 눌러 누래
묻다     -- 물어 묻어
이르다   -- 일러 이르러
되묻다   -- 되물어 되묻어
썰다     -- 썰어 써려
붓다     -- 부숴 부어 부수어
들까불다 -- 들까불러 들까불어
굽다     -- 굽어 구워
걷다     -- 걷어 걸어
뒤까불다 -- 뒤까불러 뒤까불어
이다     -- 이야 여

The algorithm for choosing which vowel to append to a conjugation is slightly more complicated than I originally thought.

def find_vowel_to_append(string):
    for character in reversed(string):
        if character in [u'뜨', u'쓰', u'트']:
            return u'어'
        if vowel(character) == u'ㅡ' and not padchim(character):
            continue
        elif vowel(character) in [u'ㅗ', u'ㅏ', u'ㅑ']:
            return u'아'
        else:
            return u'어'
    return u'어' 


>>> from hangeul_utils import find_vowel_to_append
>>> print find_vowel_to_append(u'크')
어
>>> print find_vowel_to_append(u'먹')
어
>>> print find_vowel_to_append(u'알')
아
>>> print find_vowel_to_append(u'아프')
아

As you can see, most of the time it is sufficient to just look at the last vowel. However, if there is a null vowel at the end, as is the case with 아프다, you have to look at the character before that to see if it has a non-neutral vowel. If it does, that's your vowel to append, so 아프 -> 아파. In the case of 크다 there is no previous, so it gets the default ㅓ. Verbs that end with the vowels ㅗ, ㅏ, and ㅑ all get ㅏ appended when conjugated.

Pronunciation

The pronunciation rules are a single pass, but they are all evaluated. The original string can be modified by several different rules. Here's an example:

먹었 [머걷]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
먹었다 [머걷 + 다] -> [머거따]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
# then modified by consonant_combination_rule(u'ᆮ', u'ᄃ', None, u'ᄄ')

Stemming

Stemming was not an original goal of the project (there isn't even a way to get stems using the web interface). But, I had an epiphany while I was writing the conjugator and it was quick to implement it in the brute force way that I imagined. I guess that real Korean stemmers that have to be efficient either have a massive lookup table or a much more efficient algorithm than my stemmer employs.

The strategy used in the stem function is pretty simple. Start from the left, take a character at a time and build a list of all conjugations that come from that character plus the preceding characters in the verb. If the conjugation doesn't appear in all the conjugations for the stem, it adds another character and continues.

>>> print korean_stemmer.stem(u'안녕하세요', verbose=True)
으
안
앋
압
알
앗
안
아
안니
안느
안녕
안녇
안녑
안녈
안녓
안년
안녀
안녕흐
안녕하
Matches conjugation imperative present informal high of the verb 안녕하다

You might be wondering why there are so many strange stems that the stemmer tries (like "으"). Sometimes information is lost when a verb is conjugated (e.g. 짓다 -> 지). Other verbs go through destructive stem changes (e.g. 걷다 -> 걸어). In order to catch irregular verbs the function modifies each character that is appended with all the possible ways that stem-changing irregulars can change form. This is why the stemmer as it is implemented is so inefficient.

>>> print korean_stemmer.stem(u'지어', verbose=True)
즈
지
짇
집
질
짓
Matches conjugation propositive present informal low of the verb 짓다
>>> print korean_stemmer.stem(u'걸으세요', verbose=True)
그
걸
걷
Matches conjugation imperative present informal high of the verb 걷다

The source can be had at github. It's written in Python. Should work in anything >= 2.5 and < 3.0.

If this floats your boat you might want to check out Hanjadic, a tool that I wrote to build my Korean vocabulary by learning the Korean pronunciation of Chinese characters.

Also, please be sure to check out Matt Strum's Hangeul Assistant. It's great to see other people working on stuff like this!