I have been working on a program that, when given an infinitive, can conjugate Korean verbs. You can see it in action at dongsa.net (Thanks for the great domain name recommendation Sangwhan!). My original attempt in Erlang never panned out (but it was good functional programming practice). I restarted a couple months ago using Python (a language that I'm much more familiar with) and it's now very close to passing a database of 115812 conjugations.

According to ohloh.net it would've cost $18,542 to develop this software commercially.

Features

  1. 르 irregular stem change [모르 -> 몰라]
  2. vowel contraction [ㅏ + ㅏ -> ㅏ] (몰라 + 았 -> 몰랐)
  3. join (몰랐 + 어 -> 몰랐어)
  4. join (몰랐어 + 요 -> 몰랐어요)

Testing

Since it deals with something as murky as a natural language it has been a rough ride getting this program to handle all the verbs in Korean. That said, I am sure that the core is pretty solid because I developed it by writing the tests first (a method called Test Driven Development). I have the book "500 Korean Verbs" written by Bryan Park and I made a bunch of unit tests using the data I extracted from the spreadsheet that someone made from the book:

$ ./test
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.................................................
Name                   Stmts   Exec  Cover   Missing
----------------------------------------------------
hangeul_utils             50     50   100%   
index                     33     26    78%   26-27, 46-50, 53-54
korean_conjugator        294    293    99%   298
korean_pronunciation      87     87   100%   
korean_stemmer            23     20    86%   14, 16, 18
----------------------------------------------------
TOTAL                    487    476    97%   
----------------------------------------------------------------------
Ran 532 tests in 3.113s

OK

I made heavy use of nose's ability to create tests from generator functions. In the pronunciation test I yield from a long list of pronunciation samples:

for x, y in [(u'국물',       u'궁물'),
             ...
             (u'앉아',       u'안자'),
             (u'잃어버리다', u'이러버리다'),
             (u'앉는',       u'안는'),
             (u'닮다',       u'담다'),
             (u'닮아',       u'달마'),
             (u'못하다',     u'모타다'),
             (u'학교',       u'학꾜'),
             (u'손이',       u'소니'),
             (u'산에',       u'사네'),
             (u'돈을',       u'도늘'),
             (u'문으로',     u'무느로'),
             (u'좋은',       u'조은')
             ...]:
   yield check_pronunciation, x, y

And here's an example from the conjugation tests:

def test_model_verbs():
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'
    yield check, inquisitive_past_informal_low, u'기다', u'겼어'
    yield check, propositive_present_informal_low, u'기다', u'겨'
    yield check, declarative_present_informal_high, u'기다', u'겨요'
    yield check, declarative_past_informal_high, u'기다', u'겼어요'
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'

On top of doing unit testing I also semi-regularly perform a complete check of a database I extracted from the spreadsheet mentioned above. The latest run was 114382 correct conjugations / 115812 total conjugations (98.77% accurate). Some of the data is incorrect. Sometimes there are verbs that are not contracted (but the conjugation algorithm currently only handles contractions). And of course, sometimes the algorithm is incorrect. I'm going to make the database test smarter and that will mark if a verb has been conjugated properly in a particular tense at least once to weed out the verbs that have not been contracted in the database. If a conjugation fails it is printed as an assert so I can put it in the tests. Here are the results from the last run. Currently it prints out the infinitive and what is expected. I should add what dongsa.net currently returns as well to make it so a human tester can more easily determine what is happening.

Lessons learned

I learned a lot about both Korean verbs and programming through this project. Here are the highlights:

기다1869
지다1144
가깝다976
적다729
남다683
가다503
내다464
만들다456
부르다177
살다163
주다163
외우다140
보다137
까맣다119
애쓰다119
바르다115
뛰다110
되다108
걷다90
하다90
오다75
잇다66
서다63
담그다32
쓰다24
눕다23
그러다15
푸르다14
켜다12
낫다4
누르다4
깨닫다2
돕다1
아니다1
이다1
푸다1
infinitive 낫다
declarative present informal low 나아    
1. ㅅ irregular (낫 -> 나 [hidden padchim])
2. join (나 + 아 -> 나아)

This is implemented by subclassing unicode and keeping a flag if the last character has a hidden padchim. When a slice is taken from the character the flag is only transferred if the slice includes the last character.

base                                 돕
base2                                도오
base3                                도우
declarative present informal low     도와
imperative present informal high     도우세요

As you can see, base3 is used in the imperative present informal high but base2 is used in the declarative present informal low conjugation. For most verbs base2 and base3 are the same.

일다     -- 일어 일러
곱다     -- 고와 곱아
파묻다   -- 파묻어 파물어
누르다   -- 눌러 누래
묻다     -- 물어 묻어
이르다   -- 일러 이르러
되묻다   -- 되물어 되묻어
썰다     -- 썰어 써려
붓다     -- 부숴 부어 부수어
들까불다 -- 들까불러 들까불어
굽다     -- 굽어 구워
걷다     -- 걷어 걸어
뒤까불다 -- 뒤까불러 뒤까불어
이다     -- 이야 여
def find_vowel_to_append(string):
    for character in reversed(string):
        if character in [u'뜨', u'쓰', u'트']:
            return u'어'
        if vowel(character) == u'ㅡ' and not padchim(character):
            continue
        elif vowel(character) in [u'ㅗ', u'ㅏ', u'ㅑ']:
            return u'아'
        else:
            return u'어'
    return u'어' 


>>> from hangeul_utils import find_vowel_to_append
>>> print find_vowel_to_append(u'크')
어
>>> print find_vowel_to_append(u'먹')
어
>>> print find_vowel_to_append(u'알')
아
>>> print find_vowel_to_append(u'아프')
아

As you can see, most of the time it is sufficient to just look at the last vowel. However, if there is a null vowel at the end, as is the case with 아프다, you have to look at the character before that to see if it has a non-neutral vowel. If it does, that's your vowel to append, so 아프 -> 아파. In the case of 크다 there is no previous, so it gets the default ㅓ. Verbs that end with the vowels ㅗ, ㅏ, and ㅑ all get ㅏ appended when conjugated.

Pronunciation

The pronunciation rules are a single pass, but they are all evaluated. The original string can be modified by several different rules. Here's an example:

먹었 [머걷]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
먹었다 [머걷 + 다] -> [머거따]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
# then modified by consonant_combination_rule(u'ᆮ', u'ᄃ', None, u'ᄄ')

Stemming

Stemming was not an original goal of the project (there isn't even a way to get stems using the web interface). But, I had an epiphany while I was writing the conjugator and it was quick to implement it in the brute force way that I imagined. I guess that real Korean stemmers that have to be efficient either have a massive lookup table or a much more efficient algorithm than my stemmer employs.

The strategy used in the stem function is pretty simple. Start from the left, take a character at a time and build a list of all conjugations that come from that character plus the preceding characters in the verb. If the conjugation doesn't appear in all the conjugations for the stem, it adds another character and continues.

>>> print korean_stemmer.stem(u'안녕하세요', verbose=True)
으
안
앋
압
알
앗
안
아
안니
안느
안녕
안녇
안녑
안녈
안녓
안년
안녀
안녕흐
안녕하
Matches conjugation imperative present informal high of the verb 안녕하다

You might be wondering why there are so many strange stems that the stemmer tries (like "으"). Sometimes information is lost when a verb is conjugated (e.g. 짓다 -> 지). Other verbs go through destructive stem changes (e.g. 걷다 -> 걸어). In order to catch irregular verbs the function modifies each character that is appended with all the possible ways that stem-changing irregulars can change form. This is why the stemmer as it is implemented is so inefficient.

>>> print korean_stemmer.stem(u'지어', verbose=True)
즈
지
짇
집
질
짓
Matches conjugation propositive present informal low of the verb 짓다
>>> print korean_stemmer.stem(u'걸으세요', verbose=True)
그
걸
걷
Matches conjugation imperative present informal high of the verb 걷다

The source can be had at github. It's written in Python. Should work in anything >= 2.5 and < 3.0.

If this floats your boat you might want to check out Hanjadic, a tool that I wrote to build my Korean vocabulary by learning the Korean pronunciation of Chinese characters.

Also, please be sure to check out Matt Strum's Hangeul Assistant. It's great to see other people working on stuff like this!