It's almost the end of the year already, and I don't think I've yet mentioned my honours project (and am too lazy/busy right now to check). So here is a brief description before I get to what I want to say. The title is something like "Machine Translation of SASL (South African Sign Language)" (it keeps changing) and what I'm mainly looking at is which sign notation to use as the middle step between video data and English text (there are quite a few notations). Go here if you're really interested.
So far, it has been one frustrating mountain after the other, like trying to walk when you're stuck in a slow motion capture. So. Frustratingly. Slow at making any progress at all.
One of the frustrating things has been the lack of available data. Even now that I (finally) have data, it is not enough. The people from where I got the software I will be using recommend "800k+ sentence pairs (1.2m+ for 'difficult' language pairs)". Sign languages + spoken languages definitely fall into the "difficult language pairs" category, and so far I have about 5000 sentence pairs. Not quite a million hey?
Now, 5000 sentences is small for translating, but it is a fair amount for reading through - something I have to do because I still need to clean the data; it is not all in the correct format. Sighs.
There is something pretty cool about this, though. The only parallel data I have been able to find is the ASL (American Sign Language) bible. It's not the whole bible, but a large portion of it (5000 or so verses) has been translated into ASL and written in SW (Sign Writing), available at www.aslgospel.org. This is what I will be using to train my system.
So, I am currently reading through the English sentences and discarding any that aren't fit for translation (e.g. some of them are translated from Old English versions of the bible, thees and thous and chooseth and all - not quite the language I want to translate into) or duplicates, fragments, etc. This means - I get to read the Bible and make progress on my thesis at the same time.