Glen Smith

The Groovy community is a very international one… so it’s no surprise that GroovyBlogs ends up getting entries in all sorts of languages, particularly Spanish. As someone with no exposure to non-english applications or any i18n stuff (shame on me, I know) all these UTF-8 issues are quite a head spin…

So I’ve been thinking that some of those non-english entries look pretty interesting, there must be some Java translation APIs out there I can call to do an inplace-translation? Even a webservice would be nice? You’d think, but you’d be wrong. The best I can manage without violating terms of service is to link you off to Google, for now. But how do you work out what language the source document is written in?

At the moment I’m using textcat which is a cute little library for doing just that. Bundles with a small section of languages, but the jar seems to contain signatures for a whole lot more. Just need to work out how to activate them… The API is a one liner:

def guesser = new org.knallgrau.utils.textcat.TextCategorizer()
def category = guesser.categorize(yourString)

which will return the string “spanish”. Combine that with Groovy hashmaps, templates and Google linking goodness, and you get something like:

Translation in action at GroovyBlogs

where that Translate link will send you off to Google translate with the “Spanish to English” option. Still not so accurate for some languages (notable portuguese), but that may just be activating the correct rulesets that come bundled with the jar.

Anyways, kudos to the textcat guys for a great little library. Now back to working out how to get portuguese happening… And sorting out my current UTF8 encoding woes…