Blog

  Spencer Huddleston

Jim Breen: Behind the Scenes of WWWJDIC

No Comments

One of our favorite sites is Jim Breen’s WWWJDIC. We have admiration for any service that’s been pushing the envelope with language and technology since most of us were still in primary school.

For those of you who aren’t Japanese language students, Jim’s WWWJDIC is the most comprehensive Japanese to English dictionary online. Its entries are user-generated and it’s completely open-source. As a college student on a budget in Tokyo, I didn’t have the money to buy an expensive $400 electronic dictionary that all the cool kids had. No sweat though, I could just flip open my keitai and access WWWJDIC’s mobile site for free, and I was good to go (only problem is my teachers thought I was texting in class!).

We asked Jim about the WWWJDIC project as well as the interaction between translators and technology in general:

What sparked your initial interest in Japan and the Japanese language?

JB: It all started around 1977, when my family was first introduced to the Suzuki Method for learning music, and my wife, a music teacher, began to use it. In November of 1981, the five Breens (our kids were then aged 10, 7 and 3) travelled to Japan to the Suzuki headquarters in Matsumoto. Upon first arriving, we found ourselves being greeted by a welcoming party from the Nagano Girl Scouts! We spent two months in Matsumoto living in two tiny 6-mat apartments, my wife studying Suzuki flute pedagogy with Toshio Takahashi, and kids having lessons in violin and piano (the former with Shin’ichi Suzuki himself!)

Back in Australia, I decided I didn’t want to go back to Japan until I could speak Japanese somewhat more, and read Japanese. It wasn’t until 1986 when I had become an academic that I had the time to start studying properly, and I did three years of Japanese at Swinburne Institute (now University) of Technology in Melbourne. Swinburne’s course was innovative in that it concentrated on modern practical Japanese, and was taught entirely without use of romaji.

longer version of this story is also available on Jim’s website.

How did the WWWJDIC project get started?

JB: As someone who had spent much of my life around computers, I had hankered to come to grips with handling Japanese text on computers. I had been told in Japan that it was too hard for Western computers to “do” Japanese because of the need for fonts, etc., so it was a refreshing surprise in late 1989 to discover that Mark Edwards at the University of Wisconsin was writing a free Japanese word-processor that could run on ordinary PCs. I down-loaded Mark’s program, MOKE 1.0 (original documentation downloadable here), and from then on I was hooked.

MOKE came with a rudimentary Japanese-English dictionary file, which was expanded somewhat (1,900 entries) in the commercial 2.0 release. I had long been interested in the idea of a computerized dictionary, indeed I had helped people at Swinburne publish a student dictionary. I wrote a C program that searched the MOKE dictionary file and displayed selected entries. Of course the file was too small, so I added several thousand new entries, and in early 1991 released the software (JDIC for DOS) and the expanded file as freeware. The rest, as they say, is history.

What sources do the dictionary’s entries come from?

JB: Initially from all over the place. Former students emailed their vocabulary lists, people sat down with Nelson and typed away, etc. etc. These were the gung-ho early days. I wish I’d been aware of what lay ahead and been more critical. Now it is a lot maturer the entries usually come from people who notice that something is missing and they send it in. Or they see something that is incomplete, wrong, etc. and amend it. At over 150,000 entries the basic Japanese lexicon is mostly there, so the new entries these days tend to be rather technical or historical.

How are new entries added and how do you check for accuracy?

JB: Since mid-2010 it has been maintained via a WWW-based system connected to a postgreSQL database. You can get to it at the “JMdictDB database” here or simply by clicking on the Edit or Promote links in WWWJDIC.

Once a day the database is stripped and formatted into the several distribution formats, the main ones being the XML JMdict and the EDICT2 format used by WWWJDIC. These are installed in the various ftp archives and in WWWJDIC itself.

What’s the process for checking accuracy?

JB: All new entries and edits are classed as “pending” until an editor can check them, make amendments if necessary and approve them. We have several editors in the project. We usually check against a range of Japanese dictionaries (Kōjien, Daijirin, Daijisen, etc.) and the major Japanese – English dictionaries such as the big Kenkyusha. Subject-specific glossaries, etc. are used a lot. A lot of checking is done via the WWW too. Sometimes we’ll argue for days over the meaning of an entry.

A lot of mobile and other apps have been developed using WWWJDIC. Any favorites?

  

JB: I use an Android smartphone, and I have the WWWJDIC for Android (left) and AEdict (right) apps installed. The former uses WWWJDIC’s API to pull out data and the latter has a local data holding, so it can run without the network.

How do you find the way Japanese translators interact with technology has changed over the years?

JB: A lot. Hugely. It’s spelt out in a keynote presentation I gave in 2007 at IJET in Bath. To prepare for the presentation, I conducted a survey of 171 translators working to and from Japanese.

One aspect that might be surprising for younger Japanese learners is that until 1994, all email communication in the translation community was done exclusively using romaji, which these days—with most computers coming pre-installed with Japanese kanji/kana word processors—is completely unheard of. Of course, WWW search has become integral to the majority of translators – with over 70% of those I surveyed saying it is “indispensable.” Use of Translation Memory (TM) by Japanese translators has been a relatively recent phenomenon, with almost 50% of translators adopting it between 2005 and 2007, the year I conducted the survey. Translators are still pretty unimpressed with Machine Translation, however, with over 90%(!) of those surveyed saying they never use it to assist in their translations. (Actually I’m surprised that even 10% use it.)

Where do you see the future of technology and translation heading?

JB: To highlight some of the predictions I made in my presentation, I think there will continue to be steady improvement in the quality of statistical Machine Translation without ever quite reaching what would be considered “high quality.” More and more translators will be turning to Computer Assisted Translation (CAT) within which Translation memory is just one tool.

Companies dealing with a lot of translation work are likely to move to more server-based systems with shared memories and glossaries. As these glossaries become increasingly comprehensive, they are likely to become quite valuable and there will be a big incentive to sell them.

The future for translators will feel much more “networked” with the boundaries between voice/data/video becoming increasing blurred. With the wealth of new technologies at translator’s fingertips, the potential for freelancers will continue to increase, but they’ll need to integrate more with those server systems. Of course I have to emphasize that most predictions in the IT area are wrong, so what I’m saying may not come to pass. Also IT is very subject to significant paradigm shifts (think of the PC, the Internet and the WWW), so possibly translators will be working with tools and processes which haven’t been thought of yet.

 

Want to become a Gengo translator?

Spencer Huddleston
THE AUTHOR
Spencer Huddleston
As Gengo’s lead Account Executive, Spencer acquires and manages our US and European clients. Based in San Mateo, he was previously responsible for making data-driven decisions to improve overall speed, quality and capacity as part of our Crowd Operations team.