Eloquent Software Forum Index Eloquent Software
Excellent utilities for Mac OS X
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Asian languages word finding algorithm

 
Post new topic   Reply to topic    Eloquent Software Forum Index -> LiveDictionary
View previous topic :: View next topic  
Author Message
mikeash
Site Admin


Joined: 11 Mar 2004
Posts: 127
Location: VA, USA

PostPosted: Thu Oct 27, 2005 2:24 pm    Post subject: Asian languages word finding algorithm Reply with quote

There's been some recent interest in how LiveDictionary figures out what to display when working with Japanese text, and I thought I'd post that here. The same algorithm is used for Chinese, Japanese, and Korean. This is probably not the best strategy for all three, but since I only know one of them well enough to actually come up with an algorithm, I made something decent for that one (Chinese) and hoped it worked well enough for the others.

If anybody with knowledge of any of these languages has doable suggestions for improving the word search, please post here or e-mail me. And make sure that you download the latest version first, as versions prior to 1.2 had a serious defect in their indexing of Japanese words.

Here is the algorithm:

The goal is to find the longest piece of text which includes the character under the cursor and which also has a match in the dictionary.

Dictionary lookups are done by passing a large amount of text to the dictionary, with the beginning of the text set to the first character of the candidate "word". The dictionary then does a search which returns the entries which have the largest match for the beginning of the text.

So, LD first looks up the text starting from the character under the cursor. It then looks up the character immediately to the right, and continues to do so until it gets a match which does not include the text under the cursor. At this point, it can stop because the dictionary includes the trailing fragments of words, and so if this match doesn't reach the cursor, then no following match will reach either. It then returns the longest match found. I would have to look at the code to see what it does in case of a tie, but I believe that this is probably not a big issue in real-world text.

An example, with non-Japanese characters:

Say that the text is ABCDEFGHI, with the cursor on the E, and a dictionary of three words, AB, DEF, and BCDE.

LD first performs a lookup on EFGHI. The dictionary returns the largest match as EF, as part of the DEF entry.

Code:
    vvvvv   LD looks up this
ABCDEFGHI
   -^^      and finds this


Next, it looks up DEFGHI. The dictionary returns the largest match as DEF.

Code:
   vvvvvv   LD looks up this
ABCDEFGHI
   ^^^      and finds this


Next, it looks up CDEFGHI. The dictionary returns the largest match as CDE, as part of the BCDE entry.

Code:
  vvvvvvv   LD looks up this
ABCDEFGHI
 -^^^       and finds this


Next, it looks up BCDEFGHI. The dictionary returns the largest match as BCDE.

Code:
 vvvvvvvv   LD looks up this
ABCDEFGHI
 ^^^^       and finds this


Finally, it looks up ABCDEFGHI. The dictionary returns the largest match as AB.

Code:
vvvvvvvvv   LD looks up this
ABCDEFGHI
^^          and finds this


Since this match does not include the character under the cursor, the search is terminated. The largest match found during the search was BCDE, and BCDE's entry is displayed.
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address Yahoo Messenger
gary



Joined: 01 Apr 2005
Posts: 32

PostPosted: Fri Jul 28, 2006 12:32 pm    Post subject: Reply with quote

Hi Mike,

I mentioned some of this in another thread.

Short words are a problem. For instance, I think CEDICT has 6 definitions for the word 'to', yet LiveDictionary finds toad, toast, tobacco, tock, today, and Trinidad and Tobago. And see what happens when 'a' is translated to Chinese. Smile It should be yi ge, yi, and ge.

'Largest' also doesn't find a match.

So far I'm going by comparison with HanziBar. I believe it also uses CEDICT. It might be worth purchasing and comparing. Reading the forums would probably be very helpful.
Back to top
View user's profile Send private message
gary



Joined: 01 Apr 2005
Posts: 32

PostPosted: Mon Jul 31, 2006 6:58 am    Post subject: Reply with quote

Hi Mike,

If I find words that aren't found but are in the dictionary, is there a good place to post it? Maybe I'll make a new thread.

To start,

http://zh.wikipedia.org/wiki/%E8%8B%B9%E6%9E%9C%E7%94%B5%E8%84%91

Apple Computer (in Chinese) doesn't get found. It is in CEDICT.
Back to top
View user's profile Send private message
mikeash
Site Admin


Joined: 11 Mar 2004
Posts: 127
Location: VA, USA

PostPosted: Mon Jul 31, 2006 11:42 pm    Post subject: Reply with quote

If I type "????" and point LiveDictionary at it, it finds the Apple Computer entry with no problem. However, it doesn't find it with just "??", nor does it find it on the English "Apple Computer". I'm not entirely sure why. I'll look into it.

I guess phpbb doesn't like Chinese. The first thing in quotes is "ping guo dian nao", and the second thing is "ping guo".
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address Yahoo Messenger
gary



Joined: 01 Apr 2005
Posts: 32

PostPosted: Mon Jul 31, 2006 11:49 pm    Post subject: Reply with quote

mikeash wrote:
If I type "????" and point LiveDictionary at it, it finds the Apple Computer entry with no problem. However, it doesn't find it with just "??", nor does it find it on the English "Apple Computer". I'm not entirely sure why. I'll look into it.

I guess phpbb doesn't like Chinese. The first thing in quotes is "ping guo dian nao", and the second thing is "ping guo".


My bad. I was looking at a traditional chinese page while using the simplified dictionary.

As for the other problem, I hope someday, "my" doesn't match myopia, mycotoxin, or mycoplasma pneumonia. Smile
Back to top
View user's profile Send private message
gary



Joined: 01 Apr 2005
Posts: 32

PostPosted: Thu Aug 03, 2006 3:20 pm    Post subject: Reply with quote

Hi Mike,

I noticed sometimes words are not defined. For instance, on Wikipedia, the first character of a title isn't displayed. If you go to Wikipedia Chinese and look at the page title (shou3 ye4). Shou3 isn't defined. It seems to happen with all Wikipedia titles.

I also found that CEDICT often displays explainations in parentheses. For instance:

ji2 - "extremely/pole (geography, physics)/utmost/top"

ji1 fu3 - "Kiev (capital of Ukraine)"

Yet sometimes a definition like si1 is "(phonetic)/this/"

If a definition is only something within parens, it seems like it's good to include it yet if a definition includes parens, what's in parens should be excluded.

Gary
Back to top
View user's profile Send private message
gary



Joined: 01 Apr 2005
Posts: 32

PostPosted: Fri Aug 11, 2006 3:33 pm    Post subject: Reply with quote

Mike,

Curious how the LiveDictionary development has gone. Since learning Chinese is pretty much my top priority at the moment, I early await new versions of LD. Smile I'd be happy even with betas if anything has changed.

One of the main things I'm hoping for is speed. It'd be great if LiveDictionary lookups were pretty much instant.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Eloquent Software Forum Index -> LiveDictionary All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group