 |
Eloquent Software Excellent utilities for Mac OS X
|
| View previous topic :: View next topic |
| Author |
Message |
mikeash Site Admin
Joined: 11 Mar 2004 Posts: 127 Location: VA, USA
|
Posted: Thu Oct 27, 2005 2:24 pm Post subject: Asian languages word finding algorithm |
|
|
There's been some recent interest in how LiveDictionary figures out what to display when working with Japanese text, and I thought I'd post that here. The same algorithm is used for Chinese, Japanese, and Korean. This is probably not the best strategy for all three, but since I only know one of them well enough to actually come up with an algorithm, I made something decent for that one (Chinese) and hoped it worked well enough for the others.
If anybody with knowledge of any of these languages has doable suggestions for improving the word search, please post here or e-mail me. And make sure that you download the latest version first, as versions prior to 1.2 had a serious defect in their indexing of Japanese words.
Here is the algorithm:
The goal is to find the longest piece of text which includes the character under the cursor and which also has a match in the dictionary.
Dictionary lookups are done by passing a large amount of text to the dictionary, with the beginning of the text set to the first character of the candidate "word". The dictionary then does a search which returns the entries which have the largest match for the beginning of the text.
So, LD first looks up the text starting from the character under the cursor. It then looks up the character immediately to the right, and continues to do so until it gets a match which does not include the text under the cursor. At this point, it can stop because the dictionary includes the trailing fragments of words, and so if this match doesn't reach the cursor, then no following match will reach either. It then returns the longest match found. I would have to look at the code to see what it does in case of a tie, but I believe that this is probably not a big issue in real-world text.
An example, with non-Japanese characters:
Say that the text is ABCDEFGHI, with the cursor on the E, and a dictionary of three words, AB, DEF, and BCDE.
LD first performs a lookup on EFGHI. The dictionary returns the largest match as EF, as part of the DEF entry.
| Code: | vvvvv LD looks up this
ABCDEFGHI
-^^ and finds this
|
Next, it looks up DEFGHI. The dictionary returns the largest match as DEF.
| Code: | vvvvvv LD looks up this
ABCDEFGHI
^^^ and finds this
|
Next, it looks up CDEFGHI. The dictionary returns the largest match as CDE, as part of the BCDE entry.
| Code: | vvvvvvv LD looks up this
ABCDEFGHI
-^^^ and finds this
|
Next, it looks up BCDEFGHI. The dictionary returns the largest match as BCDE.
| Code: | vvvvvvvv LD looks up this
ABCDEFGHI
^^^^ and finds this
|
Finally, it looks up ABCDEFGHI. The dictionary returns the largest match as AB.
| Code: | vvvvvvvvv LD looks up this
ABCDEFGHI
^^ and finds this
|
Since this match does not include the character under the cursor, the search is terminated. The largest match found during the search was BCDE, and BCDE's entry is displayed. |
|
| Back to top |
|
 |
gary
Joined: 01 Apr 2005 Posts: 32
|
Posted: Fri Jul 28, 2006 12:32 pm Post subject: |
|
|
Hi Mike,
I mentioned some of this in another thread.
Short words are a problem. For instance, I think CEDICT has 6 definitions for the word 'to', yet LiveDictionary finds toad, toast, tobacco, tock, today, and Trinidad and Tobago. And see what happens when 'a' is translated to Chinese. It should be yi ge, yi, and ge.
'Largest' also doesn't find a match.
So far I'm going by comparison with HanziBar. I believe it also uses CEDICT. It might be worth purchasing and comparing. Reading the forums would probably be very helpful. |
|
| Back to top |
|
 |
gary
Joined: 01 Apr 2005 Posts: 32
|
|
| Back to top |
|
 |
mikeash Site Admin
Joined: 11 Mar 2004 Posts: 127 Location: VA, USA
|
Posted: Mon Jul 31, 2006 11:42 pm Post subject: |
|
|
If I type "????" and point LiveDictionary at it, it finds the Apple Computer entry with no problem. However, it doesn't find it with just "??", nor does it find it on the English "Apple Computer". I'm not entirely sure why. I'll look into it.
I guess phpbb doesn't like Chinese. The first thing in quotes is "ping guo dian nao", and the second thing is "ping guo". |
|
| Back to top |
|
 |
gary
Joined: 01 Apr 2005 Posts: 32
|
Posted: Mon Jul 31, 2006 11:49 pm Post subject: |
|
|
| mikeash wrote: | If I type "????" and point LiveDictionary at it, it finds the Apple Computer entry with no problem. However, it doesn't find it with just "??", nor does it find it on the English "Apple Computer". I'm not entirely sure why. I'll look into it.
I guess phpbb doesn't like Chinese. The first thing in quotes is "ping guo dian nao", and the second thing is "ping guo". |
My bad. I was looking at a traditional chinese page while using the simplified dictionary.
As for the other problem, I hope someday, "my" doesn't match myopia, mycotoxin, or mycoplasma pneumonia.  |
|
| Back to top |
|
 |
gary
Joined: 01 Apr 2005 Posts: 32
|
Posted: Thu Aug 03, 2006 3:20 pm Post subject: |
|
|
Hi Mike,
I noticed sometimes words are not defined. For instance, on Wikipedia, the first character of a title isn't displayed. If you go to Wikipedia Chinese and look at the page title (shou3 ye4). Shou3 isn't defined. It seems to happen with all Wikipedia titles.
I also found that CEDICT often displays explainations in parentheses. For instance:
ji2 - "extremely/pole (geography, physics)/utmost/top"
ji1 fu3 - "Kiev (capital of Ukraine)"
Yet sometimes a definition like si1 is "(phonetic)/this/"
If a definition is only something within parens, it seems like it's good to include it yet if a definition includes parens, what's in parens should be excluded.
Gary |
|
| Back to top |
|
 |
gary
Joined: 01 Apr 2005 Posts: 32
|
Posted: Fri Aug 11, 2006 3:33 pm Post subject: |
|
|
Mike,
Curious how the LiveDictionary development has gone. Since learning Chinese is pretty much my top priority at the moment, I early await new versions of LD. I'd be happy even with betas if anything has changed.
One of the main things I'm hoping for is speed. It'd be great if LiveDictionary lookups were pretty much instant. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|