|mnoGoSearch 3.2.43 reference manual: Full-featured search engine software|
|Prev||Chapter 7. Languages support||Next|
Traditional Chinese, Thai and Japanese writing have no spaces between words in phrase, unlike western languages. Thus, while indexing documents in these languages, the indexer needs to segment phrases into words.
For Japanese language phrase segmenting, one of ChaSen, a morphological system for japanes language, or MeCab, a Japanese morphological analyzer, is used. Thus, you need one of these systems to be installed before configuring and building mnoGoSearch.
To enable Japanese language phrase segmenting,
switch for configure.
For Chinese language phrase segmenting, the frequency dictionary of Chinese words is used. Segmenting itself is done by a dynamic programming method to maximize the cumulative frequency of produced words.
To enable Chinese language phrase segmenting, you need to enable the GB2312 charset support while configuring mnoGoSearch, if you want to use mandarin.freq, a simplified Chinese dictionary, or enable the Big5 charset support, to use TraditionalChinese.freq, a traditional Chinese dictionary. You also need to specify the frequency dictionary of Chinese words with LoadChineseList in indexer.conf file.
LoadChineseList [charset dictionaryfilename]
The GB2312charset and mandarin.freqdictionary are used by default.
For Thai language phrase segmenting, the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.
LoadThaiList [charset dictionaryfilename]
The TIS-620charset and thai.freqdictionary are used by default