Chapter 7. Languages support

Table of Contents
Character sets
Making multi-language search pages
Segmenters for Chinese, Thai and Japanese languages
Multilingual servers support

Character sets

Supported character sets

mnoGoSearch supports almost all known 8 bit character sets (later called charsets) as well as some multi-byte charsets including Korean EUC-Kr, Chinese big5 and GB2312, Japanese shift-JIS, EUC-JP and ISO-2022-JP, as well as UTF8. Some multi-byte charsets are not supported by default, because the conversion tables for them are rather large and that leads to a considerable increase of the executable files size. See configure parameters to enable support for these charsets.

mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.

Several languages in one database

It is often necessary to deal with several languages simultaneously. The number of supported languages depends on the choice of charset that mnoGoSearch will use to store data. The charset is specified by the LocalCharset command.

UTF-8 mode

When UTF-8 is specified in the LocalCharset command, you may work with any languages supported in Unicode. That means you may use any number of over 650 languages. However, using UTF-8 may consume a large amount of disk space (up to twice for some languages), leading to slower indexation and search.

non-UTF-8 mode

Since every character set includes Latin characters, any character set supports at least two languages - English and one or more other languages. US-ASCII is an exception - it supports only Latin characters.

Note: When using mnoGoSearch in standard (non-UTF-8) mode, you may use as many languages as you like if they all belong to same language group.

Table 7-1. Language groups

Language groupLanguagesCharacter sets
Group 1Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, SwedishASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland
Group 2Eastern Europe: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, SloveneCP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian
Group 4BalticCP1257, ISO-8859-4, ISO-8859-13
Group 5Cyrillic: Bulgarian, Belorussian, Macedonian, Russian, Serbian, UkrainianCP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic
Group 6ArabicCP864, CP1256, ISO 8859-6, MacArabic
Group 7GreekCP869, CP1253, ISO 8859-7, MacGreek
Group 8HebrewCP1255, ISO 8859-8, MacHebrew
Group 9TurkishCP857, CP1254, ISO 8859-9, MacTurkish
Group 101JapaneseShift-JIS, EUC-JP, ISO-2022-JP
Group 102Simplified Chinese (PRC)GB2312
Group 103Traditional Chinese (ROC)Big5
Group 104KoreanEUC-KR
Group 105ThaiCP874, TIS 620, MacThai
Group 106VietnameseCP1258
Group 107IndianMacGujarati, TSCII
Group 108Georgiangeostd8
UnicodeOver 650 languagesUTF-8 (Unicode)

E.g. in case your search engine was configured to use LocalCharset from the 5th group (Cyrillic), you may index servers containing documents in Bulgarian, Belorussian, Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language document in UTF-8 is possible as well; however, the indexer will extract and save only Cyrillic content from the page. To provide support for over 650 languages, please use LocalCharset UTF-8.

Recoding

The indexer recodes all documents to the character set specified in the indexer.conf LocalCharset command. Internal recoding is implemented using Unicode. Please note that some recoding procedures may loose some data. For example, recoding between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use the UTF8 character set as LocalCharset.

Recoding at search time

You may use the BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.

Character sets aliases

Each charset is recognized by a number of its aliases. Different web servers could return the same charset in different notations. For example, ISO-8859-2, ISO8859-2, latin2 are the same charsets. The search engine understands the following charsets names aliases:

Table 7-2. Charsets aliases

ISO-2022-JP:ISO-2022-JP
ISO-8859-1: CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1
ISO-8859-10: CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6
ISO-8859-11: ISO-8859-11, TIS-620, TIS620, TACTIS
ISO-8869-13: ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7
ISO-8859-14: ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8
ISO-8859-15: ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998
ISO-8859-16: ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000
ISO-8859-2: CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2
ISO-8859-3: CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3
ISO-8859-4: CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4
ISO-8859-5:CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988
ISO-8859-6: ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987
ISO-8859-7: CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987
ISO-8859-8: CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988
ISO-8859-9: CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5
armscii-8:ARMSCII-8, ARMSCII8
big5: BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5
cp1250: CP1250, MS-EE, WINDOWS-1250
cp1251: CP1251, MS-CYRL, WINDOWS-1251
cp1252: CP1252, MS-ANSI, WINDOWS-1252
cp1253: CP1253, MS-GREEK, WINDOWS-1253
cp1254: CP1254, MS-TURK, WINDOWS-1254
cp1255: CP1255, MS-HEBR, WINDOWS-1255
cp1256: CP1256, MS-ARAB, WINDOWS-1256
cp1257: CP1257, WINBALTRIM, WINDOWS-1257
cp1258: CP1258, WINDOWS-1258
cp437: 437, CP437, IBM437
cp850: 850, CP850, CSPC850MULTILINGUAL, IBM850
cp852: 852, CP852, IBM852
cp855: 855, CP855, IBM855
cp857: 857, CP857, IBM857
cp860: 860, CP860, IBM860
cp861: 861, CP861, IBM861
cp862: 862, CP862, IBM862
cp863: 863, CP863, IBM863
cp864: 864, CP864, IBM864
cp865: 865, CP865, IBM865
cp866: 866, CP866, CSIBM866, IBM866
cp869: 869, CP869, IBM869, CP874, WINDOWS-874
EUC-JP: CSEUCJP, EUC-JP, EUCJP, UJIS, X-EUC-JP
EUC-KR: CSEUCKR, EUC-KR, EUCKR
GB2312: CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58
koi8-r: CSKOI8R, KOI8-R, KOI8R
KOI8-u KOI8-U, KOI8U
shift-JIS: CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS
cp367: ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII
UTF8: UTF-8, UTF8
viscii: CSVISCII, VISCII, VISCII1.1-1
MacCyrillic: MACCYRILLIC, X-MAC-CYRILLIC
MacRoman: MACROMAN, MACINTOSH, CSMACINTOSH, MAC
MacCentralEurope: MACCENTRALEUROPE, MACCE

Document charset detection

The indexer detects document charsets in this order:

  1. "Content-type: text/html; charset=xxx"

  2. <META NAME="Content-Type" CONTENT="text/html; charset=xxx"> (for HTML documents) or

    <?xml version="1.0" encoding="xxx"?> (for XML documents)

    The selection of this variant may be switched off by using the: GuesserUseMeta no command in your indexer.conf.

  3. The defaults to "Charset" settings of the corresponding Server or Realm command.

Automatic charset guesser

Since 3.2.0, mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using the "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.

Build your own language maps

To build your own language map use mguesser utility. In addition, your need to collect files with language samples in the desired charset. For new language maps creation, use the following command:


        mguesser -p -c charset -l language < FILENAME > language.charset.lm

You can also use mguesser utility to guess document's language and charset by using existing language maps. To do this, use following command:


        mguesser [-n maxhits] < FILENAME

For some languages, you may use several different charsets. To convert from one charset supported by mnoGoSearch to another, use mconv utility.


        mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile

By default, both mguesser and mconv utilities are installed into the /usr/local/mnogosearch/sbin/ directory.

Since version 3.2.14, mnoGoSearch has an ability to update language and charset maps automatically while indexing, if the remote server supplies pages with exactly specified language and charset. To enable this function, specify command


LangMapUpdate yes
in your indexer.conf file.

Default charset

Use the RemoteCharset indexer.conf command to choose the default charset of indexed servers.

Default Language

You can set the default language for Servers by using the DefaultLang indexer.conf command. This is useful for further restricting search results language.