Mime -- defines external parser for given mime-type



Mime {from_mime} {to_mime} {command line} [source]


Mime is used to enable parsing documents with mime types other than text/plain, text/html or text/xml, which have built-in parsers.

Processing of documents with other mime types is possible with help of external parsers - external programs which convert documents of arbitrary types to the above types natively supported by mnoGoSearch.

The from_mime and to_mime parameters are standard mime types.

to_mime should be one of the natively supported types (listed above) and can optionally have the charset= part. If the charset= part is omitted, the parser output is considered to be in LocalCharset.

By default, when executing a parser, indexer sends data to its STDIN and reads results from its STDOUT.

Some parsers can not operate on STDIN and need a file. The command line parameter can have $1 reference which stands for a temporary file name. If $1 is specified, indexer creates a temporary file, writes the input data to it, and substitutes the temporary file in the parser command line instead of the $1 reference.

Command line can also use variables, for example ${URL} or ${Content-Type}. See the list of all available variables in indexer -v6 output, in the lines having the "Response." prefix.

The fourth parameter source is optional. It can specify what kind of data is sent to the parser. By default, indexer sends raw document content. With help of the source parameter you can mix document content with other kind of data, for example, its URL or some HTTP header, using the same notation with the command line parameter. Raw content is available as ${HTTP.Content}.

Note: To make ${HTTP.Content} available, use Section HTTP.Content 0 0 command.


Mime application/msword      "text/plain; charset=cp1251"  "catdoc $1"
Mime application/x-troff-man  text/plain                    "deroff"
Mime text/x-postscript        text/plain                    "ps2ascii"
Mime application/pdf          text/plain                    "pdftotext $1 -"
Mime application/vnd.ms-excel text/plain                    "xls2csv $1"
Mime "text/rtf*"              text/html                     "rthc --use-stdout $1 2>/dev/null"

# A parser example with variables in its command line
Mime application/mytype       text/html    "myparser -u ${URL} -t ${Content-Type} $1"

# Mixing content with URL and HTTP headers
Section HTTP.Content 0 0
Mime application/mytype2      text/html    "myparser2"   "${URL} # ${Content-Type} # ${HTTP.Content}"

See also

AddType, DefaultContentType, UseRemoteContentType.