Description
When used in search.htm,
the "Section" command requires only the first three
parameters and activates recognition of section
name references in search queries. See
the Section called Restrict searched words to a section
in Chapter 8 for details.
There are no any other purposes of using the "Section" command
in search.htm. The rest of this article
applies mostly for indexer.conf.
"string" is a section name and "number" is section ID
between 0 and 255. Use 0 if you don't want to index some of
these sections. It is better to use different sections IDs
for different documents parts. In this case during search
time you'll be able to give different weight to each part
or even disallow some sections at a search time.
maxlen argument contains a maximum length of section
which will be stored in database.
"when" is an optional parameter defining when the
section should be created. Three values are possible:
- afterheaders - creates section after processing of HTTP headers.
Using Section with "afterheaders" parameter allows to replace the headers
returned by HTTP server with your own values. For example, if HTTP server
is not well configured and returns "Content-Type: text/plain" for documents
which are in fact XML or HTML documents, or "Content-Type: application/octet-stream"
for Word or XLS document, you can overwrite the "Content-Type" header
and thus force indexer to invoke a proper external or internal parser.
- afterguesser - creates section after execution of character set guesser.
A special variable ${HTTP.LocalCharsetContent} is additionally available
for use in the "expression" argument, which represents raw document content
converted into the LocalCharset.
- afterparser - creates section after extracting pieces of text
from the document (i.e. after removing tags in the case of HTML or XML),
before breaking them into individual words. This is the default
value for the "when" parameter.
"cloneflag" is a flag describing whether the section
should affect clone detection. It can be "DetectClone"
or "cdon", or "NoDetectClone" or "cdoff". By default,
url.* section values are not taken in account for clone
detection, while any other sections take part in clone detection.
"separator" is a string that separates section. This is useful
for attribute sections.
"expression" and "replacement" can be used to extract user defined
sections.
There is a special "User.Date" section. It makes possible
to use a user defined meta tag (or even any other document part)
as an alternative "Last-Modified" value. A number of widespread formats is understood:
Sun, 06 Nov 1994 08:49:37 GMT
Sun, 6 Nov 1994 08:49:37 GMT
Sunday, 06-Nov-94 08:49:37 GMT
Sun Nov 6 08:49:37 1994
1994-11-06
06.11.1994
"nobody" is another section with a special meaning.
When parsing HTML documents, indexer
ignores the words outside the <body> and </body> tags by default.
To activate indexing of these words, you can define a special section
"nobody", which should have the same ID and length with the section "body".
Making indexer see the words outside the body tags can be useful to
index a remote site with broken HTML pages - when you can't modify
the pages, or to index local HTML pages having SSI (sever side
include) directives directly from disk using file:/// schema,
even if the <body> and </body> tags are not in the HTML
pages themselves, but in shared files included using SSI directives,
like <!--#include virtual="../include/top.html"-->.
For example:
Section body 1 256
Section nobody 1 256