Indexer configuration is covered mostly by the indexer.conf-dist file.
You can find it in the /etc directory
of the mnoGoSearch installation
directory. Also, you may want to take a look into the
other *.conf samples
in the doc/samples directory of
the mnoGoSearch source distribution.
To set up indexer.conf file,
go to the /etc directory of your
mnoGoSearch installation,
copy indexer.conf-dist to
indexer.conf and edit it using a text editor.
Typically, the DBAddr
command needs to be modified according to your database connection
parameters, as well as a new command
Server
describing your Web site needs to be added. The other default
indexer.conf commands are usually suitable
in most cases and do not need changes. The file
indexer.conf is well-commented and
contains examples for the most important commands, so
you will find it easy to configure.
To configure the search front-end search.cgi,
copy the file search.htm-dist to search.htm and edit it.
Typically, only DBAddr
needs to be modified according to your database connection parameters,
similar to indexer.conf.
See the Section called How to write search result templates
in Chapter 10 for more detailed description.
To create SQL tables required for
mnoGoSearch, use indexer -Ecreate.
When started with this argument, indexer opens the file
containing the SQL statements necessary for creating all SQL tables
according to the database type and storage mode given in
the DBAddr command
in indexer.conf. The files with the SQL
scripts are typically installed to the /share
directory of the mnoGoSearch installation,
which is usually /usr/local/mnogosearch/share/mnogosearch/.
To drop all SQL tables created by mnoGoSearch,
use indexer -Edrop. The files with the SQL statements
required to drop all tables previously created by
mnoGoSearch is installed in the /share
directory of the mnoGoSearch installation.
Note:
In some cases when you need to remove all existing data
from the search database and to crawl your sites from the very beginning,
you can use indexer -Edrop followed
by indexer -Ecreate instead of
truncating the existing tables (indexer -C).
In some databases recreating the tables work faster than
truncating data from the existing tables.
Run indexer periodically
(once a week, a day, an hour...), depending
on how often changes on your sites happen.
You may find useful adding indexer
into cron job.
If you run indexer without any command
line arguments, it crawls only new and expired documents, while
fresh documents are not crawled. You can change expiration time
with help of the Period
indexer.conf command.
The default expiration period is one week.
If you need to crawl all documents, including the fresh ones,
(i.e. without having to wait for their expiration period),
use the -a command line option.
indexer will mark all documents as expired at startup.
If indexer gets a redirect
response (301, 302,
303 HTTP status), the URL from
the Location: HTTP header is added
into the database.
Note:
indexer
puts the redirect target
into its queue. It does not follow the redirect target
immediately after processing an URL with a redirect response.
When downloading documents, indexer tries
to do some optimization. It sends the
If-Modified-Since HTTP header for the
documents it have already downloaded (during the previous crawling
sessions). If the HTTP server replies "304 Not
modified", then only minor updates in the database are done.
When indexer downloads a document
(i.e. when it gets a "HTTP 200 Ok" response)
it calculates the document checksum using the crc32 algorithm.
If checksum is the same to the previous checksum stored in the database,
indexer will not do full updates in the database
with the new information about this document.
This is also done for optimization purposes to improve
crawling performance.
The -m command line option prevents
indexer from sending the
If-Modified-Since headers and forces
full updating the database even if the checksum is the same.
It can be useful if you have modified indexer.conf.
For example, when the Allow,
Disallow rules were
changed, or new Server
commands were added, and therefore you need indexer
to parse the old documents once again and add new links which
were ignored in the previous configuration.
Note:
Sometimes you may need to force reindexing of some document
(or a group of documents), that is force both document downloading
(even when it is not expired yet) and updating the information about
the document in the database (even if the checksum has not modified).
You may find this command useful:
indexer -am -u http://site/some/document.html
indexer understand the -t, -u, -s
command line options to limit actions to only a part of the database.
-t forces a limit on
Tag,
-u forces a limit on URL substring
(using SQL LIKE wildcards).
-s forces a limit on HTTP status.
All limit command can be specified multiple times.
All limit options of the same group are OR-ed,
and the groups are AND-ed. For example,
if you run indexer -s200 -s304 -u http://site1/% -u
http://site2/%, indexer will re-crawl
the documents having HTTP status 200 or
304, only from the site
http://site1/ or from the site
http://site2/.
Note:
The above command line will be internally interpreted
into this SQL query when fetching URLs from the queue:
SELECT
<columns>
FROM
url
WHERE
status IN (200,304)
AND
(url LIKE 'http://site1/%' OR url LIKE 'http://site2/%'
AND
next_index_time >= <current_time>
To clear all information from the database,
use indexer -C.
By default, indexer asks
for a confirmation if you are sure to delete data
from the database.
$ indexer -C
You are going to delete content from the database(s):
pgsql://root@/root/?dbmode=blob
Are you sure?(YES/no)
You can use the
-w command line option
together with
-C to force deleting data
without asking for confirmation:
indexer -Cw.
You may also delete only a part of the database.
All subsection control options are taking into account
when deleting data. For example:
indexer -Cw -u http://site/%
will delete infomation about all documents from the
site
http://site/ without asking
for confirmation.
If you run indexer -S,
indexer will display the current database statistics,
including the number of total and expired documents for each HTTP
status:
$indexer -S
Database statistics [2008-12-21 15:35:34]
Status Expired Total
-----------------------------
0 883 971 Not indexed yet
200 0 891 OK
404 0 1585 Not found
-----------------------------
Total 883 3447
It is also possible to see database statistic for a certain
moment of time in the future with help of the
-j
command line argument, to check expiration period of the documents.
-j understands time in the format
YYYY-MM[-DD[ HH[:MM[:SS]]]], or time offset
from the current time using the same format with the
Period command.
For example, 7d12h means
seven days and 12 hours:
$indexer -S -j 7d12h
Database statistics [2008-12-29 03:44:19]
Status Expired Total
-----------------------------
0 971 971 Not indexed yet
200 891 891 OK
404 1585 1585 Not found
-----------------------------
Total 3447 3447
From the above output we know that after
the given period of time all documents
in the database will have expired.
Note:
All subsection control options work together with -S.
The meaning of the various status values is given in this
list:
If status is not 0,
then it's a HTTP response code indexer got
when downloading this document. Some of the HTTP codes are:
200 - OK
(the document was successfully downloaded)
301 - Moved Permanently
(redirect to another URL)
302 - Moved Temporarily
(redirect to another URL)
303 - See Other
(redirect to another URL)
304 - Not modified
(the document has not been modified since last visit)
401 - Authorization required
(use login/password for the given URL)
403 - Forbidden
(you have no access to this URL)
404 - Not found
(the document does not exist)
500 - Internal Server Error
(an error in a CGI script, etc)
503 - Service Unavailable
(host is down, connection timed out)
504 - Gateway Timeout
(read timeout happened during downloading the document)
HTTP 401 means that this URL is password protected.
You can use the AuthBasic
command in indexer.conf to specify the
login:password pair for this URL.
HTTP 404 means that you have a broken link
in one of your document (a reference to a resource that does not exist).
Take a look at
HTTP specific documentation
for the further information on HTTP status codes.
Run indexer -I to display the
list of URLs together with their referrers. It can be useful
to find broken links on your site.
Note:
If HoldBadHrefs is set to 0,
link validation won't work.
Note:
All subsection control options work together with -I.
For example, indexer -I -s 404 will display
the list of the documents with HTTP status 404 Not
found together with their referrers where the links to the
missing documents were found.
You can use
mnoGoSearch
especially for link validation purposes.
It is always safe to run multiple indexer
processes with different indexer.conf
files configured to use different databases
in the DBAddr.
Some databases also allow to run multiple
indexer crawling processes with the same
indexer.conf file. As of
mnoGoSearch version
3.3.8, it is possible with
MySQL, PgSQL,
Oracle.
indexer uses locking mechanisms
provided by the database software
(such as SELECT FOR UPDATE and
LOCK TABLE) when fetching crawling
targets from the database. This is done to avoid double
crawling of the same documents
by simultaneous indexer processes.
Note:
indexer is known to work fine
with 30 simultaneous crawling
processes with MySQL.
Note: It is not recommended to use the same database with
different indexer.conf files.
The first process can add new documents to the database,
while the second process can delete the same documents
because of different configuration. This process can never stop.
You can start indexer with multiple threads
using the -N command line option. For example,
indexer -N10 will start 10
crawling threads, which means 10 documents
from different locations will be downloaded at the same time,
which improves crawling performance significantly.
Note:
Running 10 instances of indexer
is effectively very similar to running a single indexer
with 10 threads. You may notice some difference
though if you terminate (using Ctrl-Break)
or kill (using kill(1)) indexer,
or if indexer crashes for some reasons (e.g. when
it hits some bug in the sources). In case of separate processes
only one process will die and the alive processes will continue
crawling, while in case of a multi-threaded indexer
all threads die and crawling completely stops.