Cluster support was added. A typical cluster
consists of several database machines and a
single front-end machine. The front-end machine
receives HTTP requests from a user's browser,
forwards search queries to the database machines
using HTTP protocol, receives back a limited number
of top best search results (using a simple XML format,
based on OpenSearch specifications) from every database machine,
then parses and merges the results, and displays them according
to score and applying HTML template.
This approach distributes operations with high
CPU and hard disk consumption between the database
machines in parallel, leaving simple merge and HTML
template processing functions to the the front-end machine.
As of version 3.3.0, mnoGoSearch allows to join up
to 256 database machines into a single cluster.
node.xml-dist is now installed into /etc directory -
an XML template for a cluster database machine.
"DBAddr http://hostname/search.cgi/node.xml" search.htm
command was added, to specify an URL of a cluster database
machine interface with XML format.
"DBAddr file:///path/to/node.xml" search.htm
command was added, to specify a static XML search response.
This is mostly for test purposes.
Two cluster types were implemented - a merge cluster
to join results from several independent databases,
each created by its own indexer.conf, as well as
a distributed cluster - created by a single indexer.conf
when indexer automatically distributes search index
between database machines.
Changing default distribution type from "reminder" to "quotient".
Thus, for indexer.conf having three DBAddr command,
distribution is done as follows:
URLs with seed 0..85 go to the first DBAddr
URLs with seed 85..170 go to the second DBAddr
URLs with seed 171..255 go to the third DBAddr
This distribution style simplifies manual redistribution
of an existing clustered database when adding a new DBAddr
(i.e. a new database machine). Future releases will provide an automatic
tool for redistribution when adding and deleting machines in an existing
cluster, as well as more configuration commands to control distribution.
Maximum amount of words collected from a document
was changed from 64K words per document to 64K words
per section - positions are now enumerated per section,
starting from the beginning of each section separately.
"SaveSectionSize yes/no" indexer.conf and search.htm
command was added. When SaveSectionSize is set to yes,
indexer stores additional information about section sizes,
making it possible to generate better score values, as
well as to do "exact section match" searches.
Default value is "yes".
Relevancy improvement: "WordDensityFactor num" search.htm
command was added. Num is a number in the range 0..255
to specify impact of word frequency on the result score.
This feature works with "SaveSectionSize yes".
The default value is 25.
Exact section match syntax was added:
title="Apache web server"
This feature works with "SaveSectionSize yes".
"WordFormFactor num" search.htm command was added
to give more weight to the word forms originally
written in the search query and less weight to
generated word forms using ispell dictionaries
and synonyms. Use with a number 0..255.
Default value is 255. 255 means to give the same
weight to the original and generated forms.
0 means maximum effect, i.e. weight for a generated
word form is much smaller than weight for the original
word form.
Excerpt generating code performance improvements
were done. Excerpt generation from CachedCopy is
now about 6-12% faster.
Using URL and Tag limits is now possible with "indexer -Eblob", e.g.:
./indexer -Eblob -u "%subdir%"
./indexer -Eblob -t tag
This is to generate a search index over a subset of
all documents collected during crawling.
Using "Limit" command is also possible with "indexer -Eblob", e.g.:
indexer.conf command:
Limit subdir "SELECT rec_id FROM url WHERE url LIKE '%/subdir/%'"
command line:
./indexer -Eblob --fl=subdir
"ResultContentType type" search.htm command
was added to specify Content-Type header generated by
search.cgi. The default value is "text/html".
"Dehyphenate yes/no" search.htm command was added.
When "Dehyphenate yes" is specified, searching for "peace-making"
also will return documents having "peacemaking".
Thanks to Oz Basarir and Natural
Capital Institute for sponsoring this feature.
Clone template variables were changed: clones are now returned
in the same row with the document itself, using CloneN prefix, e.g.:
$(Clone0.URL). The "<!--clone-->" search.htm section and the $(CL)
variable are not supported anymore.
DetectClones is now "no" by default, for performance purposes.
"CollectLinks yes/no" indexer.conf command was added.
The default value is "no" which improves indexing performance
by not pupulating the "links" table. As a side effect
PopRank calculation is not possible in the default configuration.
If PopRank is important for your installation, specify
"CollectLinks yes" in indexer.conf.
Default sort order was changed from "RP" (score, then popularity)
to "R" (score). This change improves search performance for
the installations where PopRank is not important.
Indexer now honors <a rel="nofollow"> tags.
Thanks to Jeff Veit for contribution.
A simplified format of HTDBDoc command was added:
HTDBDoc "SELECT title, body FROM docs WHERE id=$2"
SQL column names are associated with "Section" names.
Thanks to Oz Basarir and Natural
Capital Institute for sponsoring this feature.
It's now possible to specify wf as a parameter for
DBAddr search.htm command, which is useful when
merging two or more databases - to give more score
to results coming from a desired database.
DBAddr mysql://root@localhost/db1/?wf=FFFF
DBAddr mysql://root@localhost/db2/?wf=1111
DBAddr mysql://root@localhost/db3/?wf=1111
MaxResults parameter was added for DBAddr, which is
useful to add a limited number of sponsored links
in the top of search results:
DBAddr mysql://root@localhost/avd/?wf=FFFF&MaxResults=1
DBAddr mysql://root@localhost/db1/?wf=1111
DBAddr mysql://root@localhost/db2/?wf=1111
$(DBOrder) template variable was added to display
the original order of a document in its database result,
before multiple DBAddr search results were merged into
the final result. It is equal to $(Order) when using
only a single DBAddr command in search.htm.
FOR template operator was added. Loop limits
can be both constants:
<!FOR NAME="a" FROM="10" TO="20">a=$(a)<!ENDFOR>
and variables that were previously set, for example by the SET operator:
<!SET NAME="from" CONTENT="80">
<!SET NAME="to" CONTENT="90">
<!FOR NAME="a" FROM="$(from)" TO="$(to)">a=$(a)<!ENDFOR>
"[no title]" is not added automatically anymore:
an empty string is printed instead. One can use IF
template operator to reproduce 3.2.x behaviour:
<!IF NAME="title" CONTENT="">[no title]<!ELSE>$&(title)<!ENDIF>
Various indexing and search performance improvements were made.
Fixed that indexer didn't work with MySQL-5.1.15-GPL.
"indexer -?" now prints its help page to STDOUT instead of STDERR.
A "#version" record is now put into the table "bdict" when
running "indexer -Eblob". mnoGoSearch version ID is put
as its value. For example, mnoGoSearch 3.3.0 will put "30300" string.
Preliminary implementation for DBMode=rawblob
in search.htm was added. This mode is designed for
direct search from the table "bdicti" without having
to run "indexer -Eblob" and is intended for use with
small search databases as a replacement for DBMode=single.
In the future releases it will also be reused for real-time
index updates - to avoid running "indexer -Eblob" when
only a small number of documents were changed.