Chapter 12. mnoGoSearch cluster

Table of Contents
Introduction
How it works
Operations done on the database machines
How a typical XML response looks like
Operations done on the front-end machine
Cluster types
Installing and configuring a merge cluster
Installing and configuring a distributed cluster
Using dump/restore tools to add cluster nodes
Cluster limitations

Introduction

Starting from the version 3.3.0, mnoGoSearch provides a clustered solution, which allows to scale search on several computers, extending possible database sizes up to several dozens million documents.

How it works

A typical cluster consists of several database machines (cluster nodes) and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of the best top search results (using a simple XML format, based on the OpenSearch specifications) from every database machine, then parses and merges the results ordering them by score, and displays the results applying the HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine.

Operations done on the database machines

In a clustered mnoGoSearch installation, all hard operations are done on the database machines.

This is an approximate distribution chart of time spent on different search steps:

How a typical XML response looks like


<rss>
  <channel>

    <!-- General search results information block -->
    <openSearch:totalResults>20</openSearch:totalResults>
    <openSearch:startIndex>1</openSearch:startIndex>
    <openSearch:itemsPerPage>2</openSearch:itemsPerPage>

    <!-- Word statistics block -->
    <mnoGoSearch:WordStatList>
      <mnoGoSearch:WordStatItem order="0" count="300" word="apache"/>
      <mnoGoSearch:WordStatItem order="1" count="103" word="web"/>
      <mnoGoSearch:WordStatItem order="2" count="250" word="server"/>
    </mnoGoSearch:WordStatList>

    <!-- document information block -->
    <item>
      <id>1</id>
      <score>70.25%</score>
      <title>Test page of the Apache HTTP Server</title>
      <link>http://hostname/index.html</link>
      <description>...to use the images below on Apache
      and Fedora Core powered HTTP servers. Thanks for using Apache ...
      </description>
      <updated>2006-12-13T18:30:02Z</updated>
      <content-length>3956</content-length>
      <content-type>text/html</content-type>
    </item>

    <!-- more items, typically 10 items total -->

  </channel>
</rss>

Operations done on the front-end machine

The front-end machine receives XML responses from every database machine. On the first query, the front-end machine requests top 10 results from every database machine. An XML response for the top 10 results page is about 5Kb. Parsing of each XML response takes less than 1% of time. Thus, a cluster consisting of 50 machines is about 50% slower than a cluster consisting of a single machine, but allows to search through a 50-times bigger collection of documents.

If the user is not satisfied with search results returned on the first page and navigates to higher pages, then the front-end machine requests ps*np results from each database machine, where ps is the page size (10 by default), and np is the page number. Thus, to display the 5th result page, the front-end machine requests 50 results from every database machine.

Cluster types

mnoGoSearch supports two cluster types. A merge cluster is to join results from multiple independent machines, each one created by its own indexer.conf. This type of cluster is recommended when it is possible to distribute your web space into separate databases evenly using URL patterns by means of the Server or Realm commands. For example, if you need to index three sites siteA, siteB and siteC with an approximately equal number of documents on each site, and you have three cluster machines nodeA, nodeB and nodeC, you can put each site to a separate machine using a corresponding Server command in the indexer.conf file on each cluster machine:


# indexer.conf on machine nodeA:
DBAddr mysql://root@localhost/test/
Server http://siteA/

# indexer.conf on machine nodeB:
DBAddr mysql://root@localhost/test/
Server http://siteB/

# indexer.conf on machine nodeC:
DBAddr mysql://root@localhost/test/
Server http://siteC/

A distributed cluster is created by a single indexer.conf, with indexer automatically distributing search data between database machines. This type of cluster is recommended when it is hard to distribute web space between cluster machines using URL patterns, for example when you don't know your site sizes or the site sizes are very different.

Note: Even distribution of search data between cluster machines is important to achieve the best performance. Search front-end waits for the slowest cluster node. Thus, if cluster machines nodeA and nodeB return search results in 0.1 seconds and nodeC return results in 0.5 seconds, the overall cluster response time will be about 0.5 seconds.

Installing and configuring a merge cluster

Configuring the database machines

On each database machine install mnoGoSearch using the usual procedure:

  • Configure indexer.conf: Edit DBAddr - usually specifying a database installed on the local host, for example:

    
DBAddr mysql://localhost/dbname/?dbmode=blob
    

  • Add a "Server" command corresponding to a desired collection of documents - its own collection on every database machine. Crawl the collections on every database machine by running indexer, then index the collections by running indexer --index.

  • Configure search.htm by copying the DBAddr command from indexer.conf.

  • Make sure that search works in the usual (non-clustered) mode by opening http://hostname/cgi-bin/search.cgi in your browser and typing some search query.

Configuring XML interface on the database machines

In addition to the usual installation steps, it's also necessary to configure XML interfaces on every database machine.

Go through the following steps:

  • cd /usr/local/mnogosearch/etc/

  • cp node.xml-dist node.xml

  • Edit node.xml by specifying the same DBAddr that you used in indexer.conf on this machine

  • make sure the XML front-end returns a well-formed response (according to the above format) by opening http://hostname/cgi-bin/search.cgi/node.xml?q=test

After these steps, you will have several separate document collections, with every collection indexed into its own database, and with configured XML interfaces on all database machines.

Configuring the front-end machine

Install mnoGoSearch using the usual procedure, then do the following additional steps:

  • cd /usr/local/mnogosearch/etc/

  • cp search.htm-dist search.htm

  • Fix this declaration:

    
string DBAddr= "mysql://root@localhost/test/";
    
    to
    
string DBAddr= "http://hostname1/cgi-bin/search.cgi/node.xml"
                  " http://hostname2/cgi-bin/search.cgi/node.xml"
                  " http://hostname3/cgi-bin/search.cgi/node.xml";
    
    i.e. listing URLs of all XML interfaces that were installed during the previous step. Use space to separate multiple entries.

    Note: Alternatively, if you're not using the standard search.htm from the distribution, you'll need to add a piece of the code like this:

    
ENV env;
    ...
    if (env.addline("DBAddr http://hostname1/cgi-bin/search.cgi/node.xml") ||
        env.addline("DBAddr http://hostname2/cgi-bin/search.cgi/node.xml") ||
        env.addline("DBAddr http://hostname3/cgi-bin/search.cgi/node.xml"))
    {
      cout << env.errmsg();
      exit(0);
    }
    

You're done. Now open http://frontend-hostname/cgi-bin/search.cgi. The cluster is ready to search.

Note: DBAddr file:///path/to/response.xml is also understood, to load an XML-formatted response from a static file. This is mostly for test purposes.

Installing and configuring a distributed cluster

Configuring the database machines

Install mnoGoSearch on a single database machine. Edit indexer.conf by specifying multiple DBAddr commands:


DBAddr mysql://hostname1/dbname/
DBAddr mysql://hostname2/dbname/
DBAddr mysql://hostname3/dbname/
and describing web space using Realm or Server commands. For example:

#
# The entire top level domain .ru,
# using http://www.ru/ as a start point
#
Server http://www.ru/
Realm http://*.ru/*
After that, install mnoGoSearch on all other database machines and copy indexer.conf from the first database machine. indexer configuration is done at this point. Now you can start indexer on any database machine to crawl. indexer will distribute data between the databases specified in the DBAddr commands. When crawling is done, run indexer --index.

How indexer distributes data between multiple databases

The number of the database a document is put into is calculated as a result of division of url.seed by the number of DBAddr commands specified in indexer.conf, where url.seed is calculated using a hash function on the URL of the document.

Thus, for indexer.conf having three DBAddr command, distribution is done as follows:

  • URLs with seed 0..85 go to the first DBAddr

  • URLs with seed 85..170 go to the second DBAddr

  • URLs with seed 171..255 go to the third DBAddr

Prior to version 3.3.0, indexer could also distribute data between several databases, but the distribution was done using the reminder of division url.seed by the number of DBAddr commands.

The new distribution style, introduced in 3.3.0, simplifies manual redistribution of an existing clustered database when adding a new DBAddr (i.e. a new database machine). Future releases will likely provide automatic tools for redistribution data when adding or deleting machines in an existing cluster, as well as more configuration commands to control distribution.

Configuring the front-end machine

Follow the same configuration instructions with the merge cluster type.

Using dump/restore tools to add cluster nodes

Starting from the version 3.3.9, mnoGoSearch allows to add new nodes into a cluster (or remove nodes) without having to re-crawl the documents once again.

Suppose you have 5 cluster nodes and what extend the cluster to 10 nodes. Please go through the following steps:

  1. Stop all indexer processes.

  2. Create all new 10 SQL databases and create a new .conf file with 10 DBAddr commands. Note, the old and the new SQL databases can NOT overlap. The new databases must be freshly created empty databases.

  3. Run

    indexer -d /path/to/new.conf --create
    to create table structure in all 10 new SQL databases.

  4. Make sure you have enough disk space - you'll need about 2 times extra disk space of the all original SQL databases.

  5. Create a directory, say, /usr/local/mnogosearch/dump-restore/, where you will put the dump file, then go to this directory.

  6. Run

    indexer --dumpdata | gzip > dumpfile.sql.gz 
    It will create the dump file.

  7. Run

    zcat dumpfile.sql.gz | indexer -d /path/to/new.conf --sqlmon -v2
    It will load information from the dump file and put it into the new SQL databases. Note, all document IDs will be automatically re-assigned.

  8. Check that restoring worked fine. These two commands should report the same statistics:

    
indexer -S
    indexer -S -d /path/to/new.conf
    

  9. Run

    indexer -d /path/to/new.conf --index
    to create inverted search index in the new SQL databases;

  10. Configure a new search front-end to use the new SQL databases and check that search bring the same results from the old and the new databases.

Cluster limitations