Cached copies

Starting from the version 3.2.2 mnoGoSearch is able to store compressed copies of the indexed documents, so called cached copies. Cached copies are stored in the urlinfob SQL table of the mnoGoSearch database.

Cached copies are used for these purposes:

  1. As an indexing source when running:

    
indexer --index
    

  2. To display smart excerpts from every found document with the search query words in their context.

  3. To display the entire original copy of the document, with the search words highlighted. Watching a cached copy can be especially useful when the original site is temporarily down or the document does not exist any longer.

    Note: A cached copy is opened in the browser when the user clicks on the Cached copy link near every document in search results.

Configuring cached copies

Starting from the version 3.4.0, collecting cached copies is done by default, no special configuration is needed.

Cached copy compression is controlled by the CachedCopyEncoding command in indexer.conf.

Cached copies are not stored for the documents that have noindex and noarchive at the same time in either the X-Robots-Tags HTTP header, or in the <META NAME="Robots"> meta tags. With help of this feature in combination with user defined sections, it's possible to disable storing of cached copies for certain documents, e.g. for the document of some content type:


#
# Disallow cached copied using Content-Type,
# for the documents of text/html type:
#
Section HTTP.Content 0 0
Section X-Robots-Tag 0 64 afterheaders cdoff "" "${Content-Type}" "text/html" "noindex, noarchive"
or using some more exotic condition:

#
# Disallow cached copied using content length
# for documents with size between 120 and 130 bytes:
#
Section HTTP.Content 0 0
Section X-Robots-Tag 0 64 afterheaders cdoff "" "${HTTP.Content}" "^.{120,130}$" "noindex, noarchive"
The above commands instruct indexer to replace the X-Robots-Tag with the value "noindex, noarchive" under the given condition, and thus disables storing cached copies for these documents.

Note: The documents with disabled cached copies will not be found by search.cgi, because their content is not available for indexing.

Using cached copies at search time

Displaying cached copies is enabled in the default search result template search.htm-dist.

The URL of a cached copy of a document is available by this call:


string cached_url= res.document_property_html(i, "stored_href");
where res is a variable of a RESULT class.

When using the default search template, search.cgi refers to itself recursively. When you follow the Cached Copy link in your browser, you'll open search.cgi again (but with special query string parameters which tell to display a cached copy rather than search results).

It works in the following order during search time:

  1. For each document a link to its cached copy is displayed in search results;

  2. When the user clicks the link, search.cgi is executed. It sends a query to the SQL database and fetches the cached copy content.

  3. search.cgi decompresses the requested cached copy and sends it to the web browser, highlighting the search keywords;

Moving cached copies to another machine

You can optionally specify an alternative URL to display cached copies, to have cached copies reside under another location of the same server, or even on another physical server. For example:


<a href="http://site2/cgi-bin/search.cgi?<?mnogosearch res.document_property_html(i, "stored_href")?>">Cached copy</a>
Moving cached copies to another server can be useful to distribute CPU load between machines.

Note: mnoGoSearch must be installed on the machine site2.

Using the original document as a cached copy source

Starting from the version 3.3.8, mnoGoSearch understands the UseLocalCachedCopy command in search.htm to force downloading documents from their original locations when generating smart excerpts for search results as well as when generating the "Cached Copy" documents. This command can be useful when you index the documents residing on your local file system and helps to avoid storing of cached copies in the database and thus makes the database smaller.