Search Engine Configuration

The XperienCentral search engine uses several configuration files which are stored in the <searchengine-directory>/conf directory. The most important file is the properties.txt file which contains the initial settings for the search engine. The other configuration files contain settings for meta information, credentials and parser mappings.

In This Topic

General Configuration: properties.txt

During the startup of the search engine some basic configuration parameters have to be available. These basic settings are stored in the file properties.txt. The general format in the properties.txt file is [config parameter]=[config value]. The names of the configuration parameters are case sensitive. Comments can be added by putting a # in front of the line. Additional explanation is given in the *** comments of the properties.txt file.

Back to top

Task Configuration: crontab.txt

On production environments, indexing the website is a recurring task that is usually executed every 24 hours. The indexing schedule is configured in the crontab.txt file. This file is read frequently so it’s not required to restart the search service when changes have been made. This crontab file uses the default Unix format for Cron jobs. The XperienCentral configuration on Windows servers and desktops doesn’t use crontabs but instead uses the default Windows Services. For more explanation see the Windows documentation

The crontab.txt file contains one or more lines. Every line corresponds with one task and consists of three parts:

Command - Example: fullindex, index, check. fullindex will first erase the index and then index the website. index will index the website. check will check all the URLs on the website and remove pages that don’t exist anymore.

Arguments - Example: http://edit.mywebsite.com/web/webmanager/id=39016

Time interval, Format: [minutes hours day_of-month month weekday]

Example 1: [0 0 * * *] (Run this task at 0:00am = midnight)

Example 2: [30 2 * * *] (Run this task at 2:30am)

Example 3: [55 * * * *] (Run this task every hour in the 55^th minute)

For more information and examples see Crontab. The fullindex and index commands have three arguments:

The URL of the index page.
The depth of the indexing process.
The listing of the host names that can be indexed. The host names in the listing are comma-separated and the listing should minimally contain all the host names used externally, supplemented with the host name of the index page URL (see the first argument).

An example of a crontab.txt file (two lines, one starts with fullindex, the second with check):

index http://localhost:8080/web/webmanager/id=39016 1 127.0.0.1,localhost [5 0 * * *]
check * [0 2 * * *]

This configuration specifies that a local website has to be indexed with depth=1 at 5 past midnight every day. At 2 AM the index is checked for all URLs (*) and non existing URLs are removed.

Back to top

Parser Configuration: parser.txt

The relevant properties of documents are retrieved by using parsers. The mapping between document type and parser type can be configured in the file parser.txt. This file contains various lines. Each line contains three parts:

URL Regular expression matched against the document URL
Example: .*\.pdf
Contenttype regular expression matched against the document content type retrieved from the HTTP header.
Example: application/pdf
Parser full classname
Example: nl.gx.webmanager.searchengine.parser.XmlParser

Example of parser.txt:

.*                  .*   nl.gx.webmanager.searchengine.parser.CentralContentParser
.*\.htm             .*   nl.gx.webmanager.searchengine.parser.HtmlParser
.*\.html            .*   nl.gx.webmanager.searchengine.parser.HtmlParser
.*         text/html.*   nl.gx.webmanager.searchengine.parser.HtmlParser
.*\.txt             .*   nl.gx.webmanager.searchengine.parser.TextParser
.*        text/plain.*   nl.gx.webmanager.searchengine.parser.TextParser
.*\.xml             .*   nl.gx.webmanager.searchengine.parser.XmlParser
.*          text/xml.*   nl.gx.webmanager.searchengine.parser.XmlParser
.*\.pdf             .*   nl.gx.webmanager.searchengine.parser.PdfParser
.*   application/pdf.*   nl.gx.webmanager.searchengine.parser.PdfParser
.*\.doc             .*   nl.gx.webmanager.searchengine.parser.AntiwordParser

The parser.txt file is read every minute so it’s not required to restart the search service when the contents are edited. Every document is matched top-down and from left to right. The document will be sent to the parser of every line that matches. When no valid parser is found, the document will not be indexed. This is also counts for the special parser name ‘-‘, which also means the document type is will not be indexed.

Back to top

Credentials Configuration: credentials.xml

Even though the search engine indexes the website through the frontend, there is a basic form of authentication required to retrieve the indexer page and the meta information of documents. The authentication for the search engine is configured in the file credentials.xml. Besides basic authentication, credentials.xml can also contain advanced authentication for secure websites and documents.

Basic Authentication

In a default installation, the configuration is limited to creating a special search engine user and password, for example "gxsearch" with password "Search987" and entering this information in the credentials.xml. For example:

<credentials>
   <credential pattern=".*localhost.*" type="postform" username="gxsearch" password="Search987" />
      ...
</credentials>

Advanced Authentication

XperienCentral supports three types of secure indexing: NTLM, basic authentication or postform authentication. This can be set up by creating a credential pattern for the website (or part of it) and mapping this credential to the required login attributes of the authentication. All authentication types require at least a username and password and for NTLM authentication a host and domain attribute have to be specified.

NTLM Example

<credentials>
   <credential pattern="http://www.gx.nl/docs/.*" type="ntlm" username="Administrator" password="secretpa$$word" host="wmhost" domain="GX" />
   <!—- other credentials here -->
</credentials>

Basic Authentication Example

<credentials>
   <credential pattern="http://localhost/secret.*pdf" type="basic" username="admin" password="secretpa$$word" />
   <!—- other credentials here -->
</credentials>

Postform Authentication Example

The postform authentication responds to the cookie that is returned after the form submit. This contains the required session ID to index the protected URLs.

<credentials>
   <credential pattern="http://www.gxsoftware.com/web/show/.*" type="postform" username="gxsearch" password="Search987">
   <!-- indicate which input parameters in the login form correspond to the user and password -->
   <param name="userparam" value="f48305" />
   <param name="passwordparam" value="f48306" />
   <!-- the action url george needs to post the user/password to -->
   <param name="actionurl" value=" http://www.gx.nl/web/formhandler?source=form" />
   <!-- include all input parameters in the form -->
   <formparam name="id" value="29347" />
   <formparam name="pageid" value="47952" />
   <formparam name="handle" value="form" />
   <formparam name="ff" value="47954" />
   <formparam name="form" value="48067" />
   <formparam name="formelement" value="47954" />
   <formparam name="originalurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954" />
   <formparam name="errorurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954/formerror=47954" />
   <formparam name="f48305" value="" />
   <formparam name="formpartcode" value="f48305" />
   <formparam name="f48306" value="" />
   <formparam name="formpartcode" value="f48306" />
   </credential>
   <!—- other credentials here -->
</credentials>

Back to top

Additional Meta Data: meta.txt

Additional metadata can be provided during the indexing process by using the configuration file meta.txt. This file can be used to fill metadata fields in the index with values based on specific URLs. An example of a meta.txt file is:

.*/javadoc/.*   pagetype   javadoc   http://www.gxsoftware.com/.* owner   gx

In this example, documents with URLs that contain the string "javadoc" will get an additional field pagetype with value "javadoc". The second line creates a field owner with value "gx" for all documents from the website www.gxsoftware.com. The format for the meta.txt file is <URL pattern><tab><index field><tab><index value>. The URL pattern is a regular expression. The string separator has to be a tab and not several spaces. Some IDEs (such as Eclipse) can be configured to automatically convert tabs to spaces which can lead to unwanted behavior. The meta.txt is read every minute so it’s not required to restart the search engine after the file has been changed. The reason for setting these properties is that they can be used for filtering the search results. For example, based on the above meta.txt, it is very easy to filter out all the items that have “gx” as value for the property owner.

Back to top

Page tree

Search Engine Configuration

In This Topic

General Configuration: properties.txt

Task Configuration: crontab.txt

Parser Configuration: parser.txt

Credentials Configuration: credentials.xml

Basic Authentication

Advanced Authentication

Additional Meta Data: meta.txt