Installation

Table Of Contents

Google Search Appliance: Planning for Search Appliance Installation 7

• Start URLs, which control where the crawl begins. All content must be reachable by following links

from one or more start URLs.

• Follow and Crawl URLs, which set the patterns of URLs that are crawled. Use follow and crawl URLs

to define the paths to pages and files you want crawled. If a URL in a crawled document links to a

document whose URL does not match a pattern defined as a follow and crawl URL, that document

is not crawled.

• Do Not Crawl URLs, which designate paths to pages and files you do not want crawled and file types

you do not want crawled.

If the search appliance is crawling a web site, the crawl software issues HTTP requests to retrieve

content files in the locations defined by the URLs and to retrieve files from links discovered in crawled

content. If the search appliance is crawling a file share, the crawl software uses the SMB or common

Internet file system (CIFS) protocol to locate and retrieve the content files. For more information on

crawl, see Administering Crawl, which also includes checklists of crawl-related tasks in the “Crawl Quick

Reference.”

Traversal

Traversal is the process by which the Google Search Appliance locates content to be indexed in a

content repository such as SharePoint or Lotus Notes. Traversal is a process in which the connector

issues queries to the repository to retrieve document data to feed to the Google Search Appliance for

indexing.

Feeds

Feeding is the process by which you direct content to the Google Search Appliance instead of having the

search appliance locate content. Feeding is a push process, in which the content files are pushed to the

Google Search Appliance. You can feed several types of content to a Google Search Appliance:

• A list of URLs

The crawl software fetches documents listed in the URLs.

• Content files

The files and their URLS are fed to the search appliance.

• External metadata that is not stored in a relational database or where it is difficult to map the

metadata to the content file

For more information on feeding, see the Feeds Protocol Developer’s Guide and External Metadata Indexing

Guide.

Indexing

Indexing is the process of adding the content from the crawled documents to the index.

After a file is retrieved by the crawl, the file is converted to an HTML file and submitted for indexing. The

indexing process extracts the full text from each content file, breaks down the text, and adds both the

text and information such as date and page rank to the index so that users’ search requests can be

satisfied. The index and the HTML versions of each indexed file are stored on the search appliance.