Installation
Table Of Contents
- Planning for Search Appliance Installation
- Contents
- Planning for Search Appliance Installation
- About This Document
- How Does the Search Appliance Work?
- About the End User License Agreement
- How Do I Plan My Installation?
- What Character Encoding Should Content Files and Feeds Use?
- What Hardware and Software Do I Need?
- What Does the Google Search Appliance Shipping Box Contain?
- What File Types Can Be Indexed?
- What File Sizes Can Be Indexed?
- What Content Locations Can Be Crawled or Traversed?
- How Many URLs Can Be Crawled?
- How Do I Control Security?
- Can the Search Appliance Use a Dedicated Network Interface Card for Administration?
- What Ports Does the Search Appliance Use?
- What User Accounts Do I Need?
- How are Administration Accounts Authenticated?
- How Do I Obtain Technical Support?
- How is Power Supplied to the Search Appliance?
- How is Data Destroyed on a Returned Search Appliance?
- What Values Do I Need for the Installation Process?
- What Tasks Do I Need to Perform Before I Install?
- Electrical and Other Technical Requirements
Google Search Appliance: Planning for Search Appliance Installation 7
• Start URLs, which control where the crawl begins. All content must be reachable by following links
from one or more start URLs.
• Follow and Crawl URLs, which set the patterns of URLs that are crawled. Use follow and crawl URLs
to define the paths to pages and files you want crawled. If a URL in a crawled document links to a
document whose URL does not match a pattern defined as a follow and crawl URL, that document
is not crawled.
• Do Not Crawl URLs, which designate paths to pages and files you do not want crawled and file types
you do not want crawled.
If the search appliance is crawling a web site, the crawl software issues HTTP requests to retrieve
content files in the locations defined by the URLs and to retrieve files from links discovered in crawled
content. If the search appliance is crawling a file share, the crawl software uses the SMB or common
Internet file system (CIFS) protocol to locate and retrieve the content files. For more information on
crawl, see Administering Crawl, which also includes checklists of crawl-related tasks in the “Crawl Quick
Reference.”
Traversal
Traversal is the process by which the Google Search Appliance locates content to be indexed in a
content repository such as SharePoint or Lotus Notes. Traversal is a process in which the connector
issues queries to the repository to retrieve document data to feed to the Google Search Appliance for
indexing.
Feeds
Feeding is the process by which you direct content to the Google Search Appliance instead of having the
search appliance locate content. Feeding is a push process, in which the content files are pushed to the
Google Search Appliance. You can feed several types of content to a Google Search Appliance:
• A list of URLs
The crawl software fetches documents listed in the URLs.
• Content files
The files and their URLS are fed to the search appliance.
• External metadata that is not stored in a relational database or where it is difficult to map the
metadata to the content file
For more information on feeding, see the Feeds Protocol Developer’s Guide and External Metadata Indexing
Guide.
Indexing
Indexing is the process of adding the content from the crawled documents to the index.
After a file is retrieved by the crawl, the file is converted to an HTML file and submitted for indexing. The
indexing process extracts the full text from each content file, breaks down the text, and adds both the
text and information such as date and page rank to the index so that users’ search requests can be
satisfied. The index and the HTML versions of each indexed file are stored on the search appliance.