User Guide

216 Getting Results with Novell Web Services
Getting Results with Novell Web Services
103-000133-001
August 29, 2001
Novell Confidential
Manual 99a38 July 17, 2001
encodings identified in the SearchServlet and PrintServlet properties files.
You can modify these settings using NetWare Web Search Manager.
Because most languages have several encodings that their character sets are
identified by, NetWare Web Search Server supports a wide variety of character
set encodings and encoding aliases.
Some examples of character set encodings include iso-8859-1, shift_jis, big5,
and latin2. The official list of registered encodings is available from the
Internet Assigned Numbers Authority (see Table 16 on page 222). These are
the official names for character sets that can be used in the Internet and can be
referred to in Internet documentation. However, not all IANA-registered
character set encodings are supported by NetWare Web Search Server. Refer
to Table 16 on page 222 for a list of encodings and encoding aliases that are
supported by NetWare Web Search Server.
Unicode and UTF8
Unicode is a 16-bit character encoding standard developed by the Unicode
Consortium. By using two bytes to represent each character, Unicode enables
almost all of the written languages of the world to be represented using a
single character set. Unicode does not require any special processing to access
any character in any language.
This makes Unicode very easy to use when processing text from multiple
languages and scripts. This is the reason NetWare Web Search converts all
external files into Unicode for processing.
As already mentioned, Unicode is two bytes wide for all characters. Although
this is ideal for computer processing, it doubles the size of all single-byte
languages. This has a significant impact on Internet performance. For this
reason, NetWare Web Search also supports an alternate representation of
Unicode known as UTF-8. UTF-8 is a Unicode Transformation Format that
uses sequences of 1 to 6 bytes to represent all the characters in the Unicode
standard. Most notably, ASCII characters are transmitted without any
conversion at all. This means that most Internet content is already in the UTF-
8 representation. Many Asian languages, however, require three bytes per
character in the UTF-8 format. Other languages can require up to six bytes to
represent each of their characters.
You will have to decide if Unicode or UTF-8 best meets your needs when
creating HTML content, Web Search templates, or search pages.