User Guide

ManualsBrandsNovell ManualsOtherNETWARE 6-DOCUMENTATION

2141

216 Getting Results with Novell Web Services

Getting Results with Novell Web Services

103-000133-001

August 29, 2001

Novell Confidential

Manual 99a38 July 17, 2001

encodings identified in the SearchServlet and PrintServlet properties files.

You can modify these settings using NetWare Web Search Manager.

Because most languages have several encodings that their character sets are

identified by, NetWare Web Search Server supports a wide variety of character

set encodings and encoding aliases.

Some examples of character set encodings include iso-8859-1, shift_jis, big5,

and latin2. The official list of registered encodings is available from the

Internet Assigned Numbers Authority (see Table 16 on page 222). These are

the official names for character sets that can be used in the Internet and can be

referred to in Internet documentation. However, not all IANA-registered

character set encodings are supported by NetWare Web Search Server. Refer

to Table 16 on page 222 for a list of encodings and encoding aliases that are

supported by NetWare Web Search Server.

Unicode and UTF8

Unicode is a 16-bit character encoding standard developed by the Unicode

Consortium. By using two bytes to represent each character, Unicode enables

almost all of the written languages of the world to be represented using a

single character set. Unicode does not require any special processing to access

any character in any language.

This makes Unicode very easy to use when processing text from multiple

languages and scripts. This is the reason NetWare Web Search converts all

external files into Unicode for processing.

As already mentioned, Unicode is two bytes wide for all characters. Although

this is ideal for computer processing, it doubles the size of all single-byte

languages. This has a significant impact on Internet performance. For this

reason, NetWare Web Search also supports an alternate representation of

Unicode known as UTF-8. UTF-8 is a Unicode Transformation Format that

uses sequences of 1 to 6 bytes to represent all the characters in the Unicode

standard. Most notably, ASCII characters are transmitted without any

conversion at all. This means that most Internet content is already in the UTF-

8 representation. Many Asian languages, however, require three bytes per

character in the UTF-8 format. Other languages can require up to six bytes to

represent each of their characters.

You will have to decide if Unicode or UTF-8 best meets your needs when

creating HTML content, Web Search templates, or search pages.