User's Manual

148 Chapter 11. Services Tutorials
11.6.1.3. Content provision
The remaining method to be implemented in the MetadataProvider interface is the one
that actually provides the searchable content. There are currently three formats in which
searchable content can be provided, each of which are identified by a static constant in the
com.arsdigita.search.ContentType class. Each search indexer will support a different set of
formats, so applications should implement as many formats as make sense.
TEXT - plain text, with no markup. Equivalent to text/plain mime type.
RAW - an arbitrary document format such as HTML, OpenOffice, PDF, RTF. The indexer will auto
detect which format and extract content for building the search index.
XML - a well formed XML document containing arbitrary elements. The indexer will extract con-
tent from the elements, keeping track of its XPath, enabling searches to be restricted by element
name.
The Lucene implementation in WAF supports the text content type, and InterMedia supports the RAW
and XML formats. The ContentProvider interface defines the API for providing content. The Note
object allows its body text attribute to store HTML, so there can be two implements of Content-
Provider, one for TEXT, and the other for HTML. These are typically written as package private,
inner classes of the MetadataProvider implementation
...
class TextContentProvider implements ContentProvider {
privte Note m_note;
public TextContentProvider(Note note) {
m_note = note;
}
public String getTag() {
return "Body Text";
}
public ContentType getType() {
return ContentType.TEXT;
}
public byte[] getBytes() {
String body = m_note.getBody();
// Strip out html tags
return StringUtils.htmlToText(body).getBytes();
}
}
....
Example 11-12. Text content provider
...
class HTMLContentProvider implements ContentProvider {
privte Note m_note;
public HTMLContentProvider(Note note) {
m_note = note;
}