Oracle Ultra Search Online Documentation
Release 9.2


Home	Book List	Contents	Master Index	Feedback

About the Ultra Search Crawler Agent API

	Related Topics
Sample Crawler Agent README Data Sources Page

You can implement a crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum. In Ultra Search, the proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.

The agent collects document URLs and associated metadata from the user-defined data source and returns the information to the Ultra Search crawler, which enques it for later crawling. The crawler agent must be implemented in Java using the Ultra Search crawler agent API.

Crawler Agent

A crawler agent does the following:

Authenticates the crawler for accessing the data source

Provides access to the data source document through a HTTP URL (display URL)

Provides the metadata of the document in the form of document attribute

Maps the document attribute to a common attribute name used by end users

Provides a "flatten" view of the data source such that document is retrieved one by one in a streaming fashion.

Instructs the crawler to parse the URL document for standard metadata, like author and title if necessary

Optionally provides the list of URLs that have changed since a given time stamp

Optionally provides an access URL in addition to the display URL for the processing of the document

From the crawler's perspective, the agent retrieves the list of URLs from the target data source and saves it in the crawler queue before processing it.

Note: If the crawler is interrupted for any reason, the agent invocation process is repeated with the original last crawl time stamp. If the crawler already finished enqueueing URLs fetched from the agent and is half way through crawling, then the crawler only starts the agent, but does not try to fetch URLs from the agent. Instead, it finishes crawling the URLs already enqueued.

There are two kinds of crawler agents: standard agents and smart agents.

Standard Agent

The standard agent returns the list of URLs currently existing in the data source. It does not know whether any of the URLs had been crawled before, and it relies on the crawler to find any updates to the target data source. The standard agent's interaction with the crawler is the following:

Crawler marks all existing URLs of this data source for garbage collection, assuming they no longer exist in the target data source.

Calls the agent to get an updated list of URLs. It marks for crawling every URL that already exists. If it is new, it inserts it into the URL table and queue.

Deletes the URLs that are still marked for garbage collection.

Goes through every URL marked for crawling and checks for updates.

Smart Agent

The smart agent uses a modified-since time stamp (provided by the crawler) to return the list of URLs that have been updated, inserted, and deleted. The crawler only crawls URLs returned by the agent and does not recrawl existing ones. For URLs that were deleted, the crawler removes them from the URL table. If the smart agent can only return updated or inserted URLs but not deleted URLs, then deleted URLs are not detected by the crawler. In this case, you must change the schedule crawler recrawl policy to periodically run the schedule in force recrawl mode. Force recrawl mode signals to the agent to return every URL in the data source.

The agent API isDeltaCrawlingCapable() tells the crawler whether the agent it invokes is a standard agent or a smart agent. The agent API startCrawling(boolean forceRecrawl, Date lastCrawlTime) lets the crawler tell the agent the last crawl time and whether the crawler is running in force recrawl mode.

Document Attributes and Properties

Document attributes, or metadata, describe document properties. Some attributes can be irrelevant to your application. The crawler agent creator must decide which document attributes should be extracted and saved. The agent can be also created such that the list of collected attributes are configurable. Ultra search automatically registers attributes returned by the agent. The agent can decide which attributes to return for a document.

Crawler Agent Functionality

Data Source Type Registration

A data source type is an abstraction of a data source. You can define new data source types with the following attributes:

Name of data source type: For example, Lotus Notes. The name cannot be more than 100 bytes.

ID of data source type: This is automatically assigned

Description of the data source type: This limit is 4000 bytes.

Agent Java class name: For example, WebDbAgent. The location of this class is predefined by Ultra Search in $ORACLE_HOME/ultrasearch/lib/agent/ and cannot be changed.

Agent jar file name: The agent class can be stored in a java jar file. This jar file must be in $ORACLE_HOME/ultrasearch/lib/agent/.

Parameters: Parameters are the properties of a data source. For example, seed URL and inclusion pattern for a Web data source. Defines a parameter by specifying a parameter name (100 bytes max.) and its description (4000 bytes max.). By default a parameter is not encrypted.

Encryption: Should the value of this parameter be encrypted when stored.

Ultra Search does not enforce the occurrence of parameters. You cannot specify a particular parameter to have 0 or more, at least 1, or only 1 occurrence.

Data Source Registration

After a data source type is defined, any instance of that data source type can be defined:

Data source name

Description of the data source; limit to 4000 bytes.

Data source type ID

Default language; default is 'en' (English)

Parameter values; for example, seed - http://www.oracle.com depth - 8

Data Source Attribute Registration

You can add new attributes to Ultra Search by providing the attribute name and the attribute data type. The data type can be string, number, or date. Attributes with the same name but different data type can be added. Attributes returned by an agent are automatically registered if they have not been defined.

User-Implemented Crawler Agent

The crawler agent has the following requirements:

The agent must be implemented in Java

The agent must support the Java agent APIs defined by Ultra Search

The agent must return the URL attributes and properties

The agent optionally can authenticate the crawler's access to the data source

The agent must "flatten" the data source such that document is retrieved one by one in a streaming fashion. This is to encapsulate the crawling logic of a specific data source into the agent.

The agent must decide which document attributes Ultra Search should keep. Any attribute not defined in Ultra Search is automatically registered.

The agent can map attributes to data source properties. For example, if an attribute "ID" is the unique ID of a document, then the agent should return (document_key, 4) where "ID" has been mapped to the property "document_key" and its value is 4 for this particular document.

If the attribute LOV is available, then the agent returns them upon request.

Interaction between the Crawler and the Crawler Agent

The crawler crawls data sources defined by the user through the invocation of the user-supplied crawler agent. The crawler can do the following:

Invoke the crawler agent of the defined data source

Supply data source parameter information to the agent

Authenticate itself with the agent if needed

Retrieve a list of URLs and associate attributes/properties that need to be crawled

Use the URL provided by the agent to retrieve the document

Detect insert, update, and delete to the data source

Retrieve attribute LOV data if available

Crawler Agent APIs and Classes

The crawler agent API is a collection of methods used to implement a crawler agent. A sample implementation of a crawler agent SampleAgent.java is provided under $ORACLE_HOME/ultrasearch/sample/.

UrlData: The crawler agent uses this interface to populate document properties and attribute values. Ultra Search provides a basic implementation of this interface that the agent can use directly or extend if necessary. The class is DocAttributes with a constructor that has no argument. The agent might decide to create a pool of UrlData objects and cycle through them during crawling. In the most simple implementation, the agent creates one DocAttributes object, repeatedly resets and populates the data, and returns this object.

LovInfo: The crawler agent uses this interface to submit attribute LOV definitions.

DataSourceParams: The crawler agent uses this interface to read and write data source parameters.

AgentException: The crawler agent uses this exception class when an error occurs.

CrawlerAgent: This interface lets the crawler communicate with the user-defined data source. The crawler agent must implement this interface.


Home	Book List	Contents	Master Index	Feedback