Oracle Ultra Search Online Documentation
Release 9.2


Home	Book List	Contents	Master Index	Feedback

Crawler Page

	Related Topics
About the Ultra Search Crawler and Data Sources About the Administration Tool

The Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or email archives.

Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.

With this page, you can do the following:

Settings

Crawler Threads

Specify the number of crawler threads to be spawned at run time.

Number of Processors

Specify the number of central processing units (CPUs) that exist on the server where the Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing.

Automatic Language Detection

Not all documents retrieved by the Ultra Search crawler specify the language. For documents with no language specification, the Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.

Default Language

If automatic language detection is disabled, or when a Web document does not have a specified language, the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.

Note: This default language is used only if the crawler cannot determine the document language during crawling. You can set language preference in the Users Page.

You can choose a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:

English

Brazilian portuguese

Danish

Dutch

French

German

Italian

Japanese

Korean

Portuguese

Simplified Chinese

Spanish

Swedish

Traditional Chinese

Crawling Depth

A Web document could contain links to other Web documents, which could contain more links. This setting lets you specify the maximum number of nested links the crawler will follow. Click here for more information on the importance of the crawling depth.

Crawler Timeout Threshold

Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page.

Default Character Set

Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.

Temporary Directory Location and Size

Specify a temporary directory and size. The crawler uses the temporary directory for intermittent storage during indexing. Specify the absolute path of the temporary directory. The size is the maximum temporary space in megabytes that will be used by the crawler.

The size of the temporary directory is important because it affects index fragmentation. The smaller the size, the more fragmented the index. As a result, the query will be slower, and index optimization needs to be performed more frequently. Increasing the directory size reduces index fragmentation, but it also reduces crawling throughput (total number of documents crawled each hour). This is because it takes longer to index a bigger temporary directory, and the crawler needs to wait for the indexing to complete before it can continue writing new documents to the directory.

Crawler Logging

Specify the following:

Level of detail: everything or only a summary

Crawler logfile directory

Crawler logfile language

The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.

Database Connect String

The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid] or in the form of a TNS keyword-value syntax; for example,
"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=5521)...))"

There is no connect string for a crawler launched locally on the server machine. Instead, it uses an Oracle JDBC OCI driver to connect to Oracle; for example, "jdbc:oracle:oci8:@". See the Oracle9i JDBC Developer's Guide and Reference for more information.

In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it allows connection to any node of the system. For example, "(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001))
(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_NAME=sales.us.acme.com)))"

Remote Crawler Profiles

Use this page to view and edit remote crawler profiles. A remote crawler profile consists of all parameters needed to run the Ultra Search crawler on a remote machine other than the Oracle Ultra Search database. A remote crawler profile is identified by the hostname. The profile includes the cache, log, and mail directories that the remote crawler shares with the database machine.

To set these parameters, click Edit. Enter the shared directory paths as seen by the remote crawler. You must ensure that these directories are shared or mounted appropriately.

Crawler Statistics

Use this page to view the following crawler statistics:

Summary of Crawler Activity

This provides a general summary of crawler activity:

Aggregate crawler statistics

Total number of documents indexed

Crawler statistics by data source name

Detailed Crawler Statistics

This includes the following:

List of hosts crawled

Document distribution by depth

Document distribution by document type

Document distribution by data source type

Crawler Progress

Problematic URLs

This lists errors encountered during the crawling process. It also lists the number of URLs that cause each error.


Home	Book List	Contents	Master Index	Feedback