Oracle Ultra Search Online Documentation Release 9.2 |
|
Related Topics | ||
The Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or email archives.
Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.
With this page, you can do the following:
Crawler Threads
Specify the number of crawler threads to be spawned at run time.
Number of Processors
Specify the number of central processing units (CPUs) that exist on the server where the Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing.
Automatic Language Detection
Not all documents retrieved by the Ultra Search crawler specify the language. For documents with no language specification, the Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.
Default Language
If automatic language detection is disabled, or when a Web document does not have a specified language, the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.
Note: This default language is used only if the crawler cannot determine the document language during crawling. You can set language preference in the Users Page.
You can choose a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:
- English
- Brazilian portuguese
- Danish
- Dutch
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Simplified Chinese
- Spanish
- Swedish
- Traditional Chinese
Crawling Depth
A Web document could contain links to other Web documents, which could contain more links. This setting lets you specify the maximum number of nested links the crawler will follow. Click here for more information on the importance of the crawling depth.
Crawler Timeout Threshold
Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page.
Default Character Set
Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.
Temporary Directory Location and Size
Specify a temporary directory and size. The crawler uses the temporary directory for intermittent storage during indexing. Specify the absolute path of the temporary directory. The size is the maximum temporary space in megabytes that will be used by the crawler.
The size of the temporary directory is important because it affects index fragmentation. The smaller the size, the more fragmented the index. As a result, the query will be slower, and index optimization needs to be performed more frequently. Increasing the directory size reduces index fragmentation, but it also reduces crawling throughput (total number of documents crawled each hour). This is because it takes longer to index a bigger temporary directory, and the crawler needs to wait for the indexing to complete before it can continue writing new documents to the directory.
Crawler Logging
Specify the following:
- Level of detail: everything or only a summary
- Crawler logfile directory
- Crawler logfile language
The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.
Database Connect String
The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid] or in the form of a TNS keyword-value syntax; for example,
"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=5521)...))"There is no connect string for a crawler launched locally on the server machine. Instead, it uses an Oracle JDBC OCI driver to connect to Oracle; for example, "jdbc:oracle:oci8:@". See the Oracle9i JDBC Developer's Guide and Reference for more information.
In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it allows connection to any node of the system. For example, "(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001))
(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_NAME=sales.us.acme.com)))"
Use this page to view and edit remote crawler profiles. A remote crawler profile consists of all parameters needed to run the Ultra Search crawler on a remote machine other than the Oracle Ultra Search database. A remote crawler profile is identified by the hostname. The profile includes the cache, log, and mail directories that the remote crawler shares with the database machine.
To set these parameters, click Edit. Enter the shared directory paths as seen by the remote crawler. You must ensure that these directories are shared or mounted appropriately.
Use this page to view the following crawler statistics:
Summary of Crawler Activity
This provides a general summary of crawler activity:
- Aggregate crawler statistics
- Total number of documents indexed
- Crawler statistics by data source name
Detailed Crawler Statistics
This includes the following:
- List of hosts crawled
- Document distribution by depth
- Document distribution by document type
- Document distribution by data source type
Crawler Progress
Problematic URLs
This lists errors encountered during the crawling process. It also lists the number of URLs that cause each error.
Copyright © 2002 Oracle Corporation. All Rights Reserved. |
|