Oracle Ultra Search Online Documentation Release 9.2 |
|
Related Topics | ||
A collection of documents is called a data source. The data source is characterized by the properties of its location; such as a Web site or an email inbox. The Ultra Search crawler retrieves data from one or more data sources.
The different types of sources are:
- Web sources
- Table sources
- Email sources
- File sources
- User-defined sources (requires a crawler agent)
- Oracle9iAS Portal sources
You can assign one or more data sources to a synchronization schedule. To do so, use the Schedules Page.
You can also assign data sources to data groups to enable restrictive querying. To do so, use the Queries Page.
You can create as many data sources as you want. This section explains how to create and edit data sources.
A Web source represents HTML content on a specific Web site. Web sources differ from other data source types because they exist specifically to facilitate maintenance crawling of specific Web sites.
To create a new Web source, do the following:
- Specify a name for the Web source.
- Override the default crawler settings, such as default language, for each Web source. This step is optional. If you change any of the default settings, click Update. For information on default languages, see the Crawler Page.
- Enter a starting address. This is the URL for the crawler to begin crawling.
- Set URL boundary rules to refine the crawling space. You can include or exclude hosts and URL paths. For example, an inclusion domain of oracle.com limits the Ultra Search crawler to hosts belonging to Oracle Corporation worldwide. (This is a suffix inclusion, so anything ending with oracle.com is crawled; however, http://www.oracle.com.tw is not crawled.) An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. Exclusion rules always override inclusion rules.
- Specify the types of documents the Ultra Search crawler should process for this source. HTML and plain text are default document types that the crawler will always process.
- Define, edit, or delete metatag mappings for your Web source. Metatags are descriptive tags in the HTML document header. One metatag can map to only one search attribute.
A table source represents content in a database table or view. The database table or view can reside in the Ultra Search database instance or in a remote database. Ultra Search accesses remote databases using database links. (Note: there are some limitations when using a database link to a remote database. See the bottom of this section)
To edit the name of a table source, click Edit.
To create a table source, click Create new table source, and follow these steps:
- Specify a table source name, and the name of the database link, schema, and table. Click Locate table.
- Specify settings for your table source, such as the default language and the primary key column. You can also specify the column where final content should be delivered, and the type of data stored in that column; for example, HTML, plain text, or binary. For information on default languages, see the Crawler Page.
- Verify the information about your table source.
- You can decide whether or not to use the Ultra Search logging mechanism to optimize the crawling of table data sources. With this logging mechanism, only newly updated documents are revisited during the crawling process. You can enable logging for Oracle tables, enable logging for non-Oracle tables, or disable the logging mechanism. If you enable logging, then you are prompted to create a log table and log triggers. Oracle SQL statements are provided for Oracle tables. If you are using non-Oracle tables, then you must manually create a log table and log triggers. You can follow the examples provided to create the log table and log triggers. After you have created the table, enter the table name in Log table name.
- Map table columns to document attributes. Each table column can be mapped to exactly one document attribute. This lets the search engine seamlessly search data from the table source.
- Specify the display URL template or column for the table source. This step is optional. Ultra Search uses a default text viewer for table data sources. If you specify display URL, then Ultra Search uses the Web URL defined to display the table data retrieved. If display URL column is available, then Ultra Search uses the column to get the URL to display the table data source content. You can also specify display URL templates in the following format: http://[hostname]:[port]/[path]?[parameter_name]=$(key1) where key1 is the corresponding table's primary key column.
The Table Column to Key Mappings section provides mapping information. Ultra Search supports table keys in STRING, NUMBER, or DATE type. If key1 is of NUMBER or DATE type, then you must specify the format model used by the Web site so that Oracle knows how to interpret the string. For example, the date format model for the string '11-Nov-1999' is 'DD-Mon-YYYY'. For more information on format models, see the Oracle9i SQL Reference. You can also map other table columns to Ultra Search attributes. Do not map the text column.
Editing Table Sources
You can change the name of the table source; change, add, or delete table column and search attribute mappings; change the display URL template or column; and view values of the table source settings.
Table Sources Comprised of More Than One Table
If a table source has more than one table, then a view joining the relevant tables must be created. Ultra Search then uses this view as the table source. For example, two tables with a master-detail relationship can be joined through a select statement on the master table and a user-implemented PL/SQL function that concatenate the detail table rows.
Limitations With Database Links
The following restrictions apply to base tables or views on a remote database that are accessed over a database link by the crawler.
- If the text column of the base table or view is of type BLOB or CLOB, then the table must have a ROWID column. A table or view might not have a ROWID column for various reasons, including the following:
The best way to know if a remote table or view can be safely crawled by Ultra Search is to check for the existence of the ROWID column. To do so, run the following SQL statement against that table or view using SQL*Plus:
- A view comprised of a join of one or more tables
- A view based on a single table using a GROUP BY clause
SELECT MIN(ROWID) FROM <table or view name>;- The base table or view cannot have text columns of type BFILE.
An email source derives its content from emails sent to a specific email address. When the Ultra Search crawler searches an email source, it collects all emails that have the specific email address in any of the "To:" or "Cc:" email header fields.
The most popular application of an email source is where an email source represents all emails sent to a mailing list. In such a scenario, multiple email sources are defined where each email source represents an email list.
To crawl email sources, you need an IMAP account. At present, the Ultra Search crawler can only crawl one IMAP account. Therefore, all emails to be crawled must be found in the inbox of that IMAP account. For example, in the case of mailing lists, the IMAP account should be subscribed to all desired mailing lists. All new postings to the mailing lists are sent to the IMAP email account and subsequently crawled. The Ultra Search crawler is IMAP4 compliant.
When the Ultra Search crawler retrieves an email message, it deletes the email message from the IMAP server. Then, it converts the email message content to HTML and temporarily stores that HTML in the cache directory for indexing. Next, the Ultra Search crawler stores all retrieved messages in a directory known as the archive directory. The email files stored in this directory are displayed to the search end-user when referenced by a query hit.
To crawl email sources, you must specify the username and password of the email account on the IMAP server. Also specify the IMAP server hostname and the archive directory.
To create email sources, you must enter an email address and a description. The description can be viewed by all search end-users, so you should specify a short but meaningful name. When you create (register) an email source, the name you use is the email of the mailing list. If the emails are not sent to one of the registered mailing lists, then those emails are not crawled.
Finally, you can specify email address aliases for an email source. Specifying an alias for an email source causes all emails sent to the main email address, as well as the alias address, to be gathered by the crawler.
A file source is the set of documents that can be accessed through the file protocol on the Ultra Search database machine or on a remote crawler machine.
To edit the name of a file source, click Edit.
To create a new file source, do the following:
- Specify a name for the file source.
- Designate files or directories to be crawled. If a URL represents a single file, then the Ultra Search crawler searches only that file. If a URL represents a directory, then the crawler recursively crawls all files and subdirectories in that directory.
- Specify inclusion and exclusion paths to modify the crawling space associated with this file source. This step is optional. An inclusion path limits the crawling space. An exclusion path lets you further define the crawling space. If neither path is specified, then crawling is limited to the underlying file system access privileges.
- Specify the types of documents the Ultra Search crawler should process for this file source. HTML and plain text are default document types that the crawler will always process.
Ultra Search displays file data sources in text format by default. However, if you specify display URL for the file data source, then Ultra Search uses the URL to display the file data source.
With display URL for file data sources, the URL uses network protocols, such as HTTP or HTTPS, to access the file data source. To generate display URL for the file data source, specify the prefix of the original file URL and the prefix of the display URL. Ultra Search replaces the prefix of the file URL with the prefix of the display URL.
For example, if your file URL is file:///home/archive/<sub_dir_name>/<file_name> and the display URL is https://host:7777/private/<sub_dir_name>/<file_name>, then you can specify the file URL prefix to file:///home/archive and the display URL prefix to https://host:7777/private.
Oracle Ultra Search lets you define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, which contain their own databases and interfaces.
For each new data source type, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Ultra Search crawler, which enques it for later crawling. For more information on the crawler agent API, click here.
To define a new data source, you first define a data source type to represent it. You define the type name, the crawler agent java class/jar file, and parameters to be used, such as starting address. After you define your data type, you can define a new data source by specifying parameter values.
To create a new user-defined data source, click Create new source. To create, edit, or delete data source types, click Manage types.
To create a user-defined data source type:
- Specify data source type name, description, and crawler agent Java class file or jar file name. The crawler agent Java class path is predefined at installation time. The agent collects the list of document URLs and associated metadata from the proprietary document source and returns it to the Ultra Search crawler, which enqueues the information for later crawling. The agent class file or jar file must be located under $ORACLE_HOME/ultrasearch/lib/agent/.
- Specify parameters for this data source type. If you add parameters, you need to enter the parameter name and a description. Also, you must decide whether to encrypt the parameter value.
You can edit data source type information by changing the data source type name, description, crawler agent Java class/jar file name, or parameters.
To create a user-defined data source:
- Specify a name, data source type, and default language for the data source. Each data source is created based on data source type definition. For information on default languages, see the Crawler Page.
- Enter parameter values, such as starting point.
- Specify mappings. This step is optional. Document attributes are automatically mapped directly to the search attribute with the same name during crawling. If you want document attributes to map to another search attribute, you can specify it here. The crawler picks up attributes that have been returned by the crawler agent or specified here.
You can edit user-defined data sources by changing the name, type, default language, or starting address.
Oracle Ultra Search supports the crawling and indexing of Oracle9i Application Server (9iAS) Portal installations. This enables searching across multiple portal installations. To crawl a 9iAS Portal, you must first register your portal with Ultra Search. To register your portal:
- Choose a name and portal URL base for the portal source. After it is created, the portal URL base is not updatable. For information on default languages, see the Crawler Page.
- Click Register Portal. Ultra Search attempts to contact the Oracle 9iAS Portal instance and retrieve information about it.
After registering your portal, choose the Oracle 9iAS Portal page groups you want to index. Each page group chosen is created as a 9iAS portal source. You can edit the types of documents the Ultra Search crawler should process for a portal source. HTML and plain text are default document types that the crawler will always process. You can edit document types by clicking the edit icon of the portal source after it has been created.
For more information on page groups, see the Oracle 9iAS Portal documentation.
Copyright © 2002 Oracle Corporation. All Rights Reserved. |
|