Oracle® Data Mining Concepts 11g Release 1 (11.1) Part Number B28129-01 |
|
|
View PDF |
This chapter explains how you can use Oracle Data Mining to mine text.
This chapter includes the following topics:
Oracle Data Mining is an option of the Enterprise Edition of Oracle Database. Oracle Text is a separate product that is part of the base functionality offered by Oracle Database. Oracle Text uses internal components of Oracle Data Mining to provide some data mining capabilities.
To use Oracle Text and its data mining capabilities, you do not need to license the Data Mining option. If you wish to use Oracle Data Mining, then a license for the Data Mining option is required.
The support for text data in Oracle Data Mining is different from that provided by Oracle Text. Oracle Text is dedicated to text document processing. Oracle Data Mining allows the combination of text (unstructured) columns and non-text (categorical and numerical) columns of data as input for clustering, classification, and feature extraction.
Oracle Text is described in the Oracle Text Reference and the Oracle Text Application Developer's Guide.
Table 20-1 summarizes how DBMS_DATA_MINING
, the Oracle Data Mining Java interface, and Oracle Text support text mining.
Oracle Data Mining Application Developer's Guide provides information that helps you develop text mining applications using the PL/SQL and Java interfaces. Oracle Data Mining Administrator's Guide contains descriptions of the sample text mining programs included with Oracle Data Mining.
Text mining is conventional data mining done using "text features." Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data.
Some of the applications for text mining include:
Create and manage taxonomies
Classify or categorize documents
Integrate search capabilities with classification and clustering of documents returned from a search
Extract topics automatically
Extract features for subsequent mining
Document classification, also known as document categorization, is the process of assigning documents to categories (for example, themes or subjects). A particular document may fit into two or more different categories. This type of classification can often be represented as a multi target classification problem where a supervised model is built for each category.
In some classes of problems, text is combined with structured data. For example, patient records or other clinical records usually contain both structured data (temperature, blood pressure) and unstructured data (physician's notes). In such a case, you can use Oracle Data Mining to perform mining on the structured data, the unstructured data, or both the structured and unstructured data combined.
Oracle Data Mining supports mining one or more columns of text data. A column of text data must have data type CLOB
, BLOB
, BFILE
, LONG
, VARCHAR2
, XMLType
, CHAR
, RAW
, or LONG RAW
. Before text columns can be used in mining, the features of the text columns must be extracted into a nested table. Before you can extract features, you must create a text index for the columns containing text using Oracle Text.
The sample programs distributed with Oracle Data Mining include examples of text mining. For information about the sample programs, see the Oracle Data Mining Administrator's Guide.
Oracle Data Mining provides infrastructure for developing data mining applications suitable for addressing a variety of business problems involving text. Among these, the following specific technologies provide key elements for addressing problems that require text mining:
Classification
Clustering
Feature extraction
Association
Regression
Anomaly Detection
The technologies that are most used in text mining are classification, clustering, and feature extraction.
A large number of document classification applications fall into one of the following:
Assigning multiple labels to a document. Oracle Data Mining does not support this case.
Assigning a document to one of many labels. For example, automatically assigning a mail message to a folder and spam filtering. This application requires multi-class classification.
The Support Vector Machine (SVM) algorithm provides powerful classifiers that have been used successfully in document classification applications. SVM can deal with thousands of features and is easy to train with small or large amounts of data. SVM is known to work well with text data. For more information about SVM, see Chapter 5.
Clustering is used frequently in text mining; the main applications of clustering in text mining are
Taxonomy generation
Topic extraction
Grouping the hits returned by a search engine
Clustering can also be used to group textual information with other indications from business databases to provide novel insights.
The current release of Oracle Data Mining supports clustering text features using both the PL/SQL and Java interfaces.
The k-Means clustering algorithm, described in , supports mining text columns.
There are two kinds of text mining problems for which feature extraction is useful:
Extract features from actual text. Oracle Text is designed to solve this kind of problem. Oracle Data Mining also supports feature extraction from text. Most text mining is focused on this problem.
Extract semantic features or higher-level features from the basic features uncovered when features are extracted from actual text. Statistical techniques, such as single value decomposition (SVD) and non-negative matrix factorization (NMF), are important in solving this kind of problem. Higher-order features can greatly improve the quality of information retrieval, classification, and clustering tasks.
NMF has been found to provide superior text retrieval when compared to SVD and other traditional decomposition methods. NMF takes as input a term-document matrix and generates a set of topics that represent weighted sets of co-occurring terms. The discovered topics form a basis that provides an efficient representation of the original documents. For more information about NMF, see Chapter 9, "Feature Selection and Extraction".
Association models can be used to uncover the semantic meaning of words. For example, suppose that the word sheep co-occurs with words like sleep, fence, chew, grass, meadow, farmer, and shear. An association model would include rules connecting sheep with these concepts. Inspection of the rules would provide context for sheep in the document collection. Such associations can improve information retrieval engines. For more information about association models, see Chapter 8, "Association Rules".
Table 20-1 summarizes how Oracle Data Mining (both the Java and PL/SQL interfaces) and Oracle Text support text mining functions.
Table 20-1 Text Mining Comparison
Feature | Oracle Data Mining | Oracle Text |
---|---|---|
Association |
Text data only or text and non-text data |
No support |
Clustering |
k-Means supports text only or text and non-text data |
k-means algorithm supports text only |
Attribute importance |
Text data only or text and non-text data |
No support |
Regression |
SVM supports text data only or text and non-text data |
No support |
Classification |
SVM and Naive Bayes support text only or text and non-text data Support for assigning documents to one of many labels |
SVM and decision trees support text only Support for assigning documents to one of many labels and also for assigning documents to multiple labels at the same time |
Anomaly detection |
One-Class SVM supports text only or text and non-text data. |
No support |
Feature extraction (basic features) |
The Java API supports the feature extraction process that transforms a text column to a nested table. The PL/SQL API requires the use of Oracle Text procedures to perform extraction. Oracle Data Mining allows the same degree of control as Oracle Text |
Feature extraction is done internally; the results are not exposed |
Feature extraction (higher order features) |
NMF supports either text or text and non-text data |
No support |
Record apply |
No support for record apply |
Supports record apply for classification |
Support for text columns |
Features extracted from a column of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW using an appropriate transformation |
Supports table columns of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW |