Oracle® Data Mining Concepts 11g Release 1 (11.1) Part Number B28129-01 |
|
|
View PDF |
This chapter describes feature selection and extraction mining functions. Oracle Data Mining supports an unsupervised form of feature extraction and a supervised form of attribute importance.
This chapter contains the following sections:
Feature Extraction creates a set of features based on the original data. A feature is a combination of attributes that is of special interest and captures important characteristics of the data. It becomes a new attribute. Typically, there are far fewer features than there are original attributes.
Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.
For example, feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.
Attribute Importance provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.
The time required to build classification models increases with the number of attributes. Attribute Importance identifies a proper subset of the attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes only.
Using fewer attributes does not necessarily result in lost predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. Mining using the smallest number of attributes can save significant computing time and may build better models.
The programming interfaces for Attribute Importance permit the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point.
Attribute importance models typically benefit from binning, to minimize the effect of outliers. However, the discriminating power of an attribute importance model can be significantly reduced when there are outliers in the data and external equal-width binning is used. This technique can cause most of the data to concentrate in a few bins (a single bin in extreme cases). In this case, quantile binning is a better solution.
Oracle Data Mining uses the Minimum Description Length algorithm (MDL) for attribute importance.
See Also:
Chapter 14, "Minimum Description Length"