9 Feature Selection and Extraction

This chapter describes feature selection and extraction mining functions. Oracle Data Mining supports an unsupervised form of feature extraction and a supervised form of attribute importance.

Feature Extraction

Feature Extraction creates a set of features based on the original data. A feature is a combination of attributes that is of special interest and captures important characteristics of the data. It becomes a new attribute. Typically, there are far fewer features than there are original attributes.

Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.

For example, feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.

Feature Extraction Algorithm

Oracle Data Mining uses the Non-Negative Matrix Factorization algorithm (NMF) for feature extraction.

Attribute Importance

Attribute Importance provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.

The time required to build classification models increases with the number of attributes. Attribute Importance identifies a proper subset of the attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes only.

Using fewer attributes does not necessarily result in lost predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. Mining using the smallest number of attributes can save significant computing time and may build better models.

The programming interfaces for Attribute Importance permit the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point.

Data Preparation for Attribute Importance

Attribute importance models typically benefit from binning, to minimize the effect of outliers. However, the discriminating power of an attribute importance model can be significantly reduced when there are outliers in the data and external equal-width binning is used. This technique can cause most of the data to concentrate in a few bins (a single bin in extreme cases). In this case, quantile binning is a better solution.

Attribute Importance Algorithm

Oracle Data Mining uses the Minimum Description Length algorithm (MDL) for attribute importance.