7 Clustering

This chapter describes clustering, the unsupervised mining function for discovering natural groupings within the data.

About Clustering

A cluster is a collection of data objects that are similar in some sense to one another. Clustering analysis identifies clusters in the data.

A good clustering method produces high-quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high; in other words, members of a cluster are more like each other than they are like members of a different cluster.

Clustering is useful for exploring data. If there are many cases and no obvious natural groupings, clustering data mining algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models.

Clustering models are different from supervised models in that the outcome of the process is not guided by a known result, that is, there is no target attribute. Clustering models focus on the intrinsic structure, relations, and interconnectedness of the data. Clustering models are built using optimization criteria that favor high intra-cluster and low inter-cluster similarity. The model can then be used to assign cluster identifiers to data points.

In Oracle Data Mining, a cluster is characterized by its centroid, attribute histograms, and the cluster's place in the model's hierarchical tree. A centroid represents the most typical case in a cluster. For numerical clusters, the centroid is the mean. For categorical clusters, the centroid is the mode.

Clustering Algorithms

Oracle Data Mining performs hierarchical clustering using an enhanced version of the k-means algorithm and Orthogonal Partitioning Clustering algorithm (O-Cluster), an Oracle proprietary algorithm.

The clusters discovered by these algorithms are used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the bounding boxes that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. For example, for a data set with two attributes: AGE and HEIGHT, the following rule represents most of the data assigned to cluster 10:

If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10

The clusters are also used to generate a Bayesian probability model, which is used during scoring for assigning data points to clusters.

The main characteristics of the enhanced k-means and O-Cluster algorithms are summarized in Table 7-1.

Table 7-1 Clustering Algorithms Compared

Feature	Enhanced k-Means	O-Cluster
Clustering methodolgy	Distance-based	Grid-based
Number of cases	Handles data sets of any size	More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling
Number of attributes	More appropriate for data sets with a low number of attributes	More appropriate for data sets with a high number of attributes
Number of clusters	User-specified	Automatically determined
Hierarchical clustering	Yes	Yes
Probabilistic cluster assignment	Yes	Yes
Recommended data preparation	Normalization	Equi-width binning after clipping