Oracle® Data Mining Application Developer's Guide 11g Release 1 (11.1) Part Number B28131-01 |
|
|
View PDF |
This chapter explains how to use data mining models to mine your data.
This chapter contains the following sections:
Scoring, the application of models to new data, is a primary objective of data mining. Once the models have been built, the challenges come in deploying them to obtain the best results, and in maintaining them within a production environment.
In traditional data mining, models are built using specialized software on a remote system and deployed to another system for scoring. This is a cumbersome, error-prone process open to security violations and difficulties in data synchronization.
With Oracle Data Mining, scoring is easy and secure. The scoring engine and the data both reside within the database. Scoring is an extension to the SQL language, so the results of mining can easily be incorporated into applications and reporting systems.
In-database mining provides security, backup and recovery, and high performance. It minimizes the IT effort needed to support data mining initiatives. Using standard database techniques, models can easily be refreshed (recreated) on more recent data and redeployed. The deployment is immediate since the scoring query remains the same; only the underlying model is replaced in the database.
Deploying a model means using it within a target environment. Model deployment could be:
Scoring data either for batch or real-time results
Extracting model details to produce reports. For example: the rules from a Decision Tree model, or the attribute rankings from an Attribute Importance model
Extending the business intelligence infrastructure of a data warehouse by incorporating mining results in applications or operational systems
Moving a model from the database where it was built to the database where it will be used for scoring (export/import)
See Also:
Oracle Data Mining Administrator's Guide for information about exporting and importing data mining modelsOracle Data Mining SQL functions enable prediction, clustering, and feature extraction analysis to be easily integrated into live production and operational systems. Because mining results are returned within SQL queries, mining can occur in real time.
With real-time scoring, point-of-sales database transactions can be mined. Predictions and rule sets can be generated to help front-line workers make better analytical decisions. Real-time scoring enables fraud detection, identification of potential liabilities, and recognition of better marketing and selling opportunities.
The query in Example 4-1 uses a Decision Tree model named dt_sh_clas_sample
to predict the probability that customer 101488 will use an affinity card. A customer representative could retrieve this information in real time when talking to this customer on the phone. Based on the query result, the representative might offer an extra-value card, since there is a 73% chance that the customer will use a card.
Example 4-1 Real-Time Query with Prediction Probability
SELECT PREDICTION_PROBABILITY(dt_sh_clas_sample, 1 USING *) cust_card_prob FROM mining_data_apply_v WHERE cust_id = 101488; CUST_CARD_PROB-------------- .727642276
Oracle Data Mining supports six SQL functions that return results from predictive models (classification or regression).
Predictive models produce a target value for each row (case) in the scoring data. Each SQL function returns different information from the scoring results.
See Also:
Oracle Data Mining Concepts for information on classification and regression(Classification or regression). For classification, the PREDICTION
function returns the target value that is predicted with the highest probability (or lowest cost, if costs are specified). For regression, PREDICTION
returns the best predicted target value.
PREDICTION
supports costs for classification. See "Cost-Sensitive Decision Making".
(Classification or regression) The PREDICTION_BOUNDS
function returns the upper and lower confidence bounds computed by the model.
Confidence is the degree of certainty that the true value (regression) or probability (classification) lies within the bounded interval. The default confidence is .95. Confidence can be specified by the user in the GLMS_CONF_LEVEL
setting for the model. You can override the confidence associated with the model by specifying the confidence inline when you invoke the PREDICTION_BOUNDS
function.
No confidence bounds are returned if ridge regression is being used by the algorithm.
See Also:
Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION_BOUNDS
Oracle Data Mining Concepts for information on GLM
Oracle Database PL/SQL Packages and Types Reference for information on GET_MODEL_DETAILS_GLM
(Classification only) The PREDICTION_COST
function returns the cost associated with the class that is predicted with the lowest cost. If you specify a class, the function returns the cost associated with that class.
Costs are a user-specified biasing mechanism for classification. See "Cost-Sensitive Decision Making".
See Also:
Oracle Database SQL Language Reference for syntax and an example that usesPREDICTION_COST
(Classification only) The PREDICTION_DETAILS
function returns the rule of a Decision Tree model corresponding to the given prediction. A rule is the condition (combination of attribute values) that leads to a specific classification.
Decision Tree rule identifiers are returned as XML. The full rules can be retrieved with the GET_MODEL_DETAILS_XML
function.
See Also:
Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION_RULES
Oracle Data Mining Concepts for information about Decision Tree
Oracle Database PL/SQL Packages and Types Reference for information on GET_MODEL_DETAILS_XML
(Classification only) The PREDICTION_PROBABILITY
function returns the probability associated with the best prediction (prediction with the highest probability) or the probability associated with the class that you specify.
See Also:
Oracle Database SQL Language Reference for syntax and an example that usesPREDICTION_PROBABILITY
(Classification only) The PREDICTION_SET
function returns all the target classes, associated probabilities, and associated costs (if specified) for each scored row. You can specify parameters to restrict the output of the function.
See Also:
Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION_SET
Oracle Data Mining supports three SQL functions that return results when applying clustering models.
Clustering models assign each row to a cluster with an associated probability. Each SQL function returns different information from the scoring results.
See Also:
Oracle Data Mining Concepts for information on clusteringThe CLUSTER_ID
function returns the identifier of the cluster predicted with the highest probability.
The CLUSTER_PROBABILITY
function returns the probability associated with the cluster to which cases are most likely to be assigned. If you specify a cluster ID, the function returns the probability associated with that cluster.
See Also:
Oracle Database SQL Language Reference for syntax and an example that usesCLUSTER_PROBABILITY
Oracle Data Mining supports three SQL functions that return results from feature extraction models.
Feature extraction models combine the attributes into a set of features that capture important characteristics of the data. The scoring process generates a value of each feature for each row. The value is a number that identifies the match quality of the case to the feature. Each SQL function returns different information from the scoring results.
The FEATURE_ID
function returns the identifier of the feature with the highest value (match quality) for the data.
The FEATURE_VALUE
function returns the highest feature value. If you specify a feature ID, the function returns the value of that feature.
If you wish to save the results of a scoring function, you can store them in a table.
This example shows how to save the results of scoring a customer response model.
UPDATE CUST_RESPONSE_APPLY_UPDATE SET prediction = prediction(CUST_RESPONSE19964_DT using *), probability = prediction_probability(CUST_RESPONSE19964_DT using *)
The table in question has all of the predictors, and has columns to hold the prediction and probability. The assumption is that any necessary transformations are embedded in the model (otherwise the using clause would need to contain them).
Costs are user-specified numbers that bias classification. The algorithm uses positive numbers to penalize more expensive outcomes over less expensive outcomes. Higher numbers indicate higher costs. The algorithm uses negative numbers to favor more beneficial outcomes over less beneficial outcomes. Lower negative numbers indicate higher benefits.
All classification algorithms can use costs for scoring. You can specify the costs in a cost matrix table, or you can specify the costs inline when scoring. The PREDICTION
, PREDICTION_COST
, and PREDICTION_SET
functions all support costs.
A sample cost matrix table is shown in Table 4-1.
Table 4-1 Sample Cost Matrix Table
ACTUAL_TARGET_VALUE | PREDICTED_TARGET_VALUE | COST |
---|---|---|
0 |
0 |
0 |
0 |
1 |
2 |
1 |
0 |
1 |
1 |
1 |
0 |
The cost matrix in Table 4-1 specifies costs for a binary target. The matrix indicates that the algorithm should treat a misclassified 0 as twice as costly as a misclassified 1. If the table name is cost_tbl
and it is associated with the Naive Bayes model nb_sh_clas_sample
, then the following query takes cost_tbl
into account when scoring nb_sh_clas_sample
. The output will be restricted to those rows where a prediction of 1 is less costly then a prediction of 0.
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age FROM mining_data_apply_v WHERE PREDICTION (nb_sh_clas_sample COST MODEL USING cust_marital_status, education, household_size) = 1 GROUP BY cust_gender ORDER BY cust_gender;
If there is a possibility that the cost matrix table is not present, or that a cost matrix was not specified for the model, you can use the AUTO
keyword with COST
MODEL
so that scoring only uses costs if the cost matrix is available.
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age FROM mining_data_apply_v WHERE PREDICTION (nb_sh_clas_sample COST MODEL AUTO USING cust_marital_status, education, household_size) = 1 GROUP BY cust_gender ORDER BY cust_gender;
You can specify the costs inline when you invoke the scoring function. The inline costs are used for scoring even if a cost matrix table is associated with the model. Here is the same query with the costs specified inline.
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age FROM mining_data_apply_v WHERE PREDICTION (nb_sh_clas_sample COST (0,1) values ((0, 2), (1, 0)) USING cust_marital_status, education, household_size) = 1 GROUP BY cust_gender ORDER BY cust_gender;
To associate a cost matrix table with a model for scoring, use the ADD_COST_MATRIX
procedure in the DBMS_DATA_MINING
package. You can retrieve the cost matrix with the GET_COST_MATRIX
function. The REMOVE_COST_MATRIX
procedure removes the cost matrix. If you want to use a different cost matrix table with a model, first remove the existing one then add the new one.
-- add cost matrix table cost_tbl -- to model nb_sh_clas_sample -- for scoring -- EXEC DBMS_DATA_MINING.ADD_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl'); -- -- replace cost_tbl with cost_tbl_2 -- EXEC DBMS_DATA_MINING.REMOVE_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl'); EXEC DBMS_DATA_MINING.ADD_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl_2');
The Decision Tree algorithm can use costs to bias the model build. If you want to create a Decision Tree model with costs, create a cost matrix table and provide its name in the CLAS_COST_TABLE_NAME
setting for the model. If you specify costs when scoring the model, the cost matrix used to create the model will be used. If you want to use a different cost matrix table for scoring, first remove the existing cost matrix table then add the new one.
Oracle Data Mining supports a batch apply operation that writes the results of scoring directly to a table. The columns in the table are mining function-dependent. The apply operation is accomplished by DBMS_DATA_MINING.APPLY
.
APPLY
creates an output table with the columns shown in Table 4-2.
Table 4-2 APPLY Output Table
Mining Function | Output Columns |
---|---|
classification |
|
regression |
|
anomaly detection (one-class SVM) |
|
clustering |
|
feature extraction |
|
Example 4-2 illustrates anomaly detection with APPLY
. The query of the APPLY
output table returns the ten first customers in the table. Each has a a probability for being typical (1) and a probability for being anomalous (0).
Example 4-2 Anomaly Detection with DBMS_DATA_MINING.APPLY
EXEC dbms_data_mining.apply ('SVMO_SH_Clas_sample','svmo_sh_sample_prepared', 'cust_id', 'one_class_output'); SELECT * from one_class_output where rownum < 11; CUST_ID PREDICTION PROBABILITY ---------- ---------- ----------- 101798 1 .567389309 101798 0 .432610691 102276 1 .564922469 102276 0 .435077531 102404 1 .51213544 102404 0 .48786456 101891 1 .563474346 101891 0 .436525654 102815 0 .500663683 102815 1 .499336317
Whether performed by APPLY
or by a SQL scoring function, scoring generates the same mining results. Classification produces a prediction and a probability for each case; clustering produces a cluster ID and a probability for each case, and so on. The difference lies in the way that scoring results are captured and the mechanisms that can be used for retrieving them.
Since APPLY
output is stored separately from the scoring data, it must be joined to the scoring data to support queries that include the data attributes being mined (the scored rows). Thus any model that will be used with APPLY
must have a case ID.
A case ID is not required for models that will be applied with SQL scoring functions. Likewise, storage and joins are not required, since scoring results are generated and consumed in real time within a SQL query.
The SQL scoring functions offer flexibility. You can invoke the function that returns the specific information you need. You simply reference a model and identify the kind of mining results you want to retrieve.