6 Preparing Text for Mining

Oracle Data Mining supports the mining of data sets that have one or more text columns. These columns must undergo a special preprocessing step whereby text tokens known as terms are extracted and stored in a nested table column. The transformed text can then be used as any other attribute in the building, testing, and scoring of models.

This chapter explains how to use Oracle Text packages in a PL/SQL program to prepare a column of text for Oracle Data Mining.

Note:

Oracle Data Mining includes sample programs that illustrate text transformation and text mining in both PL/SQL and Java. Refer to Oracle Data Mining Administrator's Guide for information on the Oracle Data Mining sample programs.

This chapter contains the following sections.

Oracle Text for Oracle Data Mining
Term Extraction in the Sample Programs
From Unstructured Data to Structured Data
Steps in the Term Extraction Process
Example: Transforming a Text Column

Oracle Text for Oracle Data Mining

Oracle Data Mining uses specialized Oracle Text routines to preprocess text data. Oracle Text is a technology within the Database for building text querying and classification applications. Oracle Text provides the following facilities that are specific to the Oracle Data Mining term extraction process:

SVM_CLASSIFIER, defined in the CTX_DLL Oracle Text PL/SQL package, specifies an index preference for Oracle Data Mining term extraction. It is used in the text transformation process for all algorithms supported by Oracle Data Mining.
The CTXSYS.DRVODM Oracle Text PL/SQL package defines the table functions, FEATURE_PREP and FEATURE_EXPLAIN, which generate intermediate and final tables of text terms for Oracle Data Mining.

Note:

Text terms are also known as features. In text mining, a feature is a word or group of words extracted from a text attribute. Both NMF models and text mining transformation perform a kind of feature extraction. NMF creates a single feature from multiple attributes. Text transformation creates multiple features from a single attribute.

The data preparation process in a PL/SQL text mining application requires the use of these Oracle Text facilities. Java developers can use the OraTextTransform interface, which presents the Oracle Text term extraction capability within the context of a Java environment.

Term Extraction in the Sample Programs

A good place to start in learning the text term extraction process is with the sample programs. You can find these programs in the /rdbms/demo directory under $ORACLE_HOME. Refer to the Oracle Data Mining Administrator's Guidefor more information.

The following sample programs contain term extraction code for text mining:

dmsh.sql — Prepares the build, test, and scoring data for the sample programs, including the text mining programs. dmsh.sql creates views for data mining and tables and indexes for text mining.
dmtxtfe.sql — Uses a table with an indexed text column, created by dmsh.sql, to create a table of build data with a nested table column.

The dmtxtfe.sql program is a sample term extractor. It contains extensive comments that explain the code in a step-by-step fashion. You can expand this program into a complete term extraction solution by adding index creation and the preparation of test and scoring data (as in dmsh.sql).

Text Mining Programs

Once you have properly prepared the text data, you can build a text mining program using any algorithm that supports sparse data: Apriori, k-Means, SVM (classification, regression, and one-class classification), and Non-Negative Matrix Factorization.

Two text mining sample PL/SQL programs use the data prepared by dmsh.sql.

dmtxtnmf.sql creates a text mining model that uses Non-Negative Matrix Factorization.
dmtxtsvm.sql creates a text mining model that uses SVM classification.

Both these programs mine a table of customer data, which includes a nested table column called COMMENTS. The COMMENTS column has been pre-processed by dmsh.sql. The models created by these programs are shown in the following example from a Linux system.

-- Run the programs
SQL> @ $ORACLE_HOME%rdbms/demo/dmtxtnmf.sql
SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtsvm.sql
-- List the models created by the programs
SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE  
           from dm_user_models;NAME               FUNCTION_NAME      ALGORITHM_NAME            TARGET_ATTRIBUTE ----------------   ------------------ ------------------------  ----------------T_NMF_SAMPLE       FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR T_SVM_CLAS_SAMPLE  CLASSIFICATION     SUPPORT_VECTOR_MACHINES   AFFINITY_CARD

From Unstructured Data to Structured Data

The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS from columns of type VARCHAR2 or CLOB. Each row of the nested table specifies an attribute name and a value. The DM_NESTED_NUMERICALS type defines the following columns.

attribute_name     VARCHAR2(4000)
value               NUMBER)

The term extraction process treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label. Within the nested table column, the attribute_name column holds the text and the value column holds the numeric value of the term, which is derived using the term frequency in the document and in the document collection (other rows).

For example, the following query returns various attributes of customer 102998, including a text column of comments. The text column has not been transformed.

SQL> select cust_id, cust_gender, cust_income_level, affinity_card, comments
             from mining_build_text
             where cust_id = 102998;
 
CUST_ID C  CUST_INCOME_LEVEL     AFFINITY_CARD COMMENTS
------- -- --------------------  ------------- --------------------------------
102998  M  J: 190,000 - 249,999  1             I wanted to write you to let you
                                               know that I've purchased several
                                               items at your store recently and
                                               have been very satisfied with my
                                               purchases. Keep up the good work.

The following query returns the same attributes of customer 102998, but the text in the comments column has been transformed. The query extracts the ATTRIBUTE_NAME and VALUE columns from the nested table that holds the transformed text.

SQL> select b.cust_id, b.cust_gender, b.cust_income_level, b.affinity_card, n.*
             from mining_build_nested_text b,
                  table(b.comments) n
             where b.cust_id = 102998
             order by n.attribute_name;
 
CUST_ID  C  CUST_INCOME_LEVEL    AFFINITY_CARD  ATTRIBUTE_NAME  VALUE
-------  -- -------------------  -------------  --------------  --------
102998   M  J: 190,000 - 249,999  1              GOOD            .26894
102998   M  J: 190,000 - 249,999  1              ITEMS            158062
102998   M  J: 190,000 - 249,999  1              KEEP             238765
102998   M  J: 190,000 - 249,999  1              KNOW              .2006
102998   M  J: 190,000 - 249,999  1              LET              299856
102998   M  J: 190,000 - 249,999  1              PURCHASED        142743
102998   M  J: 190,000 - 249,999  1              PURCHASES        173146
102998   M  J: 190,000 - 249,999  1              RECENTLY        .195223
102998   M  J: 190,000 - 249,999  1              SATISFIED       .355851
102998   M  J: 190,000 - 249,999  1              SEVERAL         .355851
102998   M  J: 190,000 - 249,999  1              STORE          .0712537
102998   M  J: 190,000 - 249,999  1              UP              .159838
102998   M  J: 190,000 - 249,999  1              WANTED          .355851
102998   M  J: 190,000 - 249,999  1              WORK            .299856
102998   M  J: 190,000 - 249,999  1              WRITE           .355851

The ATTRIBUTE_NAME column holds an item of text from the original comments column. The VALUE column holds the term value. Note that not all words from the original comments column are extracted as terms. For example, the articles the and to are not included.

Steps in the Term Extraction Process

The steps in the term extraction process are summarized in this section. Further details and specific syntax requirements are explained later in this chapter.

Transform a Text Column in the Build Table

First transform the text in the build data. During this process you will generate the text term definitions, which you will reuse for the test and apply data. Perform the following steps:

Create an index on the text column in the build table.
Create an SVM_CLASSIFIER preference for the index.
Define a table to hold the categories specified by the SVM_CLASSIFIER index.
Use the FEATURE_PREP table function to create the term definitions and populate an intermediate terms table.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original build table (using a view or another table), replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.

Transform a Text Column in the Test and Apply Tables

The test and apply data must undergo the same pre-processing as the build data. To transform the test and apply data, you will reuse the term definitions generated for the build data. Perform the following steps:

Create an index on the text column in the test or apply table.
Use the FEATURE_PREP table function to populate an intermediate terms table. Use the term definitions previously generated for the build data.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original test or apply table, replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.

Create the Index and Index Preference

Oracle Text processing requires a text index. Oracle Text supports several types of indexes for querying, cataloging, and classifying text documents. The Oracle Data Mining term extraction process requires a CONTEXT index for text querying.

You must create an index for each text column to be transformed. Use the following syntax to create the index.

SQL>CREATE INDEX index_name ON table_name(column_name)
                   INDEXTYPE IS ctxsys.context PARAMETERS ('nopopulate');

Note:

This statement creates a basic CONTEXT index. You can further define the characteristics of the index by specifying additional arguments to the CREATE INDEX statement. Refer to Oracle Text Reference for details.

Oracle Text supports index preferences for overriding the default characteristics of an index. The CREATE_PREFERENCE procedure in the Oracle Text package CTX_DDL creates a preference with the name and type that you specify. The SVM_CLASSIFIER preference type defines the characteristics of an index for Oracle Data Mining.

You must create an index preference when you prepare the build data. It will be reused when you prepare the test and apply data. Use the following syntax to create the index preference.

SQL>EXECUTE ctx_ddl.create_preference('preference_name', 'SVM_CLASSIFIER');

The SVM_CLASSIFIER index preference uses a predefined table with two numeric columns: an ID column for the case ID, and a CAT column for the category. The category table is used for internal processing. You must create the category table using the following syntax.

SQL>CREATE TABLE category_table_name(id NUMBER, cat NUMBER);

Create the Intermediate Terms Table

The FEATURE_PREP table function in the CTXSYS.DRVODM Oracle Text package extracts terms from a text column using an index preference of type SVM_CLASSIFIER. FEATURE_PREP creates a table of term definitions from the build data and reuses these definitions for the test and apply data.

FEATURE_PREP returns an intermediate terms table.

FEATURE_PREP Calling Syntax

FEATURE_PREP is an over-loaded function that accepts two different sets of arguments. You will specify one set of arguments for the build data and another set for the test and apply data.

--- syntax for build data ---
            CTXSYS.DRVODM.FEATURE_PREP (
                   text_index                IN   VARCHAR2,
                   case_id                   IN   VARCHAR2,
                   category_tbl              IN   VARCHAR2,
                   category_tbl_id_col       IN   VARCHAR2,
                   category_tbl_cat_col      IN   VARCHAR2,
                   feature_definition_tbl    IN   VARCHAR2,
                   index_preference          IN   VARCHAR2)
                RETURN DRVODM;

--- syntax for test/apply data ---
           CTXSYS.DRVODM.FEATURE_PREP (
                   text_index                IN   VARCHAR2,
                   case_id                   IN   VARCHAR2,
                   feature_definition_tbl    IN   VARCHAR2,
                RETURN DRVODM;

FEATURE_PREP Return Value

FEATURE_PREP returns the following columns. The SEQUENCE_ID column holds the case ID; the ATTRIBUTE_ID column holds the term ID.

Name                   NULL?   Type
---------------------- ------- ------
SEQUENCE_ID                    NUMBER
ATTRIBUTE_ID                   NUMBER
VALUE                          NUMBER

FEATURE_PREP Arguments

FEATURE_PREP accepts the arguments described in Table 6-1.

Table 6-1 FEATURE_PREP Table Function Arguments

Argument Name	Data Type
`text_index`	`VARCHAR2`	Name of the index on the text column in the build, test, or apply table.
`case_ID`	`VARCHAR2`	Name of the case ID column in the build, test, or apply table.
`category_tbl`	`VARCHAR2`	Name of the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`category_tbl_id_col`	`VARCHAR2`	Specify 'id'. This is the name of the `ID` column in the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`category_tbl_cat_col`	`VARCHAR2`	Specify 'cat'. This is the name of the `CAT` column in the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`feature_definition_tbl`	`VARCHAR2`	Name of the term definition table created by `FEATURE_PREP`. The columns of the term definition table are: Name Null? Type --------------------------------- CAT_ID NUMBER TYPE NUMBER RULE BLOB
`index_preference`	`VARCHAR2`	Name of the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.

FEATURE_PREP Example

The following example creates an intermediate terms table called txt_term_out. The FEATURE_PREP table function extracts terms from a text column with an index called build_text_idx. The text column is in a build table with a case ID column called cust_id. The index preference txt_pref is applied to the index using the id and cat columns in the table cat_tbl. FEATURE_PREP creates a table of term definitions called txt_pref_terms.

CREATE TABLE txt_term_out AS
SELECT *
  FROM TABLE(ctxsys.drvodm.feature_prep (
               'build_text_idx',
               'cust_id',
               'cat_tbl',
               'id',
               'cat',
               'txt_pref_terms',
               'txt_pref'));

Create the Final Terms Table

The FEATURE_EXPLAIN table function in the CTXSYS.DRVODM Oracle Text package extracts the term values from the definitions created by FEATURE_PREP and appends the associated word to each value.

FEATURE_EXPLAIN returns the final terms table.

FEATURE_EXPLAIN Calling Syntax

The calling syntax of FEATURE_EXPLAIN is described as follows.

CTXSYS.DRVODM.FEATURE_EXPLAIN (
                   feature_definition_tbl     IN   VARCHAR2,
                RETURN DRVODM;

FEATURE_EXPLAIN Return Value

FEATURE_EXPLAIN returns the following columns.

Name              Type
---------------    ---------------
text               VARCHAR2(160)
type               NUMBER(3)
ID                 NUMBER
score              NUMBER

FEATURE_EXPLAIN Arguments

FEATURE_EXPLAIN accepts a single argument: the terms definition table created by FEATURE_PREP.

FEATURE_EXPLAIN Example

The following example creates a final terms table called txt_final_terms using the intermediate terms table txt_term_out. The FEATURE_EXPLAIN table function returns the terms specified in the terms definition table txt_pref_terms.

SQL> create table txt_final_terms as
                   select A.sequence_id, B.text, A.value
                         FROM txt_term_out A,
                              TABLE(ctxsys.drvodm.feature_explain(
                                    'txt_pref_terms')) B
                         WHERE A.attribute_id = B.id;

Populate a Nested Table Column

Use the final terms table to populate a nested table column of type DM_NESTED_NUMERICALS.

The following example creates the table mining_build_nested_text. (Alternatively, you could create a view.) The table has a case ID column of customer IDs and three customer attribute columns: age, education, and occupation. It also includes a comments column of type DM_NESTED_NUMERICALS created from the terms table txt_final_terms.

SQL> CREATE TABLE mining_build_nested_text
       NESTED TABLE comments store AS build_comments
      AS
    SELECT non_text.cust_id,
      non_text.age,
      non_text.education,
      non_text.occupation,
     txt.comments
     FROM
     mining_build_text non_text,
     ( SELECT features.sequence_id,
              cast(COLLECT(dm_nested_numerical(features.text,features.value))
                           as dm_nested_numericals)  comments
       FROM txt_final_terms features
       group by features.sequence_id) txt
     WHERE non_text.cust_id = txt.sequence_id(+);

Example: Transforming a Text Column

In the following example, a text column in MINING_BUILD_TEXT is transformed to a nested table column in MINING_BUILD_NESTED_TEXT. The same text column in MINING_APPLY_TEXT is transformed to a nested table column in MINING_APPLY_NESTED_TEXT.

Both MINING_BUILD_TEXT and MINING_APPLY_TEXT have the following columns.

Name                              Null?    Type
 --------------------------------- -------- ---------------------------
 CUST_ID                           NOT NULL NUMBER
 AGE                                        NUMBER
 EDUCATION                                  VARCHAR2(21)
 OCCUPATION                                 VARCHAR2(21)
 COMMENTS                                   VARCHAR2(4000)

The following statements create the indexes.

SQL> create index build_text_idx on mining_build_text (comments)
             indextype is ctxsys.context parameters ('nopopulate');
SQL> create index apply_text_idx ON mining_apply_text (comments)
             indextype is ctxsys.context parameters ('nopopulate');

The following statements create the index preference and its table.

SQL> execute ctx_ddl.create_preference('idx_pref', 'SVM_CLASSIFIER');
SQL> create table idx_pref_cat (id number, cat number);

The following statement returns the intermediate terms in the table BUILD_TERMS_OUT. It also creates the table FEATURE_DEFS and populates it with the term definitions.

SQL>  create table build_terms_out as
              select * from
                     table (ctxsys.drvodm.feature_prep
                                  ('build_text_idx',
                                   'cust_id',
                                   'idx_pref_cat',
                                   'id',
                                   'cat',
                                   'feature_defs',
                                   'idx_pref'));

The following statement returns the final terms in the table BUILD_EXPLAIN_OUT.

SQL> create table build_explain_out as
             select a.sequence_id,
                    b.text,
                    a.value
             from build_terms_out a,
             table (ctxsys.drvodm.feature_explain('feature_defs')) b
             where a.attribute_id = b.id;

The following statement creates the table MINING_BUILD_NESTED_TEXT. This table contains the non-text attributes from the original build table and a nested table of comments. This table can be used to build a model.

SQL> create table mining_build_nested_text
        nested table comments store as build_comments
          as select non_text.cust_id,
                    non_text.age,
                    non_text.education,
                    non_text.occupation,
                    txt.comments
              from mining_build_text non_text,
             (select features.sequence_id,
                cast(collect(dm_nested_numerical(features.text,features.value))
                as dm_nested_numericals)  comments
              from build_explain_out features
              group by features.sequence_id) txt
              where non_text.cust_id = txt.sequence_id(+);

The following statement creates the intermediate terms table for the comments column in the apply table, MINING_APPLY_TEXT. It uses the term definitions in the FEATURE_DEFS table, which was created during the pre-processing of the comments column in MINING_BUILD_TEXT.

SQL>  create table apply_terms_out as
              select * from
                     table (ctxsys.drvodm.feature_prep
                                  ('build_text_idx',
                                   'cust_id',
                                   'feature_defs'));

The following statement creates the final terms table for apply.

SQL> create table apply_explain_out as
             select a.sequence_id,
                    b.text,
                    a.value
             from apply_terms_out a,
             table (ctxsys.drvodm.feature_explain('feature_defs')) b
             where a.attribute_id = b.id;

The following statement creates the table MINING_APPLY_NESTED_TEXT. This table contains the non-text attributes from the original apply table and a nested table of comments. This table can be used to apply the model.

SQL> create table mining_apply_nested_text
        nested table comments store as apply_comments
          as select non_text.cust_id,
                    non_text.age,
                    non_text.education,
                    non_text.occupation,
                    txt.comments
              from mining_apply_text non_text,
             (select features.sequence_id,
                cast(collect(dm_nested_numerical(features.text,features.value))
                as dm_nested_numericals)  comments
              from apply_explain_out features
              group by features.sequence_id) txt
              where non_text.cust_id = txt.sequence_id(+);