Oracle® Database Concepts 11g Release 1 (11.1) Part Number B28318-01 |
|
|
View PDF |
This chapter describes some of the basic ideas in business intelligence.
This chapter contains the following topics:
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, Oracle Warehouse Builder, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
This section includes the following topics:
Data warehouses all share the following basic characteristics:
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
A typical data warehouse query scans thousands or millions of rows.For example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
Data warehouses usually store many months or years of data. This is to support historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
Figure 16-1 Architecture of a Data Warehouse
In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
Summaries in Oracle Database are called materialized views.
Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
Figure 16-2 Architecture of a Data Warehouse with a Staging Area
Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts
See Also:
Oracle Database Data Warehousing GuideYou need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. The entire process, including data loading, is referred to as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what is now called ETL.
Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT
statement.
If any errors occur during loading, an error is logged and the operation can continue.
This section includes the following topics:
Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different computer architectures and operating systems.
Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
Using transportable tablespaces, Oracle Database data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.
A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
Multiple rows to be returned from a function
Results of SQL subqueries (that select multiple rows) to be passed directly to functions
Functions take cursors as input
Functions can be parallelized
Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT
statement in conjunction with the SELECT
from an external table. Figure 16-4 illustrates a typical example of pipelining.
Figure 16-4 Pipelined Data Transformation
The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE
/INSERT
/DELETE
) are possible and no indexes can be created on them.
External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.
You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.
To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
See Also:
"Table Compression"Change Data Capture efficiently identifies and captures data that has been added to, updated, or removed from Oracle Database relational tables, and makes the change data available for use by applications.
Oftentimes, data warehousing involves the extraction and transportation of relational data from one or more source databases into the data warehouse for analysis. Change Data Capture quickly identifies and processes only the data that has changed, not entire tables, and makes the change data available for further use.
Change Data Capture does not depend on intermediate flat files to stage the data outside of the relational database. It captures the change data resulting from INSERT
, UPDATE
, and DELETE
operations made to user tables. The change data is then stored in a database object called a change table, and the change data is made available to applications in a controlled way.
See Also:
Oracle Database Data Warehousing GuideOne technique employed in data warehouses to improve performance is the creation of summaries. Summaries are special kinds of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a table to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle Database using a schema object called a materialized view. Materialized views can perform a number of roles, such as improving query performance or providing replicated data.
Previously, organizations using summaries spent a significant amount of time and effort creating summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. Summary management eased the workload of the database administrator and meant that the user no longer needed to be aware of the summaries that had been defined. The database administrator creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables and views at the detail data level.
The query rewrite mechanism in Oracle Database automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.
Although materialized views are usually accessed through the query rewrite mechanism, an end user or database application can construct queries that directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because any change to the summaries will affect the queries that reference them.
To help you select from among the many possible materialized views in your schema, Oracle Database provides a collection of materialized view analysis and advisor functions and procedures in the DBMS_ADVISOR
package. Collectively, these functions are called the SQL Access Advisor, and they are callable from any PL/SQL program. The SQL Access Advisor recommends materialized views from a hypothetical or user-defined workload or one obtained from the SQL cache. You can run the SQL Access Advisor from Oracle Enterprise Manager or by invoking the DBMS_ADVISOR
package.
See Also:
Oracle Database Performance Tuning Guide for information about materialized views and the SQL Access AdvisorBitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:
Reduced response time for large classes of ad hoc queries
Reduced storage requirements compared to other indexing techniques
Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
Efficient maintenance during parallel DML and loads
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.
An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids.
Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space.
Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE
clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. A good candidate for a bitmap index would be a gender column due to the low number of possible values.
Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.
See Also:
Oracle Database Data Warehousing GuideWhen Oracle Database runs SQL statements in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a statement among multiple processes, Oracle Database can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel processing.
Parallel execution dramatically reduces response time for data-intensive operations on large databases typically associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large-scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up among many CPUs on a single Oracle Database system. You can also implement parallel execution on certain types of online transaction processing (OLTP) and hybrid systems.
Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high.
Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you need to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance.
In Oracle RAC environments, parallel execution is controlled by the service placement of a particular service. Specifically, parallel processes run on the nodes on which you have configured the service. The default behavior is for Oracle Database to run the parallel process only on the instance that offers the service that you used to connect to the database. This does not affect other parallel operations such as parallel recovery or the processing of GV$queries
.
Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere seconds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution time.
See Also:
Oracle Database Data Warehousing Guide for specific information on tuning your parameter files and database to take full advantage of parallel execution and the Oracle Real Application Clusters Administration and Deployment Guide for considerations regarding parallel execution in Oracle RAC environmentsWhen parallel execution is not used, a single server process performs all necessary processing for the sequential execution of a SQL statement. For example, to perform a full table scan (such as SELECT
* FROM emp
), one process performs the entire operation, as illustrated in Figure 16-5.
Figure 16-6 illustrates several parallel execution servers performing a scan of the table emp
. The table is divided dynamically (dynamic partitioning) into load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordinator. Each granule is a range of physical blocks of the table emp
. The mapping of granules to execution servers is not static, but is determined at execution time. When an execution server finishes reading the rows of the table emp
corresponding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granules are exhausted, in other words, until the entire table emp
has been read. The parallel execution servers send results back to the parallel execution coordinator, which assembles the pieces into the desired full table scan.
Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL statement all have the same degree of parallelism.
See Also:
Oracle Database Data Warehousing Guide for information on granules as well as how Oracle Database divides work and handles DOP in multiuser environmentsOracle has introduced many SQL operations for performing analytic operations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, this syntax offers much better performance.
This section discusses:
Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle Database provides extensions to the GROUP BY
clause to make querying and reporting easier and faster. Some of these extensions enable you to:
Aggregate at increasing levels of aggregation, from the most detailed up to a grand total
Calculate all possible combinations of aggregations with a single statement
Generate the information needed in cross-tabulation reports with a single query
These extension let you specify exactly the groupings of interest in the GROUP BY
clause. This allows efficient analysis across multiple dimensions without performing a CUBE
operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE
, ROLLUP
, and grouping sets produce a single result set that is equivalent to a UNION ALL
of differently grouped rows.
To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. The term dimension is used to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
See Also:
Oracle Database Data Warehousing GuideOracle has advanced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate:
Rankings and percentiles
Moving window calculations
Lag/lead analysis
First/last analysis
Linear regression statistics
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.
Other features include the CASE
expression. CASE
expressions provide if-then logic useful in many situations.
To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity.
See Also:
Oracle Database Data Warehousing GuideThe Oracle MODEL
clause brings a new level of power and flexibility to SQL calculations. With the MODEL
clause, you can create a multidimensional array from query results and then apply formulas to this array to calculate new values. The formulas can range from basic arithmetic to simultaneous equations using recursion. For some applications, the MODEL
clause can replace PC-based spreadsheets. Models in SQL leverage the Oracle Database strengths in scalability, manageability, collaboration, and security. The core query engine can work with unlimited quantities of data. By defining and executing models within the database, users avoid transferring large datasets to and from separate modeling environments. Models can be shared easily across workgroups, ensuring that calculations are consistent for all applications. Just as models can be shared, access can also be controlled precisely with the Oracle Database security features. With its rich functionality, the MODEL
clause can enhance all types of applications.
See Also:
Oracle Database Data Warehousing GuideOracle online analytical processing (OLAP) adds power to your SQL applications by providing extensive analytic content and fast query response times. A SQL query interface enables any application to query cubes and dimensions without any knowledge of OLAP.
The OLAP option automatically generates a set of relational views on cubes, dimensions, and hierarchies. SQL applications query these views to display the information-rich contents of these objects to analysts and decision makers. You can also create custom views that comply with the structure expected by your applications, using the system-generated views like base tables.
Analysts can choose any SQL query and analysis tool for selecting, viewing, and analyzing the data. You can use your favorite tool or application, or use one of the tools supplied with Oracle Database, such as Oracle Application Express and Business Intelligence Publisher.
See Also:
Oracle OLAP User's GuideThis section includes the following topics:
By integrating multidimensional objects and analytics into the database, Oracle provides the best of both worlds: the power of multidimensional analysis along with the reliability, availability, security, and scalability of Oracle Database.
Oracle OLAP is fully integrated into Oracle Database. At a technical level, this means:
The OLAP engine runs within the kernel of Oracle Database
Dimensional objects are stored in Oracle Database in their native multidimensional format
Cubes and other dimensional objects are first class data objects represented in the Oracle data dictionary
Data security is administered in the standard way, by granting and revoking privileges to Oracle Database users and roles
Applications can query dimensional objects using SQL
The benefits to your organization are significant. Oracle OLAP offers the power of simplicity. One database, standard administration and security, standard interfaces and development tools.
Oracle OLAP makes it easy to enrich your database and your applications with interesting analytic content. Native SQL access to Oracle multidimensional objects and calculations greatly eases the task of developing dashboards, reports, business intelligence, and analytical applications of any type compared to systems that offer proprietary interfaces. Moreover, SQL access means that the power of Oracle OLAP analytics can be used by any database application, not just by the traditional limited collection of OLAP applications.
Because Oracle OLAP is completely embedded in Oracle Database, there is no administration learning curve as is typically associated with stand-alone OLAP servers. You can leverage your existing DBA staff, rather than invest in specialized administration skills.
One major administrative advantage of Oracle's embedded OLAP technology is automated cube maintenance. With stand-alone OLAP servers, the burden of refreshing the cube is left entirely to the administrator. This can be a complex and potentially error-prone job. The administrator must create procedures to extract the changed data from the relational source, move the data from the source system to the system running the stand-alone OLAP server, load and rebuild the cube. The administrator must take responsibility for the security of the changed values during this process, as well.
With Oracle OLAP, in contrast, cube refresh is handled entirely by Oracle Database. The database tracks the staleness of the dimensional objects, automatically keeps track of the deltas in the source tables, and automatically applies only the changed values during the refresh process. The administrator simply schedules the refresh at appropriate intervals, and Oracle Database takes care of everything else.
With Oracle OLAP, standard Oracle Database security features are used to secure your multidimensional data.
In contrast, with a stand-alone OLAP server, administrators must manage security twice: once on the relational source system and again on the OLAP server system. Additionally, they must manage the security of data in transit from the relational system to the stand-alone OLAP system.
Business intelligence and analytical applications are dominated by actions such as drilling up and down hierarchies and comparing aggregate values such as period-over-period, share of parent, projections onto future time periods, and a myriad of similar calculations. Often these actions are essentially random across the entire space of potential hierarchical aggregations. Because Oracle OLAP pre-computes or efficiently computes on the fly all aggregates in the defined multidimensional space, it delivers unmatched performance for typical business intelligence applications.
Oracle OLAP queries take advantage of Oracle shared cursors, dramatically reducing memory requirements and increasing performance.
When Oracle Database is installed with Oracle Real Application Clusters (Oracle RAC), OLAP applications receive the same benefits in performance, scalability, failover, and load balancing as any other application.
All these features add up to reduced costs. Administrative costs are reduced because existing personnel skills can be leveraged. Moreover, Oracle Database can manage the refresh of dimensional objects, a complex task left to administrators in other systems. Standard security reduces administration costs as well. Application development costs are reduced because the availability of a large pool of application developers who are SQL knowledgeable, and a large collection of SQL-based development tools means applications can be developed and deployed more quickly. Any SQL-based development tool can take advantage of Oracle OLAP. Hardware costs are reduced by Oracle OLAP's efficient management of aggregations, use of shared cursors, and Oracle RAC, which enables highly scalable systems to be built from low-cost commodity components.
Oracle Data Mining embeds data mining in the Oracle Database. The data never leaves the database — data preparation, model building, and model scoring are all performed within the database. Since the data never leaves the database, there are significant advantages in scalability, manageability, and user access. Thus, the Oracle Database provides an infrastructure for application developers to integrate data mining seamlessly with database applications. Data mining is often used in applications such as call centers, ATMs, ERM, and business planning.
As of Oracle Database 11g, Oracle Data Mining models are implemented as data dictionary objects in the SYS schema. A set of new data dictionary views present mining models and their properties. New system and object privileges control access to mining model objects.
Support of Generalized Linear Models (GLM) is new for Oracle Data Mining 11g. Oracle Data Mining supports two forms of GLM, one for classification and one for regression:
Binary Logistic Regression, used for classification, predicts the probability for each row of scoring data. The dependent variable (target) is binary and categorical. For example, demographic attributes might be used to predict whether customer response to a promotion is low or high.
Multivariate Linear Regression, used for regression, predicts the best estimate within a continuum for each row of scoring data. For example, demographic attributes such as age bracket, income level, gender, and town of residence might be used to predict sales per customer.
Oracle Data Mining GLM can handle many hundreds or thousands of input attributes, unlike traditional implementations that typically support 30 or fewer input attributes.
Data mining activities such as model building, testing, and scoring are accomplished through a PL/SQL API, a Java API, and SQL Data Mining functions. The Java API is compliant with the data mining standard JSR 73. The Java API and the PL/SQL API are fully interoperable.
Optionally, Oracle Data Mining can automatically perform all algorithm-required data preparation, such as binning, normalization, and outlier treatment. Additionally, user-specified data transformations can be integrated with the algorithm-specific data preparation to simplify testing and scoring; models like this are supermodel.
The SQL Data Mining functions are SQL language operators for the deployment of data mining models. The Data Mining functions support the scoring of classification, regression, clustering, and feature extraction models. Within the context of standard SQL statements, pre-created models can be applied to new data and the results returned for further processing.
Predictive Analytics is a technology that captures data mining processes in simple routines. Sometimes called "one-click data mining," predictive analytics simplify and automate the data mining process. The procedure returns the results of analytic processing. The models and other intermediate objects are not preserved. The DBMS_PREDICTIVE_ANALYTICS
PL/SQL package implements Predictive Analytics with the following procedures:
EXPLAIN
- Ranks attributes in order of strongest relationships with a target attribute.
PREDICT
- Predicts the value of a target attribute.
PROFILE
- Creates rules that identify the records that have the same target value.
Oracle Data Mining supports the following algorithms (Generalized Linear Models are new for Oracle Database 11g):
For classification, Naive Bayes, Decision Tree, Generalized Linear Models (Binary Logistic Regression), and Support Vector Machine
For regression, Support Vector Machine and Generalized Linear Models (Multivariate Linear Regression)
For associations (market basket analysis), Apriori
For clustering, k-Means and O-Cluster
For attribute importance, Minimum Description Length
For anomaly detection, One Class Support Vector Machine
For feature extraction, Non-Negative Matrix Factorization
Oracle Data Mining Java API Reference contains Javadoc descriptions of the classes that constitute the Oracle Data Mining Java API
The PL/SQL API is described in the DBMS_DATA_MINING
, DBMS_DATA_MINING_TRANSFORM
, and DBMS_PREDICTIVE_ANALYTICS
chapters of Oracle Database PL/SQL Packages and Types Reference
The SQL Data Mining functions are described in Oracle Database SQL Language Reference