Replacing 3GL Data Loading with SQL*Loader and Database Triggers

 

Michael A. Rife                                                           Rumpi Gravenstein

Goodyear Tire & Rubber Company                            Rumken, Inc.

Gerhardt Martin                                                            Jen Croy

Goodyear Tire & Rubber Company                             Rumken, Inc.

 

Introduction

This paper describes a design to replace 3GL data load programs with SQL*Loader and database triggers while supporting tracking of load errors. In presenting this design the paper discusses why to use SQL*Loader, SQL*Loader datatypes, typical SQL*Loader use, two SQL*Loader approaches, future enhancements, and am example of using on the SQL*Loader approaches.

This design was implemented in an Oracle8 for Solaris environment.  All the techniques used are compatible with Oracle7, but have not been tested in Oracle7.

Why Use SQL*Loader

In a 3GL loader program the business rules for loading data are usually coded in the 3GL application.  Using SQL*Loader, database triggers, and database packages moves the load logic and business rules into the database and out of the load program.  Since the logic is stored in the database, any logic that is common across data loads can be shared.  The resulting environment has less code and is easier to maintain.

3GL loader programs also tend to be more difficult to develop, since both 3GL and Oracle skills are required.  It is easier to find a developer with Oracle skills than to find a developer with expertise in both skill sets.  In addition, a developer skilled in both areas would likely command a higher salary.  Since PL/SQL tends to be a higher level language than 3GL languages, the use of a pure PL/SQL approach should save development time. The developer would be working at a higher level and does not need to be concerned with the level of minutia that need to be handled by a 3GL program.

If that is not enough, we believe that 3GL loader programs are more difficult to maintain. Operating system, Oracle database, and 3GL language upgrades require re-linking and possible recompilation of 3GL programs while PL/SQL based programs need very little attention during migrations as they migrate with the database.  An added issue here is that 3GL programs, and the people that support them can not always move from one environment (say MVS) to another  (say Solaris).  One platform's implementation of a language may have different libraries from another's implementation.  With the business rules coded in PL/SQL database packages and utilizing SQL*Loader, these packages and the people that support them will be portable across operating system, Oracle database, and PL/SQL language migrations and upgrades. 

SQL*Loader Datatypes

SQL*Loader can handle a large number of data types.  The table below shows the data type in the data file and what Oracle data type SQL*Loader could convert that data type to.


Data Type in File

Oracle Data Type

Char

Any type

Integer

Number

Small integer

Number

Float

Number

Double Precision

Number

Byte Integer

Number

Packed Decimal

Number

 

Data Type in File

Oracle Data Type

Zoned

Number

Varchar

Varchar2

Graphics

Raw/long

Graphic External

Raw/long

VarGraphic

Raw/long

Raw

Raw

 

 


Typical SQL*Loader Use

The typical manner in which SQL*Loader is used is made up of three main components.  There is the SQL*Loader process, staging tables, and post insert triggers on the staging tables.  A description of each of these elements follows.

The SQL*Loader process is an operating specific program that executes at a specific time or is caused to execute by an operating system event like the receipt of a data file.  This OS program can do preprocessing of the data file and invokes SQL*Loader to load the data file into a staging table.  Once SQL*Loader is finished loading the data file into the database, the OS program creates and distributes a load report.  Lastly the OS program handles any required data file archiving issues.

As SQL*Loader loads data into the staging table, it will save a set of records at a time.  The set size, transaction set, is defined by the number of rows SQL*Loader loads during each commit.  A transaction set then can be one or more lines from the data file being loaded, the size of which is normally determined while tuning the load process.  The data loaded into the staging table resides only temporarily in this table as it is moved almost immediately after it is loaded.  The staging table is normally defined to mirror the layout of the data file.  The mirroring here includes mirroring the column data types.  That is if you are loading an ASCII file, all the columns in the staging table should be defined as VARCHAR2 or CHAR.  Similarly if you are loading data from a binary data file, the columns in the staging table should match exactly the definitions of the structure in the data file. This is done so that load errors do not occur while SQL*Loader is loading data into the table, but rather later while the data is being moved from the staging table.  Handling errors later allows the developer more control (with PL/SQL code) in how he wants to deal with these data errors.

Once a transaction set is inserted into the staging table a post-insert trigger fires.  It processes the transaction set of records that was just inserted into the staging table.  The trigger moves the data from the staging table into the destination tables while handling all data errors.  The logic in this post-insert trigger is:

1.       Mark this transaction set’s records in the staging table as unprocessed

2.       Get a set of unprocessed records out of the staging table

3.       If records were fetched, then process and save the data to the destination table(s)

4.       If no records were fetched, then end processing

5.       Delete or mark fetched records as processed

6.       Go back to step 2

The key to the logic is the fact that the database trigger on the staging table contains the DML statements for saving the records to the destination tables.  The trigger deletes the data out of the staging table, so the data in the staging table is only available for this SQL*Loader's Oracle session.  Given this load processing approach, the data that is temporarily stored in the staging table is not accessible to any other Oracle sessions.

The following chart shows the general flow in a typical SQL*Loader design.


 


SQL*Loader Approaches

There are two approaches to using SQL*Loader.  There are two major differences between the two approaches.  The first is obviously the number of staging tables and table triggers that are used.  One approach has multiple staging tables while the other utilizes a single generic staging table.  The other difference is when using multiple staging tables the SQL*Loader process does the parsing of the data file line into fields.  When using a single staging table, parsing the data file line is performed by the post-insert trigger or a package called by this trigger.

Data Loading with Multiple Staging Tables

Data loading with multiple staging tables is composed of the SQL*Loader process, multiple staging tables, post-insert triggers on the staging tables, a staging table specific data package, and a common load package.  A description of each of these elements follows.

The SQL*Loader process is identical to the one described in the Typical SQL*Loader Use section above.  The SQL*Loader process handles generating the load report as well as parsing the data file's lines into fields and saving those fields in a staging table.

Since there is one staging table per data file, there will be many staging tables if there are many data file types to load.  The layout of each staging table mirrors the layout of the file being loaded.  Each field in the data file will have a matching field in the staging table. There may be some additional bookkeeping fields for tracking the data file's load status, recording load order, time stamping the processing of the record, as well as capturing any load messages or errors that occur.

Once a set of records has been inserted into a staging table by the SQL*Loader process, the post-insert trigger on that staging table executes once for a set of records.  The logic in this post-insert trigger is:

1.       Mark this transaction set’s records in the staging table as unprocessed

2.       Get all the unprocessed records out of the staging table

3.       For each unprocessed record fetched call the data interface package to process and save the data in the destination table(s)

4.       Delete or mark unprocessed records as processed

The data interface package procedure loads the data in that specific staging table.  The data interface package procedure can insert the data directly into the destination tables or can utilize shared DML procedures in the database for inserting, updating, and deleting data in the destination tables.  These shared procedures are stored in a general use database package.

The chart below gives the general flow of data loading using multiple staging tables.


Data Loading with a Single Staging Table

Data loading with a single staging table is composed of the SQL*Loader process, one staging table, two  table triggers on the staging table, a data interface package, a common load package, and data load tracking tables.  A description of each of these elements follows.

The SQL*Loader process is similar to the one described in the Data Loading with Multiple Staging Tables section above. The SQL*Loader process does not generate the load report, but still does the distribution of the load report.  With a single staging table the SQL*Loader process does not parse the data file's lines into fields.  It saves the record directly in the staging table.  If the data file's lines were made of a mixture of binary and character datatypes, then the SQL*Loader process would parse like datatypes into one field in the staging table.  For example, if the lines in the data file had 100 characters, composed of different fields, at the beginning of each line followed by 1000 bytes of binary data, the staging table would have a character field as well as a raw field.  The SQL*Loader process would put the 100 characters into one character field in the staging table and the 1000 bytes of data into a raw field in the staging table.  If a data file had no binary data, nothing would get loaded into the raw field in the staging table.

There is one staging table.  It is laid out generically to be able to handle all datatypes in all data files to be loaded.  All data files being loaded into the database will be inserted into the one staging table.  The status of the processing of each data file line is tracked in the staging table.  The staging table has a row level pre-insert and post-insert trigger.

The row level pre-insert trigger handles assigning a unique job ID to the data load that is occurring.  This trigger fires for each row inserted into the staging table.  If the record the trigger is firing for is the first line in the data file, the trigger determines the job ID, assigns it to the record, and saves the job ID in a variable in the common load package.  If the record the trigger is firing for is not the first line in the data file, the trigger assigns the value of the job ID in the common load package to the row in the staging table.  After the pre-record row insert trigger has completed executing the post-insert trigger fires once.

The post-insert trigger handles the same functionality as the post-insert trigger in loading data with multiple staging tables. When loading with a single staging table, it additionally will check to see if it is processing the last line of a data load file.  If it is, the post-insert trigger will call a procedure in the data interface package to create the load report.  This load report is stored in a table in the database.  The post-insert trigger queries the database for the name of the procedure that loads the staging table record.  The trigger will invoke that procedure to load the record into the destination tables.

Each data load has its own data interface package. The data interface package is a collection of procedures.  One creates the load report. Another procedure parses the data out of the staging table record, does needed data processing on the record, and calls a DML procedure in the common load package to save the information to the database.

The common load package is a group of DML procedures stored in the database.  If multiple data loads were inserting employee information, all the data interface packages would call the same common DML procedure for inserting employee data.  This makes for more modular code.

The chart below shows the general flow of data loading a single staging table.


Data Load Tracking tables for Single Staging Table Loading

In addition to the single staging table, there are common tables which track information about the data loads.  Information is tracked at a data load run level and at a data file line level.  The figure below shows the common tables used to track data loads.


Data_Load_Interfaces

This table contains a list of data interfaces.  The key field is the interface name.  The load program field contains the name of the data interface package procedure that should be called to process the data records for this data load.  This is a powerful field.  With the insertion of a new record in the table, the post-insert trigger will have a new procedure in a data interface package defined that it can call to load data.  This can be accomplished by either changing the trigger to recognize and call the new interface or by using fairly simple dynamic PL/SQL.  In our case we chose to change the trigger every time we added a new interface and not use the dynamic PL/SQL as we found that the use of dynamic PL/SQL had an unacceptable impact on the run time.  The example included in this white contains the dynamic PL/SQL.

Data_Loads

This table stores information about each data load that occurs.  Each data interface occurs multiple times.  Each occurrence is assigned a job ID.  Each data interface has its own rules for determining the value of the job ID.  This table contains various high-level statistics that apply across the whole data load.

Data_Load_Inbox

This table is the single staging table.  All data for loading will get stored here.  This table should be laid out to handle all types of data, from character to binary, that will be loaded into the database.  The table as it is laid out above can only support the loading of character-based files.

Data_Load_Records

This table contains a snapshot of the data file line that had a message or error.  This information is used in generating the data load report.  This information may or may not be retained after the data load report is generated.  In determining whether to keep these records, consider the following issues.  Is the data really needed?  Will the user be correcting the data record in the database or in the file?  If the user will correct the data in a file, the database record may not needed.  Consider the volume of records that will be retained.  A possible option is to identify a retention period for purging old records in this table.

Data_Load_Message_Codes

This table contains a list of message codes with error type, error or informational, and a clear text description of its exact meaning.

Data_Load_Messages

This table contains the messages and errors that exist for a specific run of a data load.  Along with what data load ran and messages occurred, this table has the data value causing the message, what area of the database the data was going to be stored in (table and column), whether the data was stored in the database, and any additional information about the message.  This information is tracked at a line by line basis in the data file.

Data_Load_Report

This table contains the load reports for the data loads.  Each record in this table is a line of the report.  To list out a load report for a specific data load, locate all the records based on the interface name and job ID and list the records in the order of the line number.

Implementing a New Data Interface

After the single staging and load tracking tables, the staging table triggers, the common load packages are created, what is needed to implement a get a new data interface running?  This is where this design is powerful.

1.       Create an Operating System Job or Event

2.       Create a new record in the Data_Load_Interfaces table defining the new data load

3.       Create SQL*Loader Control File

4.       Create Data Interface Package

5.       Add logic to the row level pre-insert trigger for creating a job ID

6.       Create new DML procedures in the common load package if needed

The SQL*Loader control file would look like the following:

LOAD DATA

INFILE Data_File_Name

APPEND

INTO TABLE data_load_inbox

TRAILING NULLCOLS

(

 interface_name  CONSTANT Interface_Name,

 line_no         RECNUM,

 line_text       POSITION(1:Record_Length),

 status          CONSTANT NEW

)

Data_File_Name is the name of the data file.

Interface_Name is the name of the data interface.

Record_Length is the length of the longer line in the data file.

This SQL*Loader control file loads every record into the staging table (DATA_LOAD_INBOX).  It gives each record a value identifying the data interface.  It also assigns a line number to record in the data file.  By default each record is marked with a status of 'NEW''.

Future Enhancements

Load Record Set Support

Currently the procedure in the data interface package that processes the data file record is designed to receive one record from the data file.  It would be nice to be able to send it a set of records.  This would reduce the number of times that this packaged procedure is invoked, improving performance.

Java Implementation

Future releases of Oracle8 will support stored packages, procedures, and functions coded in Java.

Oracle IFS

Oracle's Internet File System (Oracle 8.1) can store operating system files in the Oracle database.  The creation of a file in a particular directory could fire off an Oracle job to load that data file.  The reading of the file could be developed in a PL/SQL stored procedure eliminating the use of SQL*Loader.

XML Support

Adding an XML parser to the common load package would provide the ability of receiving data in and XML format.

Multiple Data File Lines per Logical Record

The current data interfaces support one data file line per record loaded in the database.  Some interfaces may need to support the ability to have logical records span multiple physical lines in the data file.  This may mean that logic will need to be added to the post-insert trigger on the staging table that would assemble multiple data file lines into one record and pass it to the procedure in the data interface package that loads the data into the destination tables.  Of course with the multiple table approach this is issue can be handled in the SQL*Loader control file.

 

Data Mapping Stored in Database Tables

The parsing of a data file line into fields and data mapping of those fields to data base tables and fields are coded in the procedure in the data interface package.  It would be nice to store this information in database tables and dynamically generate INSERT, UPDATE, and DELETE DML statements to be executed against the database.   Care needs to be taken here since making this enhancement will impact the speed at which data is loaded.

Multiple Simultaneous Data Loads

It appears that the single staging table design supports the ability to do load more different data file types (interfaces) at the same time.  This has not been tested.  If this design supported this, then scheduling a data load to run would not be depend on when other data loads run.

Multiple Threaded Data Load

This is the ability to take a single data file and split it into multiple simultaneous data loads.  This is assuming that the order in which the records are loaded into the database is not important.

Employee Data Load Example

The example included in this white paper is an implementation of data loading with a single staging table.  The example loads data into Oracle's EMP and DEPT tables.  The data file contains employee information along with the department number and department name of the employee.  The result of loading this data will be information in the EMP and DEPT tables along with a load report.  The example is pretty basic, but demonstrates how one line in a file can result in records in multiple tables in the database.  It also illustrates informational and error trapping and messaging.

This example can be use as a stub for a first implementation.  A developer can take the example and replace the data interface package with a package customized to implement the load of any data file and the associated business rules.

Here are the file names and their content.  These files should be available from the IOUG-A homepage.  If not, e-mail Mike (Rife@Compuserve.com) and he will send them to you.

File Name

Contents

421ins.sql

SQL*Plus script to install the example data interface

421ex.sql

SQL*Plus script to create the example DEPT and EMP tables

421dbg.sql

SQL*Plus script to create a debugging messaging procedure

421tab.sql

SQL*Plus script to create the data interface tracking tables

421clpk.sql

SQL*Plus script to create the common load package

421dipk.sql

SQL*Plus script to create the data interface package

421trig.sql

SQL*Plus script to create the staging table triggers

421test.sql

SQL*Plus script to run a test load of the example data interface

421ldr.bat

Windows NT batch file to start SQL*Loader

421ex.dat

Employee data file to load

421ex.ctl

SQL*Loader control file

 

Example Load Execution

The following messages were put into the data load package to list out what was read from the data file.  Please note that line 1 is missing an employee number and lines 1, 12, and 15 have invalid hire dates.  These are  logged as errors.  Lines 1 and 10 do not have a manager's employee number.  These are logged as informational messages.

EMP_LOAD   1: Line   1:           DOE        SALESMAN            24/06/1982

EMP_LOAD   2: Line   2:      7369 SMITH      CLERK          7902 12/17/1980       800                  20 RESEARCH       DALLAS

EMP_LOAD   3: Line   3:      7499 ALLEN      SALESMAN       7698 02/20/1981      1600       300        30 SALES          CHICAGO

EMP_LOAD   4: Line   4:      7521 WARD       SALESMAN       7698 02/22/1981      1250       500        30 SALES          CHICAGO

EMP_LOAD   5: Line   5:      7566 JONES      MANAGER        7839 04/02/1981      2975                  20 RESEARCH       DALLAS

EMP_LOAD   6: Line   6:      7654 MARTIN     SALESMAN       7698 09/28/1981      1250      1400        30 SALES          CHICAGO

EMP_LOAD   7: Line   7:      7698 BLAKE      MANAGER        7839 05/01/1981      2850                  30 SALES          CHICAGO

EMP_LOAD   8: Line   8:      7782 CLARK      MANAGER        7839 06/09/1981      2450                  10 ACCOUNTING     NEW YORK

EMP_LOAD   9: Line   9:      7788 SCOTT      ANALYST        7566 04/19/1987      3000                  20 RESEARCH       DALLAS

EMP_LOAD  10: Line  10:      7839 KING       PRESIDENT           11/17/1981      5000                  10 ACCOUNTING     NEW YORK

EMP_LOAD  11: Line  11:      7844 TURNER     SALESMAN       7698 09/08/1981      1500         0        30 SALES          CHICAGO

EMP_LOAD  12: Line  12:      7876 ADAMS      CLERK          7788 23/05/1987      1100                  20 RESEARCH       DALLAS

EMP_LOAD  13: Line  13:      7900 JAMES      CLERK          7698 12/03/1981       950                  30 SALES          CHICAGO

EMP_LOAD  14: Line  14:      7902 FORD       ANALYST        7566 12/03/1981      3000                  20 RESEARCH       DALLAS

EMP_LOAD  15: Line  15:      7934 MILLER     CLERK          7782 23/01/1982      1300                  10 ACCOUNTING     NEW YORK

EMP_LOAD  16: Line  16: EOF

 

16 rows selected.

 

The following are records in the DEPT and EMP tables.  The loading of the example data file created these records.

   DEPTNO DNAME          LOC

--------- -------------- -------------

       30 SALES          CHICAGO

       10 ACCOUNTING     NEW YORK

       20 RESEARCH       DALLAS

 

 

    EMPNO ENAME      JOB             MGR HIREDATE        SAL      COMM    DEPTNO

--------- ---------- --------- --------- --------- --------- --------- ---------

     7499 ALLEN      SALESMAN       7698 20-FEB-81      1600       300        30

     7521 WARD       SALESMAN       7698 22-FEB-81      1250       500        30

     7566 JONES      MANAGER        7839 02-APR-81      2975                  20

     7654 MARTIN     SALESMAN       7698 28-SEP-81      1250      1400        30

     7698 BLAKE      MANAGER        7839 01-MAY-81      2850                  30

     7782 CLARK      MANAGER        7839 09-JUN-81      2450                  10

     7788 SCOTT      ANALYST        7566 19-APR-87      3000                  20

     7839 KING       PRESIDENT           17-NOV-81      5000                  10

     7844 TURNER     SALESMAN       7698 08-SEP-81      1500         0        30

     7876 ADAMS      CLERK          7788                1100                  20

     7900 JAMES      CLERK          7698 03-DEC-81       950                  30

     7902 FORD       ANALYST        7566 03-DEC-81      3000                  20

     7934 MILLER     CLERK          7782                1300                  10

     7369 SMITH      CLERK          7902 17-DEC-80       800                  20

 

14 rows selected.

Below is the SQL*Loader log file.

SQL*Loader: Release 8.0.5.0.0 - Production on Mon Feb 14 15:43:3 2000

(c) Copyright 1998 Oracle Corporation.  All rights reserved.

 

Control File:   421ex.ctl

Data File:      421ex.dat

  Bad File:     421ex.bad

  Discard File:  none specified

 

 (Allow all discards)

 

Number to load: ALL

Number to skip: 0

Errors allowed: 50

Bind array:     64 rows, maximum of 65536 bytes

Continuation:    none specified

Path used:      Conventional

 

Table DATA_LOAD_INBOX, loaded from every logical record.

Insert option in effect for this table: APPEND

TRAILING NULLCOLS option in effect

 

   Column Name                  Position   Len  Term Encl Datatype

------------------------------ ---------- ----- ---- ---- ---------------------

LINE_TEXT                           1:200   200           CHARACTER           

 

INTERFACE_NAME                 CONSTANT 'EMP_LOAD'

STATUS                         CONSTANT 'NEW'

LINE_NO                        RECNUM

 

 

Table DATA_LOAD_INBOX:

  16 Rows successfully loaded.

  0 Rows not loaded due to data errors.

  0 Rows not loaded because all WHEN clauses were failed.

  0 Rows not loaded because all fields were null.

 

 

Space allocated for bind array:                  65450 bytes(275 rows)

Space allocated for memory besides bind array:        0 bytes

 

Total logical records skipped:          0

Total logical records read:            16

Total logical records rejected:         0

Total logical records discarded:        0

 

Run began on Mon Feb 14 15:43:03 2000

Run ended on Mon Feb 14 15:43:07 2000

 

Elapsed time was:     00:00:03.53

CPU time was:         00:00:00.05

 

The following is the data load report.  If the user wanted to know which lines of data had these errors, the report generation procedure in the data interface package could be modified to also list out the line number on which the error occurred.

Interface:                    EMP_LOAD

Job ID:                       20000214154306

Load Started:                 14-FEB-2000 15:43:06

Load Finished:                14-FEB-2000 15:43:07

Load Status:                  COMPLETE

Lines in file:                000,000,016

Data records in file:         000,000,015

Lines with errors:            000,000,003

Total error count:            000,000,004

Lines with messages:          000,000,002

Total message count:          000,000,002

Dept Inserts:                 000,000,003

Dept Updates:                 000,000,011

Emp Inserts:                  000,000,014

Emp Updates:                  000,000,000

 

Message                                  Occurrences       %

---------------------------------------- ----------- -------

ERROR:Missing Employee Number                      1    6.67

     1.  '         '                               1    6.67

ERROR:Unable to load hire date                     3   20.00

     1.  '23/01/1982'                              1    6.67

     2.  '23/05/1987'                              1    6.67

     3.  '24/06/1982'                              1    6.67

INFO:Missing Manager Employee Number               2   13.33

     1.  '         '                               2   13.33

 

26 rows selected.

 

Conclusion

The appropriate use of staging tables, triggers, and packages can be used to replace most, if not all, 3GL load programs.  The authors have not seen a load that could not be replaced with the right SQL*Loader algorithm. The appropriate choice of  SQL*Loader technology is not only easier to develop and maintain, but can also run faster.