Replacing 3GL Data Loading with SQL*Loader and Database Triggers
Michael A. Rife Rumpi Gravenstein
Goodyear Tire & Rubber Company Rumken, Inc.
Gerhardt Martin Jen Croy
Goodyear Tire & Rubber Company Rumken, Inc.
Introduction
This paper describes a design to replace 3GL data load programs with SQL*Loader and database triggers while supporting tracking of load errors. In presenting this design the paper discusses why to use SQL*Loader, SQL*Loader datatypes, typical SQL*Loader use, two SQL*Loader approaches, future enhancements, and am example of using on the SQL*Loader approaches.
This design was implemented in an Oracle8 for Solaris environment. All the techniques used are compatible with Oracle7, but have not been tested in Oracle7.
Why Use SQL*Loader
In a 3GL loader program the business rules for loading data are usually coded in the 3GL application. Using SQL*Loader, database triggers, and database packages moves the load logic and business rules into the database and out of the load program. Since the logic is stored in the database, any logic that is common across data loads can be shared. The resulting environment has less code and is easier to maintain.
3GL loader programs also tend to be more difficult to develop, since both 3GL and Oracle skills are required. It is easier to find a developer with Oracle skills than to find a developer with expertise in both skill sets. In addition, a developer skilled in both areas would likely command a higher salary. Since PL/SQL tends to be a higher level language than 3GL languages, the use of a pure PL/SQL approach should save development time. The developer would be working at a higher level and does not need to be concerned with the level of minutia that need to be handled by a 3GL program.
If that is not enough, we believe that 3GL loader programs are more difficult to maintain. Operating system, Oracle database, and 3GL language upgrades require re-linking and possible recompilation of 3GL programs while PL/SQL based programs need very little attention during migrations as they migrate with the database. An added issue here is that 3GL programs, and the people that support them can not always move from one environment (say MVS) to another (say Solaris). One platform's implementation of a language may have different libraries from another's implementation. With the business rules coded in PL/SQL database packages and utilizing SQL*Loader, these packages and the people that support them will be portable across operating system, Oracle database, and PL/SQL language migrations and upgrades.
SQL*Loader Datatypes
SQL*Loader can handle a large number of data types. The table below shows the data type in the data file and what Oracle data type SQL*Loader could convert that data type to.
|
Data Type in File |
Oracle Data Type |
|
Char |
Any type |
|
Integer |
Number |
|
Small integer |
Number |
|
Float |
Number |
|
Double Precision |
Number |
|
Byte Integer |
Number |
|
Packed Decimal |
Number |
|
Data Type in File |
Oracle Data Type |
|
Zoned |
Number |
|
Varchar |
Varchar2 |
|
Graphics |
Raw/long |
|
Graphic External |
Raw/long |
|
VarGraphic |
Raw/long |
|
Raw |
Raw |
|
|
|
The typical manner in which SQL*Loader is used is made up of three main components. There is the SQL*Loader process, staging tables, and post insert triggers on the staging tables. A description of each of these elements follows.
The SQL*Loader process is an operating specific program that executes at a specific time or is caused to execute by an operating system event like the receipt of a data file. This OS program can do preprocessing of the data file and invokes SQL*Loader to load the data file into a staging table. Once SQL*Loader is finished loading the data file into the database, the OS program creates and distributes a load report. Lastly the OS program handles any required data file archiving issues.
As SQL*Loader loads data into the staging table, it will save a set of records at a time. The set size, transaction set, is defined by the number of rows SQL*Loader loads during each commit. A transaction set then can be one or more lines from the data file being loaded, the size of which is normally determined while tuning the load process. The data loaded into the staging table resides only temporarily in this table as it is moved almost immediately after it is loaded. The staging table is normally defined to mirror the layout of the data file. The mirroring here includes mirroring the column data types. That is if you are loading an ASCII file, all the columns in the staging table should be defined as VARCHAR2 or CHAR. Similarly if you are loading data from a binary data file, the columns in the staging table should match exactly the definitions of the structure in the data file. This is done so that load errors do not occur while SQL*Loader is loading data into the table, but rather later while the data is being moved from the staging table. Handling errors later allows the developer more control (with PL/SQL code) in how he wants to deal with these data errors.
Once a transaction set is inserted into the staging table a post-insert trigger fires. It processes the transaction set of records that was just inserted into the staging table. The trigger moves the data from the staging table into the destination tables while handling all data errors. The logic in this post-insert trigger is:
1. Mark this transaction set’s records in the staging table as unprocessed
2. Get a set of unprocessed records out of the staging table
3. If records were fetched, then process and save the data to the destination table(s)
4. If no records were fetched, then end processing
5. Delete or mark fetched records as processed
6. Go back to step 2
The key to the logic is the fact that the database trigger on the staging table contains the DML statements for saving the records to the destination tables. The trigger deletes the data out of the staging table, so the data in the staging table is only available for this SQL*Loader's Oracle session. Given this load processing approach, the data that is temporarily stored in the staging table is not accessible to any other Oracle sessions.
The following chart shows the general flow in a typical SQL*Loader design.

SQL*Loader Approaches
There are two approaches to using SQL*Loader. There are two major differences between the two approaches. The first is obviously the number of staging tables and table triggers that are used. One approach has multiple staging tables while the other utilizes a single generic staging table. The other difference is when using multiple staging tables the SQL*Loader process does the parsing of the data file line into fields. When using a single staging table, parsing the data file line is performed by the post-insert trigger or a package called by this trigger.
Data Loading with Multiple Staging Tables
Data loading with multiple staging tables is composed of the SQL*Loader process, multiple staging tables, post-insert triggers on the staging tables, a staging table specific data package, and a common load package. A description of each of these elements follows.
The SQL*Loader process is identical to the one described in the Typical SQL*Loader Use section above. The SQL*Loader process handles generating the load report as well as parsing the data file's lines into fields and saving those fields in a staging table.
Since there is one staging table per data file, there will be many staging tables if there are many data file types to load. The layout of each staging table mirrors the layout of the file being loaded. Each field in the data file will have a matching field in the staging table. There may be some additional bookkeeping fields for tracking the data file's load status, recording load order, time stamping the processing of the record, as well as capturing any load messages or errors that occur.
Once a set of records has been inserted into a staging table by the SQL*Loader process, the post-insert trigger on that staging table executes once for a set of records. The logic in this post-insert trigger is:
1. Mark this transaction set’s records in the staging table as unprocessed
2. Get all the unprocessed records out of the staging table
3. For each unprocessed record fetched call the data interface package to process and save the data in the destination table(s)
4. Delete or mark unprocessed records as processed
The data interface package procedure loads the data in that specific staging table. The data interface package procedure can insert the data directly into the destination tables or can utilize shared DML procedures in the database for inserting, updating, and deleting data in the destination tables. These shared procedures are stored in a general use database package.
The chart below gives the general flow of data loading using multiple staging tables.

Data Loading with a Single Staging Table
Data loading with a single staging table is composed of the SQL*Loader process, one staging table, two table triggers on the staging table, a data interface package, a common load package, and data load tracking tables. A description of each of these elements follows.
The SQL*Loader process is similar to the one described in the Data Loading with Multiple Staging Tables section above. The SQL*Loader process does not generate the load report, but still does the distribution of the load report. With a single staging table the SQL*Loader process does not parse the data file's lines into fields. It saves the record directly in the staging table. If the data file's lines were made of a mixture of binary and character datatypes, then the SQL*Loader process would parse like datatypes into one field in the staging table. For example, if the lines in the data file had 100 characters, composed of different fields, at the beginning of each line followed by 1000 bytes of binary data, the staging table would have a character field as well as a raw field. The SQL*Loader process would put the 100 characters into one character field in the staging table and the 1000 bytes of data into a raw field in the staging table. If a data file had no binary data, nothing would get loaded into the raw field in the staging table.
There is one staging table. It is laid out generically to be able to handle all datatypes in all data files to be loaded. All data files being loaded into the database will be inserted into the one staging table. The status of the processing of each data file line is tracked in the staging table. The staging table has a row level pre-insert and post-insert trigger.
The row level pre-insert trigger handles assigning a unique job ID to the data load that is occurring. This trigger fires for each row inserted into the staging table. If the record the trigger is firing for is the first line in the data file, the trigger determines the job ID, assigns it to the record, and saves the job ID in a variable in the common load package. If the record the trigger is firing for is not the first line in the data file, the trigger assigns the value of the job ID in the common load package to the row in the staging table. After the pre-record row insert trigger has completed executing the post-insert trigger fires once.
The post-insert trigger handles the same functionality as the post-insert trigger in loading data with multiple staging tables. When loading with a single staging table, it additionally will check to see if it is processing the last line of a data load file. If it is, the post-insert trigger will call a procedure in the data interface package to create the load report. This load report is stored in a table in the database. The post-insert trigger queries the database for the name of the procedure that loads the staging table record. The trigger will invoke that procedure to load the record into the destination tables.
Each data load has its own data interface package. The data interface package is a collection of procedures. One creates the load report. Another procedure parses the data out of the staging table record, does needed data processing on the record, and calls a DML procedure in the common load package to save the information to the database.
The common load package is a group of DML procedures stored in the database. If multiple data loads were inserting employee information, all the data interface packages would call the same common DML procedure for inserting employee data. This makes for more modular code.
The chart below shows the general flow of data loading a single staging table.

Data Load Tracking tables for Single Staging Table Loading
In addition to the single staging table, there are common tables which track information about the data loads. Information is tracked at a data load run level and at a data file line level. The figure below shows the common tables used to track data loads.

Data_Load_Interfaces
This table contains a list of data interfaces. The key field is the interface name. The load program field contains the name of the data interface package procedure that should be called to process the data records for this data load. This is a powerful field. With the insertion of a new record in the table, the post-insert trigger will have a new procedure in a data interface package defined that it can call to load data. This can be accomplished by either changing the trigger to recognize and call the new interface or by using fairly simple dynamic PL/SQL. In our case we chose to change the trigger every time we added a new interface and not use the dynamic PL/SQL as we found that the use of dynamic PL/SQL had an unacceptable impact on the run time. The example included in this white contains the dynamic PL/SQL.
Data_Loads
This table stores information about each data load that occurs. Each data interface occurs multiple times. Each occurrence is assigned a job ID. Each data interface has its own rules for determining the value of the job ID. This table contains various high-level statistics that apply across the whole data load.
Data_Load_Inbox
This table is the single staging table. All data for loading will get stored here. This table should be laid out to handle all types of data, from character to binary, that will be loaded into the database. The table as it is laid out above can only support the loading of character-based files.
Data_Load_Records
This table contains a snapshot of the data file line that had a message or error. This information is used in generating the data load report. This information may or may not be retained after the data load report is generated. In determining whether to keep these records, consider the following issues. Is the data really needed? Will the user be correcting the data record in the database or in the file? If the user will correct the data in a file, the database record may not needed. Consider the volume of records that will be retained. A possible option is to identify a retention period for purging old records in this table.
Data_Load_Message_Codes
This table contains a list of message codes with error type, error or informational, and a clear text description of its exact meaning.
Data_Load_Messages
This table contains the messages and errors that exist for a specific run of a data load. Along with what data load ran and messages occurred, this table has the data value causing the message, what area of the database the data was going to be stored in (table and column), whether the data was stored in the database, and any additional information about the message. This information is tracked at a line by line basis in the data file.
Data_Load_Report
This table contains the load reports for the data loads. Each record in this table is a line of the report. To list out a load report for a specific data load, locate all the records based on the interface name and job ID and list the records in the order of the line number.
Implementing a New Data Interface
After the single staging and load tracking tables, the staging table triggers, the common load packages are created, what is needed to implement a get a new data interface running? This is where this design is powerful.
1. Create an Operating System Job or Event
2. Create a new record in the Data_Load_Interfaces table defining the new data load
3. Create SQL*Loader Control File
4. Create Data Interface Package
5. Add logic to the row level pre-insert trigger for creating a job ID
6. Create new DML procedures in the common load package if needed
The SQL*Loader control file would look like the following:
LOAD DATA
INFILE Data_File_Name
APPEND
INTO TABLE data_load_inbox
TRAILING NULLCOLS
(
interface_name CONSTANT Interface_Name,
line_no RECNUM,
line_text
POSITION(1:Record_Length),
status CONSTANT
NEW
)
Data_File_Name is the name of the data file.
Interface_Name is the name of the data interface.
Record_Length is the length of the longer line in the data file.
This SQL*Loader control file loads every record into the staging table (DATA_LOAD_INBOX). It gives each record a value identifying the data interface. It also assigns a line number to record in the data file. By default each record is marked with a status of 'NEW''.
Future Enhancements
Load Record Set Support
Currently the procedure in the data interface package that processes the data file record is designed to receive one record from the data file. It would be nice to be able to send it a set of records. This would reduce the number of times that this packaged procedure is invoked, improving performance.
Java Implementation
Future releases of Oracle8 will support stored packages, procedures, and functions coded in Java.
Oracle IFS
Oracle's Internet File System (Oracle 8.1) can store operating system files in the Oracle database. The creation of a file in a particular directory could fire off an Oracle job to load that data file. The reading of the file could be developed in a PL/SQL stored procedure eliminating the use of SQL*Loader.
XML Support
Adding an XML parser to the common load package would provide the ability of receiving data in and XML format.
Multiple Data File Lines per Logical Record
The current data interfaces support one data file line per record loaded in the database. Some interfaces may need to support the ability to have logical records span multiple physical lines in the data file. This may mean that logic will need to be added to the post-insert trigger on the staging table that would assemble multiple data file lines into one record and pass it to the procedure in the data interface package that loads the data into the destination tables. Of course with the multiple table approach this is issue can be handled in the SQL*Loader control file.
Data Mapping Stored in Database Tables
The parsing of a data file line into fields and data mapping of those fields to data base tables and fields are coded in the procedure in the data interface package. It would be nice to store this information in database tables and dynamically generate INSERT, UPDATE, and DELETE DML statements to be executed against the database. Care needs to be taken here since making this enhancement will impact the speed at which data is loaded.
Multiple Simultaneous Data Loads
It appears that the single staging table design supports the ability to do load more different data file types (interfaces) at the same time. This has not been tested. If this design supported this, then scheduling a data load to run would not be depend on when other data loads run.
Multiple Threaded Data Load
This is the ability to take a single data file and split it into multiple simultaneous data loads. This is assuming that the order in which the records are loaded into the database is not important.
Employee Data Load Example
The example included in this white paper is an implementation of data loading with a single staging table. The example loads data into Oracle's EMP and DEPT tables. The data file contains employee information along with the department number and department name of the employee. The result of loading this data will be information in the EMP and DEPT tables along with a load report. The example is pretty basic, but demonstrates how one line in a file can result in records in multiple tables in the database. It also illustrates informational and error trapping and messaging.
This example can be use as a stub for a first implementation. A developer can take the example and replace the data interface package with a package customized to implement the load of any data file and the associated business rules.
Here are the file names and their content. These files should be available from the IOUG-A homepage. If not, e-mail Mike (Rife@Compuserve.com) and he will send them to you.
|
File Name |
Contents |
|
421ins.sql |
SQL*Plus script to install the example data interface |
|
421ex.sql |
SQL*Plus script to create the example DEPT and EMP tables |
|
421dbg.sql |
SQL*Plus script to create a debugging messaging procedure |
|
421tab.sql |
SQL*Plus script to create the data interface tracking tables |
|
421clpk.sql |
SQL*Plus script to create the common load package |
|
421dipk.sql |
SQL*Plus script to create the data interface package |
|
421trig.sql |
SQL*Plus script to create the staging table triggers |
|
421test.sql |
SQL*Plus script to run a test load of the example data interface |
|
421ldr.bat |
Windows NT batch file to start SQL*Loader |
|
421ex.dat |
Employee data file to load |
|
421ex.ctl |
SQL*Loader control file |
Example Load Execution
The following messages were put into the data load package to list out what was read from the data file. Please note that line 1 is missing an employee number and lines 1, 12, and 15 have invalid hire dates. These are logged as errors. Lines 1 and 10 do not have a manager's employee number. These are logged as informational messages.
EMP_LOAD 1: Line
1: DOE SALESMAN 24/06/1982
EMP_LOAD 2: Line
2: 7369 SMITH CLERK 7902 12/17/1980
800 20
RESEARCH DALLAS
EMP_LOAD 3: Line
3: 7499 ALLEN SALESMAN 7698 02/20/1981
1600 300 30 SALES CHICAGO
EMP_LOAD 4: Line
4: 7521 WARD SALESMAN 7698 02/22/1981
1250 500 30 SALES CHICAGO
EMP_LOAD 5: Line
5: 7566 JONES
MANAGER 7839
04/02/1981 2975 20 RESEARCH DALLAS
EMP_LOAD 6: Line
6: 7654 MARTIN SALESMAN 7698 09/28/1981
1250 1400 30 SALES CHICAGO
EMP_LOAD 7: Line
7: 7698 BLAKE
MANAGER 7839
05/01/1981 2850 30 SALES CHICAGO
EMP_LOAD 8: Line
8: 7782 CLARK MANAGER 7839 06/09/1981
2450 10
ACCOUNTING NEW YORK
EMP_LOAD 9: Line
9: 7788 SCOTT ANALYST 7566 04/19/1987
3000 20
RESEARCH DALLAS
EMP_LOAD 10: Line
10: 7839 KING PRESIDENT 11/17/1981
5000 10
ACCOUNTING NEW YORK
EMP_LOAD 11: Line
11: 7844 TURNER SALESMAN 7698 09/08/1981
1500 0 30 SALES CHICAGO
EMP_LOAD 12: Line
12: 7876 ADAMS CLERK 7788 23/05/1987
1100 20
RESEARCH DALLAS
EMP_LOAD 13: Line
13: 7900 JAMES CLERK 7698 12/03/1981
950 30
SALES CHICAGO
EMP_LOAD 14: Line
14: 7902 FORD ANALYST 7566 12/03/1981
3000 20
RESEARCH DALLAS
EMP_LOAD 15: Line 15: 7934 MILLER CLERK 7782 23/01/1982
1300 10
ACCOUNTING NEW YORK
EMP_LOAD 16: Line
16: EOF
16 rows selected.
The
following are records in the DEPT and EMP tables. The loading of the example data file created these records.
DEPTNO DNAME LOC
--------- -------------- -------------
30 SALES CHICAGO
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
--------- ---------- ---------
--------- --------- --------- --------- ---------
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81
2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000
10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 1100
20
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 1300
10
7369 SMITH CLERK 7902 17-DEC-80 800 20
14 rows selected.
Below is the SQL*Loader log file.
SQL*Loader: Release 8.0.5.0.0 -
Production on Mon Feb 14 15:43:3 2000
(c) Copyright 1998 Oracle
Corporation. All rights reserved.
Control File: 421ex.ctl
Data File: 421ex.dat
Bad File: 421ex.bad
Discard File: none specified
(Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Bind array: 64 rows, maximum of 65536 bytes
Continuation: none specified
Path used: Conventional
Table DATA_LOAD_INBOX,
loaded from every logical record.
Insert option in effect
for this table: APPEND
TRAILING NULLCOLS option
in effect
Column Name Position
Len Term Encl Datatype
------------------------------
---------- ----- ---- ---- ---------------------
LINE_TEXT 1:200
200 CHARACTER
INTERFACE_NAME CONSTANT 'EMP_LOAD'
STATUS CONSTANT 'NEW'
LINE_NO RECNUM
Table DATA_LOAD_INBOX:
16 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses
were failed.
0 Rows not loaded because all fields were
null.
Space allocated for bind
array: 65450 bytes(275
rows)
Space allocated for
memory besides bind array: 0 bytes
Total logical records
skipped: 0
Total logical records
read: 16
Total logical records
rejected: 0
Total logical records
discarded: 0
Run began on Mon Feb 14
15:43:03 2000
Run ended on Mon Feb 14
15:43:07 2000
Elapsed time was: 00:00:03.53
CPU time was: 00:00:00.05
The following is the data load report. If the user wanted to know which lines of data had these errors, the report generation procedure in the data interface package could be modified to also list out the line number on which the error occurred.
Interface: EMP_LOAD
Job ID: 20000214154306
Load Started: 14-FEB-2000 15:43:06
Load Finished: 14-FEB-2000 15:43:07
Load Status: COMPLETE
Lines in file: 000,000,016
Data records in
file: 000,000,015
Lines with errors: 000,000,003
Total error count: 000,000,004
Lines with messages: 000,000,002
Total message count: 000,000,002
Dept Inserts: 000,000,003
Dept Updates: 000,000,011
Emp Inserts: 000,000,014
Emp Updates: 000,000,000
Message
Occurrences %
----------------------------------------
----------- -------
ERROR:Missing Employee
Number 1 6.67
1.
' ' 1 6.67
ERROR:Unable to load hire
date 3 20.00
1.
'23/01/1982' 1
6.67
2.
'23/05/1987' 1
6.67
3.
'24/06/1982' 1
6.67
INFO:Missing Manager
Employee Number 2 13.33
1.
' ' 2
13.33
26 rows selected.
Conclusion
The appropriate use of staging tables, triggers, and packages can be used to replace most, if not all, 3GL load programs. The authors have not seen a load that could not be replaced with the right SQL*Loader algorithm. The appropriate choice of SQL*Loader technology is not only easier to develop and maintain, but can also run faster.