Friday, August 7, 2009

Using a Randomizer to Split EIM Batches

EIM runs best when it imports data in batches of 5,000 to 10,000 records. There are potentially many techniques for dividing a large number of records into batches of this size. Here is my favorite:

Imagine the EIM_CONTACT table has 300,000 records that you want to import, but all of them contain the number 1 in the IF_ROW_BATCH_NUM column. Ideally, you would prefer to load the data in 60 batches of 5,000 records each, numbered 101 - 160. Use the following SESSION SQL step in the process to split the batches:

[SPLIT BATCHES]
TYPE = IMPORT
BATCH = 0
TABLE = EIM_CONTACT
SESSION SQL = "UPDATE SIEBEL.EIM_CONTACT SET IF_ROW_BATCH_NUM = floor(dbms_random.value(101, 160)) WHERE IF_ROW_BATCH_NUM = 1

The above EIM step is separated from all other functionality for clarity, but your own process will probably contain some differences. For example, the SESSION SQL could be part of your EIM IMPORT step, or the UPDATE statement could contain more columns.

Part of the guarantee of a randomizer is that generated numbers will spread equally across the available range, so you can assume that your data will be distributed evenly across the 60 batches.

Please notice that I'm assuming there are no records in batch number 0. Otherwise, they would be imported during this step. Also notice that I'm using functions specific to the Oracle database. Different databases have different functions for generating random numbers, but the technique is largely the same for any database you use.

4 comments:

Deadok said...

Useful advice. Thanks!

PeterToes said...

nice function but it is, of course, always faster to split into batch ranges in the insert so we have even less data transformation.

Jim Uicker said...

Agreed.

I used Session SQL in my example because I thought it would be easier to understand, but the function works just as well in an Insert statement.

The Session SQL example is really an artificial one. I generally only use a step like this in a process where I am running a batch update against existing Siebel data. I begin with an EXPORT step, then I use a Session SQL step to make the updates to the exported data, including populating the BU columns and splitting into batches. So the actual query would be more complex than this.

Anonymous said...

I use Bulk Collect to read from Staging table. During Bulk Collect select statement I also select rownum (oracle row number).

Now in my insert statement I use the below function :

FLOOR((V_ROWNUM(indx)-1)/5000) + V_BATCH_NUM

Ex :
Initial value V_BATCH_NUM=100
V_ROWNUM=1,2,3,4,5,6....

After 5000 records batch number will be incremented by 1.
:)
Hope it helps