Wednesday, August 11, 2021

Finding fastest way to copy data from one table to another

fastest way to copy data from table to another

This post is based on the presentation Thomas Leo Swint made to the Central Texas IBM i User Group last night. Thomas performed a series of tests looking to find the fastest methods to "read" several million rows of data, and also to "read" the data from one table and output it to another table. When he shared his results with me, I was surprised what he had discovered. I asked him to make a presentation to the CTXiUG, and for his permission to share his results here.

Thomas needed to copy millions of rows of data from one set of SQL tables to another. This process needed to take the least amount of time possible. He created four test tables with the character columns defined in different ways. He compared the time taken to "copy" rows from the tables to another table using:

But before I start showing the code of these programs, I want to show the four different tables. All of the code is his, all I have done is removed extra lines not necessary for showing how this all works.

In his first table, I am calling TCHAR, the columns are all defined as CHAR:

01  CREATE TABLE TLSLIB.TCHAR (
02   MSGID     CHAR(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   CHAR(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  CHAR(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 CHAR(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN CHAR(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TCHARR ;

The concern with having large character columns is that the contents of these columns are not that long, which wastes DASD space. To overcome this waste of space he created another file, I am calling TVARCHAR, where columns are defined as variable character, VARCHAR:

01  CREATE TABLE TLSLIB.TVARCHAR (
02   MSGID     VARCHAR(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVARCHARR ;

There is a performance issue when using columns defined as VARCHAR. When data is fetched from them two input/output operations will occur. This is nicely explained on this IBM Support document, here. I have copied the section below from that document:

Q1: If I insert only 50, 30, or 80 bytes into a VARCHAR column (defined as 254 with ALLOCATE(0) ), will I incur two I/Os?

A1: Yes, it will be two I/0s because ALLOCATE(0) was specified. To avoid an extra I/0, use ALLOCATE(30) or (50). We suggest allocating enough so most of your input will fit and only the long ones will overflow the allocated part.

How much space do you allocate for each VARCHAR column?

Thomas found that the consensus was that the allocated space should be the size to accommodate 80% of the data in that column. Not 80% of the size. I am not going to show how the size of 80% of the data was calculated, that will be the subject of a future post.

The file with the 80% allocated columns, I am calling TVAR80, looks like:

01  CREATE TABLE TLSLIB.TVAR80 (
02   MSGID     VARCHAR(7)    ALLOCATE(7)   CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   ALLOCATE(20)  CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  ALLOCATE(69)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) ALLOCATE(849) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  ALLOCATE(107) CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVAR80R ;

The final table allocated 100% of the columns size. This was to see if the results would be the same as the TCHAR table. This file I am going to call TVAR100:

01  CREATE TABLE TLSLIB.TVAR100 (
02   MSGID     VARCHAR(7)    ALLOCATE(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   ALLOCATE(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  ALLOCATE(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) ALLOCATE(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  ALLOCATE(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVAR100R ;

Thomas performed three rounds of tests with different numbers of rows:

  1. 1 million rows
  2. 5 million rows
  3. 15 million rows

I am not going to show all the results from all of those tests. I am just going to show the results from his tests where he used 15 million rows.

As I mentioned above, he used three RPG programs, that I will show below. The code I am going to show is not for the full programs. I am just going to show what I want to call the "interesting" parts of them. I am also just showing one version of the programs. There were four programs for each of the four files.

The first RPG program used a CPYF executed by the SQL's QCMDEXC:

01  dcl-s CopyCommand     char(250) ;
02  dcl-s StrTS           timestamp ;
03  dcl-s EndTS           timestamp ;

04  CopyCommand =  'CPYF FROMFILE(TLSLIB/TCHAR) ' +
05                   'TOFILE(QTEMP/OUTFILE) ' +
06                   'CRTFILE(*YES) ' ;

07  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

08  Exec sql CALL QSYS2.QCMDEXC(:CopyCommand) ;

09  Exec SQL SET :EndTS = CURRENT_TIMESTAMP ;

10  Exsr $WritetoLogFile ;

The next program uses a SQL cursor to perform a single row fetch to receive data from the table, and then uses a single row insert to add data to the output table.

01  dcl-ds @DataIn  extname('TLSLIB/TCHAR') qualified ;
02  end-ds ;

03  dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ;
04  end-ds ;

05  dcl-s StrTS           timestamp ;
06  dcl-s EndTS           timestamp ;
07  dcl-s @SqlSuccessful  ind ;

08  Exec SQL Declare TestFile_Cursor Cursor For
09      select * from TLSLIB.TCHAR ;

10  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

11  Exec SQL Open TestFile_Cursor ;
12  @SqlSuccessful = (SQLCOD = 0) ;

13  DoW @SqlSuccessful ;
14      Exec SQL Fetch TestFile_Cursor Into :@DataIn ;
15      @SqlSuccessful = (SQLCOD <> 100) ;

16      If @SqlSuccessful ;
17          @TestFileOut = @DataIn ;
18          Exec SQL Insert into TLSLIB.TCHAROUT
19                 values( :@TestFileOut ) ;
20      EndIf ;
21  Enddo;

22  Exec SQL Close TestFile_Cursor ;

23  Exec SQL SET :EndTS = CURRENT_TIMESTAMP ;

24  Exsr $WritetoLogFile ;

The third program uses a SQL cursor to do a multiple row fetch, and then performs a multiple row insert to add data to the output table.

01  dcl-ds @DataIn  extname('TLSLIB/TCHAR') qualified dim(6000);
02  end-ds ;

03  dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ;
04  end-ds ;

05  dcl-ds gd qualified ;
06    RowsCount        int(10) ;
07  end-ds ;

08  dcl-s StrTS           timestamp ;
09  dcl-s EndTS           timestamp ;
10  dcl-s @SqlSuccessful  ind ;
11  dcl-s NbrOfRows       int(5) inz(%elem(@DataIn)) ;
12  dcl-s p_RowsCount     int(10) ;

13  Exec SQL Declare TestFile_Cursor Cursor For
14      select * from TLSLIB.TCHAR ;

15  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

16  Exec SQL Open TestFile_Cursor ;
17  @SqlSuccessful = (SQLCOD = 0) ;

18  DoW @SqlSuccessful ;
19      Clear @DataIn ;

20      Exec SQL Fetch TestFile_Cursor
21          FOR :NbrOfRows ROWS Into :@DataIn ;

22      SQLCOD = SQLCOD ;
23      @SqlSuccessful = (SQLCOD <> 100) ;

24      If @SqlSuccessful ;
25        gd = GetSqlDiagnostics();
26        p_RowsCount = gd.RowsCount ;

27        Exec SQL
28          INSERT into TLSLIB.TCHAROUT :p_RowsCount ROWS
29            values( :@DataIn ) ;
30     EndIf ;
31  Enddo;                                   

32  Exec SQL Close TestFile_Cursor ;

33  exec sql SET :EndTS = CURRENT_TIMESTAMP ;

34  Exsr $WritetoLogFile ;

Line 1: The data structure array is defined with 6,000 elements. Defining it with a greater number of elements exceeded the maximum size for an array.

I have broken out the results into a table making it easy to compare the results from each time of table. Each program was run 10 times in different jobs. The times are the average of the 10 runs and are in seconds.

Program No. of
tests
TCHAR TVARCHAR TVAR80 TVAR100
SQL multi row 10 150.780885 224.643632 271.863179 271.826870
CPYF 10 446.536102 556.010700 708.202601 467.950940
SQL single row 10 472.460167 620.534527 785.702390 774.258705

What conclusions can I draw from these results:

  1. Using a SQL multi row fetch is the fastest way to copy data from one table to another (I was surprised how much faster it is when compared to CPYF).
  2. Using VARCHAR is slower than just a straight CHAR (not surprised).
  3. Using the 80% allocation was the slowest of all copies (surprised).
  4. Only with the CPYF was the 100% allocation time close to the TCHAR number (surprised).

Why was RPG not used in these examples?

In earlier rounds of testing Thomas compared RPG, SQL single row fetch, and SQL multiple row fetch for input only. The RPG program was so much slower than the others that it was not used in the next set of tests.

Results for reading the table of 1 million rows:

Table RPG Single row fetch Multiple row fetch
TCHAR 56.258650 17.878956 7.920792
TVARCHAR 56.335911 19.058225 9.734931
TVAR80 66.362741 21.619942 11.693264

These results reinforce my preference for using SQL multi row fetches for performing all kinds of database input and output.

Thank you to Thomas for allowing me to share his results.

15 comments:

  1. Thank Simon for sharing

    ReplyDelete
  2. Thanks Simon. Interesting methods. For the cpyf, we used the blocking factor to speed up copy way back when. I wonder if it still works?

    ReplyDelete
    Replies
    1. Yes, you can still use control blocking with CPYF. I did when I was copying large files... now I think I should replace those with SQL.

      See here how to increase the blocking for CPYF.

      Delete
    2. Hi

      i totally agree

      some tests we did long time ago whas that SQL was in fact using blocking like the maximum you could get with OVRDBF I even built a specific FCPYF command forcing the blocking but calculated as at this time the buffer size today available was not implemented, and using it with CPYF was makign the timing more similar than without OVRDBF. But still SQL was always the best, as long as you didn't need CPYF functionnality.

      Delete
    3. Part 2 includes the CPYF with blocking versus standard CPYF, and SQL.

      Read it here.

      Delete
  3. Thanks for sharing! I was in a discussion about this very thing a couple days ago... frustratingly, not everyone I work with is on-board with using SQL for things like this. Now I have some proof to show.

    ReplyDelete
  4. Thanks Simon for sharing this fantastic article. Its really enhance my knowledge.

    ReplyDelete
  5. Good finding's. Embedded SQL faster than RPG native DB call. Thanks for sharing.

    ReplyDelete
  6. Interessante

    ReplyDelete
  7. Besides the performance aspect, a valid reason to use the CPYF command is when you have identity column(s) in the table which are referred as foreign keys in other table(s). This is the only way to be 100% sure that records copied have the same value in souce and target table

    ReplyDelete
  8. Thanks for sharing.

    ReplyDelete
  9. PaweÅ‚ GÄ™bskiAugust 12, 2021 at 9:38 AM

    Thanks for sharing

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.