Wednesday, January 18, 2017

Easy way to merge data using SQL

merge statement in sql

I had a project where the first step was to merge data from three files into one. I have used the word "merge" deliberately to differentiate it from "copy". With a "merge" I needed to update specific fields if a record is present in the output file with a matching key. If there is no matching key record I add a new record. I have done this many times in RPG, this time I wanted to try something different, like SQL.

After searching IBM's KnowledgeCenter I discovered the MERGE SQL statement. I do not know when it was released, but searching in the different versions of the KnowledgeCenter I can find it in the versions for IBM i 7.1, 7.2, and 7.3, but I cannot find it in the earlier releases. The MERGE does exactly what I want it to, it will update on a match and insert when there is no match.

Before I explain the MERGE I need to show the three files I will be using in this example. The files are named:

  1. FILE1
  2. FILE2
  3. FILE3

They all contain one common field, that I will be using to match with, and two unique fields.

Not all of the files have the records with the same matching key, see below.

Key Present in
FILE1 FILE2 FILE3
1 Y Y Y
2 Y Y
3 Y Y
4 Y
5 Y
6 Y

The DDS and the content of for FILE1 looks like this:

01  A          R FILE1R
02  A            F1KEY1         3A
03  A            F1F1           5A
04  A            F1F2          10A


F1KEY1  F1F1   F1F2
 1      AAAAA  BBBBBBBBBB
 2      CCCCC  DDDDDDDDDD
 6      EEEEE  FFFFFFFFFF

FILE2 looks like:

01  A          R FILE2R
02  A            F2KEY1    R               REFFLD(F1KEY1  FILE1)
03  A            F2F1          10A
04  A            F2F2           5P 0


F1KEY1  F2F1          F2F2
 1      GGGGGGGGGG       1
 2      HHHHHHHHHH       2
 3      IIIIIIIIII       3
 4      JJJJJJJJJJ       4

The reason the field F2KEY1 shows F1KEY1 when looking at the data is because the REFFLD copies the attributes of F1KEY1 from FILE1. As F1KEY1 in FILE1 does not have a column heading, COLHDG, the field name is used.

FILE3:

01  A          R FILE3R
02  A            F3KEY1    R               REFFLD(F1KEY1  FILE1)
03  A            F3F1           7P 2
04  A            F3F2           5A


F1KEY1       F3F1   F3F2
 1           5.00   KKKKK
 3           6.00   LLLLL
 5           7.00   MMMMM

I decided to put my MERGE into a RPG program, which starts like this:

01  exec sql SET OPTION COMMIT = *NONE ;

02  exec sql DROP TABLE QTEMP.TABLE1 ;

03  exec sql CREATE TABLE QTEMP.TABLE1
04           (KEY_COLUMN FOR "KEYFLD",
05            FIRST,SECOND,THIRD,FOURTH,FIFTH,SIXTH)
06            AS (SELECT A.F1KEY1,A.F1F1,A.F1F2,
07                       B.F2F1,B.F2F2,
08                       C.F3F1,C.F3F2
09                  FROM FILE1 A CROSS JOIN
10                       FILE2 B CROSS JOIN
11                       FILE3 C
12                 WHERE A.F1KEY1 = B.F2KEY1
13                   AND A.F1KEY1 = C.F3KEY1)
14           DEFINITION ONLY
15           INCLUDING COLUMN DEFAULTS
16           RCDFMT TABLE1R ;

Line 1: I unless I need commitment control I always add this option into my SQLRPGLE programs.

Line 2: It may be unnecessary to delete a file in QTEMP, especially when the program is running in batch. This program was not, and I ran it repeatedly changing my program. I wanted to delete the table to ensure that the one that was being created would be the way I wanted it and it would be empty. If I use a CREATE OR REPLACE on line 3 the table would not be deleted and the data remain within it.

Lines 3 – 16: I am creating my output table in a similar way to using REFFLD in DDS, defining the attributes of the columns to be the same as the fields in the files used.

Line 3: When creating a table you use the CREATE TABLE statement followed by which library you want it in and its name. I have decided to use "SQL naming convention", therefore the library and file names are separated by a period ( . ).

Line 4: I have decided to give the first column a long name, KEY_COLUMN, as well as a short name, KEYFLD. I can use either name for the column in this program.

Line 5: These are the names of the other columns.

Lines 6 – 8: I am using a SELECT statement to give the "referenced" fields that will be used to define the columns on lines 4 and 5.

Lines 9 – 11: These are the files those fields are contained within. I have used a CROSS JOIN here I am just creating an empty table, therefore, I don't have to bother with trying to match records by using some other kind of join.

Lines 12 and 13: This defines how the three files will be joined.

Line 14: I only want to define the table, in other words create an empty table, so I use the DEFINTION ONLY.

Line 15: I have found that by not including the from field defaults did cause me errors when inserting into the table. By including INCLUDING COLUMN DEFAULTS means that if a column is not defined in an INSERT the default value, taken from the DDS field, is used.

Line 16: I have given the table a record format name. I may, at some time in the future, decide to read this file in a RPG program, and by having a record format name removes the need for me to use a RENAME in the file definition.

At the end of this code I have an empty table ready to be filled. Let me start by inserting data from FILE1.

17  exec sql INSERT INTO QTEMP.TABLE1
             (KEY_COLUMN,FIRST,SECOND)
             SELECT * FROM FILE1 ;

When that has finished the output table contains:

KEY_COLUMN  FIRST  SECOND      THIRD       FOURTH       FIFTH   SIXTH
   1        AAAAA  BBBBBBBBBB                   0         .00
   2        CCCCC  DDDDDDDDDD                   0         .00
   6        EEEEE  FFFFFFFFFF                   0         .00

Let me now merge the data from FILE2 into the output table:

18  exec sql MERGE INTO QTEMP.TABLE1 A USING FILE2 B
19                 ON A.KEY_COLUMN = B.F2KEY1
20            WHEN MATCHED THEN
21                 UPDATE SET A.THIRD = B.F2F1,
22                            A.FOURTH = B.F2F2
23            WHEN NOT MATCHED THEN
24                 INSERT (KEY_COLUMN,THIRD,FOURTH)
25                 VALUES (B.F2KEY1,B.F2F1,B.F2F2) ;

Line 18: I want to merge data into QTEMP.TABLE1 from FILE2

Line 19: Using these "key" fields for the match.

Line 20 – 22: If the "key" fields match update the existing columns in TABLE1.

Line 23 – 25: If the "key" fields do not match insert a new row into TABLE1 using the values from the fields in FILE2 record.

The MERGE updated the rows for "key" fields 1 and 2, and inserted new rows for the "key" values of 3 and 4.

KEY_COLUMN  FIRST  SECOND      THIRD       FOURTH       FIFTH   SIXTH
   1        AAAAA  BBBBBBBBBB  GGGGGGGGGG       1         .00
   2        CCCCC  DDDDDDDDDD  HHHHHHHHHH       2         .00
   3                           IIIIIIIIII       3         .00
   4                           JJJJJJJJJJ       4         .00
   6        EEEEE  FFFFFFFFFF                   0         .00

If there is a problem with the matching, for example a duplicate row in either file caused by the given "key" matching, the MERGE quits. The error is diagnostic, therefore, it does not interrupt the user, and the merge is unfinished. The only way to find if this type of error has happened is to look at the joblog when the job has completed. To stop errors like this I can add an extra line to the MERGE statement. The merge of the data from FILE3 contains this extra line.

26  exec sql MERGE INTO QTEMP.TABLE1 A USING FILE3 B
27                 ON A.KEY_COLUMN = B.F3KEY1
28            WHEN MATCHED THEN
29                 UPDATE SET A.FIFTH = B.F3F1,
30                            A.SIXTH = B.F3F2
31            WHEN NOT MATCHED THEN
32                 INSERT (KEY_COLUMN,FIFTH,SIXTH)
33                 VALUES (B.F3KEY1,B.F3F1,B.F3F2)
34            NOT ATOMIC CONTINUE ON SQLEXCEPTION ;

That "extra" line is line 34. This wonderfully named line means that if there is an error the merge must continue.

With all three files' data merged my output file looks like:

KEY_COLUMN  FIRST  SECOND      THIRD       FOURTH       FIFTH   SIXTH
   1        AAAAA  BBBBBBBBBB  GGGGGGGGGG       1        5.00   KKKKK
   2        CCCCC  DDDDDDDDDD  HHHHHHHHHH       2         .00
   3                           IIIIIIIIII       3        6.00   LLLLL
   4                           JJJJJJJJJJ       4         .00
   5                                            0        7.00   MMMMM
   6        EEEEE  FFFFFFFFFF                   0         .00

Having used the MERGE in a more complicated scenario I have to say I am impressed with its functionality and speed. No more merging data using RPG for me, I'll use SQL's MERGE instead.

 

You can learn more about SQL's MERGE from the IBM website here.

 

This article was written for IBM i 7.3, and should work for 7.2 and 7.1 too.

8 comments:

  1. Good article Simon, but it would be more complete with the mention of UNION. UNION allows the merging of data from different tables into a single result-set which presents the merged rows across all tables coded in the UNION. The various tables that are UNION'd do not necessarily have to have the same column spellings. A set process (INSERT or whatever) could access a UNION TABLE and write to another file, there by doing a true merge. Keep up the good work!

    ReplyDelete
  2. Hi Dan

    Could you please post an example?

    ReplyDelete
  3. SELECT COLa, COLb FROM TABLEa UNION ALL SELECT COLx, COLy FROM TABLEb ORDER BY 1, 2;

    As long as the data types of COLa same as COLx and COLb same as COLy this will work.

    UNION ALL means all rows from both tables are merged. UNION without ALL means only distinct rows merged.

    Instead of UNION one could use INTERSECT or SUBTRACT to do all kinds ways to create sets of data.

    Where would RPG be without DB2?!? LOL

    ReplyDelete
  4. I started using the SQL Merge statements in 2013. Unfortunately, my stone-age supervisor could not grasp it and asked others to investigate it as if I had somehow concocted it. He was against RPG-FREE too.

    ReplyDelete
    Replies
    1. There are a lot of managers and supervisors still stuck in that "stone age".

      Delete
    2. That's just because they are not updated with the new techniques. Things that they don't understand are the things we should not use. That's how they think. In my county, there are too many such managers. SQL is like a taboo to them.

      Delete
    3. Part of being a good manager is to learn and trust from your staff. Wanting everyone to do thing the same way you did 20 years ago hurts you, your employer, and destroys your staff's morale.

      Delete
    4. Yes but they do not want to learn because they wrongly think that's going to be difficult and they do not want to risk their retirement.

      Delete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.