Wednesday, February 24, 2016

Removing duplicate records from a file

delete duplicate records from file

How to remove duplicate records from a file? is a question I have been asked by so many different people I feel that it must be on one of those interview questions lists. What I mean by duplicate records is where there is more than one record in a file were values in all are the same as another record. I do not mean records with duplicate keys, as that is whole other "can of worms".

In a well built database each "master" file or table should have a unique key, which prevents duplicate keys and records:

A                                      UNIQUE
A          R CSTMSTR

Alas, in these examples I am going to have a DDS file that does not have any unique keys, or even keys:

A..........T.Name++++++RLen++TDpB......
A          R TESTFILER
A            F00           10A
A            F01           10A
A            F02           10A
A            F03           10A
A            F04           10A
A            F05           10A
A            F06           10A
A            F07           10A
A            F08           10A
A            F09           10A

It should come as no surprise as there are no unique keys the file contains duplicate records:

 F00         F01         F02         F03
 One         One         One         One
 Two         Two         Two         Two
 Three       Three       Three       Three
 Four        Four        Four        Four
 Five        Five        Five        Five
 One         One         One         One
 Two         Two         Two         Two
 Three       Three       Three       Three
 Four        Four        Four        Four
 Five        Five        Five        Five
 One         One         One         One
 Two         Two         Two         Two
 Three       Three       Three       Three
 Four        Four        Four        Four
 Five        Five        Five        Five

Rather than delete the duplicates records from the original file, TESTFILE, I am going to create a duplicate file, TESTFILE2, and copy the unique data to it. The command to create a duplicate of any object is the Create Duplicate Object command, CRTDUPOBJ.

  CRTDUPOBJ OBJ(TESTFILE) FROMLIB(MYLIB) OBJTYPE(*FILE)
             NEWOBJ(TESTFILE2) CST(*NO) TRG(*NO)

The easiest way I could think of was to use a simple SQL statement:

  INSERT INTO TESTFILE2 (SELECT DISTINCT * FROM TESTFILE)

Using the SELECT DISTINCT * makes sure I am only selecting distinct (unique) records using all of the file's fields ( * ). Personally I think this is easiest way as this statement works the same if the file has one field or several hundred fields.

If I really had to I could use the Open Query File, OPNQRYF, and Copy From Query File, CPYFRMQRYF, commands. I am reluctant to use these as IBM is encouraging us all to use SQL instead:

01  OPNQRYF FILE((TESTFILE)) +
              KEYFLD((F00) (F01) (F02) (F03) (F04) +
                     (F05) (F06) (F07) (F08) (F09)) +
              UNIQUEKEY(*ALL)

02  CPYFRMQRYF FROMOPNID(TESTFILE) TOFILE(TESTFILE2) +
                 MBROPT(*REPLACE)

03  CLOF OPNID(TESTFILE)

Line 1: I list all of the key fields to sort the file by. This can be a lot of work for a file with a lot of fields, and the KEYFLD parameter is limited to maximum of 50 entries. The UNIQUEKEY(*ALL) parameter is essential as this will only select unique keys, the first occurrence of our duplicate records.

Line 2: The unique keyed records are copied from the original file to the duplicated file.

Line 3: As I have used OPNQRYF I have to close the file using the Close File command, CLOF.

What if I had to remove the duplicate records using RPG? RPG is not a database language, in other words it does contain within the language the ability to sort the contents of a file within it. Therefore, I would have to provide a key external to the program that the program to use. Without an external key I would have to read the entire output file until I either found a match for the record on the original file or end of file, a very slow and tedious process.

So I built a logical file over my duplicate file, which I called TESTLF. The key is made up of every field within the file, and the UNIQUE ensures that there can only be one record with that key. As the maximum number of key fields you can have on a file is 120 if the file contains more than that then we could not use this approach.

A                                      UNIQUE
A          R TESTFILER                 PFILE(TESTFILE2)
A          K F00
A          K F01
A          K F02
A          K F03
A          K F04
A          K F05
A          K F06
A          K F07
A          K F08
A          K F09

Here is my RPG program:

01  dcl-f TESTFILE ;
02  dcl-f TESTLF usage(*output)
                   rename(TESTFILER:TESTLFR) ;

03  dou %eof(TESTFILE) ;
04    read TESTFILER ;
05    write(e) TESTLFR ;
06  enddo ;

07  *inlr = *on ;

Line 1: This is the file with the duplicates in it.

Line 2: The output file is the logical file. It is opened for output, and I have to rename the record format as it is the same in both files.

Line 3: I am going to perform this loop until I reach end of file for TESTFILE.

Line 4: I read TESTFILE.

Line 5: Notice that after the WRITE is the error operation code extender, (E). The error indicator, %ERROR, will come on whenever I try to write a duplicate keyed record to the output file. After the last read of TESTFILE when the end of file is encountered the error indicator will prevent me from writing the last record in TESTFILE to TESTLF again.

Line 6: Loop's end.

Once you understand how the error operation code extender is working, to prevent database errors, you can see how these few lines of code can do what we want, only copy the unique records to the output file.

For those of you still stuck having to use fixed definitions in your RPG these are the file definitions:

    FFilename++IPEASF.....L.....A.Device+.Keywords+++++++++++++++++
01  FTESTFILE  IF   E             DISK
02  FTESTLF    O    E             DISK    rename(TESTFILER:TESTLFR)

No matter which of these alternatives were used I would do the following after them:

  • Back up TESTFILE, does not matter whether to tape or just make a copy of the file in another library.
  • Clear TESTFILE.
  • Copy the data from TESTFILE2 into TESTFILE.
  • Make a new logical file with a unique key over TESTFILE to prevent duplicate keys records being written to TESTFILE again. This does not have to have every field in the file in it, just the fields that would be enough to make each record unique.

Which of these three methods would I use? The SQL approach it does not matter how many fields the file has it works the same for 1 field or 300. If someone insists that it has be a RPG program then I would given them this:

01  exec sql INSERT INTO TESTFILE2
               (SELECT DISTINCT * FROM TESTFILE) ;

02  *inlr = *on ;

 

Addendum

I have received another way to remove the duplicate records from Paul R, Sumit Goyal, and others, using SQL:

  DELETE FROM TESTFILE A
   WHERE RRN(A) > (SELECT MIN(RRN(B))
    FROM TESTFILE B
   WHERE A.F00 = B.F00
     AND A.F01 = B.F01
     AND A.F02 = B.F02
     AND A.F03 = B.F03
     AND A.F04 = B.F04
     AND A.F05 = B.F05
     AND A.F06 = B.F06
     AND A.F07 = B.F07
     AND A.F08 = B.F08
     AND A.F09 = B.F09)              

It is not too long if I only have ten columns/fields, but if the file/table had 100...

 

This article was written for IBM i 7.2, and should work for earlier releases too.

7 comments:

  1. Removing duplicate records is needed very often but I've never seen a case where all columns are considered the key. A more practical example would be if you checked the first x columns for a unique key.

    ReplyDelete
    Replies
    1. If I only checked a few of the columns/fields then I still might not have a duplicate record, just a duplicate key.

      In my opinion duplicate keyed records/rows as it is a nightmare to work out which one is right, or perhaps they are both partially right.

      Delete
  2. cpyf errlvl(*NOMAX) into file with unique key

    ReplyDelete
  3. If all you need to do is delete duplicate records, wouldn't SQ be a cleaner alternative?
    For example:

    Delete from libr.file01 f1
    where rrn(f1) > (select min(rrn(f2))
    from libr.file01 f2
    where f2.fld01 = f1.fld01
    and f2.fld02 = f1.fld02
    and f2.fld03 = f1.fld03
    and f2.fld04 = f1.fld04)


    ReplyDelete
  4. Very nice explanation thanks 🙏

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.