Wednesday, June 22, 2022

Removing duplicate characters from a string

rpg scan replace %scanrpl

I was asked how I would remove duplicate characters from a string just using RPG. I has been some time since I came up with an all RPG solution, so I accepted the challenge.

The string of data is in a file, TESTFILE, in the field F1. The sample string I used in F1 was:

SELECT F1 FROM TESTFILE ;


F1
--------------------
abcdefghabcdijkl

I have more than one of these characters in the string: a, b, c, and d. And only one of these: f, g, h, i, j, k, and l. My goal is for a single character of all of these.

The program to do this is only 12 lines long:

01  **free
02  ctl-opt option(*srcstmt) ;

03  dcl-f TESTFILE ;

04  dcl-s Counter uns(3) ;

05  read TESTFILER ;

06  for Counter = 1 to %len(%trimr(F1)) ;
07    if (%subst(F1:Counter:1) <> ' ') ;
08      F1 = %scanrpl(%subst(F1:Counter:1):'':F1:Counter + 1) ;
09    endif ;
10  endfor ;

11  dsply ('F1 = ' + %subst(F1:1:20)) ;
12  *inlr = *on ;

Line 1: If I am writing the code then it can only be totally free format RPG.

Line 2: My favorite control option *SRCSTMT.

Line 3: I need to define the file TESTFILE so I can read it to get the string in the field F1.

Line 4: I am defining a variable, Counter, as unsigned numeric.

Line 5: I read my file, TESTFILE. I know I have one record in the file; therefore, I am not going to bother to check for end of file or anything like that.

Line 6: Start of a For group. Here the maximum value the group is performed is calculated by nested BiFs. I use the %TRIMR nested with in the %LEN to determine the length of the string, 16, not the length of the field, 50.

Line 17: I am "working" my way through the string one position at a time. I am doing this with the %SUBST BiF checking the single character in the same number position as the value in the variable Counter. If it is not blank I execute the code below.

Line 8: I have found the Scan and Replace BiF, %SCANRPL, one of my favorites. Here I am using it with four parameters:

  1. Character to replace: This is the character I have "substringed" from the string
  2. Value to replace it with: '' contains no value, therefore it is a null value
  3. Name of the field containing the string: F1
  4. Starting position for the scan and replace to begin: Which I want this to be the next character to current position I "substringed"

By replacing a character with a null value the characters to the right all move one to the left.

First time in the For-group I have "substringed" the character a in the first position of the string. I start the scan and replacing an one more than the position I "substringed", the second position. It finds an a in the ninth position, and replaces that with a null, which means that all the characters in the tenth and greater positions move one to the "left". This now makes the string:

Before
--------------------
abcdefghabcdijkl

After
--------------------
abcdefghbcdijkl

Line 11: When I leave the For-group I am using the DSPLY operation code to show the contents of F1.

DSPLY  F1 = abcdefghijkl

Success! Each character is only found in the string one time.

I came up with an equivalent with SQL using a Common Table Expression, CTE. In my opinion the CTE is not as elegant as the RPG.

01  WITH T0(ELEMENT) AS
02  (SELECT ELEMENT FROM TESTFILE,
03  TABLE(SYSTOOLS.SPLIT(SUBSTR(F1,1,1)|| '*' ||
04                       SUBSTR(F1,2,1)|| '*' ||
05                       SUBSTR(F1,3,1)|| '*' ||
06                       SUBSTR(F1,4,1)|| '*' ||
07                       SUBSTR(F1,5,1)|| '*' ||
08                       SUBSTR(F1,6,1)|| '*' ||
09                       SUBSTR(F1,7,1)|| '*' ||
10                       SUBSTR(F1,8,1)|| '*' ||
11                       SUBSTR(F1,9,1)|| '*' ||
12                       SUBSTR(F1,10,1)|| '*' ||
13                       SUBSTR(F1,11,1)|| '*' ||
14                       SUBSTR(F1,12,1)|| '*' ||
15                       SUBSTR(F1,13,1)|| '*' ||
16                       SUBSTR(F1,14,1)|| '*' ||
17                       SUBSTR(F1,15,1)|| '*' ||
18                       SUBSTR(F1,16,1)|| '*' ||
19                       SUBSTR(F1,17,1),'*'))),
20                   '*'))),

21  T1 AS
22  (SELECT DISTINCT ELEMENT FROM T0),

23  T2 (RESULT) AS
24  (SELECT LISTAGG(ELEMENT) WITHIN GROUP(ORDER BY ELEMENT) FROM T1) 
                     
25  SELECT RESULT FROM T2 ; 

Lines 1 – 20: The first part of the CTE uses the Split table function. This splits apart a string delimited by a character, in this case an asterisk ( * ), into separate rows. This is where my code gets messy. I have to in add an asterisk after every character. The only way I could think of was using the substring function, SUBSTR. In lines 3 – 19 the string is broken apart into separate characters, by using the substring function, and concatenated with an asterisk and then the next character. I had to hard code the positions used for the starting position of the substring, and the number of times 16, this is to be performed. Line 20 is where I inform the Split function that the delimiting character is the asterisk. All very messy.

The results for this part of the CTE looks like:

ELEMENT
-------
a
b
c
d
e
f
g
h
a
b
c
d
i
j
k
l

Lines 21 and 22: As I have the results of the first part in a virtual table, T0, I can use a SELECT DISTINCT to only return distinct (unique) characters in my results:

ELEMENT
-------
c
e
j
k
l
i
b
d
a
f
h
g

Lines 23 and 24: In the last part I use the LISTAGG function to combine the rows in the virtual T1 table into one string. I want to sort the rows before they are converted into the string and I use the WITHIN GROUP(ORDER BY <column name>) to do that.

Line 25: When I display the results of this CTE I get what I wanted, each character once and sorted in alphabetical order:

RESULT
------------
abcdefghijkl

If anyone knows a way I can accomplish what I did in the first part of this CTE, lines 1 – 20, please let me know as I am not happy with that one bit.

Will this remain a scenario where the RPG is neater and cleaner than the SQL?




Ray Richert sent me the following example:

The key is using a Recursive Common Table Expression(RCTE) when you need to break something down. Rather than show a query statement, I have provided a scalar function. You can see the query statement in the function. I also enhanced it to allow a number to be passed that defines how many characters you want to check for duplicates. If nothing is specified, the default is 1. I didn't sort the data as you referenced in your post. If that is a requirement, then the RCTE will look different.

CREATE OR REPLACE FUNCTION GET_DEDUPED_STRING
   ( deduplicate_string VARCHAR( 32739 )
   , deduplicate_length INTEGER DEFAULT 1
   )
  RETURNS VARCHAR( 32739 )
  LANGUAGE SQL
  SPECIFIC DEDUPE
  DETERMINISTIC -- It will return the same value if the inputs are the same.
  MODIFIES SQL DATA
--  RETURNS NULL ON NULL INPUT

  SET OPTION
    DBGVIEW = *SOURCE
    , DATFMT = *ISO
    , COMMIT = *NONE


BEGIN
   -- Author.........: Ray Richert
   -- Date Written...: 6/22/2022
   -- Purpose........: Deduplicate a variable


   -- Common variables
   DECLARE v_MessageText VARCHAR( 1000 );
   DECLARE v_MissingColumns VARCHAR( 900 );

   -- Specific variables
   DECLARE v_DedupedString VARCHAR( 32739 );

   -- Absolutely Required Parameters
   SET v_MissingColumns = '';

   -- Code for parameters that are required.
   IF deduplicate_string IS NULL THEN
     SET v_MissingColumns = v_MissingColumns || ', DEDUPLICATE_STRING';
   END IF;

   -- Code for parameters that are required.
   IF deduplicate_length IS NULL THEN
     SET v_MissingColumns = v_MissingColumns || ', DEDUPLICATE_LENGTH';
   END IF;


   -- If  required parameters are missing, send error and terminate.
   IF v_MissingColumns <> '' THEN
     SET v_MessageText = ' - Missing required parameter(s): '
                       || TRIM( SUBSTR( v_MissingColumns, 2 ) ) || '.';
     SIGNAL SQLSTATE 'XX999' SET MESSAGE_TEXT = v_MessageText;
   END IF;

   -- Edit/Validation of parameters
   -- Make sure that the deduplicate length doesn't exceed the deduplicate sting.
   IF deduplicate_length > LENGTH( deduplicate_string ) THEN
     SET v_MessageText = ' - DEDUPLICATE_LENGTH exceeds length of DEDUPLICATE_STRING.';
     SIGNAL SQLSTATE 'XX998' SET MESSAGE_TEXT = v_MessageText;
   END IF;

   -- Mainline

   WITH remove_duplicates( dedupe_length, start_position, dedupe_string ) as
     ( SELECT
         CAST( deduplicate_length AS INTEGER ) AS dedupe_length
         , CAST( deduplicate_length + 1 AS INTEGER ) AS start_position
         , CAST( REGEXP_REPLACE( deduplicate_string  -- Source string
                               , SUBSTRING( deduplicate_string
                                          , 1
                                          , deduplicate_length
                                          )  -- Search string
                               , ''  -- Replacement value is null
                               , deduplicate_length + 1  -- Starting position to search
                               )
                 AS VARCHAR( 32739 )
               ) AS deduped_element
       FROM SYSIBM.SYSDUMMY1
       UNION ALL
       SELECT
         remove_duplicates.dedupe_length
         , remove_duplicates.start_position + remove_duplicates.dedupe_length
         , CAST( REGEXP_REPLACE( remove_duplicates.dedupe_string  -- Source string
                               , SUBSTRING( remove_duplicates.dedupe_string
                                          , remove_duplicates.start_position
                                          , remove_duplicates.dedupe_length
                                          )  -- Search string
                               , ''  -- Replacement value is null
                               , remove_duplicates.start_position + 
                                 remove_duplicates.dedupe_length  -- Starting position to search
                               )
                 AS VARCHAR( 32739 )
               ) AS deduped_element
       FROM remove_duplicates
       WHERE remove_duplicates.start_position < LENGTH(remove_duplicates.dedupe_string)
     )
   SELECT
     remove_duplicates.dedupe_string
   INTO
     v_DedupedString
   FROM remove_duplicates
   ORDER BY
     remove_duplicates.start_position DESC
   LIMIT 1;

   RETURN v_DedupedString;

END;


COMMENT ON SPECIFIC FUNCTION DEDUPE IS
   'Return a string after having removed duplicate values.';


COMMENT ON PARAMETER SPECIFIC FUNCTION DEDUPE
   ( deduplicate_string IS
       'Required'
   , deduplicate_length IS
       'Optional - Default 1'
   );

Examples:
values get_deduped_string( deduplicate_string => 'abcdefghabcdijkl' ); -- Result abcdefghijkl
values get_deduped_string( deduplicate_string => 'aabbccddaacceeffff' ); -- Result abcdef
values get_deduped_string( deduplicate_string => 'aabbccddaacceeffff', deduplicate_length => 2 );
Result aabbccddeeff
values get_deduped_string( deduplicate_string => 'aabbccddaacceeffff', deduplicate_length => 20 );
Error XX998 will show.
values get_deduped_string( deduplicate_string => 'aabbccddaacceeffff', deduplicate_length => cast( null as integer ) );
Error XX999 will show if "RETURNS NULL ON NULL INPUT" is commented out.

8 comments:

  1. For SQL, I've done this previously with a regex, however, my solution leaves the last occurrence, instead of keeping the first one and removing subsequent chars:

    values(REGEXP_REPLACE('abcdefghabcdijkl', '(?:(.)(?=(.*)\1))', ''))

    returns efghabcdijkl

    ReplyDelete
  2. Hello,

    You could use a recursive CTE instead of SPLIT() so your code still works when the length of the input is greater than 17, like
    with testfile(f1) as (values 'abcdefghabcdijkl', 'jyzgnfoqrihgemosegafj'),
    T0(n, element, f1) as (
    select 1, cast(null as char(1)), f1 from testfile where length(f1) > 0
    union all
    select n+1, substr(f1, n+1, 1), f1 from t0 where n<length(f1)
    )
    select f1, listagg(distinct element, '') within group(order by element) from t0 group by f1

    ReplyDelete
  3. Code
    06 for Counter = 1 to %len(%trimr(F1)) ;
    07 if (%subst(F1:Counter:1) <> ' ') ;
    08 F1 = %scanrpl(%subst(F1:Counter:1):'':F1:Counter + 1) ;
    09 endif ;
    10 endfor ;
    seems very inefficient to me.
    Firstly, at each stage of the loop, the length of the string can change, which means that the variable responsible for the end of the loop will be recalculated each time. Given that the construction %len(%trimr(F1)) by itself is not very efficient.
    Secondly, at each stage of the loop, a complete scan of the "forward" line will occur, which also does not add speed.
    In my opinion, it is more efficient to create a table of 256 indicators, firstly, and, secondly, make a new line for the result. Then the algorithm will be something like this:
    1. We take the next character from the source string.
    2. We look at the table - if the indicator corresponding to this is already set (the symbol has already been seen before), go to the next symbol. If not, we transfer the symbol to the result line and set the indicator corresponding to it.
    And yes, it's easier to write it in C and plug it in as a module with extproc . But RPG is not a problem either.

    ReplyDelete
  4. For SQL, as you asked, you can reduce the string with the statement below.
    If you remove the last WHERE you can even see the reduction steps.
    But this is of course as an exercise... For readability I think this is best left to a proper iterative routine (RPG or procedural SQL indifferently) then exposed to SQL as a user scalar defined function.

    Recursive CTE:


    WITH data(F1, s, i) as (
    VALUES ('llabcdeallall', 'llabcdeallall', 1)
    UNION
    SELECT F1, substring(s, 1, i) || replace(substring(s, i+1), substring(s, i, 1), ''), i+1
    FROM data where i < length(F1)
    )
    SELECT s from data WHERE I = LENGTH(F1)

    ReplyDelete
  5. Additionally, I suspect that your CTE doesn't even guarantee original ordering.
    I suspect that the optimizer/db2 doesn't give guarantees about that (with DISTINCT you just asking to produce a set of elements).

    ReplyDelete
  6. josewalker@gmail.comJune 23, 2022 at 8:41 AM

    Hi Simon.
    with d(n, str, nstr) as (
    select 1, replace(str, l, ''), cast(l as char(50))
    from (select trim(str) str, substr(trim(str), 1, 1) l
    from table(values(' abcde abcde')) x(str) ) i(str, l)
    union all
    select n + 1, replace(str, substr(trim(str), 1, 1)
    , ''), trim(nstr)|| substr(trim(str), 1, 1)
    from d
    where length(trim(str)) > 0
    )
    select nstr from d
    order by n desc
    fetch first 1 rows only

    ReplyDelete
  7. Your example using %scanrpl() will of course only work with fixed length character fields. If you are using a variable length field, then you will have problems because of the way the example is programmed. But it is a good example to be inspired with.

    ReplyDelete
  8. Corrected version

    CREATE TABLE je_char.chars (chars CHAR(20))
    ;

    INSERT INTO je_char.chars
    VALUES
    ('adagaziad')
    ;


    WITH characters (value, char, index, length) AS (
    SELECT chars,
    SUBSTR(chars, 1, 1),
    1,
    LENGTH(chars)
    FROM je_char.chars
    UNION ALL
    SELECT value,
    SUBSTR(value, index + 1, 1),
    index + 1,
    length
    FROM characters
    WHERE index < length
    )
    SELECT MAX(value) original,
    LISTAGG(DISTINCT char) within group(order by char) sorted_unique
    FROM characters
    ;

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.