Wednesday, October 10, 2018

Removing multiple alpha characters using SQL

using sql translate function to translate characters

I needed to remove alphabetic characters (A - Z) from a field in a file, and left justify the remaining numeric characters. These alpha characters could be before and/or after any numbers contained within the string. As usual I need data in a test file, TESTFILE, that I can play with to find a way to do what I want.

In these examples, as the subject I have only bothered with two records/rows in my table/file. Why have more when I know what works in these examples will work in any scenario.

COLUMN
abc1234defghijklmnop
aaBbcCDd1mmg2k3LLkkH

The problem is that I want to replace any alpha character, no matter where in the string and what order they come in.

If I use the REPLACE function in a SQL Select it regards the list of "from values" as a string. Which does not give me the result I want.

01  SELECT COLUMN,REPLACE(COLUMN,'abc','')
02    FROM TESTFILE


COLUMN                REPLACE
abc1234defghijklmnop  1234defghijklmnop
aaBbcCDd1mmg2k3LLkkH  aaBbcCDd1mmg2k3LLkkH

To remove any occurrence of the letters I would need to nest the REPLACE functions, that is going to get too unwieldy too quickly.

SELECT REPLACE(REPLACE(REPLACE(COLUMN,'a',''),'b',''),'c','')
  FROM TESTFILE

I could use the regular expression function REGEXP_REPLACE to get what I desire.

01  SELECT COLUMN,
02         LTRIM(REGEXP_REPLACE(COLUMN,'[A-z]',' '))
03    FROM TESTFILE


COLUMN                LTRIM
abc1234defghijklmnop  1234
aaBbcCDd1mmg2k3LLkkH  1   2 3

On line 2 I pass three parameters to the REGEXP_REPLACE function:

  1. Name of column/field.
  2. The value(s) to be translated. In this case A (upper case A) to z (lower case z) includes all of the alphabetic characters regardless of case.
  3. Result. In this case I want to translate all of the letters to blank.

The LTRIM function removes any leading spaces, which results in my results being right justified.

When using regular expressions I find working out what the values to be translated can become complicated, especially if special characters are used. Which leaves me to ask: is there another way?

Fortunately there is the Translate function, TRANSLATE.

The order of the parameters I need to pass to TRANSLATE function is a bit strange:

  1. Column/field name
  2. To value(s)
  3. From value(s)

For this example my statement would look like.

01  SELECT COLUMN,
02         LTRIM(TRANSLATE(COLUMN,' ',
03             'AaBbCcDdEeFfGgHhIiJjKkLlMm
04             NnOoPpQqRrSsTtUuVvWwXxYyZz'))
05    FROM TESTFILE


COLUMN                LTRIM
abc1234defghijklmnop  1234
aaBbcCDd1mmg2k3LLkkH  1   2 3

Line 2: I want to translate the values in the column COLUMN to be a space.

Lines 3 and 4: I have listed all of the values I want to translate from on two lines just to show it is possible with a large number of values. This allows me to add or remove characters to or from the list.

Again the LTRIM, line 2, is needed to remove the leading spaces.

As I find many times in programming there is more than one way to do a task, in this case remove the alphabetic characters from the column COLUMN.

 

You can learn more about the TRANSLATE function from the IBM website here.

 

This article was written for IBM i 7.3, and should work for some earlier releases too.

6 comments:

  1. A word of caution about the second argument in TRANSLATE(FIELD, REPLCHARS, BADCHARS)

    SELECT COLUMN, TRANSLATE(COLUMN,' ',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz')
    FROM TESTFILE

    Note that you have a single space there as the replacement character for each digit in REPLCHARS. This causes only the ‘A’ to be replaced with the ‘ ‘ from argument 2. I changed the data in TESTFILE to demonstrate this as I need an capital A in there:

    COLUMN
    aaBbcCDd1mAg2k3LLkkH

    I set the replacement character to ‘#’ versus ‘ ‘ this time to make things more visible:

    SELECT COLUMN, TRANSLATE(COLUMN,'#',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz') as RESULT
    FROM TESTFILE

    COLUMN RESULT
    aaBbcCDd1mAg2k3LLkkH ‘ 1 # 2 3 ‘ ******** End of data ********

    Note that only the ‘A’ was replaced with it!! TRANSLATE will only replace the character in the BADCHARS with the character in the corresponding position of REPLCHARS, very much like %XLATE.

    Then why did all the letters get replaced with a ‘ ’ in your example? That happened because of optional argument 4 of TRANSLATE, which you omitted: TRANSLATE(FIELD, REPLCHARS, BADCHARS, PADDING)

    PADDING defines which character(s) should be used when the REPLCHARS string is shorter than the BADCHARS string. To demonstrate, I’ll use ‘#’ again for REPLCHARS and ’@’ for PADDING and pass it explicitly:

    SELECT COLUMN, TRANSLATE(COLUMN,'#',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz', '@') as RESULT
    FROM TESTFILE

    COLUMN RESULT
    aaBbcCDd1mAg2k3LLkkH @@@@@@@@1@#@2@3@@@@@

    You can clearly see now that only A was replaced with the # from argument 2. All other bad characters were not replaced because a replacement character was not defined for them. Instead, they were ‘padded’ with the @ from argument 4. The reason all your letters seem to have been “replaced” with a space was because argument 4 was not passed, which causes it to defaults to a space, giving the illusion that they were all “replaced” . I just don’t want people to think they only need to define a single replacement character for every digit they want replaced in BADCHARS. The fact that it worked here was just pure luck.

    There are 2 ways to do this properly:
    1. Define REPLCHARS with equal length as BADCHARS:
    SELECT COLUMN, TRANSLATE(COLUMN,
    ' ',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz') as RESULT
    FROM TESTFILE

    Now every character in BADCHARS has a replacement character in the corresponding position in REPLCHARS.

    2. Explicitly define argument 4:

    SELECT COLUMN, TRANSLATE(COLUMN, '#',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz', '#') as RESULT
    FROM TESTFILE

    Here I can’t be bothered to figure out how many characters are in the BADCHARS. I replace the first one with a # and “replace” every other bad character that doesn’t have a defined replacement with the padding character, which is also a #. I am using # here and not ‘ ‘ because it makes it more visible, but of course you can use a space on both, which is in essence what you did.

    ReplyDelete
  2. One neat feature REPLACE has and TRANSLATE does not is the ability to *remove* characters:

    SELECT COLUMN, REPLACE(COLUMN, 'A', '') from TESTFILE

    COLUMN REPLACE
    abc1234defghijklmnop abc1234defghijklmnop
    aaBbcCDd1mAg2k3LLkkH aaBbcCDd1mg2k3LLkkH

    Note that my replacement character is not a space (' '), it’s a nothing (''). Sadly, unlike REPLACE, you cannot specify a “nothing” as the replacement character in TRANSLATE. So what if I want to strip all the letters out and only keep the numbers? Here’s the trick:

    STEP 1: deliberately replace/pad each character you do not want with a single character you also don’t want:

    SELECT COLUMN, TRANSLATE(COLUMN,'A',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz', 'A') as RESULT FROM TESTFILE

    COLUMN RESULT
    aaBbcCDd1mAg2k3LLkkH AAAAAAAA1AAA2A3AAAAA

    Step 2: Wrap a REPLACE around that to replace every 'A' with a nothing ('')

    SELECT COLUMN, REPLACE(TRANSLATE(COLUMN,'A',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz', 'A'),'A','') as RESULT (that's a NOTHING at the end there)
    FROM TESTFILE

    COLUMN RESULT
    aaBbcCDd1mAg2k3LLkkH 123

    I use this REPLACE/TRANSLATE function a lot to remove embedded “junk” characters from inbound data, such as X'0D' LineFeed and X'25' Carriage return. Basically anything below Hex(40) is either a non-human readable character or a control character that often show up in imported data from other systems and wreak havoc on the IBM i. If you have a need to strip junk characters from fields often, you can create a subprocedure in a service program. Conceptually:

    Dcl-Proc StripJunk Export;
    Dcl-Pi *N Char(512);
    FieldIn Char(512) Const;
    End-Pi;
    Dcl-S Char(40) BadChars Inz(X’0001020304……39’)
    Dcl-S Char(1) Null Inz(X’00’)
    Dcl-S Char(512) FieldOut Inz;

    Exec sql ‘set :FieldOut = SELECT REPLACE(TRANSLATE(:FieldIn, :Null,:BadChars, :Null),:Null,’’) [that’s a NOTHING at the end there] FROM SYSIBM/SYSDUMMY1;

    Return FieldOut;
    End-Proc;

    In English, translate the first digit in BadChars with a null character X'00', pad all the others with X'00' as well, then replace all those null characters with nothings, i.e. strip them out. Of course you could do that same thing with %XLATE as well. I like using X;00; for my temporary character because there's no way I'll ever want that one in my data. In the calling program:

    CleanField = Stripjunk(RawField);


    ReplyDelete
  3. Apparently this editor can't handle a series of blanks within quotes, which will cause people confusion as my solution now looks exactly the same as what you had. Let's try that again:

    1. Define REPLCHARS with equal length as BADCHARS:
    SELECT COLUMN, TRANSLATE(COLUMN,
    '___________________________________________________',
    'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz') as RESULT FROM TESTFILE

    that whole '____' thing is supposed to be 52 blanks.

    ReplyDelete
  4. Great, Simon and also thanks to Dan! I think we’ll soon have a StripJunk UDF in our shop. :-)

    ReplyDelete
  5. To Dan Doorn: Dan, your idea really is very interesting but the code ( according to my opinion ) is not correct one. I tried it and had to change some important points:
    1) Sorry, but I cannot realize how the string with hexadecimal codes can be ended with 39 ?
    ( dc;-s BadChars ... Inz(X’0001020304……39’)

    39 is wrong value for decimal 39, the Badchar string values should be ended with 27 like in example below :
    Dcl-S BadChars Char(40) Inz(X'000102030405060708090A0B0C0D0E0F10111213+ 1415161718191A1B1C1D1E1F2021222324252627');

    2) The other thing as it seems me is the wrong format of this example operator :
    Exec sql set :FieldOut = SELECT REPLACE(TRANSLATE(:FieldIn, :Null,:BadChars, :Null),:Null,’’) [that’s a NOTHING at the end there] FROM SYSIBM/SYSDUMMY1;
    Compiler Never permit equation like this :
    set :Variable = SELECT .... From
    - cannot be used

    ReplyDelete
  6. The above routine doesn't actually "strip" junk characters from the string, it replaces them with BLANKS. That's not "stripping" and in hindsight I probably shouldn't have called this routine "strip". What you do with the data you need to scrub is your call and totally depends on your application. There are cases where you can't move where digits are located within a string so then you should not "strip out" junk characters because that will cause all characters after that junk character to move 1 position to the left. In those cases you will want to REPLACE all junk characters and I generally choose to replace them with spaces.

    Now, if you want to strip them OUT then you need to do the replace(translate thing that is duly detailed in my post above from 6 years ago.

    It was PURE LUCK that I happened to see this recent question here. I happened to be in the service program that has to StripHx routine in it and that program had a url in the comment and that url was to this blogpost of Simon's. If it wasn't for that I probably would have never seen this question. I hope this helps. Dan.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.