Wednesday, October 11, 2017

Using Regexp to substring

regular expression, regexe, to substring

In this third installment of the regular expression functions added to Db2 for i is going to be about REGEXP_SUBSTR. The SQL SUBSTRING and RPG %SUBST built in function need a fixed start and number of positions. The REGEXP_SUBST allows me to use regular expressions patterns to find and substring the pattern I desire.

All of these Regular Expressions functions were added to Db2 for i as part of the 7.3 TR2 and 7.2 TR6 updates. Rather than repeat what I have in the earlier posts, I will refer you to them for more details.

The format for this function is:

  REGEXP_SUBSTR(source-string, search-pattern, starting-position,
                occurrence, flags, group)
  1. source-string  What is searched, it can be a string or a variable.
  2. search-pattern  What I am looking for in the search string.
  3. starting-position  Where I want to start my search for the search pattern.
  4. flags  These are the same flags I can use with the other regexp functions.
  5. group  Capture group, as with REGEXP_INSTR I was not able to find a reason why I would use this.

In these examples I am going to use the same table, PERSON, I used in the post I wrote about REGEXP_LIKE, therefore, I am not going to waste your time to describe the table or its contents again.

Let me start with my example, to find all the people who have an e in their last name followed by another character.

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'e.',1,1,'i') AS RESULT
03   FROM PERSON

Line 2: The parameters in the REGEXP_SUBSTR are:

  1. LNAME  The short name of the last name column in the table.
  2. 'e.'  The dot ( . ) following the e indicates that I want an E followed by some other character.
  3. 1  I want my search to start in the first position of the last name.
  4. 1  First occurrence of the letter E only.
  5. 'i'  The lower case i means that my search is not case specific, in other words it will match both lower and upper case forms of the letter.

The results are:

Last            RESULT
name
ALLEN           EN
Austin          -
bennett         en
BRUCK           -
Caine           -
curtis          -
DALEY           EY
Framer          er
harvey          ey
HEWITT          EW
Hill            -
hill            -
HOLDER          ER
Imlach          -
jones           es
KEARNS          EA
Kletzenbauer    et
lightening      en
MYERSCOUGH      ER
Nicholas        -
ryan            -
SATCHWELL       EL
Stiffle         -
straw           -
WESSON          ES

Note:  The dashes ( - ) indicate a null, not found, value.

This has worked like a scan, to find the E, and a substring, to get the result. Notice that with the last name Stiffle null was returned as the E at the end of the name is not followed by another character.

I only know of one way to remove the rows that return null from the results by adding a where statement like line 4, see below.

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'e.',1,1,'i') AS RESULT
03    FROM PERSON
04   WHERE REGEXP_SUBSTR(LNAME,'e.',1,1,'i') IS NOT NULL

If you know of an easier or better way to stop null appearing in the result set please let me know in the Comments below.

So what do I get if I change my Select to look for the second occurrence of E.

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'e.',1,2,'i') AS RESULT
03    FROM PERSON
04   WHERE REGEXP_SUBSTR(LNAME,'e.',1,2,'i') IS NOT NULL

Line 2: I change the occurrence value, the fourth parameter, to 2 to indicate I only want to include in my results a second E in the last names.

Line 4: I have repeated the REGEXP_SUBSTR in the Where to ignore any null results.

And what do I get?

Last            RESULT
name
bennett         et
Kletzenbauer    en

Only those names with a second E in them, followed by a character.

How about any name with a vowel in it, followed by N.

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'[aeiou]n',1,1,'i')
             AS RESULT
03    FROM PERSON
04   WHERE REGEXP_SUBSTR(LNAME,'[aeiou]n',1,1,'i')
             IS NOT NULL

Line 2: The square brackets ( [ ] ) allow me to put in a list of range of values. In my example I have listed all the vowels. What this will do is return in the results any last name with a vowel followed by a letter N.

Last            RESULT
name
ALLEN           EN
Austin          in
bennett         en
Caine           in
jones           on
Kletzenbauer    en
lightening      en
ryan            an
WESSON          ON

How about a vowel followed by another vowel?

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'[aeiou]+[aeiou]',1,1,'i')
             AS RESULT
03    FROM PERSON
04   WHERE REGEXP_SUBSTR(LNAME,'[aeiou]+[aeiou]',1,1,'i')
            IS NOT NULL

Line 2: in the search pattern I have used a plus ( + ), in regular expressions it means "match preceding element one or more times".

The best way to understand what that means is to see the results.

Last            RESULT
name
Austin          Au
Caine           ai
KEARNS          EA
Kletzenbauer    aue
MYERSCOUGH      OU

The interesting result is for the last name "Kletzenbauer" as it returns three vowels because we have the first vowel, A, and an ending vowel, E, but also any other vowels that precede the last vowel. If there was a last name with four vowels next to each other, "Baaaat" is would appear in the results as "aaaa", because the vowels are all next to one another.

Let me make this more complicated… I want to extract strings from these last names starting with the first vowel and ending with the last vowel.

01  SELECT LNAME,
02         REGEXP_SUBSTR(LNAME,'[aeiou]+[a-z]+[aeiou]',1,1,'i') 
             AS RESULT
03    FROM PERSON

My results are just what I desired:

Last            RESULT
name
ALLEN           ALLE
Austin          Austi
bennett         enne
BRUCK           -
Caine           aine
curtis          urti
DALEY           ALE
Framer          ame
harvey          arve
HEWITT          EWI
Hill            -
hill            -
HOLDER          OLDE
Imlach          Imla
jones           one
KEARNS          -
Kletzenbauer    etzenbaue
lightening      ighteni
MYERSCOUGH      ERSCOU
Nicholas        ichola
ryan            -
SATCHWELL       ATCHWE
Stiffle         iffle
straw           -
WESSON          ESSO

I included the null results this time so you could see why some of the last names did not return a result.

  • BRUCK, Hill, hill, ryan, straw do not return results as they only have one vowel.
  • KEARNS does not return a vowel as the two vowels are next to one another, with no other character between.

What I have been impressed with is the ease I have been able to get substrings from strings without having to give a hard starting position and a length, try creating a line of RPG code to do the same thing. I can definitely see myself using this function in the future.

 

Other Db2 for i regular expressions:

 

You can learn more about the REGEXP_SUBSTR command from the IBM website here.

 

This article was written for IBM i 7.3 TR2 and 7.2 TR6.

3 comments:

  1. excellent examples, was able to use in my SQLRPGLE programs

    ReplyDelete
  2. The tools of IBM Power i just keep getting more powerful! Good article with great examples!

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.