Wednesday, August 23, 2017

REGEXP like added to SQL

regular expression regex like

I can remember many years ago having a conversation with a non-AS400 (yes, it was that long ago) person about the programming environment we worked with. He scoffed at the AS400 stating he could never work on it as it did not support Regular Expressions. I am not going to describe in too much detail what Regular Expressions, sometimes called regex or regexp, are as others have done a good job doing that. To give a very simple explanation: it is a sequence of characters that define a search pattern, which is then used to find or find and replace parts of strings. If you are interested in learning more check the Wikipedia page about regular expressions here.

Within the Db2 for i PTFs that accompanied IBM i TR2 and 7.2 TR6 a number of regular expressions functions were introduced to SQL. I am not going to describe them all in one post, as it would be too long, as I feel they merit their own posts. To start this post is about the REGEXP_LIKE. I am just going to give a simple overview of how to use this, if you want more details or want to learn how to make complex statements I suggest you visit the page on IBM's KnowledgeCenter, link at the bottom of this post, or use your favorite search engine to search for "REGEXP_LIKE". The posts about REGEXP_LIKE for Oracle's PL/SQL I found helpful.

For these examples I built a Db2 table to contain the data I will be using:

01  CREATE OR REPLACE TABLE MYLIB.PERSON (
02    PERSON_ID FOR "PID" INTEGER
03              GENERATED ALWAYS AS IDENTITY
04              (START WITH 1, INCREMENT BY 1, NOCYCLE),
05    FIRST_NAME FOR "FNAME" VARCHAR(20),
06    LAST_NAME FOR "LNAME" VARCHAR(30)
07  ) ;

08  LABEL ON COLUMN PERSON (
09   PERSON_ID  IS 'Person              id', 
10   FIRST_NAME IS 'First               name', 
11   LAST_NAME  IS 'Last                name'
12  ) ;

Line 1: As of IBM i 7.2 I can use CREATE OR REPLACE to define my table. If I change the definition of the table I can run this statement again and it will change the table to match the new definition, and map the contents of the table to the new definition.

Lines 2 – 4: I am defining an identity column to help keep the rows unique. For more details on its definition refer to the post about Temporal tables. Personally, I think the definition is self explanatory.

Lines 5 and 6: I have defined two columns for the first and last names of the people to be included in this table.

Lines 8 – 12: I want to give this table decent column headings. These are the equivalent of DDS's COLHDG.

I added rows to the table using the INSERT statement.

INSERT INTO PERSON VALUES(DEFAULT,'ROBERT','ALLEN') ;
INSERT INTO PERSON VALUES(DEFAULT,'John','Austin') ;
INSERT INTO PERSON VALUES(DEFAULT,'don','bennett') ;

Notice that I used DEFAULT for the value of the first row. As this is the identity column it will increment itself when rows are added, I do not give it value.

The names I have loaded the table with are the names of the Coventry City Football Club squad from the 1961-62 season.

Let me start with the my first Select statement.

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'Y')

I have given two of the REGEXP-LIKE parameters. The first is the name of the column to perform the test upon. The second is the character(s) to search for. There are others, but I have not used them in this statement. Here I want my result set to include all of the players whose last name has a "Y" in it.

Person   First    Last
id       name     name
     7   ALAN     DALEY
    19   BILLY    MYERSCOUGH

As I entered an upper case "Y" all of the players with that in their last name were returned. But I know I have players with a lower case "y" in the last name too. Fortunately I can use the third parameter to ignore the case.

SELECT * FROM PERSON 
 WHERE REGEXP_LIKE(LNAME,'Y','i')

By using a lower case 'i' as the third parameter the results will return all of the last names that contain a "Y", regardless of case. The "i" must be lower case, an upper case version will give you an error. As this is a utility taken from Unix it is case sensitive.

Person   First    Last
id       name     name
     7   ALAN     DALEY
     9   lol      harvey
    19   BILLY    MYERSCOUGH
    21   reg      ryan

I could make my search string be more than one character if I so desire.

In this next I want to find all of the rows where the last name contains a "Y" or "Z", regardless of case. I use the pipe ( | ) symbol to indicate that I want an "or".

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'y|z','i')

The names returned are shown below. Frank is the only player with a "Z" in his name.

Person   First    Last
id       name     name
     7   ALAN     DALEY
     9   lol      harvey
    17   Frank    Kletzenbauer
    19   BILLY    MYERSCOUGH
    21   reg      ryan

If I wanted to test for any three characters in a name I could use the pipe, but if there were more than three it would look unwieldy. I can use the square brackets ( [ ] ) and place the values I want to search for within, think of it like a list of characters.

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'f|y|z','i')

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[fyz]','i')

Now I have Ron Framer added to the results.

Person   First    Last
id       name     name
     7   ALAN     DALEY
     8   Ron      Framer
     9   lol      harvey
    17   Frank    Kletzenbauer
    19   BILLY    MYERSCOUGH
    21   reg      ryan

I can also use a hyphen ( - ) for a range of values within the square brackets. With this statement I want all of the players whose last name contains a letter between "I" and "K".

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[i-k]','i')

This gives me a long list of players.

Person   First     Last
id       name      name
     2   John      Austin
     4   DIETMAR   BRUCK
     5   Brian     Caine
     6   george    curtis
    10   RON       HEWITT
    11   Brian     Hill
    12   peter     hill
    14   Stuart    Imlach
    15   eric      jones
    16   MICK      KEARNS
    17   Frank     Kletzenbauer
    18   arthur    lightening
    20   Brian     Nicholas
    23   Nelson    Stiffle

I can give two ranges to only return those players whose names contain a letter from "I" – "K" followed by a letter in the range "A" – "E".

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[i-k][a-e]','i')

The results are much smaller. "KEARNS" is returned as it has a "K" followed by an "E". "Nicholas" as the "i" is followed by "c".

Person   First    Last
id       name     name
    16   MICK     KEARNS
    20   Brian    Nicholas

If I want to check the variable I am testing starts with a letter I use the caret ( ^ ) symbol.

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'^b','i')

Which will give me.

Person   First     Last
id       name      name
     3   don       bennett
     4   DIETMAR   BRUCK

The dollar ( $ ) character is used to indicate that I want to look for the last character in the variable.

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'t$|r$','i')

I used the or ( | ) I get all of players whose name ends in "T" or "R".

Person   First    Last
id       name     name
     3   don      bennett
     8   Ron      Framer
    10   RON      HEWITT
    13   COLIN    HOLDER
    17   Frank    Kletzenbauer

I can also search for patterns. For example a vowel, followed by any character, and then another vowel. To indicate the "any character" I use a period/full stop ( . ).

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[aeiou].[aeiou]','i')

Which returns the following.

Person   First     Last
id       name      name
     5   Brian     Caine
     7   ALAN      DALEY
     8   Ron       Framer
    10   RON       HEWITT
    15   eric      jones
    17   Frank     Kletzenbauer
    18   arthur    lightening
    20   Brian     Nicholas

And I just add another period/full stop if I want a list of all the players with vowels separated by two characters.

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[aeiou]..[aeiou]','i')

Which gives me.

Person   First    Last
id       name     name
     1   ROBERT   ALLEN
     2   John     Austin
     3   don      bennett
     5   Brian    Caine
     6   george   curtis
     9   lol      harvey
    13   COLIN    HOLDER
    14   Stuart   Imlach
    17   Frank    Kletzenbauer
    20   Brian    Nicholas
    25   BOB      WESSON

In this last example I want to list all of the players who have two "T" in their last name. I could have "tt" in my search patter, but there is another way. If I put the character(s) I want to search for in square brackets and the number of occurrences in curly brackets ( { } ).

SELECT * FROM PERSON
 WHERE REGEXP_LIKE(LNAME,'[t]{2}','i')

Only two players have "TT" in their last names.

Person   First    Last
id       name     name
     3   don      bennett
    10   RON      HEWITT

There is a lot more you can do with the REGEXP_LIKE, this is just intended as a quick introduction. For more of what you can do with it click on the link below to the KnowledgeCenter page, below. Thankfully this is POSIX compliant, therefore, examples from other SQL databases that are also POSIX compliant can be used.

 

Other Db2 for i regular expressions:

 

You can learn more about the REGEXP_LIKE command from the IBM website here.

 

This article was written for IBM i 7.3 TR2 and 7.2 TR6.

7 comments:

  1. This is Great we just started an application using regular expression. *Note we are still on 7.1 and this is working for us.
    Also house keeping notes.
    Your table is using MYLIB but your labels are using QTEMP.
    Also your blog is only showing 3 inserts but your examples are using other data.
    **else this is a great resource looking forward to more from this powerful tool.

    ReplyDelete
    Replies
    1. Good catch! And a big oops on my part.

      All of the references to QTEMP have been removed.

      I only gave three of the insert statements just as an example. Trust me the other 22 looked pretty much the same, so I did not want to bother to take up space above.

      Thanks

      Delete
  2. This is a great tool for pattern matching!

    ReplyDelete
  3. Thanks for your article. As an SQL novice, I found it very helpful.




    ReplyDelete
  4. Excellent article. Investigating the topic further, there are three other services that use REGEXP: SUBSTR, SCAN and REPLACE which will allow REGEXP functionality in other SQL clauses as well. For example this allowed me to find the top email domains of our customers or at least got me started:

    SELECT REGEXP_SUBSTR(CONTACTFORM , '@...*') ,count(*)
    FROM CUSTOMERS
    WHERE REGEXP_LIKE(CONTACTFORM ,'@','i')
    and DELT ='N'
    group by REGEXP_SUBSTR(CONTACTFORM , '@...*')
    order by REGEXP_SUBSTR(CONTACTFORM , '@...*')

    ReplyDelete
    Replies
    1. All of those other REGEXP functions will be covered in future posts.

      Delete
  5. This an excellent, clear article, Simon. Thank you.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.