Thursday, August 5, 2021

Finding records with a certain character in a field using SQL regex

regex for finding characters in a string

This question came from a colleague of mine. He was trying to identify which records from a DDS file contains one of several special characters within a particular field. To make it a bit more interesting these special characters could be in any position of the field, maybe the first position, last or in the middle. He had tried several SQL statements but had not managed to create one he felt comfortable with.

I am sure you can understand why I cannot share the file he was using. I have created my own file to illustrate the problem and my solution. I am sure you will not be surprised to find I called it TESTFILE.

To see the contents of the file I can use this simple SQL statement:

01  SELECT * FROM TESTFILE ;

And the following results are returned:

FLD001
------
 @
#
     &
   !
ABCDEF
@@@@@

As you can see I have various special characters scattered in various positions in the field FLD001.

My colleague had tried using a WHERE .. LIKE clause but it had become to messy for his liking. His problem was trying to find a way to do the equivalent of SQL's like with multiple values.

When he asked me my opinion I realized another approach that could be used was to use a SQL function that would return the position of a character in a string. If the character is found its position is returned, and if not zero is returned.

My first attempt was using the LOCATE_IN_STRING function:

01  SELECT FLD001, 
02         LOCATE_IN_STRING(FLD001,'@')
03    FROM TESTFILE ;

The results give me the first occurrence of the special character in the field:

FLD001     00002
---------  -----
 @             2
#              0
    &          0
   !           0
ABCDEF         0
@@@@@          1

This works fine for one character, it is not possible to do it for multiple characters.

Then I remember that there is a Regular Expression function, REGEXP_INSTR, that can be used to search strings for one of more characters. At times the syntax of the regex, Regular Expression, can be overwhelming, but in this case I think it is simple. In this example I only want to check for three of the special characters, and return the position they first are found in.

01  SELECT FLD001,
02         REGEXP_INSTR(FLD001,'[@,!,#]')
03    FROM TESTFILE ;

Line 2: By placing the special characters, separated by commas, within square brackets ( [ ] ) acts as list of values for the regex to test for. In this example I want to test for @, !, or # in FLD001 and return the first position they are found in. The results look like:

FLD001     00002
---------  -----
 @             2
#              1
    &          0
   !           4
ABCDEF         0
@@@@@          1

Now if I make the regex my WHERE clause I will only return the rows I desire:

01  SELECT FLD001
02    FROM TESTFILE
03   WHERE REGEXP_INSTR(FLD001,'[@,!,#]') > 0 ;

Line 3: As I demonstrated above if one of the desired characters is found in the field then the regex expression returns a value that corresponds to its place in the string. If none of the characters are found then a value of zero is returned. By testing if the returned value is greater than zero then I will only have rows with these special characters in them returned to me.

FLD001
------
 @
#
   !
@@@@@

My colleague was impressed how simple the statement looked compared to his, and took this to use in his program.

 

This article was written for IBM i 7.4, and should work for some earlier releases too.

5 comments:

  1. Simon, you show a regex of [@,!,#], I'm thinking you really want [@!#].
    What you have written includes the comma in the set to search for.

    ReplyDelete
  2. I found that to be true as well. don't use commas unless your searching for them.

    ReplyDelete
  3. Thanks Simon, your your articles are clear and very helpful, really inspiring!

    ReplyDelete
  4. Coming in late as this article just popped for me in LinkedIn. To add to the first comment made by Unknown, you could also use the pipe as it represents an OR, so [@|!|#] or select testfile.fld001, regexp_instr( testfile.fld001, '[@|!|#]' ) from testfile;. The documentation for pattern expressions for V7R4 on IBM's website is https://www.ibm.com/docs/en/i/7.4?topic=predicates-regexp-like-predicate#rbafzregexp_like__regexp_likecontrol.

    I enjoy the REGEXP functions but find that when you are dealing with very large tables, it doesn't perform that well. If you are dealing with smaller tables, smaller being relative to the server configuration, it works great. If the search is relatively simple, as per this example, I'm throwing out another alternative method, TRANSLATE. This is what I used before IBM introduced the REGEXP functions and it performs way better on larger tables. TRANSLATE can convert a list of characters to 1 character and then you use LOCATE to find that 1 character. IBM's documentation - https://www.ibm.com/docs/en/i/7.4?topic=functions-translate

    select
    testfile.fld001
    from testfile
    where locate( '@'
    , translate( testfile.fld001
    , '@'
    , '@!#'
    , '@'
    )
    ) > 0;

    Ray Richert

    ReplyDelete
  5. Simon, great read and examples. This is a great function for researching . We have another tool in Qshell that will do the same GREP . Not only one lib but every lib on the box. Again, thanks for sharing.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.