Wednesday, February 1, 2023

Finding all the fields that contain certain characters

The germ for the idea for this post came from a question I saw in a Facebook Group. The question was how to identify records in a file where a particular field contains "special" characters, using SQL.

As the question was specially about a DDS file, I created a file, TESTFILE, with one field, FIELD1. I am not going to give the code for this file, just its contents. I would use the following SQL statement to list the contents of this file:

SELECT FIELD1 FROM TESTFILE

The results from this file are:

FIELD1
--------
abcde
ABCDE
12345
ABC 123
A@
B#
4$
!@#$%^&*

I have used the REGEXP_LIKE scalar function to return whether a character, or string of characters, is in the field. In this first statement I am going to return whether the field contains one of the characters I am testing for.

01  SELECT FIELD1,
02         REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')
03    FROM TESTFILE

Line 2: I am using just two of the possible parameters of REGEXP_LIKE:

  1. Name of the field
  2. List of the characters to test for

As I want to check for a list of possible characters, I list them within square brackets ( [ ] ).

Two columns are returned:

  1. Contents of FIELD1
  2. Whether the field contains one of characters I am testing for

Which returns:

FIELD1     00002
--------   -----
abcde      false
ABCDE      false
12345      false
ABC 123    false
A@         true
B#         true
4$         true
!@#$%^&*   true

If I wanted to return only those records that has at least one of the characters I am testing for in FIELD1, I would use:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')

Line 2: In this statement the REGEXP_LIKE is in the Where clause.

The results are:

FIELD1
---------
A@
B#
4$
!@#$%^&*

If I wanted the opposite, a list of records without any of those test characters in FIELD1, I would make one simple change.

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')

Line 2: I have added the word NOT to the Where clause, to reverse its meaning from the previous statement.

The results are:

FIELD1
---------
abcde
ABCDE
12345
ABC 123

What if I wanted to return the records that contain characters other than A-Z and 0-9?

If I used the method I have described above I would need to list every character that is not in the desired ranges. Which would be a laborious process. There must be an alternative.

Fortunately there is, I can use wildcard and "or" in the list of characters:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]')

Let me describe which is meant in the REGEXP_LIKE values.

  • . = The dot character is used as the wildcard character. This allows for a match to a character anywhere in FIELD1
  • [A-Z] = Range of values from A to Z
  • | = The pipe symbol is used to denote "or"
  • [0-9] = Range of values from 0 to 9

The results from this statement are:

FIELD1
---------
abcde
A@
B#
!@#$%^&*

All of the Regular Expressions are case sensitive, therefore, the first returned row is considered invalid as the characters are all lower case.

I consider 'abcde' to be valid, therefore, I need to add a third parameter to REGEXP_LIKE:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]','i')

Line 2: The 'i' character tells REGEXP_LIKE to ignore case.

This change means that the record that contains 'abcde' is not consider invalid:

FIELD1
---------
A@
B#
!@#$%^&*

If you think the third parameter of REGEXP_LIKE is too cryptic for others to follow I can simply use the UPPER scalar function in its place.

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(UPPER(FIELD1),'.[A-Z]|[0-9]')

Which produces the same results as the prior statement.

I have found the above very useful when scanning strings coming from non-IBM i sources for invalid characters before moving the data into the database.

 

This article was written for IBM i 7.5, and should work for some earlier releases too.

4 comments:

  1. Hi Simon, very informative as always. In the example with the WHERE clause as WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]'), shouldn't the value of 4$ also be listed in the results?

    ReplyDelete
    Replies
    1. I had to look up what "4$" stands for, local currency symbol.
      In the original question the person did not want currency symbols either.

      But this is a useful code to know and to use in future examples.

      Thanks

      Delete
  2. thanks for sharing Simon

    ReplyDelete
  3. Thanks for sharing Simon Hutchinson , This particular scalar function is handy.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.