Wednesday, February 1, 2023

Finding all the fields that contain certain characters

The germ for the idea for this post came from a question I saw in a Facebook Group. The question was how to identify records in a file where a particular field contains "special" characters, using SQL.

As the question was specially about a DDS file, I created a file, TESTFILE, with one field, FIELD1. I am not going to give the code for this file, just its contents. I would use the following SQL statement to list the contents of this file:

SELECT FIELD1 FROM TESTFILE

The results from this file are:

FIELD1
--------
abcde
ABCDE
12345
ABC 123
A@
B#
4$
!@#$%^&*

I have used the REGEXP_LIKE scalar function to return whether a character, or string of characters, is in the field. In this first statement I am going to return whether the field contains one of the characters I am testing for.

01  SELECT FIELD1,
02         REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')
03    FROM TESTFILE

Line 2: I am using just two of the possible parameters of REGEXP_LIKE:

  1. Name of the field
  2. List of the characters to test for

As I want to check for a list of possible characters, I list them within square brackets ( [ ] ).

Two columns are returned:

  1. Contents of FIELD1
  2. Whether the field contains one of characters I am testing for

Which returns:

FIELD1     00002
--------   -----
abcde      false
ABCDE      false
12345      false
ABC 123    false
A@         true
B#         true
4$         true
!@#$%^&*   true

If I wanted to return only those records that has at least one of the characters I am testing for in FIELD1, I would use:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')

Line 2: In this statement the REGEXP_LIKE is in the Where clause.

The results are:

FIELD1
---------
A@
B#
4$
!@#$%^&*

If I wanted the opposite, a list of records without any of those test characters in FIELD1, I would make one simple change.

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]')

Line 2: I have added the word NOT to the Where clause, to reverse its meaning from the previous statement.

The results are:

FIELD1
---------
abcde
ABCDE
12345
ABC 123

What if I wanted to return the records that contain characters other than A-Z and 0-9?

If I used the method I have described above I would need to list every character that is not in the desired ranges. Which would be a laborious process. There must be an alternative.

Fortunately there is, I can use wildcard and "or" in the list of characters:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]')

Let me describe which is meant in the REGEXP_LIKE values.

  • . = The dot character is used as the wildcard character. This allows for a match to a character anywhere in FIELD1
  • [A-Z] = Range of values from A to Z
  • | = The pipe symbol is used to denote "or"
  • [0-9] = Range of values from 0 to 9

The results from this statement are:

FIELD1
---------
abcde
A@
B#
!@#$%^&*

All of the Regular Expressions are case sensitive, therefore, the first returned row is considered invalid as the characters are all lower case.

I consider 'abcde' to be valid, therefore, I need to add a third parameter to REGEXP_LIKE:

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]','i')

Line 2: The 'i' character tells REGEXP_LIKE to ignore case.

This change means that the record that contains 'abcde' is not consider invalid:

FIELD1
---------
A@
B#
!@#$%^&*

If you think the third parameter of REGEXP_LIKE is too cryptic for others to follow I can simply use the UPPER scalar function in its place.

01  SELECT FIELD1 FROM TESTFILE
02   WHERE NOT REGEXP_LIKE(UPPER(FIELD1),'.[A-Z]|[0-9]')

Which produces the same results as the prior statement.

I have found the above very useful when scanning strings coming from non-IBM i sources for invalid characters before moving the data into the database.

 

This article was written for IBM i 7.5, and should work for some earlier releases too.

3 comments:

  1. Hi Simon, very informative as always. In the example with the WHERE clause as WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]'), shouldn't the value of 4$ also be listed in the results?

    ReplyDelete
    Replies
    1. I had to look up what "4$" stands for, local currency symbol.
      In the original question the person did not want currency symbols either.

      But this is a useful code to know and to use in future examples.

      Thanks

      Delete
  2. thanks for sharing Simon

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.