Wednesday, July 19, 2017

Finding the number of times a character is in a field using SQL

using sql to count the number of times a character is in a field

I needed to determine, first, if a certain character was present in a file's field, and then how many times it occurs. Using RPG's file input I would not be able to do this without reading and checking every record's field. Could this be done easier in SQL? Yes.

I am going to give examples of various ways of being able to tell if a certain character is in a field in a file, using increasing complex methods I found. At first I will just give the SQL Select statement for each, and at the end the SQLRPGLE code I created to do this in a program.

Regular readers of this blog will not be surprised that my data is held in a file called TESTFILE, and it contains just one field F1. If I run a very simple Select statement over this file I can see what rows/records are in the file.

01  SELECT F1 FROM TESTFILE

it would be hard to come up with a simpler statement than that. When executed I see:

F1
SIMON
EDWARD
SUSAN
SIMMON
SIMOM
MURRAY
---------M

I want to select only those rows/records that contain the letter M. By using the WHERE clause with a wild card I can select all of those rows/records.

01  SELECT F1 FROM TESTFILE
02   WHERE F1 LIKE '%M%'

Line 2: The wild card character in SQL is the percent symbol ( % ). When I place a percent symbol either side of the M it will return the rows/records that contains M anywhere in the field.

F1
SIMON
SIMMON
SIMOM
MURRAY
---------M

When searching the first thing I found was how to determine the first position of the character in the field. I know this was not relevant to the task at hand, but I could resist trying it out.

The LOCATE scalar function returns the first place the search character is found in the column/field. The format for this scalar function is:

LOCATE(<string to search for>, <field or column>)
  <starting position>

In my example I want to return the first place an M is found in the field:

01  SELECT F1,LOCATE('M',F1) AS FIRST_POSITION
02    FROM TESTFILE
03   WHERE LOCATE('M',F1) > 0

Line 1: I am defining a LOCATE as one of the columns in the returned results, and giving that column a meaningful name. I do not need a starting position here.

Line 3: This time I do need to define a start position for the scalar function. Here I am checking for anywhere in the field, i.e. greater than zero.

The results are:

F1          FIRST_POSITION
SIMON                    3
SIMMON                   3
SIMOM                    3
MURRAY                   1
---------M              10

If I change the WHERE to only look in the third position of the file:

01  SELECT F1,LOCATE('M',F1) AS FIRST_POSITION
02    FROM TESTFILE
03   WHERE LOCATE('M',F1) = 3

The results are:

F1          FIRST_POSITION
SIMON                    3
SIMMON                   3
SIMOM                    3

What if I change the starting position of the search to greater than three.

01  SELECT F1,LOCATE('M',F1) AS FIRST_POSITION
02    FROM TESTFILE
03   WHERE LOCATE('M',F1) > 3

The results are:

F1          FIRST_POSITION
---------M              10

I have to admit the first time I saw the result I was surprised, until I remembered only rows/records would be returned if the first occurrence of the character in the column/field is in a position greater than three. This row/record was returned as it is the only one where the first M occurs in a place greater than the third.

Returning to task at hand, finding the rows/records with my desired character and the number of times that character occurs in the field. I am going to show the Select statement I came up with and then describe scalar functions I used.

01  SELECT LENGTH(F1) - LENGTH(REPLACE(F1,'M','')),F1
02         LENGTH(F1),LENGTH(REPLACE(F1,'M','')) 
03    FROM TESTFILE
04   WHERE LENGTH(F1) - LENGTH(REPLACE(F1,'M','')) > 0

Let me start with the REPLACE. As its name suggests it replaces one character in a field with another.

REPLACE(<column or field>, <character>, <replacement character>)

In my Select statement I am replace ever occurrence of M with null. Null is indicated by the two apostrophes ( ' ) without anything between them.

The first LENGTH functions returns the length of the field F1, 10. The second LENGTH will return a value of less than 10 as the Ms have replaced by nulls decreasing the length of F1. I know it sounds a bit strange, which is why I included these two values as columns in the results, line 2.

Line 4: I am only going to return any rows/records where the difference between the length of F1 and the value of the length when the Ms have been replaced by null is greater than zero. The more Ms there are the greater the difference, which, of course, corresponds to the number of Ms.

Numeric Expression  F1          LENGTH ( F1 )          LENGTH
               1    SIMON                  10               9
               2    SIMMON                 10               8
               2    SIMOM                  10               8
               1    MURRAY                 10               9
               1    ---------M             10               9

Now I have the SQL statement how can I put it in a SQLRPGLE program?

In this program I am going to make the character that is selected and counted be a variable. This could be passed to a program, subprocedure, etc. I am just going to show this code in its simplest form.

01  **free
02  dcl-s Char char(1) inz('M') ;
03  dcl-s SQL char(300) ;

04  exec sql SET OPTION COMMIT = *NONE ;

05  exec sql DROP TABLE QTEMP.OUTPUT ;

06  SQL = 'CREATE TABLE QTEMP.OUTPUT AS +
07         (SELECT F1, +
08                 LENGTH(F1) - LENGTH(REPLACE(F1,''' + Char +
09                                             ''','''')) +
10                 AS NBR_OF_CHAR +
11            FROM TESTFILE +
12           WHERE LENGTH(F1) - LENGTH(REPLACE(F1,''' + Char +
13                                             ''','''')) > 0) +
14           WITH DATA ' ;

15  exec sql EXECUTE IMMEDIATE :SQL ;

16  *inlr = *on ;

Line 1: As this is on an IBM i running 7.3 why would I not use totally free RPG.

Line 2: This is the definition of the variable I will be using for the search character.

Line 3: This variable will be used to contain the string for the SQL Select statement.

Line 4: I do not want to have commitment control on the file I will be creating. By using the SET OPTION it is turned off.

Line 5: I always like to drop/delete, or try to drop/delete, the output table/file before I create it.

Lines 6 – 14: This is the SQL statement that uses the Select Statement I developed before, to generate a table/file in QTEMP.

Line 8 and 9: Yes, lots of apostrophes. Three either side of the variable Char, and two either side of the null value. I have to do the same on lines 12 and 13.

line 14: Need the WITH DATA otherwise my table/file will be empty.

If I use debug and look at the value in SQL I can see that all the apostrophes are all aligned correctly.

SQL =
       ....5...10...15...20...25...30...35...40...45...50...55...60 
  1   'CREATE TABLE QTEMP.OUTPUT AS (SELECT F1, LENGTH(F1) - LENGTH'
 61   '(REPLACE(F1,'M','')) AS NBR_OF_CHAR FROM TESTFILE WHERE LENG'
121   'TH(F1) - LENGTH(REPLACE(F1,'M','')) > 0) WITH DATA          '
181   '                    '

Line 15: I am using the EXECUTE IMMEDIATE to execute the SQL command string in the SQL variable.

When the program is compiled and run it generates a table/file in QTEMP, that contains the following:

F1            NBR_OF_CHAR
SIMON                   1
SIMMON                  2
SIMOM                   2
MURRAY                  1
---------M              1

 

You can learn more about these scalar functions on the IBM website page about SQL scalar functions here.

 

This article was written for IBM i 7.3, and should work for earlier releases too.

4 comments:

  1. another great post, thanks for all of your tips/techniques

    ReplyDelete
  2. Hi Simon good article, like always, however I think this can be accomplished easier using regular expression like this
    SELECT REGEXP_COUNT(F1,'M',1) FROM T1, of course there are many way to skin a cat.

    ReplyDelete
  3. Oh I forgot to say with regular expression count you can even use ignore case like this.
    SELECT REGEXP_COUNT(F1,'M',1,'i') FROM TESTFILE

    ReplyDelete
  4. I was just going to suggest the regexp scalar functions of DB2, but looks like Jose beat me to it! Discovered them recently and was pleasantly surprised. Finding support for regex anywhere is always a plus!

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.