Wednesday, March 4, 2015

Checking for Mixed Case in SQL

convert to upper case using sql

Most of the files and tables I use contain data that is only in upper case. Occasionally I have to work with files that can contain data in mixed case, both upper and lower, or data from our overseas subsidiaries that can contain special characters: accents, umlauts, circumflexes, etc. This does present me with an issue when I need to search data in those files. I mainly use SQL to search and extract from files, and while I can use a LIKE in the SQL statement it does still compare like to like, where 'A' is not the same as 'a'.

The simple solution would appear to be to convert the incoming field/column into upper case when performing the comparison, but I have Theo Kouwnehoven to thank for an ever simpler and better solution.

For this example I have a file, TESTFILE, that contains a field FRUIT.

  A          R TESTFILER
  A            FRUIT         15

The field contains repeating names of fruits in mixed case and, as you can see in record 9, it can contain special characters:

     FRUIT
     ---------------
  01 peach
  02 APPLE
  03 PineApple
  04 Pear
  05 pear
  06 PEACH
  07 BLUEBERRY
  08 blackberry
  09 pÉæR
  10 mangO

I could just cope with the translation of lower to upper case using the UPPER function, which is the same as UCASE.

  SELECT FRUIT FROM TESTFILE
         WHERE UPPER(FRUIT) LIKE 'PE%'
         ORDER BY UPPER(FRUIT)

This would give me the following output:

  FRUIT
  ---------------
  peach
  PEACH
  Pear
  pear

Using the UPPER function will result in a slight increase in the overhead used by this program, as each time a record is fetched the field FRUIT goes through the translation process. It also missed "pÉæR".

Theo Kouwnehoven came up with another solution: to change the sort sequence to *LANGIDSHR. IBM defines the sort sequence *LANGIDSHR as using "A shared-weight sort table". What this means is that it treats upper and lower case as the same, and will treat the special characters as their non-special equivalent. For example: "É" will be sorted as "E".

If I was to embed this into an RPG program it could look like:

01  dcl-s wkFRUIT char(15) ;

02  exec sql SET OPTION SRTSEQ = *LANGIDSHR,
                         COMMIT = *NONE ;

03  exec sql DECLARE C1 CURSOR FOR
                   SELECT FRUIT FROM TESTFILE
                          WHERE FRUIT LIKE 'PE%'
                          ORDER BY FRUIT
                          FOR READ ONLY ;
04  exec sql OPEN C1 ;

05  dow (1 = 1) ;
06    exec sql FETCH C1 INTO :wkFruit ;
07    if (SQLCOD <> 0) ;
08      leave ;
09    endif ;

10    dsply wkFruit ;
11  enddo ;

12  *inlr = *on ;
13  exec sql CLOSE C1 ;

I am not going to give an explanation on how to read a file in SQL using a cursor and fetch here. I will explain the pertinent parts of this example.

On line 2 I am using the SET OPTION to set the sort sequence to *LANGIDSHR and not to use commitment control.

I do not have to do any translation to the field FRUIT in the SELECT statement on line 3.

The FETCH on line 6 is like a read.

The output from the DSPLY would look like:

  DSPLY  peach
  DSPLY  PEACH
  DSPLY  Pear
  DSPLY  pear
  DSPLY  pÉæR

This time "pÉæR" is included.

If you want to change the sort sequence in interactive SQL, STRSQL command:

  1. At the "Enter SQL Statements" screen press F13 (F13=Services)
  2. At the menu select option 1 (1. Change session attributes)
  3. Page down to the second screen.
  4. Fourth field down is "Sort sequence", change the value to *LANGIDSHR.
  5. Press Enter twice to update and return to the "Enter SQL Statements" screen.

 

You can learn more about SQL's SET OPTION on the IBM website here.

 

This article was written for IBM i 7.2, and it should work with earlier releases too.

4 comments:

  1. Well done, amazing power in that sql

    ReplyDelete
  2. Using sort sequences is great, but it is an all or nothing option, i.e. if changed it is used for all columns in all files.
    When using scalar functions you can use different sequences.

    In either way to get the best performance, the right indexes must be created.

    When using a sort sequence only indexes created with this sortsequence can be used, otherwise table scans may be performed, which is no problem for 500 records but is a huge problem for 500 billion of records.

    To ge the best performance with scalar functions, you need to create a new drived index, i.e. with one or more new columns including the scalar function.

    Birgitta

    ReplyDelete
  3. Birgitta, are you suggesting that an index with an expression over that single field would be better than using *LANGIDSHR is there a scalar function that gives that result? An index with Upper would work except for the one missing value, it appears.

    ReplyDelete
  4. Its not a good habit to test SQLCOD with "SQLCOD <> 0"...it's safer to test it this way: SQLCOD < 0. This is because, as along as SQLCOD is positive, its fine...even a value of 100 (row not found) still show the embedded statement did work, even if it did not find a row. There are other embedded commands that return non-zero but positive values even when the command executed successfully.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.