Pages

Wednesday, December 13, 2023

Obscuring data in a column with SQL

I am not sure if this is the most accurate title for what I want to do, obscure part of an email address as a hint for someone trying to signon to a web faced application. If someone has forgotten the email address they used to register, part of the process is to present them with a hint of the email address they used to sign up for the service. It obscures part of the username of the email address. For example, the email address:

john.smith@gmail.com

Would be obscured to become:

jXXXXXXXXX@gmail.com

At present this is being handled within a RPG program. But as part of the modernize drive the idea is to have this information available in a SQL View. When this was brought up in a meeting one of the programmers suggested creating an User Defined Function, UDF, that would still be a RPG program to do this. I explained that was not necessary as it could be done completely within SQL. This post describes my recommendation.

Before I start showing code I want to show that I have a small file with three email addresses that I want to obscure:

01  SELECT EMAIL_ADDRESS FROM TESTFILE


EMAIL_ADDRESS
--------------------------
john.smith@gmail.com
marymoore456@hotmail.com
LIZA_MADISON@CORPORATE.NET

Below is similar to the RPG program that proposed to become the basis for the UDF:

01  **free
02  dcl-f TESTFILE alias ;

03  dcl-s Position int(5) ;
04  dcl-s Obscured like(EMAIL_ADDRESS) ;

05  dcl-c FromChars 'abcdefghijklmnopqrstuvwxyz1234567890._' ;
06  dcl-c ToChars 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' ;

07  dow (*on) ;
08    read TESTFILE ;
09    if (%eof) ;
10      leave ;
11    endif ;

12    Position = %scan('@' : EMAIL_ADDRESS : 1) ;

13    Obscured =
14      %subst(EMAIL_ADDRESS : 1 : 1) +
15      %xlate(FromChars : ToChars :
16        %lower(%subst(EMAIL_ADDRESS : 2 : Position - 2))) +
17      %subst(EMAIL_ADDRESS : Position) ;
18  enddo ;

Line 1: If it is not free then it cannot be modern RPG.

Line 2: I have defined the File or Table to use the alias names for the fields or columns. I do this as I prefer to use the long name, rather than short system names.

Line 3: I am defining an unsigned integer variable to contain where in the email address the at sign is ( @ ).

Line 4: This variable is defined to be the same as the EMAIL_ADDRESS from the File or Table.

Lines 5 and 6: I need these two constants that I will be using in the program.

Line 7: This is a never-ending loop. The loop does not end. I have to leave it to get out of it.

Lines 8 - 11: I read the input File or Table. If "end of file" is returned I leave the loop.

Line 12: I want to have the position of the at symbol in the email address in its own variable as I will be using this number in more than one place.

Lines 13 – 17: This is where I obscure the email address.

Line 14: I want the first character of the email to not be obscured, which I concatenate to…

Lines 15 and 16: I am using the %XLATE built in function, BiF, to translate the character in the username to the be "X". The first of the parameters is the constant of lower-case characters, numbers, and special characters I am going to be translating from. Which is followed by the constant of all "X", as I must have one "X" for every character in from constant. Those are followed by the value to be translated. What I have done might look complicated, but it is not. I have used a substring to extract from the username part of the email address from the second to the character to the one before the at sign. Then convert that string to lower case. I decided to convert the string to lower case as then I would only need to have all the lower-case letters in the From Characters constant.

I finally concatenate the at sign and domain name to the email address.

When the program is run I get the following results in the Obscured variable:

jXXXXXXXXX@gmail.com 
mXXXXXXXXXXX@hotmail.com  
LXXXXXXXXXXX@CORPORATE.NET

How could I do the same in SQL? Let me start showing how I would determine the parts before coming to the final statement.

Let me start with determining where the at sign is:

01  SELECT EMAIL_ADDRESS,
02         LOCATE_IN_STRING(EMAIL_ADDRESS, '@')
03  FROM TESTFILE

Line 2: I am using the LOCATE_IN_STRING to find and return the location of the at sign.

The results are:

EMAIL_ADDRESS               00002 
--------------------------  ------
john.smith@gmail.com            11
marymoore456@hotmail.com        13
LIZA_MADISON@CORPORATE.NET      13

Next step is to extract the string I will be changing, the piece from the second character to just before the at sign.

01  SELECT EMAIL_ADDRESS,
02         LOCATE_IN_STRING(EMAIL_ADDRESS, '@'),
03         SUBSTR(EMAIL_ADDRESS, 2,(LOCATE_IN_STRING(EMAIL_ADDRESS, '@') - 2))
04  FROM TESTFILE

Line 3: I substring from the email address, starting in the second position, to the place before the at sign.

My results are:

EMAIL_ADDRESS               00002   00003 
--------------------------  ------  -----------
john.smith@gmail.com            11  ohn.smith
marymoore456@hotmail.com        13  arymoore456
LIZA_MADISON@CORPORATE.NET      13  IZA_MADISON

In the statement below I bring the parts from the above and combine them into a new statement to create the obscured email address:

01  SELECT EMAIL_ADDRESS,
02         SUBSTR(EMAIL_ADDRESS, 1, 1) ||
03         REGEXP_REPLACE(LOWER(SUBSTR(EMAIL_ADDRESS, 2,
                      (LOCATE_IN_STRING(EMAIL_ADDRESS, '@') - 2))),
                      '[a-z]|[0-9]|[._]', 'x')
04        || SUBSTR(EMAIL_ADDRESS, LOCATE_IN_STRING(EMAIL_ADDRESS, '@'))
05  FROM TESTFILE

Line 2: I use the substring function to extract the first character of the username of the email address, and by using the double pipe characters ( || ) concatenate it to…

Line 3: I am using the REGEXP_REPLACE function to do the obscuring of the part I want.

As I did in the RPG program I extract the string using the substring function and then convert it to lower case using the LOWER function.

On the third line is the part that tells REGEXP_REPLACE what to replace and with what. The first part instructs the function to replace the range: ("a" – "z") or ("0" – "9") or (period, underscore) with value in the next parameter. This happens to be the letter "x".

Line 4: I concatenate all of the above with remainder of the email address, from the at sign and whatever follows.

My results:

EMAIL_ADDRESS               00002
--------------------------  --------------------------
john.smith@gmail.com        jxxxxxxxxx@gmail.com
marymoore456@hotmail.com    mxxxxxxxxxxx@hotmail.com
LIZA_MADISON@CORPORATE.NET  Lxxxxxxxxxxx@CORPORATE.NET

But this is not a View, which what was wanted. I would define my View as:

01 CREATE OR REPLACE VIEW QTEMP.EMAIL_ADDRESSES
02  (EMAIL_ADDRESS,OBSCURED_EMAIL_ADDRESS)
03  AS
04  (SELECT EMAIL_ADDRESS,
05         SUBSTR(EMAIL_ADDRESS, 1, 1) ||
06         REGEXP_REPLACE(LOWER(SUBSTR(EMAIL_ADDRESS, 2,
                      (LOCATE_IN_STRING(EMAIL_ADDRESS, '@') - 2))),
                      '[a-z]|[1-9]|[._]', 'x') ||
07         SUBSTR(EMAIL_ADDRESS, LOCATE_IN_STRING(EMAIL_ADDRESS, '@'))
08  FROM TESTFILE)

Line 1: My View will be called EMAIL_ADDRESSES.

Line 2: This is a very small View with just two columns: the email address and the obscured email address.

Lines 4 – 8: The same as the previous SQL statement.

When the View has been created I can just use the obscured email address straight from the View.

SELECT * FROM EMAIL_ADDRESSES


EMAIL_ADDRESS               OBSCURED_EMAIL_ADDRESS
--------------------------  --------------------------
john.smith@gmail.com        jxxxxxxxxx@gmail.com
marymoore456@hotmail.com    mxxxxxxxxxxx@hotmail.com
LIZA_MADISON@CORPORATE.NET  Lxxxxxxxxxxx@CORPORATE.NET

By the end of the meeting the others agreed that it was better "to do it all in SQL", rather than use the RPG program UDF.

 

This article was written for IBM i 7.5, and should work for some earlier releases too.

4 comments:

  1. What if there are special characters? Instead of translating specific characters, it would be better to catenate the required count of 'X's.

    ReplyDelete
    Replies
    1. What kind of special characters are you thinking of?
      Any valid special characters could be added to the FROM_CHAR constant.

      Delete
  2. Further to my previous comment, for better obscuration, inserting a set amount of 'X's instead of the exact count of letter being replaced. Eg. Regardless of the length of the email address, you could have the first letter plus something like 10 Xs, then the .domain.

    ReplyDelete
  3. To mask at the database level (so it also applies to ODBC or file transfer) you can use sql masks. But be careful, once masked if one of the conditions is not true it is no longer possible to know the value of the field.

    CREATE MASK EMAILMASK ON TESTFILE
    FOR COLUMN EMAIL_ADDRESS RETURN
    CASE WHEN (SESSION_USER = 'USER1') THEN
    EMAIL_ADDRESS
    WHEN (SESSION_USER = 'USER2') THEN
    'XXX-XXX' || SUBSTR(EMAIL_ADDRESS, 8, 43)
    WHEN (EMAIL_ADDRESS > 'abc') THEN
    'YYYYYYYYYY' || SUBSTR(EMAIL_ADDRESS, 11, 40)
    ELSE ' '
    END

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.