Wednesday, August 18, 2021

CTE example: Calculating the maximum length of 80% of the rows in a column

common table express cte example

I don't have a good example of using a Common Table Expression, CTE, in this blog so I decided to write this one as an example of how I can determine the maximum length of 80% of the rows in a SQL Table.

It does sound a bizarre thing to calculate, but it was used in the post from last week where I discussed the fastest way to copy data from one file or table to another. IBM's recommendation was if you use a VARCHAR to define a column in a table using the ALLOCATE it could reduce the number of I/O operations that would be performed. The suggested allocation for any column is that the number of bytes allocated will include 80% of the rows in the table.

I decided to write the SQL statement I will show, below, to perform that calculation. I broke up what I wanted to do into "parts", this would allow me to test my results before going to the next "part".

The table I tested with was one of the ones I used in last week's post, TCHAR. All of its columns are defined as CHAR. I want to know how many rows that table contains, which I can simply so with the following statement:

SELECT TO_CHAR(COUNT(*),'999G999G999') FROM U4142TLS.TCHAR ;

The result is:

00001
---------
1,008,026

Why did I use the TO_CHAR function in my SQL statement?

TO_CHAR acts like the Edit Word keyword, EDTWRD, I can use when defining a field in a DDS file or the %EDTW built in function in RPG. I am using it to insert thousand separators into the returned number. I think that 1,008,026 are a lot easier to understand than just 1008026.

CTE is used to define multiple SQL statements that create tables in memory, that are combined into one set of results at the end of the statement.

This CTE statement will return the length of 80% of the rows from the column MSG_TEXT from the table TCHAR. I am going to show the entire statement below, and then I will "break it up" into its parts to show you what it is doing.

    -- Calculate the length of each row's column
01  WITH T1(T1_LENGTH)
02  AS (SELECT LENGTH(RTRIM(MSG_TEXT)) FROM U4142TLS.TCHAR),

    -- Summarize to give me the number or rows with each record length
03  T2 (T2_LENGTH,T2_COUNT)
04  AS (SELECT T1_LENGTH,COUNT(T1_LENGTH) 
05        FROM T1
06       GROUP BY T1_LENGTH
07       ORDER BY T1_LENGTH),

    -- Total the number of rows have this record length or a smaller record length
08  T3 (T3_LENGTH,T3_COUNT)
09  AS (SELECT A.T2_LENGTH,
10             (SELECT SUM(B.T2_COUNT) 
11                FROM T2 B
12                WHERE B.T2_LENGTH <= A.T2_LENGTH) 
13        FROM T2 A),

    -- Calculate what is 80% of all the records
14  T4 (TOT_COUNT,P80_COUNT)
15  AS (SELECT COUNT(*),INT((COUNT(*) * 0.8)) FROM T1)

    -- Final statement is what is displayed
16  SELECT T3_LENGTH,T3_COUNT 
17    FROM T3 
18   WHERE T3_COUNT <= (SELECT T4.P80_COUNT FROM T4) 
19   ORDER BY T3_LENGTH DESC
20   LIMIT 1 ;

When I create a CTE I always add a comment before each part to explain what it is doing.

The final result from this statement is:

T_LENGTH   T5_COUNT
--------   --------
     130     787868

The results tell me that the length to accommodate 80% of the data in the column MSG_TEXT is 130 characters.

What I like about CTE is that I can change the last statement to return results from any of the temporary tables, not just the last one. This is a wonderful feature that allows me to debug the CTE by checking the results from each part.

Let me start with the first part, the part that creates the temporary file T1:

    -- Calculate the length of each row's column
01  WITH T1(T1_LENGTH)
02  AS (SELECT LENGTH(RTRIM(MSG_TEXT)) FROM U4142TLS.TCHAR),

Line 1: All CTE start with the word WITH. My first temporary table is T1 and contains just one column, T1_LENGTH. I could have called the temporary table and the column name anything I liked, I just like to K.I.S.S.

Line 2: This is the SQL statement that builds that data that will be in T1. Here I want to determine the length of the string in MSG_TEXT. I am using the RTRIM to remove the trailing blanks from the string, before determining its length with the LENGTH function.

Notice that line 2 ends with a comma ( , ) I need this as there this is not the last part of this CTE statement.

I replace the last statement, on line 16, with:

    -- T1 results
16  SELECT * FROM T1 LIMIT 5 ;

Line 16: I want the first five results from the temporary table T1.

T1_LENGTH
---------
       90
       58
       90
       58
      130

The next part creates the temporary table T2, which contains a count of each record length.

    -- Summarize to give me the number or rows with each record length
03  T2 (T2_LENGTH,T2_COUNT)
04  AS (SELECT T1_LENGTH,COUNT(T1_LENGTH) 
05        FROM T1
06       GROUP BY T1_LENGTH
07       ORDER BY T1_LENGTH),

Line 3: As this is not the first table defined in the CTE I do not need the WITH. This temporary table, T2 contains two columns.

Lines 4 – 7: This is the SQL statement that builds T2. It uses the temporary file T1 that I defined in the previous statement. I use the GROUP BY and ORDER BY to produce a count for each length.

Notice that line 7 ends with a comma.

To see the results from this I need to replace the final statement, line 16, with:

-- T2 results
16  SELECT * FROM T2
17  ORDER BY 1
18  LIMIT 5 ;

Lines 16 – 18: I want to return both the columns from T2. Sort the results by the first column, T2_LENGTH, and only return the first five results.

The results are:

T2_LENGTH   T2_COUNT
---------   --------
        1        558
        4        360
        6         54
       16         72
       20         36

In the next part I want to produce a count for each length value that is a cumulative count. I am not sure if I am saying this right. For example: the length of 2 includes the counts of length 1 and 2. Count for length 3 includes the counts of lengths 1, 2, and 3. Count for length 4 includes the counts for lengths 1, 2, 3, and 4. And so on.

I did that with the following:

    -- Total the number of rows have this record length or a smaller record length
08  T3 (T3_LENGTH,T3_COUNT)
09  AS (SELECT A.T2_LENGTH,
10             (SELECT SUM(B.T2_COUNT) 
11                FROM T2 B
12                WHERE B.T2_LENGTH <= A.T2_LENGTH) 
13        FROM T2 A),

Line 8: The temporary table T3 is defined as having two columns.

Lines 9 – 13: Is the SQL statement that builds those columns.

Line 9: T2_LENGTH is simply taken from the column on T2 T2_LENGTH. I have prefixed this field with A to make it easier that this value is coming from t2 defines on like 13.

Lines 10 -12: The cumulative column, T3_COUNT is calculated by a subselect which also uses T2.

Line 13: There is the comma, indicating that there is more to come.

To check that the cumulative count is correct I can replace line 16, the final Select statement of the CTE with:

    -- T3 results
16  SELECT * FROM T3
17   ORDER BY 1
18   LIMIT 5 ;

This part of the statement just access the temporary table T3, sorts the results by the first column, T3_LENGTH, and just returns the first five results. That is all I need to check that my logic is valid.

T3_LENGTH   T3_COUNT
---------   --------
        1        558
        4        918
        6        972
       16       1044
       20       1080

The next part is where I calculate what is the value that is 80%. The calculation is simple, I need the total count of the rows in the table and determine 80% for that.

For this I go back and use the first temporary table T1.

    -- Calculate what is 80% of all the records
14  T4 (TOT_COUNT,P80_COUNT)
15  AS (SELECT COUNT(*),INT((COUNT(*) * 0.8)) FROM T1)

Line 14: The temporary table T4 contains two columns the total number of rows, and the number that is 80% of them.

Line 15: The statement to create the contents of T4 is simple enough. The COUNT is used for the first column to get the total number of rows. The second row takes the total number of rows and multiplies is by 0.8 to determine 80% of that count. In most cases the 80% will not be a whole number, as I cannot have a part of a row I am using the INT function to convert the result of that calculation to an integer, in other words removing the decimal values.

Line 15 is last table expression before the final Select statement, therefore, it does not have a comma.

I need to change line 16 to see these results:

    -- T4 results
16  SELECT * FROM T4 ;

The results, shown below confirm that the table has 1,008,026 rows. Multiplying that by 0.8 gives me: 806,420.8 . As I cannot have 0.8 of a row the integer function, INT, converts the calculated value to 806,420.

TOT_COUNT   P80_COUNT
---------   ---------
  1008026      806420

The last part to determine in which row is that 80% row. I know that there will not be an exact match. But which row is the one that is the 80% value or the next smallest?

I am going to replace line 16 with the following statement:

    -- T3 results again
16  SELECT * FROM T3
17   WHERE T3_COUNT > 750000
18   ORDER BY 1 ;

Line 17: The 80% number is 806,420 but I want to see what lengths are close to that. Here I want to list all of the rows where the count is greater than 750,000.

I am only going to show three of the returned results so that you can see that I am going to use the length 130 as my 80% value.

T3_LENGTH   T3_COUNT
---------   --------
      129     759158
      130     787868
      131     820592

All CTE end with a SQL statement that does not generate a temporary table. In this example this is my final statement:

    -- Final statement is what is displayed
16  SELECT T3_LENGTH,T3_COUNT 
17    FROM T3 
18   WHERE T3_COUNT <= (SELECT T4.P80_COUNT FROM T4) 
19   ORDER BY T3_LENGTH DESC
20   LIMIT 1 ;

Line 16 and 17: This CTE will return the length and the count, from T3.

Line 18: I only want to select those rows where the count is less than the 80% I calculated in T4.

Line 19 and 20: By sorting the results by the length in descending order and only returning one result I get:

T_LENGTH   T5_COUNT
--------   --------
     130     787868

When I want to calculate the 80% length for other columns in TCHAR all I need to do is to change the column name on line 2.

I have seen CTE insert or delete data from a table. I am sure you can come up with other uses once you get to play with them for a while.

 

This article was written for IBM i 7.4, and should work for some earlier releases too.

3 comments:

  1. Thanks for posting

    ReplyDelete
  2. CTE make SQL extremely powerful and makes big queries more readable

    By the way, I think the final SELECT as to be
    SELECT T3_LENGTH,T3_COUNT
    FROM T3
    WHERE T3_COUNT >= (SELECT T4.P80_COUNT FROM T4)
    ORDER BY T3_LENGTH
    LIMIT 1

    Else if you do
    with inp_data (vc) as (
    values '1', '1', '1', '1', '1', '1', '1', '999999999', '999999999', '999999999'
    ),
    T1(T1_LENGTH)
    AS (SELECT length(rtrim(vc)) FROM inp_data),
    and so on...

    you get 1 as length and 7 records among 10

    ReplyDelete
  3. Hello, thanks for the examples.
    But your requirements translate simply to get the 80 percentile of the distribution so can be solved by one statement, as follows.
    These lines exemplifies CTEs and analysis function:
    ---
    WITH distribution as
    (select length(rtrim(mystringfield)) stringlength from mytable)

    select percentile_disc(0.8) WITHIN GROUP ( ORDER BY stringlength ) from distribution
    ---

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.