Wednesday, March 29, 2023

More useful information added to SYSCOLUMNSTAT View about variable length columns

As part of the recent IBM i Technology refreshes, 7.5 TR1 and 7.4 TR7, three new columns were added to the SYSCOLUMNSTAT View. These are only of interest for variable length columns. While two of the columns I found easy to understand the third took awhile for me to work out what its contents mean.

These new columns are:

  • MAXIMUM_COLUMN_LENGTH:  For variable length columns this is the maximum length of the column. For fixed width columns this is the column length.

  • OVERFLOW_ROWS:  If the variable length column has been defined with an ALLOCATE clause this is the number of rows that exceed the allocate value. For fixed length columns this column is null.

  • LENGTH_AT_90TH_PERCENTILE:  This is the one that I found harder to grasp. There is what is called a 90/10 rule for using the allocate clause. 90% of the data should be within the allocate value, and only 10% should exceed it. If this is a variable length column the value here is the database's best estimate of what the 90th percentile should be. This column is null for fixed width columns.

If I am going to show these columns in action I need to have a SQL DDL table to use:

01  CREATE TABLE MYLIB.TESTTABLE (
02    FIRST_COLUMN VARCHAR(100) ALLOCATE(50),
03    SECOND_COLUMN VARCHAR(100),
04    THIRD_COLUMN CHAR(100))

Line 1: The example table will be called TESTTABLE, and it will be created in my library.

Line 2: The first column, FIRST_COLUMN is variable length of 100. I have used the allocate clause to give an allocation limit of 50. I chose 50 for no other reason than it is half of 100.

Line 3: SECOND_COLUMN also has a variable length of 100. But it has no allocate clause.

Line 4: THIRD_COLUMN is fixed length.

I wrote a RPG program to insert variable length values into TESTABLE:

01  **free
02  ctl-opt dftactgrp(*no) ;
03  dcl-s Asterisks char(100) inz(*all'*') ;
04  dcl-s Counter int(3) ;

05  exec sql CALL QSYS2.QCMDEXC('CLRPFM TESTTABLE') ;

06  for Counter = 10 by 10 to 70 ;
07    InsertIntoTable(Counter) ;
08  endfor ;

09  *inlr = *on ;

Line 2: As the program contains a subprocedure I need to use the DFTACTGRP(*NO) control option.

Line 3: I have defined this variable with a fixed width of 100 characters. I have initialized it with asterisks ( * ). The *ALL'*' fills the variable with asterisks. This is a wonderful shortcut, and better than having to type 100 asterisks.

Line 4: This variable is defined as an integer.

Line 5: I am using SQL's QCMDEXC procedure to execute the CL command to clear TESTTABLE.

Lines 6 - 9: A For group.

Line 6: The variable COUNTER will start with a value of 10. That will be incremented by 10 every time the loop is performed. And stops after it reaches the value 70. In other words on each execution of the loop Counter will contain: 10, 20, 30, 40, 50, 60, and 70.

Line 7: The subprocedure InsertIntoTable is called, and passed to it is the value contained in Counter. The subprocedure is below.

10  dcl-proc InsertIntoTable ;
11    dcl-pi *n ;
12      EndPosition int(3) const ;
13    end-pi ;

14    dcl-s String varchar(100) ;

15    String = %subst(Asterisks:1:EndPosition) ;

16    exec sql INSERT INTO TESTTABLE VALUES(:String,:String,:String) ;
17  end-proc ;

Lines 11-13: The subprocedure interface, which defines the variable that the passed number will be contained in. As a number is passed, rather than a variable, I need to give CONST for the parameter, as a number is a constant and cannot be changed.

Line 14: This variable is "local" to just this subprocedure. Notice that it is a VARCHAR type.

Line 15: I am substring asterisks from the variable Asterisks of the length that was passed to this subprocedure.

Line 16: I am inserting a row into TESTABLE, using the same variable as the value for the three columns.

I can get a count of the number of rows in TESTTABLE of each length by using the following SQL statement:

01  SELECT LENGTH(FIRST_COLUMN) AS "Col length",
02         COUNT(*) AS "Count"
03    FROM TESTTABLE 
04   GROUP BY LENGTH(FIRST_COLUMN)

Line 1: I am using the Length scalar function to return the length of the variable FIRST_COLUMN. As it is a variable length column so I will get different results for each row.

Line 2: The count of each length, this is overkill as there is just one row for each length. This will prove useful later on.

Line 4: The GROUP BY groups the different row lengths together into one result.

Col length    Count
-----------   ------
         10        1
         20        1
         30        1
         40        1
         50        1
         60        1
         70        1

The results are what I expected, one row of each length.

This where things can get complicated. The sometimes the statistics that are contained with SYSCOLUMNSTAT are updated immediately. Other times they are not, in this situation I would need to manually generate the statistics for TESTTABLE.

In my experience when I use the following SQL statement:

01  SELECT * FROM QSYS2.SYSCOLUMNSTAT
02   WHERE TABLE_SCHEMA = 'MYLIB'
03     AND TABLE_NAME = 'TESTTABLE'

If no results, or incomplete results, are returned I am going to have to manually generate the table's statistics. Tomorrow I will publish a video showing how to manually refresh the statistics for the table. It is a lot easier to understand if I show, rather than describe in text. After doing the refresh I wait a couple of minutes before checking for results again, with the above SQL statement. When there are results I continue.

Now I can see the new columns I mentioned at the top of this post:

01  SELECT A.COLUMN_NAME,A.DATA_TYPE,A.LENGTH,
02         A.CHARACTER_MAXIMUM_LENGTH AS "Char max len",
03         B.AVERAGE_COLUMN_LENGTH AS "Ave col len",
04         B.MAXIMUM_COLUMN_LENGTH AS "Max col len",
05         B.LENGTH_AT_90TH_PERCENTILE "90th %tile",
06         B.OVERFLOW_ROWS AS "Over rows"
07    FROM QSYS2.SYSCOLUMNS A,QSYS2.SYSCOLUMNSTAT B
08   WHERE A.TABLE_SCHEMA = B.TABLE_SCHEMA
09     AND A.TABLE_NAME = B.TABLE_NAME
10     AND A.COLUMN_NAME = B.COLUMN_NAME
11     AND A.TABLE_SCHEMA = 'MYLIB'
12     AND A.TABLE_NAME = 'TESTTABLE'

I have joined SYSCOLUMNS and SYSCOLUMNSTAT to get a more useful set of results.

Lines 1 and 2: All columns with the "A" prefix are from SYSCOLUMNS.

Lines 3 – 6: All of these columns are from SYSCOLUMNSTAT. As the column names are so long I have decided to give them shorter column headings.

Line 7: This is a novel way to join to files. I was shown this approach by a colleague last week.

Lines8 – 10: I join the two views by schema (library), table, and column names.

Lines 11 – 12: This is where I have give the schema (library) and table name for TESTTABLE.

               DATE             Char     Average  Max      90th    Over
COLUMN_NAME    _TYPE    LENGTH  max len  col len  col len  %tile   rows
-------------  -------  ------  -------  -------  -------  ------  -----
FIRST_COLUMN   VARCHAR     100     100        40       70     70       1
SECOND_COLUMN  VARCHAR     100     100        40       70     70       7
THIRD_COLUMN   CHAR        100     100       100      100  <NULL>  <NULL>

Let me explain the results for each column.

FIRST_COLUMN average length is 40 (= (10 + 20 + 30 + 40 + 50 + 60 + 70) / 7). As it is variable length its maximum length is the 70. With so few rows of data the 90th percentile of my rows happens to be the same as the length as the longest column. What is interesting is that it had calculate that only one row exceeds that allocate value, which surprises me as the allocate value is 50, and there are columns of 60 and 70 long.

SECOND_COLUMN the results are the same as the previous column. The only difference being that it calculates there are 7 rows over the allocate value. Since this column does not have one I assume when one is not given the allocate value is zero.

THIRD_COLUMN is the fixed length column. Average amd Maximum column lengths can only be 100. As this column is fixed length there are no values for the 90th percentile and the number of rows over the allocate value.

As my sample size is so small what happens if I have more rows to test with. I add 999 rows for each of the lengths 10 – 40:

Col length    Count
-----------   ------
         10     1000
         20     1000
         30     1000
         40     1000
         50        1
         60        1
         70        1

Now when I look at the statistics for this greater number of rows I see:

               DATE             Char     Average  Max      90th    Over
COLUMN_NAME    _TYPE    LENGTH  max len  col len  col len  %tile   rows
-------------  -------  ------  -------  -------  -------  ------  -----
FIRST_COLUMN   VARCHAR     100     100        25       70     40       1
SECOND_COLUMN  VARCHAR     100     100        25       70     40    4003
THIRD_COLUMN   CHAR        100     100       100      100  <NULL>  <NULL>

The results now look better. The average length for the variable length columns is now 25. Interestingly the number of rows longer than the allocate value for FIRST_COLUMN shows 1, IMHO this should be 3.

With these results I can see if the allocate value for FIRST_COLUMN is optimal, or at least the 90th percentile value is not too different from the allocate value.

This is another great addition to information I can retrieve about columns, especially when I need to determine if the allocate value I gave is optimal.

You can learn more about the new columns in SYSCOLUMNSTAT command from the IBM website here.

 

This article was written for IBM i 7.5 TR1 and 7.4 TR7.


Addendum

I received this clarification from Sue Romano of the IBM Db2 for i team:

I noticed the odd overflow rows number you observed, so questioned the developer to see what is going on. He was delighted that his minor enhancement was noticed, and provided an answer. I'll summarize my understanding of this mystery.

It turns out that SLIC doesn't strictly use the ALLOCATE value when deciding how much fixed-length space to allocate. When doing efficient column alignment in storage, SLIC can leave unused bytes between columns. When dealing with an ALLOCATE length, SLIC can round up the ALLOCATE value to use this "gap" space, rather than wasting it. I don't know what the alignment algorithm is, but I assume it is probably something like 8- or 16-byte alignment. So perhaps a VARCHAR(50) ALLOCATE 10 might actually implement an allocate of 16.

Conceptually, this makes sense to me. I hope it does to you as well.

No comments:

Post a Comment

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.