Find the most recent rows for a repeating sets of data with SQL @ RPGPGM.COM

Wednesday, March 22, 2023

Find the most recent rows for a repeating sets of data with SQL

This post is based upon something I needed to provide to my superiors at my work. We have a history file that contains many thousands of repeating sets records (yes, I know it is time to purge the old data), I was asked to "create a SQL statement" that would return the most recent records for each key. It will make more sense when I show the data I am going to be using here.

In these examples the file is named TESTFILE, and contains three columns of data. I can use the following SQL statement to list the contents of this file:

01  SELECT * FROM TESTFILE

The results are:

KEY      TIMESTAMP                    VALUE
------   --------------------------   ------
FIRST    2023-02-28 07:28:55.362189        1
SECOND   2023-02-28 07:28:55.362189        1
THIRD    2023-02-28 07:28:55.362189        1
FIRST    2023-02-28 08:28:58.307440        2
SECOND   2023-02-28 08:28:58.307440        2
THIRD    2023-02-28 08:28:58.307440        2
FIRST    2023-02-28 09:12:32.470802        3
SECOND   2023-02-28 09:12:32.470802        3
THIRD    2023-02-28 09:12:32.470802        3

As you can see there are three sets of data, with the same timestamp and value in the VALUE column. The most recent rows for the three key values are all where the column is VALUE equal to '3'.

This is an occasion where I use SQL's inner join. The inner join will return all the rows that are in the two files I have joined. To achieve my desired results I am joining TESTFILE to itself:

01  SELECT A.KEY,A.TIME_STAMP,A.VALUE
02    FROM TESTFILE A
03   INNER JOIN
04    (SELECT MAX(TIME_STAMP) AS LATEST,KEY
05       FROM TESTFILE
06      GROUP BY KEY) B
07      ON A.TIME_STAMP = B.LATEST
08     AND A.KEY = B.KEY

Line 1: These are the columns I want in my results. I needed to prefix them with the letter 'A' as the fields in both files have the same names.

Line 2: The first definition of the file, and the 'A' follows the file name to indicate that this is the prefix for the file's fields.

Line 3: Not much to say here except I am using an inner join.

Lines 4 – 6: This is a sub select to return the greatest value of the timestamp field.

Line 4: I need to define two columns here. The first is the maximum value from the timestamp column, which I can easily get with the MAX scalar function. I have even given this result column its own name, LATEST, which I can use later in this statement. The second is the key column, I need that to join the two set of results together.

Line 6: By using the GROUP BY only one result is return for each unique key. And I have prefixed the results from this file with 'B'.

Lines 7 and 8: Defining the joins between the two files.

The results are:

KEY      TIMESTAMP                    VALUE
------   --------------------------   ------
FIRST    2023-02-28 09:12:32.470802        3
SECOND   2023-02-28 09:12:32.470802        3
THIRD    2023-02-28 09:12:32.470802        3

The results are exactly what I wanted.

What about doing the opposite, returning the oldest set of records from the file. With just one change I have those results:

01  SELECT A.KEY,A.TIME_STAMP,A.VALUE
02    FROM TESTFILE A
03   INNER JOIN
04    (SELECT MIN(TIME_STAMP) AS LATEST,KEY
05       FROM TESTFILE
06      GROUP BY KEY) B
07      ON A.TIME_STAMP = B.LATEST
08     AND A.KEY = B.KEY

Line 4: The only difference is to replace the MAX scalar function with the MIN scalar function.

Results are:

KEY      TIMESTAMP                    VALUE
------   --------------------------   ------
FIRST    2023-02-28 07:28:55.362189        1
SECOND   2023-02-28 07:28:55.362189        1
THIRD    2023-02-28 07:28:55.362189        1

The results are for the oldest set of records.

I assume that if I was to add a record with an even greater timestamp for one of the keys it would appear in the results, rather than the record that belongs to the third set of records. I inserted a row with a future timestamp, and a value of '4':

1  INSERT INTO TESTFILE 
        VALUES('SECOND',CURRENT_TIMESTAMP + 1 YEAR,4)

Then I used the same statement as before to retrieve the records with the greatest timestamp for each key:

01  SELECT A.KEY,A.TIME_STAMP,A.VALUE
02    FROM TESTFILE A
03   INNER JOIN
04    (SELECT MAX(TIME_STAMP) AS LATEST,KEY
05       FROM TESTFILE
06      GROUP BY KEY) B
07      ON A.TIME_STAMP = B.LATEST
08     AND A.KEY = B.KEY

As I expected, the row I inserted above now appears in the results.

KEY      TIMESTAMP                    VALUE
------   --------------------------   ------
FIRST    2023-02-28 09:12:32.470802        3
SECOND   2024-02-28 08:06:25.619974        4
THIRD    2023-02-28 09:12:32.470802        3

In the real world a statement equivalent to these returned to me the most recent values for the file's keys. My superiors were happy with the results, and the short amount of time taken to generate them and the "SQL statement".

This article was written for IBM i 7.5, and should work for some earlier releases too.

2 comments:

AnonymousMarch 24, 2023 at 5:00 AM
Instead of joining to a subselect, there is an alternative way of achieving this. And that is to use the row_number() OLAP function. This is a special type of aggregation function that can add a row number to the records, within a custom grouping and with a custom ordering.

For this example, you would group it by KEY and order it by TIMESTAMP desc - you would then get row numbering for each key, such that the newest timestamp in each key is row number 1. Then just select the rows with row number = 1.

with RANKED as (
select
t.*,
row_number() over(
partition by KEY
order by TIME_STAMP desc
) as rn
from TESTFILE t
select KEY, TIME_STAMP, VALUE
from RANKED
where rn = 1;

When I first discovered this a few years ago, providing the appropriate indexes existed, Visual Explain showed it to be considerably faster than the join to subselect method.
ReplyDelete
Replies
AnonymousApril 6, 2023 at 3:08 PM
I have done something similar in the past using RRN (providing table is REUSEDLT(*NO).
ReplyDelete
Replies

Add comment

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.

RPGPGM.COM - From AS400 to IBM i

Pages

Advertisements

Wednesday, March 22, 2023

Find the most recent rows for a repeating sets of data with SQL

2 comments: