## Wednesday, March 22, 2023

### Find the most recent rows for a repeating sets of data with SQL

This post is based upon something I needed to provide to my superiors at my work. We have a history file that contains many thousands of repeating sets records (yes, I know it is time to purge the old data), I was asked to "create a SQL statement" that would return the most recent records for each key. It will make more sense when I show the data I am going to be using here.

In these examples the file is named TESTFILE, and contains three columns of data. I can use the following SQL statement to list the contents of this file:

 ```01 SELECT * FROM TESTFILE ```

The results are:

 ```KEY TIMESTAMP VALUE ------ -------------------------- ------ FIRST 2023-02-28 07:28:55.362189 1 SECOND 2023-02-28 07:28:55.362189 1 THIRD 2023-02-28 07:28:55.362189 1 FIRST 2023-02-28 08:28:58.307440 2 SECOND 2023-02-28 08:28:58.307440 2 THIRD 2023-02-28 08:28:58.307440 2 FIRST 2023-02-28 09:12:32.470802 3 SECOND 2023-02-28 09:12:32.470802 3 THIRD 2023-02-28 09:12:32.470802 3 ```

As you can see there are three sets of data, with the same timestamp and value in the VALUE column. The most recent rows for the three key values are all where the column is VALUE equal to '3'.

This is an occasion where I use SQL's inner join. The inner join will return all the rows that are in the two files I have joined. To achieve my desired results I am joining TESTFILE to itself:

 ```01 SELECT A.KEY,A.TIME_STAMP,A.VALUE 02 FROM TESTFILE A 03 INNER JOIN 04 (SELECT MAX(TIME_STAMP) AS LATEST,KEY 05 FROM TESTFILE 06 GROUP BY KEY) B 07 ON A.TIME_STAMP = B.LATEST 08 AND A.KEY = B.KEY ```

Line 1: These are the columns I want in my results. I needed to prefix them with the letter 'A' as the fields in both files have the same names.

Line 2: The first definition of the file, and the 'A' follows the file name to indicate that this is the prefix for the file's fields.

Line 3: Not much to say here except I am using an inner join.

Lines 4 – 6: This is a sub select to return the greatest value of the timestamp field.

Line 4: I need to define two columns here. The first is the maximum value from the timestamp column, which I can easily get with the MAX scalar function. I have even given this result column its own name, LATEST, which I can use later in this statement. The second is the key column, I need that to join the two set of results together.

Line 6: By using the GROUP BY only one result is return for each unique key. And I have prefixed the results from this file with 'B'.

Lines 7 and 8: Defining the joins between the two files.

The results are:

 ```KEY TIMESTAMP VALUE ------ -------------------------- ------ FIRST 2023-02-28 09:12:32.470802 3 SECOND 2023-02-28 09:12:32.470802 3 THIRD 2023-02-28 09:12:32.470802 3 ```

The results are exactly what I wanted.

What about doing the opposite, returning the oldest set of records from the file. With just one change I have those results:

 ```01 SELECT A.KEY,A.TIME_STAMP,A.VALUE 02 FROM TESTFILE A 03 INNER JOIN 04 (SELECT MIN(TIME_STAMP) AS LATEST,KEY 05 FROM TESTFILE 06 GROUP BY KEY) B 07 ON A.TIME_STAMP = B.LATEST 08 AND A.KEY = B.KEY ```

Line 4: The only difference is to replace the MAX scalar function with the MIN scalar function.

Results are:

 ```KEY TIMESTAMP VALUE ------ -------------------------- ------ FIRST 2023-02-28 07:28:55.362189 1 SECOND 2023-02-28 07:28:55.362189 1 THIRD 2023-02-28 07:28:55.362189 1 ```

The results are for the oldest set of records.

I assume that if I was to add a record with an even greater timestamp for one of the keys it would appear in the results, rather than the record that belongs to the third set of records. I inserted a row with a future timestamp, and a value of '4':

 ```1 INSERT INTO TESTFILE VALUES('SECOND',CURRENT_TIMESTAMP + 1 YEAR,4) ```

Then I used the same statement as before to retrieve the records with the greatest timestamp for each key:

 ```01 SELECT A.KEY,A.TIME_STAMP,A.VALUE 02 FROM TESTFILE A 03 INNER JOIN 04 (SELECT MAX(TIME_STAMP) AS LATEST,KEY 05 FROM TESTFILE 06 GROUP BY KEY) B 07 ON A.TIME_STAMP = B.LATEST 08 AND A.KEY = B.KEY ```

As I expected, the row I inserted above now appears in the results.

 ```KEY TIMESTAMP VALUE ------ -------------------------- ------ FIRST 2023-02-28 09:12:32.470802 3 SECOND 2024-02-28 08:06:25.619974 4 THIRD 2023-02-28 09:12:32.470802 3 ```

In the real world a statement equivalent to these returned to me the most recent values for the file's keys. My superiors were happy with the results, and the short amount of time taken to generate them and the "SQL statement".

This article was written for IBM i 7.5, and should work for some earlier releases too.

#### 2 comments:

1. Instead of joining to a subselect, there is an alternative way of achieving this. And that is to use the row_number() OLAP function. This is a special type of aggregation function that can add a row number to the records, within a custom grouping and with a custom ordering.

For this example, you would group it by KEY and order it by TIMESTAMP desc - you would then get row numbering for each key, such that the newest timestamp in each key is row number 1. Then just select the rows with row number = 1.

with RANKED as (
select
t.*,
row_number() over(
partition by KEY
order by TIME_STAMP desc
) as rn
from TESTFILE t
select KEY, TIME_STAMP, VALUE
from RANKED
where rn = 1;

When I first discovered this a few years ago, providing the appropriate indexes existed, Visual Explain showed it to be considerably faster than the join to subselect method.

2. I have done something similar in the past using RRN (providing table is REUSEDLT(*NO).

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.

```
```