Wednesday, June 17, 2026

Program to alert when jobs' CPU percentage becomes too high

I was asked if there was a way to alert if a job exceeded a certain threshold of CPU percentage. The IBM i operating system collects data through Collection Services. Navigator for i and Performance Data Investigator, PDI, can be used to analyze CPU heavy jobs, but this is all historical data there is no real time alerting.

There are various tools from ISV that can do this. For this website I always write about solutions using just what comes in the IBM i operating system, with no third party software.

This problem allows me to use one of my favorite Db2 for i table function, ACTIVE_JOB_INFO. This table function allows me to retrieve information about all of the active jobs on my partition.

I can use the following statement to list the top five consumers of CPU percent on the partition I use for creating the posts for this website. This partition is only used by a few others, most of the time I am the only person using it. Therefore, the results I show will be a lot smaller than you are likely to find on your partitions.

The statement I would use is:

01  SELECT ELAPSED_CPU_PERCENTAGE AS "CPU %",
02         JOB_NAME,JOB_TYPE,JOB_STATUS,CPU_TIME,SUBSYSTEM
03    FROM TABLE(QSYS2.ACTIVE_JOB_INFO(DETAILED_INFO => 'WORK'))
04   ORDER BY ELAPSED_CPU_PERCENTAGE DESC
05  LIMIT 5

Lines 1 and 2: The information I would find interesting are:

  • ELAPSED_CPU_PERCENTAGE:  Percent of processing time attributed to this job during the measured time interval
  • JOB_NAME:  Fully qualified job name of the job
  • JOB_TYPE:  Type of the job
  • JOB_STATUS:  Status of the job
  • CPU_TIME:  Total processing time, in milliseconds
  • SUBSYSTEM:  Subsystem the job is running in

Line 3: I am using the 'WORK' value in the detailed info parameter as this returns a limited number of columns, which the results are returned faster than using the 'ALL' results.

Line 4: I want to order the results by the elapsed CPU percentage in descending order, therefore, the biggest consumer of CPU appears first.

Line 5: I am limiting the number of results to five.

The results look like:

                                JOB_  JOB_    CPU_
CPU %  JOB_NAME                 TYPE  STATUS  TIME    SUBSYSTEM
-----  -----------------------  ----  ------  ------  ---------
 1.10  308719/QWEBADMIN/ADMIN1  BCI   THDW    176147  QHTTPSVR
 1.04  308720/QLWISVR/ADMIN5    BCI   THDW    167984  QHTTPSVR
 0.99  387826/QUSER/QZDASOINIT  PJ    RUN        184  QUSRWRK
 0.77  370511/QUSER/QZRCSRVS    PJ    TIMA     37655  QUSRWRK
 0.66  367310/QUSER/QZRCSRVS    PJ    TIMA     40092  QUSRWRK

Below, I take that SQL statement and make it part of a RPG program.

The first part I am showing are the global definitions:

01  **free
02  ctl-opt main(Main) option(*srcstmt) dftactgrp(*no) actgrp(*caller) ;

03  dcl-s Threshold packed(7 : 2) inz(1.00) ;
04  dcl-s SleepSeconds int(10) inz(60) ;

05  dcl-ds Data qualified dim(*auto : 9999) ;
06    Percent packed(7 : 2) ;
07    JobName char(28) ;
08    JobType char(3) ;
09    JobSts char(4) ;
10    CpuTime packed(20 : 0) ;
11    Subsystem char(10) ;
12  end-ds ;

13  dcl-pr sleep int(10) extproc('sleep') ;
14    *n uns(10) value ;
15  end-pr ;

Line 2: My favorite control options. This program will not use the RPG cycle, therefore, it needs a Main procedure. I want debug to use the source member sequence numbers, rather than generate its own. And as the Main procedure calls a couple of subprocedures, therefore, I cannot operate in the default activation group.

Lines 3 and 4: I am defining these two variables here as these are the ones that will be used to control the way this program works.

  • Threshold:  This is the CPU percentage threshold I wish to alert when any jobs reaches it
  • SleepSeconds:  The number I want to pause before getting the results again

I put these at the top of the program so they would be easy to modify. The threshold is low, only 1%, as I mentioned before this partition is rarely used by anyone else.

Lines 5 – 12: This data structure array is used to contain the results retrieved from the ACTIVE_JOB_INFO. It is an auto-extending array, therefore, the number of array elements will be the same as the number of results, up to the maximum of 9,9999.

Lines 13 – 15: This is the procedure prototype for the C procedure sleep. I will be using this to pause the job between fetching results.

Onto the Main procedure:

16  dcl-proc Main ;
17    dow (*on) ;
18      GetData() ;
19      if (%elem(Data) > 0) ;
20        SendMessages() ;
21      endif ;

22      sleep(SleepSeconds) ;
23    enddo ;

24  on-exit ;
25  end-proc ;

Line 17: The start of a "never ending" Do loop. The loop will keep on being performed until the program is ended.

Line 18: Call the procedure to fetch the data from ACTIVE_JOB_INFO and populate the Data array.

Line 19: The %ELEM Built in Function, BiF, returns the number of elements the array has. If that number is greater than zero, there are jobs using more than the threshold CPU percentage.

Line 20: Call the procedure to send messages alerting that there are jobs using more than the threshold of CPU.

Line 22: Pause the job before returning to the start of the Do loop.

Line 24: I have included an ON-EXIT group even though I am not doing anything within it. I could use this later to close files, delete objects, etc.

Now onto the procedure that gets the data from ACTIVE_JOB_INFO:

26  dcl-proc GetData ;
27    dcl-s Rows uns(10) inz(%elem(Data : *max)) ;

28    %elem(Data) = 0 ;

29    exec sql DECLARE C0 CURSOR FOR
30               SELECT ELAPSED_CPU_PERCENTAGE,JOB_NAME,JOB_TYPE,
31                      JOB_STATUS,CPU_TIME,SUBSYSTEM
32                 FROM TABLE(QSYS2.ACTIVE_JOB_INFO(DETAILED_INFO => 'WORK'))
33                WHERE ELAPSED_CPU_PERCENTAGE >= :Threshold
34                ORDER BY ELAPSED_CPU_PERCENTAGE DESC
35                  FOR READ ONLY ;

36    exec sql OPEN C0 ;

37    exec sql FETCH C0 FOR :Rows ROWS INTO :Data ;

38    exec sql CLOSE C0 ;
39  end-proc ;

Line 27: Define a variable that contains the maximum number of elements may array can contain. As this is the only subprocedure I need it in, I defined it here.

Line 28: Clear the array, by setting the number of elements to zero.

Lines 29 – 35: The cursor definition is the same as the previous SQL statement I showed. The only differences are: I am not limiting the number of results, and, line 35, I have defined the cursor for input, as I will not be updating it. Line 33 is where the variable RPG Threshold as part of the selection criteria for the results.

Line 36: Open the cursor.

Line 37: This is a multi-row Fetch that will return all eligible rows, up to 9,999, and populate the array with them.

Line 38: Close the cursor.

The last subprocedure is where I send the message.

40  dcl-proc SendMessages ;
41    dcl-ds Single likeds(Data) ;
42    dcl-s Text varchar(1000) ;

43    for-each Single in Data ;
44      Text = 'SNDMSG MSG('''+
45             'Job ' + %trimr(Single.JobName) +
46             ' is running at ' +  %trimr(%char(Single.Percent)) + '%' +
47             ' in subsystem ' + %trimr(Single.Subsystem) +
48             ''') TOMSGQ(QSYSOPR)' ;

49      exec sql CALL QCMDEXC(:Text) ;
50    endfor ;
51  end-proc ;

Line 41: I need to define a data structure based on the data structure array so I can use the FOR_EACH operation code. I can make a copy of the data structure array's definition by using the LIKEDS.

Line 42: This variable will contain the text for the message I will be sending to the System Operator, *SYSOPR, message queue.

Line 43: Start of the For Each group, that takes each element, in turn, from the data structure array Data, and places the contents of that element into the Single data structure.

Lines 44 – 48: I am using the SNDMSG CL command to send the QSYSOPR message queue. I am only a few of the information I retrieved from ACTIVE_JOB_INFO:

  • Single.JobName:  Fully qualified job name
  • Single.Percent:  I am using the %CHAR and %TRIMR BiFs to convert the numeric percent to a character value, without trailing blanks, so that I can concatenate it into my text
  • Single.Subsystem:  The name of the subsystem the job is running in

I could have added any of the other information I retrieved from the table function. But, for this example, I wanted to keep it simple.

Line 49: I use the QCMDEXC SQL procedure to send the message.

When I look in the System Operator message queue I see the message I sent:

From  . . . :   SIMON          06/16/26   19:06:11
Job 308720/QLWISVR/ADMIN5 is running at 1.90% in subsystem QHTTPSVR

I know that 1.90% is a ridiculously low percent to alert on, but as I have said before there is not much going on this partition.

I could have done more "alerting", sent message to multiple message queues, sent an email, etc. But for this example, this demonstrates how to do what I set out to do. If you want to copy and use this program you can add any other "alerting" methods you like.

If I was deploying this to a production server I would add it as a subsystem autostart job entry, so that it starts when the subsystem starts, and ends when the subsystem ends.

 

This article was written for IBM i 7.6, and should work for some earlier releases too.

No comments:

Post a Comment

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.