Wednesday, February 26, 2020

Checking if Job has error, and sends an email

rpogram to monitor job for error message and send email

The germ for this post came from a question I was asked about if it is possible to monitor a job, and if it errors to send an email. In previous posts I have given examples of the pieces need to do this, and in this post I will just put it all together, like assembling pieces of Lego.

The one perquisite that is needed to make this work is that the name of the Job has to be unique, and always the same. Each time the job is submitted to batch it has to be submitted with the same unique name. In this example the job I want to monitor will always have the name: SIMON_JOB

SBMJOB CMD(CALL PGM(program)) JOB(SIMON_JOB)

If the name of this job is not unique then there is no way I can identify it, and I will be able to check its status.

The program is quite simple I just need to do two things:

  1. Check the status of the job
  2. Send an email

I create one RPG program that does all of this. It contains just three procedures:

  1. The main procedure
  2. A procedure to determine if the job is in message wait
  3. Another procedure to send the email

While the program is not too long, it is too long to show in one piece here. Let me start with the "global" section, the place where the global variables that are available to all of the procedures in the program, are defined.

01  **free
02  ctl-opt option(*srcstmt)
03            dftactgrp(*no)
04            main(Main) ;

05  dcl-s ThisJobNameOnly char(10) inz('SIMON_JOB') ;
06  dcl-s EmailAddress char(30) inz('SIMON@EMAIL.COM') ;

07  dcl-ds Data qualified ;
08    JobName char(28) ;
09    Subsystem char(10) ;
10    MsgText char(350) ;
11    MsgDate date(*mdy) ;
12    MsgTime time(*hms) ;
13  end-ds ;

Line 1: I only write programs in totally free RPG. It has been five years since it was introduced so why would I want to use an older form of the language?

Lines 2 – 4: Here are the program's control options.

Line 2: I always want the source member sequence numbers to be the program's internal line numbers used when it is compiled. It makes it so much easier to find an error when the program and source statement numbers are the same.

Line 3: As this program contains subprocedures it cannot be run in the default activation group.

Line 4: I am using a Main procedure, rather than have my program use the RPG cycle.

Line 5 and 6: Rather than hard code the name of the Job and the email address further into the program, I have defined these two variables here so that you can easily find and change them to be used in your environment. I think their names explains their purpose. When you can use long names why not make the variable name truly descriptive of its function?

Lines 7 – 13: I am going to use SQL to fetch results from a table function. This data structure will contain the results. The only reason I have defined the date format, line 11, and time format, line 12, is when I show the results here they are in American format.

The next section of the program is the Main procedure.

14  dcl-proc Main ;
15    dcl-s Stop char(1) ;

16    exec sql SET OPTION COMMIT = *NONE,CLOSQLCSR = *ENDMOD ;

17    dow (1 = 1) ;
18      exec sql SELECT DATA_AREA_VALUE INTO :Stop
                   FROM TABLE(QSYS2.DATA_AREA_INFO(
                          DATA_AREA_NAME => 'TESTDA')) ;
19      if (Stop = '1') ;
20        leave ;
21      endif ;

22      GetData() ;

23      if (Data.JobName <> ' ') ;
24        SendEmail() ;
25      endif ;

26      exec sql CALL QSYS2.QCMDEXC('DLYJOB DLY(300)') ;
27    enddo ;
28  end-proc ;

Before I start describing the Main procedure notice that there is not Procedure Prototype, as the Main procedure does not receive parameters from whatever is calling it a Procedure Prototype is not needed.

Line 15: I have defined this variable in the Main procedure as this is the only procedure it will be used in.

Line 16: I add these SQL options to all of the SQL RPG programs I write.

Line 19: The data area TESTDA contains either zero or 1. If it contains 1 the program will end. This way I can end the program in a controlled manner. Here I am using the DATA_AREA_INFO SQL table function to get the data from the data area. This table function is only available in IBM i 7.4 TR1. If I was using an earlier release of IBM i I would need to do:

15  dcl-s Stop char(1) dtaara('TESTDA') ;
 .
 .
18  in Stop ;

Lines 19 – 21: If the value retrieved from the data area is "1" then the Do loop is exited, and the program will end.

Line 22: This is the call to the subprocedure that gets the data about whether the job SIMON_JOB is in error.

Line 23: If the fist data structure subfield contains data then the job is in error. If it is blank then there is no error.

Line 24: Call to the subprocedure to send the email.

Line 26: I am going to pause processing for five minutes before checking for an error again. I am using the Delay Job command, DLYJOB, in the SQL procedure QCMDEXC to accomplish this.

The next procedure, GetData, gets the data to show if the job is in error.

29  dcl-proc GetData ;
30    reset Data ;

31    exec sql DECLARE C0 CURSOR FOR
32      SELECT A.JOB_NAME,A.SUBSYSTEM,
33             CAST(RTRIM(B.MESSAGE_TEXT) || ' '
34                    || SUBSTR(B.MESSAGE_SECOND_LEVEL_TEXT,24)
35                    AS CHAR(350)),
36             DATE(MESSAGE_TIMESTAMP),TIME(MESSAGE_TIMESTAMP)
37      FROM TABLE(QSYS2.ACTIVE_JOB_INFO(JOB_NAME_FILTER 
38                                    => :ThisJobNameOnly)) A
39           CROSS JOIN TABLE(QSYS2.HISTORY_LOG_INFO()) B
40      WHERE A.JOB_NAME = B.FROM_JOB
41        AND A.JOB_STATUS = 'MSGW'
42        AND MESSAGE_TIMESTAMP < (CURRENT TIMESTAMP - 5 MINUTES)
43      ORDER BY B.ORDINAL_POSITION DESC
44      LIMIT 1
45      FOR READ ONLY ;

46    exec sql OPEN C0 ;

47    exec sql FETCH C0 INTO :Data ;

48    exec sql CLOSE C0 ;
49  end-proc ;

I know that it looks complicated. I could have made this very simple and made the program send an email that says something like "You have an error!". In my opinion that would not be very useful, I wanted the email to contain some information that could give me an idea of what the error is. Therefore, I need to know if the job has an error what is the error message. To achieve this I have joined two table functions together:

  1. ACTIVE_JOB_INFO:  I use this to determine if the job is in error
  2. HISTORY_LOG_INFO  To get the error message details

Let me describe the select statement in the cursor declaration.

I only want five columns returned in the results:

  1. Full job name
  2. Subsystem the job is running in
  3. Error message text
  4. Date the error happened
  5. Time the error happened

Line 32: The first two columns are returned by ACTIVE_JOB_INFO.

Lines 33 – 35: I am making the third column, the error message text, by concatenating the first level message text and the second level message text. I am using a CAST to only return 350 characters as I will not be able to include any more of it in my email. If you notice I have used a substring to only include the message second level text starting at the 24th position. I have done this as the first 23 characters are always the same no matter what the message is.

Line 36: Rather than display a timestamp in the email I think it is better to display the date and time as two different pieces of information. Therefore, I have used the DATE and TIME two extract those part of the message timestamp into separate columns.

Lines 37 - 40: This is where the two table functions are defined, and joined. ACTIVE_JOB_INFO has the parameter of JOB_NAME_FILTER as I only want the information for the job named SIMON_JOB. I am just using a CROSS JOIN as I know that there will be matching rows in the HISTORY_LOG_INFO to the ACTIVE_JOB_INFO. The two table functions are joined by the full job name columns, line 40.

Line 41: And I only want those jobs that are in message wait status.

Line 42: And the message must have happened at least five minutes ago.

Line 43: I want to sort the results in descending order by the ordinal position column from the history log info. This will make the most recent row for this job come first, which should be the one with the error message in it.

Line 44: I only want one row of the results returned.

Line 45: I am only using this cursor for input only.

Lines 46 – 48: Pretty standard stuff. The cursor is opened. The data is fetched from the cursor into the data structure. The cursor is closed.

The last subprocedure is what sends the email.

50  dcl-proc SendEmail ;
51    dcl-s Command char(500) ;

52    Command = 'SNDSMTPEMM ' +
53              'RCP((' + %trimr(EmailAddress) + ')) ' +
54              'SUBJECT(''Error msg job ' + %trimr(Data.JobName) +
55              ' in sbs ' +  %trimr(Data.Subsystem) + ''') ' +
56              'NOTE('''  + %char(Data.MsgDate:*mdy) + ' ' +
57                           %char(Data.MsgTime:*hms) + ' | ' +
58                           Data.MsgText + ''')' ;

59    exec sql CALL QSYS2.QCMDEXC(:Command) ;
60  end-proc ;

Line 51: This is the variable that will contain the CL command to send the email.

Lines 52 – 58: Here I am building the Send SMTP Email command, SNDSMTPEMM command to send me email.

Line 53: The recipient is the address I gave as a global variable.

Lines 54 and 55: The email's subject line contains the full job name and the subsystem the job is in.

Lines 56 – 58: The email body contains the date and time the error happened, followed by the message texts.

Line 59: Last step is to use the SQL QCMDEXC procedure to send the email.

I tested this with two scenarios: the first was an error in a CL program, and the second in a RPG program.

When the error was in the CL program this is what I retrieved from the SQL cursor:

EVAL data
DATA.JOBNAME = '135993/SIMON/SIMON_JOB      '
DATA.SUBSYSTEM = 'QSPL      '
DATA.MSGTEXT =
     ....5...10...15...20...25...30...35...40...45...50...55...60 
  1 'CPF2105 received by procedure EG061TC. (C D I R) ILE Control'
 61 ' language (CL) procedure EG061TC in module EG061TC in progra'
121 'm EG061TC in library RPGPGM1 detected an error at statement '
181 'number 0000001400.  Message text for CPF2105 is: Object A in'
241 ' QTEMP type *FILE not found.  Use F10 (if available) or the '
301 'Display Job Log (DSPJOBLOG) command to see the mes'
DATA.MSGDATE = '02/25/20'
DATA.MSGTIME = '18:46:09'

When this was formatted into the SNDSMTPEMMM command it looked like:

EVAL command
COMMAND =
     ....5...10...15...20...25...30...35...40...45...50...55...60 
  1 'SNDSMTPEMM RCP((SIMON@EMAIL.COM)) SUBJECT('Error msg job 135'
 61 '993/RPGPGM/SIMON_JOB in sbs QSPL') NOTE('02/25/20 18:46:09 |'
121 ' CPF2105 received by procedure EG061TC. (C D I R) ILE Contro'
181 'l language (CL) procedure EG061TC in module EG061TC in progr'
241 'am EG061TC in library RPGPGM1 detected an error at statement'
301 ' number 0000001400.  Message text for CPF2105 is: Object A i'
361 'n QTEMP type *FILE not found.  Use F10 (if available) or the'
421 ' Display Job Log (DSPJOBLOG) command to see the mes')       '
481 '                    '

The format of the email depends upon which email client you use. But it will look something like:

From: XXXXXXXXXXXXXXXXXXXXXX
To: simon@email.com
Date: 02/25/2020 6:52 PM
Subject: Error msg job 135993/RPGPGM/SIMON_JOB in sbs QSPL

01/18/20 18:46:09 | CPF2105 received by procedure EG061TC. 
(C D I R) ILE Control language (CL) procedure EG061TC in 
module EG061TC in program EG061TC in library RPGPGM1 
detected an error at statement number 0000001400.  Message
text for CPF2105 is: Object A in QTEMP type *FILE not found.
Use F10 (if available) or the Display Job Log (DSPJOBLOG) 
command to see the mes

The data retrieve from the SQL cursor for the RPG program error looked like:

EVAL data
DATA.JOBNAME = '137698/SIMON/SIMON_JOB      '
DATA.SUBSYSTEM = 'QSPL      '
DATA.MSGTEXT =
     ....5...10...15...20...25...30...35...40...45...50...55...60 
  1 'Attempt to divide by zero (C G D F). RPG procedure EG061TR1 '
 61 'in program RPGPGM1/EG061TR1 at statement 000500 tried to div'
121 'ide by zero. &N Recovery  . . . :   Contact the person respo'
181 'nsible for program maintenance to determine the cause of the'
241 ' problem. &N Possible choices for replying to message . . . '
301 '. . . . . . . . . . . . :   &B D -- Obtain RPG for'
DATA.MSGDATE = '02/25/20'
DATA.MSGTIME = '20:08:19'

When this was formatted for the SNDSMTPEMM command it looked like:

EVAL command
COMMAND =
     ....5...10...15...20...25...30...35...40...45...50...55...60 
  1 'SNDSMTPEMM RCP((SIMON@EMAIL.COM)) SUBJECT('Error msg job 137'
 61 '698/RPGPGM/SIMON_JOB in sbs QSPL') NOTE('02/25/20 20:08:19 |'
121 ' Attempt to divide by zero (C G D F). RPG procedure EG061TR1'
181 ' in program RPGPGM1/EG061TR1 at statement 000500 tried to di'
241 'vide by zero. &N Recovery  . . . :   Contact the person resp'
301 'onsible for program maintenance to determine the cause of th'
361 'e problem. &N Possible choices for replying to message . . .'
421 ' . . . . . . . . . . . . :   &B D -- Obtain RPG for')       '

And the email:

From: XXXXXXXXXXXXXXXXXXXXXX
To: simon@email.com
Date: 02/25/2020 8:13 PM
Subject: Error msg job 137698/RPGPGM/SIMON_JOB in sbs QSPL

02/25/20 20:08:19 | Attempt to divide by zero (C G D F). RPG 
procedure EG061TR1 in program RPGPGM1/EG061TR1 at statement
000500 tried to divide by zero. &N Recovery  . . . : Contact
the person responsible for program maintenance to determine
the cause of the  problem. &N Possible choices for replying 
to message . . . . . . . . . . . . . . . :   &B D -- Obtain
RPG for 

Now whenever a critical job goes into message wait you will receive an email alerting you.

 

This article was written for IBM i 7.4, and should work for some earlier releases too.

6 comments:

  1. Excellent .. I will test this
    I am sure this will help my future developments

    ReplyDelete
  2. Rather than having a monitoring job running and using system resources, you could use a Watch Exit Program to watch for any messages in your job and have it send an email instead.

    A very good example is available here: https://www.ibm.com/support/pages/strwch-watch-exit-programs-explained-cl-example

    Just change the /* DSPJOB and JOBLOG FOR OFFENDING JOB */ section to call SNDSMTPEMM instead, and you are done. Start the watch using the STRWCH command.

    ReplyDelete
  3. Thanks for the post Simon, do we have any alternative solution for QSYS2.HISTORY_LOG_INFO() as this table is unavailable in V7.1

    ReplyDelete
  4. You have to do this the "old way", which means a lot of work for you to find what you are looking for.

    Start with the following command: DSPLOG
    And prompt it.
    Enter the time and date range you desire.
    In the OUTPUT parm enter *PRINT.
    Press Enter.

    Depending upon the range you select this can take some time to generate. You may also get a message that the spool file is full, CPA4072, and you will need to answer the message with NOMAX.

    When the spool file has generated you then need to search that for what you are looking for.

    You could copy the spool file to a physical file, see here how to do it, and then use SQL to search the "flat" physical file.

    ReplyDelete
  5. Hi Simon, thank you for the tip. You also could have used the external procedure sleep() instead of DLYJOB.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.