Wednesday, March 15, 2017

Difference between UNION and JOIN

difference between selection with join and select with union

After last week's post about using a Select statement with an UNION several people have asked me to clarify the differences between an Union and a Join.

I think using some simple graphics makes it a whole lot easier to explain. I have two files:

File 1
           
           
File 2
           
           

If I Join the two files I have data from both files into a single row:

Join
                       
                       

The Union merges data from two or more subselects into one result set, the data from each of the subselects is kept as a separate row:

Union
           
           
           
           

OK, let me give more detail using some example code. I have two files, which are identical:

A          R RFILE1
A            F1_1           2S 0
A            F1_2      R               REFFLD(F1_1 *SRC)
A                                      COLHDG('F1_2')

And:

A                                      REF(FILE1)
A          R RFILE2
A            F2_1      R               REFFLD(F1_1)
                                       COLHDG('F2_1')
A            F2_2      R               REFFLD(F1_2)
                                       COLHDG('F2_2')

And a third with just one field:

A                                      REF(FILE1)
A          R RFILE3
A            F3_1      R               REFFLD(F1_1)
                                       COLHDG('F3_1')

Each of the file contain five records:

FILE1
F1_1  F1_2
  1     1
  2     2
  3     3
  4     4
  5     5
FILE2
F2_1  F2_2
  2     2
  4     4
  6     6
  8     8
 10    10
FILE3
F3_1
  3
  6
  9
 12
 15

I can join the three with the following SQL statement:

01  SELECT A.F1_1,A.F1_2,B.F2_1,B.F2_2,C.F3_1
02    FROM FILE1 A
03    FULL OUTER JOIN FILE2 B
04      ON A.F1_1 - B.F2_1
05    FULL OUTER JOIN FILE3 C
06      ON A.F1_1 = C.F3_1
07      OR B.F2_1 = C.F3_1

A FULL OUTER join will include all of the records from the three files, even if there are not matches in the others. I must admit I still refer to the SQL joins diagram to determine which kind of join is best for my desired result.

The results show that the data has been combined into rows.

F1_1  F1_2  F2_1  F2_2  F3_1
  1     1      -     -     -
  2     2     2     2      -
  3     3      -     -    3
  4     4     4     4      -
  5     5      -     -     -
   -     -    6     6     6
   -     -    8     8      -
   -     -   10    10      -
   -     -     -     -    9
   -     -     -     -   12
   -     -     -     -   15

The dash ( - ) indicates a value of null, in other words no match could be made.

In my first example of an Union I am going to put together two subselects selecting all the records from FILE1 and FILE2.

01  SELECT * FROM FILE1
02  UNION
03  SELECT * FROM FILE2
04  ORDER BY 1

As the names of the fields in the files are different I need to give a number for the ORDER BY. By giving the number 1 I want my result set to be sorted by the first row of the result set.

F1_1  F1_2
  1     1
  2     2
  3     3
  4     4
  5     5
  6     6
  8     8
 10    10

By using the UNION my result set is "distinct", in other words duplicate results have been removed. If I want to see the duplicates I use the UNION ALL instead:

01  SELECT * FROM FILE1
02  UNION ALL
03  SELECT * FROM FILE2
04  ORDER BY 1

Now I see the duplicates:

F1_1  F1_2
  1     1
  2     2
  2     2
  3     3
  4     4
  4     4
  5     5
  6     6
  8     8
 10    10

I can include more than two files in an Union, so let me add FILE3.

01  SELECT * FROM FILE1
02  UNION
03  SELECT * FROM FILE2
04  UNION
05  SELECT * FROM FILE3
06  ORDER BY 1

When I execute that statement I get an error:

Number of columns not consistent.


 Message ID . . . . . :   SQL0421       Severity . . . . . . :   30
 Message type . . . . . :   Diagnostic

 Message . . . . :   Number of columns not consistent.
 Cause . . . . . :   The subselects of a UNION, INTERCEPT, or EXCEPT
   must have the same number of result columns.  All rows specified
   for a multiple row insert or for a VALUES clause must have the
   same number of values.
 Recovery  . . . :   Correct the SQL statement so that the same 
   number of columns are defined for each row. Try the request 
   again.

The result set of the Union between FILE1 and FILE2 produces a result set of two columns. As FILE3 only has one field it cannot be part of the result set. Unless I do something like:

01  SELECT * FROM FILE1
02  UNION
03  SELECT * FROM FILE2
04  UNION
05  SELECT F3_1,0 FROM FILE3
06  ORDER BY 1

Line 5: I have added a second column to the subselect from FILE3 by using the zero ( 0 ). This will place zero into the second column of the result set, and I get:

F1_1            F1_2
  1                1
  2                2
  3                0
  3                3
  4                4
  5                5
  6                6
  6                0
  8                8
  9                0
 10               10
 12                0
 15                0

If I just want to have a "distinct" list of the values I need to modify my SQL statement to just be for the first columns:

01  SELECT F1_1 FROM FILE1
02  UNION
03  SELECT F2_1 FROM FILE2
04  UNION
05  SELECT F3_1 FROM FILE3
06  ORDER BY 1

Which gives me:

F1_1
  1
  2
  3
  4
  5
  6
  8
  9
 10
 12
 15

 

To those who asked me about the Union clause I hope that these example adequately show the differences between the Join and the Union.

 

You can learn more about this from the IBM website:

 

This article was written for IBM i 7.3, and should work for earlier releases too.

5 comments:

  1. FWIW, I use UNION to detect row differences in 2 files where one file is a clone of the other. If 2 rows are identical, only 1 row gets put into the result set.

    with mytable as
    ( select * from ITEM a
    UNION
    select * from ITEM_SAVE b )
    select item, count(*) from mytable
    group by item
    having count(*) > 1

    Ringer

    ReplyDelete
    Replies
    1. Another smart tecnique for detecting row differences from 2 files is using the EXCEPT keyword. Eg.:
      SELECT * FROM item
      EXCEPT
      SELECT * FROM item_save

      This returns all rows from ITEM table that do not have a matching row in ITEM_SAVE. Here the comparison is made on all columns, but you can also choose single columns to compare.
      For reference:

      https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_72/sqlp/rbafyexcept.htm

      Delete
  2. I use union with CTE to give me a number of data samples union-ed together. I create a CTE of data where a for 5 rows then 5 rows where b and 2 rows where c then union the data set.

    ReplyDelete
  3. If you like to know, what row comes from what table you can add a new field like:

    Select 'tableA' as Source, f11 from tableA
    Union
    Select 'tableB' as Source, f12 from tableB

    ReplyDelete
  4. a short tip for using union and sort out what row comes from what table afterwards: just add a

    'Table A' as Source, ...
    union
    'Table B' as Source, ...
    ...

    field and you still know the source if you like to do...

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.