Relational databases make it very easy to join data from different tables, but did you know you can also JOIN data within the same table?
Relational databases make it easy to JOIN (combine) data from different tables.
However, you can also JOIN a table with itself. This is especially handy when a
table already contains all the data you need, but not in the right format.
References to the same table
One of the most common reasons to JOIN a table with itself, is because you have
a table that contains a foreign key reference to itself.
In the example below we have an employees table with employees. The table includes
a manager_id column that refers to an employee in that same employees table.
employees
id
name
manager_id
1
Denholm
2
Richmond
3
Roy
5
4
Maurice
5
5
Jen
1
If we want to create an overview that lists the name of each employee along
with the name of their manager, we need to join the employees table with
itself.
When we do this, we need to assign to one (or both) of them so that the database can
distinguish between the two instances. It doesn’t really matter what names you
use, as long as they’re unique and make sense to you.
Since we have two tables with identical column names, the database no longer
understands what you mean when you say you want something like a name. You
therefore have to explicitly tell it that you want the name of some row in the
employees or the managers version of the table. This can be done by
prepending column names with a table alias, followed by a .:
This query gives us the following result:
Computing durations between changes
Another common reason to use self joins is to calculate some duration or distance
between pairs of rows within a table.
For example, the order_state table below keeps track of state changes that
happen to orders at a webshop. Each record in this table includes a reference to
the order, the name of the new state, and shows when the order changed to its
new state. Can we calculate how long each of these state transitions took?
order_state
id
order_id
state
created_at
1
1
placed
2021-06-01 00:01:00
2
1
packaged
2021-06-01 00:58:00
3
1
despatched
2021-06-01 02:20:00
4
1
delivered
2021-06-01 15:20:00
5
2
placed
2021-06-02 14:10:00
6
2
packaged
2021-06-02 18:55:00
7
3
placed
2021-06-02 19:00:00
8
3
packaged
2021-06-02 20:30:00
9
3
despatched
2021-06-03 01:40:00
10
2
despatched
2021-06-03 01:40:00
11
2
delivered
2021-06-03 08:00:00
12
3
delivered
2021-06-03 16:10:00
Of course we can! The query below shows how.
Once again, we have one table, order_state, that we want to JOIN with itself.
We name one of these from_state and the other to_state, to represent the
collection of states before and after the transition.
Obviously, the two states involved in the transition should belong to the same
order, so we add the condition that from_state.order_id must be equal to
to_state.order_id. Moreover, the to_state should occur later than the
from_state, so we’ll also add a condition that the .
Finally, we want to make sure that we only see direct state transitions, like
those from placed to packages, and not indirect ones, like those from
placed to delivered. This is done by first grouping by from_state.id, which
ensures that the “from” state appears only once in the result, and then JOINING
each from_state with the closest to_state record that exists, i.e. the
record with the lowest to_state.id.
Creating pairs of data
Self joins can also be used to easily create large or unique pairs of data from
a relatively small set of data.
Cartesian product
You can to create a result set that includes every possible
combination of values within a single table, for instance when you want to know
how many possible configurations exist for a product or when you wish to discover
which combinations are common or rare.
In the example below, we combine multiple instances of a digit table to form
larger numbers.
digit
id
0
1
2
3
4
5
6
7
8
9
I once used a very similar method to generate lists of dates in a legacy
business intelligence application that was built on top of a MySQL database:
Unique pairs
Sometimes you’re only interested in unique combinations. For example when you
need to create pairs of people or want to list : it doesn’t
matter which value is listed first or second – what matters is that the first
and second values are listed together exactly once.
The example below shows a simple person table that contains the names of six
people.
person
id
name
1
Rachel
2
Monica
3
Phoebe
4
Joey
5
Chandler
6
Ross
We can create a list of possible pairings by adding a simple condition to the
JOIN clause: