SQL Joins Internal Architecture
Understanding the Internals of JOIN operation of SQL
Have you ever tried to combine pieces of information from different sources that are available to you? Maybe you will be having a list of users in the platform you built. You might want to see who have written what blog's or how many like a specific blog has gotten. In databases, we use something called joins to connect data from different tables.
Recently, while working on a project for the company I’m employed at, I found myself diving deep into SQL joins. I came across some videos that explained the concept well, but as I was watching, I wondered that: How do SQL joins work internally?
This curiosity led me to discover the difference between logical joins and physical joins. While we, as users, write SQL queries using logical joins to express the relationships we want to establish, it’s the database management system (RDBMS) that translates these logical joins into physical joins to perform the actual operations. The magic happens behind the scenes, and understanding how RDBMS executes these operations can be a game-changer for optimising query performance.
So before we do deep dive into physical joins, lets do a quick refresh of Different types of joins in SQL.
NOTE: This Article is assuming that we have no index.
Summary of Different types of JOINs:
INNER JOIN: Combines rows from two tables where the join condition is met.
LEFT (OUTER) JOIN: Includes all rows from the left table and matched rows from the right table.
RIGHT (OUTER) JOIN: Includes all rows from the right table and matched rows from the left table.
FULL (OUTER) JOIN: Includes all rows from both tables, with NULL in unmatched fields.
CROSS JOIN: Produces a Cartesian product of two tables.
SELF JOIN: Joins a table to itself using a join condition.
Now let's jump to the Physical Joins where the real magic lies.
Internal Working of SQL
Parsing and Query Optimisation
Parsing: The SQL query is parsed into a query tree to ensure its syntax is valid.
Optimisation: The database engine evaluates multiple query execution plans and chooses the one with the lowest cost, considering factors like indexes, table sizes, and data distribution.
Now after it is optimised, now we move to the JOIN operation execution algorithms.
Physical Join Algorithms
Now before we start explaining the algorithms behind SQL join, we will be using the example query given below:
SELECT u.name, COUNT(*) AS bcount
FROM blogs b
INNER JOIN users u ON b.user_id = u.id
GROUP BY u.name
ORDER BY bcount DESC;
Nested Loop Join
In simple terms, what it does is that Every Right relation is scanned for every row in left relation.
It is quite similar to nested for loops in programming.
Processing Flow:
FOR r1 in table1: # Users table in this case
FOR r2 in table2: # Blog table in this case
if r1.attr1 = r2.attr2:
resultset.push(<r1,r2>)
Example:
Characteristics:
Easiest to Implement
Time Consuming for large datasets
Better with either small datasets or index or join attributes (No need for full scan, uses index to access the relevant rows).
Computational Complexity: O(n*m)
Hash Join
It involves two relations:
Right Relation: The relation that is scanned and inserted into a hash table based on the join attribute.
Left Relation: The tuples in this relation are then matched against the Hash Table entries to form the result set.
It involves building a Hash Table as the name suggest and then the joining is performed. Generally the Hash Key is derived from the JOIN Attribute. Entire thing is based on the premise that Hash Function should distribute the data evenly.
Processing Flow:
For r1 in relation1: rows = table[FN(r1.attr1)] # JOIN Attribute For r2 in rows: # Hash Table is Prebuilt using the condition above of Right Relation. resultset.push(<r1, r2>)
Example:
Characteristics:
Well Suited for Equi-Joins (attr = attr)
Efficient for Large Datasets
Pre-Join preparation is required which is the Hash Table Construction
Required Additional memory.
Computational Cost: O(m+n)
Merge Join
It works on the principle of Sorting each table on join attributes, scanning both tables and join matching rows.
It is quite similar to merging two sorted arrays. It makes it easier for the DB engine to join two tables together.
Processing Flow:
table1 = sort(table1) on attribute1
table2 = sort(table2) on attribute2
Initialise points to first rows of sorted Table 1 and Table 2.
While neither table is exhausted:
If join keys match:
Output matching rows
Advance both pointers
Else if Table 1's key < Table 1's key:
Advance Table 1 pointer
Else:
Advance Table 2 pointer
Example:
Characteristics:
Well Suited for Equi-Joins (attr = attr)
Efficient for Large Datasets
Pre-Join preparation is required which is the Hash Table Construction
Required Additional memory.
Computational Cost: O(m+n)
Now how does SQL Engine decide to apply which algorithm?
The SQL Engine decides which join algorithm to use—such as nested loop join, hash join, or merge join—based on the query execution plan, which is generated during the optimization phase. This decision depends on factors like the size of the tables, the availability of indexes, the distribution of data, and the presence of sorting.
For example, if one table is significantly smaller, a hash join may be preferred to build an efficient in-memory lookup structure. For already sorted tables, a merge join can minimize overhead by directly matching rows. If no indexes are available and the dataset is small, a nested loop join might suffice. The SQL optimiser evaluates these scenarios to choose the algorithm with the lowest estimated cost in terms of CPU, memory, and I/O.
Factors Affecting Join Performance
Indexes: Proper indexing on join columns can vastly improve the performance.
Table Size: Larger tables require more resources for scanning and processing.
Join Condition: Simpler is better than Complex conditions in terms of evaluation.
Data Distribution: Unevenly distributed data can lead to skewed join performance.
Available Memory: Hash joins and merge joins may require sufficient memory to store intermediate results.
Optimising SQL Joins
Index Join Columns: Create indexes on frequently joined columns to reduce lookup times.
Filter Early: Use WHERE clauses to filter rows before the join operation to reduce the dataset size.
Choose the Right Join: Select the join type that matches our use case and minimizes unnecessary data retrieval.
Use Explain Plans: Analyze query execution plans to identify inefficiencies and optimize the query.
Partition Tables: For very large datasets, consider partitioning tables to improve join performance.
Conclusion
Now, we have reached the end of the article. Always remember that we don't need to explicitly specify SQL which Physical join to use, it can select the best according to the dataset that it is scanning through.
By writing this article, it helped me in understanding the internals of the SQL JOIN greatly and also helped me in revising concepts that I long forgot.
If you found this blog post helpful, please consider sharing it with others who might benefit. You can also follow me for more content on Javascript, React, and other web Development topics.
For Paid collaboration, mail me at: avikm744@gmail.com
Connect with me on Twitter, LinkedIn, and GitHub.
Thank you for Reading. Adios :)