Pondering Join Algorithms

Truly enjoying this Intro to Database Systems course from Carnegie Mellon University. Some really great breakdowns of common join algorithms in this lecture. Here are my notes.

Lecture 11- Join Algorithms (CMU Databases Systems / Fall 2019)

Prof. Andy Pavlo, Carnegie Mellon Database Group

Join Algorithms

screenshot from lecture

Table Positioning for a Join

“In general, your smaller table should be the “left” table when joining two tables.”… Professor demonstrates better performance by making the smaller table the “outer” table in a join.

Block Nested Loop Join [mysql example]

  • “The brute force approach”
  • If you have enough memory to hold a large table, a good option for joining.
  • Always pick the smaller table as the outer table.
  • Buffer as much of your outer table in memory as possible to reduce redundant I/O.
  • Loop over the inner table or use an index.

Index Nested Loop Join [CS Course definition]

If indexes are available, or you could create an index to use for a join.

Sort-Merge Join [wikipedia]

Useful if one or both tables are sorted on a join key. Maximize sequential I/O.

Sort - Merge Join

screenshot from lecture

Hash Join [mysql blog example]

Best performance. For large datasets.

  1. Phase #1 Build (Hash Table)
  2. Phase #2 Probe

Use Bloom Filter set operations for probe phase optimization

  1. insert a key
  2. lookup a key

Additional Reading on Bloom Filters

Let’s implement a Bloom Filter

Bloom Filters Debunked

Grace Hash Join [wikipedia]

  • “Do hash joins when things don’t fit in memory.”
  • Use a hash table for each table. Break the tables into buckets then do a nested loop join on each bucket. If the buckets do not fit in memory, use recursive partitioning. Then everything fits in memory for the join.

“Split outer relation into partitions based on the hash key.”

Prof. Andy Pavlo on Hash Join algorithm

  • Hashing is almost always better than sorting for operator execution.

“No join algorithm works well in all scenarios.”

-Prof. Andy Pavlo