Motivation
- How do we process
SELECT * FROM Student WHERE sid > 30
?
- Learned in previous lectures, use indexes and data structures etc
- How do we process
SELECT * FROM Student S, Enroll E WHERE S.sid = E.sid
?
Nest-Loop Join (NLJ)
For each r in R:
For each s in S:
if r.A = s.A: output(r,s)
Performance are bad when the table is large
Sort-Merge Join (SMJ)
- Sort the relations first, then join
- Use a two pointer approach to compare each tuple in the table
- e.g. always move the smaller pointer when the 2 pointers' value are not the same
if not sorted, sort R and S by A
i,j = 1,1
while i < len(R) and j < len(S):
if R[i].A == S[j].A:
output (R[i],S[j])
i += 1
j += 1
else if R[i].A > S[j].A: j += 1
else if R[i].A < S[j].A: i += 1
Index Join (IJ)
-
Create an index for S.A if needed
For each r in R:
let X = loopkup index on S.A with r.A value
For each s in X:
output (r,s)
Hash Join (HJ)
- Hash function : h(v) → [1, k]
- Q: Given $(r \in R)$ and $(s \in S)$, can r and s join if h(r.A) ≠ h(s.A)?
- Main idea
- Partition tuples in R and S based on hash values on join attributes
- Perform "joins" only between partitions of the same hash value
- Hashing stage (bucketizing): hash tuples into buckets
- Hash R tuples into G1,...Gk buckets
- Hash S tuples into G1,...Gk buckets
- Join stage: join tuples in matching buckets