there are 2 differences:
Sorting:
- a merge join requires both inputs to be sorted the same way
- lookup does not require either input to be sorted.
Database query load:
- a merge join does not refer to the database , just the 2 input flows (although the reference data is typically in the form of 'select * from table order by join critera' )
- lookup will issue 1 query for each (distinct, if cached) value that it is being asked to join on. This rapidly becomes more expensive than the above select.
This leads to:
if it is no effort to produce a sorted list, and you want more than about 1% of the rows (single row selects being ~100x the cost of the same row when streaming) (you don't want to sort a 10 million row table in memory ..) then merge join is the way to go.
If you only expect a small number of matches (distinct values looked up, when caching is enabled) then lookup is better.
For me, the tradeoff between the two comes between 10k and 100k rows needing to be looked up.
The one which is quicker will depend on
- the total number of rows to be processed. (if the table is memory resident, a sort of the data to merge it is cheap)
- the number of duplicate lookups expected. (high per-row overhead of lookup)
- if you can select sorted data (note, text sorts are influence by code collation, so be careful that what sql considers sorted is also what ssis considers sorted)
- what percentage of the entire table you will look up. (merge will require selecting every row, lookup is better if you only have a few rows on one side)
- the width of a row (rows per page can strongly influences the io cost of doing single lookups vs a scan) (narrow rows -> more preference for merge)
- the order of data on disk (easy to produce sorted output, prefer merge, if you can organised the lookups to be done in physical disk order, lookups are less costly due to less cache misses)
- network latency between the ssis server and the destination (larger latency -> prefer merge)
- how much coding effort you wish to spend (merge is a bit more complex to write)
- the collation of the input data -- SSIS merge has wierd ideas about sorting of text strings which contain non-alphanumeric characters, but are not nvarchar. (this goes to sorting, and getting sql to emit a sort which ssis is happy to merge is hard)