select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as
timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table
as
I think you can do what you want with two queries, but I'm not 100% sure. Often in this situation, it is sufficient to find things in the first table that don't match in the second table. You are also trying to get a "closest" match, which is why this is challenging.
The following query looks for matches on user id and exactly one of the other two fields, and then combines them:
SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts
) prod_and_ts JOIN
table2
on prod_and_ts.user_id = table2.buyer_id and
prod_and_ts.product_id = table2.item_id and
prod_and_ts.timestamps <> UNIX_TIMESTAMP(table2.created_time)
union all
SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts
) prod_and_ts JOIN
table2
on prod_and_ts.user_id = table2.buyer_id and
prod_and_ts.product_id <> table2.item_id and
prod_and_ts.timestamps = UNIX_TIMESTAMP(table2.created_time)
This will not find situations where there is no match on either field.
Also, I've written this using the "on" syntax rather than "where". I assume HIVE supports this.
Your rep is too high to open a duplicate and especially 2 duplicates of the same question.
Joining two Tables in Hive using HiveQL(Hadoop)
Join Two Tables and get the output from both of them
You don't have enough info to tie the records back for the third scenario.
You can do a FULL OUTER JOIN with an OR and get everything back, match the rows that you have enough info on as in the first and second case you list, and identify recs that you don't by returning rows with nulls for the fields from the non matching table in the third scenario.
SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),*
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
Trying to match on the third scenario is a hack - the info is not there
This will match them with any for the date specified that aren't matching on the other days, but again you will get Cartesian products.
SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),*
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
AND (
(B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
OR
(
(A.Created_TIME <> B.t1time AND B.PRODUCTID <> A.ITEM_ID AND DATEPART(d,B.T1time) = DATEPART(d,A.Created_TIME))
AND a.ITEM_ID NOT IN(SELECT ITEM_ID
FROM SO_Table1HIVE A2
INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME = B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
)
AND B.PRODUCTID NOT IN(SELECT PRODUCTID
FROM SO_Table1HIVE A2
INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME = B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
)
)
)
You could use RANK() or try a top one, etc. RANK() or ROW_NO would probably be the best of these hacks if this were not a Hive question, but as I know your are using HQL, I am not going to write it up. You could pull them out in a separate table and run some logical update queries to update it, then use that as a lookup table to tie back.
tbl1Tbl2Lookup
---------------
id int identity
table1info FK
table2info FK
What you should probably do is what the person in the question you offered a bounty on suggested - since you really don't have a good way to query the third scenario and they offered you an alternative that is specific to HIVE.