Join two RDD in spark

偶尔善良 提交于 2021-02-06 14:00:02

问题


I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?

val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")

val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()

Edit:

Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:

select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid 
group by ....

here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.


回答1:


Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode

If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.

SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key

While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.

Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.

For small data you can use filter in a similar way to broadcast join:

val moviesidBD = sc.broadcast(
  lines.map(x => x.split("\t")).map(_.head).collect.toSet)

movienames.filter{case (id, _) => moviesidBD.value contains id}

but if you really want SQL-ish joins then you should simply use SparkSQL.

val movieIdsDf = lines
   .map(x => x.split("\t"))
   .map(a => Tuple1(a.head))
   .toDF("id")

val movienamesDf = movienames.toDF("id", "name")

// Add optional join type qualifier 
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))


来源:https://stackoverflow.com/questions/33321704/join-two-rdd-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!