I have two comma-separated string columns (sourceAuthors
and targetAuthors
).
val df = Seq(
(\"Author1,Author2,Author3\",\"Author2,Aut
Unless I misunderstood your problem, there are standard functions that can help you (so you don't have to write a UDF), i.e. split
and array_intersect
.
Given the following dataset:
val df = Seq(("Author1,Author2,Author3","Author2,Author3"))
.toDF("source","target")
scala> df.show(false)
+-----------------------+---------------+
|source |target |
+-----------------------+---------------+
|Author1,Author2,Author3|Author2,Author3|
+-----------------------+---------------+
You could write the following structured query:
val intersect = array_intersect(split('source, ","), split('target, ","))
val solution = df.select(intersect as "common_elements")
scala> solution.show(false)
+------------------+
|common_elements |
+------------------+
|[Author2, Author3]|
+------------------+