Spark column string replace when present in other column (row)

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-17 13:56:34

问题


I would like to remove strings from col1 that are present in col2:

val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")

using regexp_replace or translate ref: spark functions api

val res = df.withColumn("sentence_without_label", regexp_replace 
(col("sentence") , "(?????)", "" ))

so that res looks as below:


回答1:


You could simply use regexp_replace

df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))

or you can use simple udf function as below

val df5 = spark.createDataFrame(Seq(
  ("Hi I heard about Spark", "Spark"),
  ("I wish Java could use case classes", "Java"),
  ("Logistic regression models are neat", "models")
)).toDF("sentence", "label")

val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))

val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))

res.show()

Output:

+-----------------------------------+------+------------------------------+
|sentence                           |label |sentence_without_label        |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark             |Spark |Hi I heard about              |
|I wish Java could use case classes |Java  |I wish  could use case classes|
|Logistic regression models are neat|models|Logistic regression  are neat |
+-----------------------------------+------+------------------------------+



回答2:


If label it just a literal it is pretty simple:

import org.apache.spark.sql.functions._

df.withColumn("sentence_without_label", 
  regexp_replace(col("sentence"), col("label"), lit(""))).show(false)

+-----------------------------------+------+------------------------------+
|sentence                           |label |sentence_without_label        |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark             |Spark |Hi I heard about              |
|I wish Java could use case classes |Java  |I wish  could use case classes|
|Logistic regression models are neat|models|Logistic regression  are neat |
+-----------------------------------+------+------------------------------+  

In Spark 1.6 you can do the same with expr:

df.withColumn(
  "sentence_without_label",
  expr("regexp_replace(sentence, label, '')"))


来源:https://stackoverflow.com/questions/45615621/spark-column-string-replace-when-present-in-other-column-row

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!