How to use the PySpark CountVectorizer on columns that maybe null

梦想与她 提交于 2019-12-13 05:47:27

问题


I have a column in my Spark DataFrame:

 |-- topics_A: array (nullable = true)
 |    |-- element: string (containsNull = true)

I'm using CountVectorizer on it:

topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")

I get NullPointerExceptions, because sometimes the topic_A column contains null.

Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.


回答1:


Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:

from pyspark.sql.functions import when, col, coalesce, array

You can define an empty array of specific type as:

fill = array().cast("array<string>")

and combine it with when clause:

topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))

or coalesce:

topics_a = coalesce(col("topics_A"), fill)

and use it as:

df.withColumn("topics_A", topics_a)

so with example data:

df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])

df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)

the result will be:

+---+--------+-------------------+
| id|topics_A|       topics_vec_A|
+---+--------+-------------------+
|  1|  [a, b]|(2,[0,1],[1.0,1.0])|
|  2|      []|          (2,[],[])|
+---+--------+-------------------+



回答2:


I had similar issue, based on comment, I used following syntax to resolve before tokenizing:

remove the null values

clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)

+-----+
|title|
+-----+
+-----+

+-----+
|title|
+-----+
+-----+

root
 |-- title: string (nullable = true)

+--------------------+
|               title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows


来源:https://stackoverflow.com/questions/50744930/failed-to-execute-user-defined-functionanonfuncreatetransformfunc1-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!