Convert StringType to ArrayType in PySpark

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-21 15:27:04

问题


I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.

from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6) 
model = fpGrowth.fit(df)

I am getting the following error:

An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input 
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)

My Dataframe df is in the form:

df.show(2)

+---+---------+--------------------+
| id|     name|               actor|
+---+---------+--------------------+
|  0|['ab,df']|                 tom|
|  1|['rs,ce']|                brad|
+---+---------+--------------------+
only showing top 2 rows

The FP algorithm works if my data in column "name" is in the form:

 name
[ab,df]
[rs,ce]

How do I get it in this form that is convert from StringType to ArrayType

I formed the Dataframe from my RDD:

rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))

rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)

rd2.take(2):

[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]

回答1:


Split by comma for each row in the name column of your dataframe. e.g.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('list', PandasUDFType.SCALAR)
def split_comma(v):
    return v[1:-1].split(',')

df.withColumn('name', split_comma(df.name))

Or better, don't defer this. Set name directly to the list.

rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))



回答2:


Based on your previous question, it seems as though you are building rdd2 incorrectly.

Try this:

rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

The change is that we call str.split(",") on x[0][1] so that it will convert a string like 'a,b' to a list: ['a', 'b'].



来源:https://stackoverflow.com/questions/49681837/convert-stringtype-to-arraytype-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!