pyspark: TypeError: IntegerType can not accept object in type

前端未结

关注

 2  1711

programming with pyspark on a Spark cluster, the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily

basically it loo

相关标签:

2条回答

北海茫月

2021-02-20 17:52

With apache 2.0 you can let spark infer the schema of your data. Overall you'll need to cast in your parser function as argued above:

"When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict."

0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2021-02-20 17:53
As noted by ccheneson you pass wrong types.

Assuming you data looks like this:
```
data = sc.parallelize(["af.b Current%20events 1 996"])
```
After the first map you get RDD[List[String]]:
```
parts = data.map(lambda l: l.split())
parts.first()
## ['af.b', 'Current%20events', '1', '996']
```
The second map converts it to tuple (String, String, String, String):
```
wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
wikis.first()
## ('af.b', 'Current%20events', '1', '996')
```
Your schema states that 3rd columns is an integer:
```
[f.dataType for f in schema.fields]
## [StringType, StringType, IntegerType, StringType]
```
Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.

You can either cast your data during last map:
```
wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))
```
Or define count as a StringType and cast column
```
fields[2] = StructField("count", StringType(), True)
schema = StructType(fields) 

wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")
```
On a side note count is reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.
0 讨论(0)
发布评论:

提交评论
- 加载中...