Load json file to spark dataframe

邮差的信 提交于 2021-01-29 20:16:40

问题


I try to load the following data.json file in a spark dataframe:

{"positionmessage":{"callsign": "PPH1", "name": 0.0, "mmsi": 100}}
{"positionmessage":{"callsign": "PPH2", "name": 0.0, "mmsi": 200}}
{"positionmessage":{"callsign": "PPH3", "name": 0.0, "mmsi": 300}}

by the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Create a schema for the dataframe
schema = StructType([
    StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
])

# Create data frame
json_file_path = "data.json"
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
print(df.head(3))

It prints: [Row(callsign=None, name=None, mmsi=None)]. What do I do wrong? I have set my environment variables in the system settings.


回答1:


You are having positionmessage struct field and missing in schema.

Change the schema to include struct field as shown below:

schema = StructType([StructField("positionmessage",StructType([StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
]))])

spark.read.schema(schema).json("<path>").\
select("positionmessage.*").\
show()
#+--------+----+----+
#|callsign|name|mmsi|
#+--------+----+----+
#|    PPH1| 0.0| 100|
#|    PPH2| 0.0| 200|
#|    PPH3| 0.0| 300|
#+--------+----+----+


来源:https://stackoverflow.com/questions/61877486/load-json-file-to-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!