I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below
schema = StructType([
StructFiel
StructType
provides json
and jsonValue
methods which can be used to obtain json
and dict
representation respectively and fromJson
which can be used to convert Python dictionary to StructType
.
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
StructType.fromJson(schema.jsonValue())
The only thing you need beyond that is built-in json module to parse input to the dict
that can be consumed by StructType
.
For Scala version see How to create a schema from CSV file and persist/save that schema to a file?