I'm using PySpark to do classic ETL job (load dataset, process it, save it) and want to save my Dataframe as files/directory partitioned by a "virtual" column; what I mean by "virtual" is that I have a column Timestamp which is a string containing an ISO 8601 encoded date, and I'd want to partition by Year / Month / Day; but I don't actually have either a Year, Month or Day column in the DataFrame; I have this Timestamp from which I can derive these columns though, but I don't want my resultat items to have one of these columns serialized.
File structure resulting from saving the DataFrame to disk should look like:
/
year=2016/
month=01/
day=01/
part-****.gz
Is there a way to do what I want with Spark / Pyspark ?
Columns which are used for partitioning are not included in the serialized data itself. For example if you create DataFrame like this:
df = sc.parallelize([
(1, "foo", 2.0, "2016-02-16"),
(2, "bar", 3.0, "2016-02-16")
]).toDF(["id", "x", "y", "date"])
and write it as follows:
import tempfile
from pyspark.sql.functions import col, dayofmonth, month, year
outdir = tempfile.mktemp()
dt = col("date").cast("date")
fname = [(year, "year"), (month, "month"), (dayofmonth, "day")]
exprs = [col("*")] + [f(dt).alias(name) for f, name in fname]
(df
.select(*exprs)
.write
.partitionBy(*(name for _, name in fname))
.format("json")
.save(outdir))
individual files won't contain partition columns:
import os
(sqlContext.read
.json(os.path.join(outdir, "year=2016/month=2/day=16/"))
.printSchema())
## root
## |-- date: string (nullable = true)
## |-- id: long (nullable = true)
## |-- x: string (nullable = true)
## |-- y: double (nullable = true)
Partitioning data is stored only in a directory structure and not duplicated in serialized files. It will be attached only when your read complete or partial directory tree:
sqlContext.read.json(outdir).printSchema()
## root
## |-- date: string (nullable = true)
## |-- id: long (nullable = true)
## |-- x: string (nullable = true)
## |-- y: double (nullable = true)
## |-- year: integer (nullable = true)
## |-- month: integer (nullable = true)
## |-- day: integer (nullable = true)
sqlContext.read.json(os.path.join(outdir, "year=2016/month=2/")).printSchema()
## root
## |-- date: string (nullable = true)
## |-- id: long (nullable = true)
## |-- x: string (nullable = true)
## |-- y: double (nullable = true)
## |-- day: integer (nullable = true)
来源:https://stackoverflow.com/questions/35437378/spark-save-dataframe-partitioned-by-virtual-column