Get CSV to Spark dataframe

前端未结

关注

 9  1300

忘了有多久 2020-12-05 14:45

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

9条回答

谎友^ (楼主)

2020-12-05 15:08

Following Spark 2.0, it is recommended to use a Spark Session:

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession \
    .builder \
    .appName("basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

def mapper(line):
    fields = line.split(',')
    return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))

lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)

# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")

0 讨论(0)

查看其它9个回答