Get CSV to Spark dataframe

前端 未结 9 1300
忘了有多久
忘了有多久 2020-12-05 14:45

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

9条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-05 15:08

    Following Spark 2.0, it is recommended to use a Spark Session:

    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    
    # Create a SparkSession
    spark = SparkSession \
        .builder \
        .appName("basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    
    def mapper(line):
        fields = line.split(',')
        return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
    
    lines = spark.sparkContext.textFile("file.csv")
    df = lines.map(mapper)
    
    # Infer the schema, and register the DataFrame as a table.
    schemaDf = spark.createDataFrame(df).cache()
    schemaDf.createOrReplaceTempView("tablename")
    

提交回复
热议问题