reading data from URL using spark databricks platform

前端 未结 2 1523
野趣味
野趣味 2021-02-09 17:20

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

2条回答
  •  不要未来只要你来
    2021-02-09 17:54

    Try this.

    url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
    from pyspark import SparkFiles
    spark.sparkContext.addFile(url)
    
    **df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**
    

    Just fetching few columns of your csv url.

    df.select("age","workclass","fnlwgt","education").show(10);
    >>> df.select("age","workclass","fnlwgt","education").show(10);
    +---+----------------+------+---------+
    |age|       workclass|fnlwgt|education|
    +---+----------------+------+---------+
    | 39|       State-gov| 77516|Bachelors|
    | 50|Self-emp-not-inc| 83311|Bachelors|
    | 38|         Private|215646|  HS-grad|
    | 53|         Private|234721|     11th|
    | 28|         Private|338409|Bachelors|
    | 37|         Private|284582|  Masters|
    | 49|         Private|160187|      9th|
    | 52|Self-emp-not-inc|209642|  HS-grad|
    | 31|         Private| 45781|  Masters|
    | 42|         Private|159449|Bachelors|
    +---+----------------+------+---------+
    

    SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

提交回复
热议问题