可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv")
Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
回答1:
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) # Each line is converted to a tuple. people = parts.map(lambda p: (p[0], p[1].strip())) # The schema is encoded in a string. schemaString = "name age" fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] schema = StructType(fields) # Apply the schema to the RDD. schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
回答2:
from pyspark.sql.types import StringType from pyspark import SQLContext sqlContext = SQLContext(sc) Employee_rdd = sc.textFile("\..\Employee.csv") .map(lambda line: line.split(",")) Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name']) Employee_df.show()
回答3:
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # If no header: # pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000) for chunky in chunk_100k: Spark_temp_rdd = sc.parallelize(chunky.values.tolist()) try: Spark_full_rdd += Spark_temp_rdd except NameError: Spark_full_rdd = Spark_temp_rdd del Spark_temp_rdd Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
回答4:
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession from pyspark.sql import Row # Create a SparkSession spark = SparkSession \ .builder \ .appName("basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() def mapper(line): fields = line.split(',') return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3])) lines = spark.sparkContext.textFile("file.csv") df = lines.map(mapper) # Infer the schema, and register the DataFrame as a table. schemaDf = spark.createDataFrame(df).cache() schemaDf.createOrReplaceTempView("tablename")
回答5:
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read
gives you a DataFrameReader
instance, with a .csv()
method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True
to the .csv()
call. A handful of other options are available, and described in the link above.
回答6:
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate() df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
回答7:
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
回答8:
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(",")) df = lines.toDF(["year", "month", "day", "count"])