问题
I am reading a dataset as below.
f = sc.textFile("s3://test/abc.csv")
My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.
How do I do that in PySpark ? Is DataFrame way to go here ?
PS - Newbie to Spark.
回答1:
Here is how to add column names using DataFrame:
Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:
f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])
Suppose the data has 3 columns:
data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]
Now, you can specify the column names when transferring this RDD to DataFrame using toDF()
:
df_withcol = data_rdd.toDF(['height','color','width'])
df_withcol.printSchema()
root
|-- height: string (nullable = true)
|-- color: string (nullable = true)
|-- width: string (nullable = true)
If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:
df_default = data_rdd.toDF()
df_default.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
回答2:
The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.
filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")
回答3:
f = sc.textFile("s3://test/abc.csv") <br />
header = f.first()
header will give you as below:-<br />
u'col1,col2,col3' --> for example taking 3 columns name
head = str(header).split(",") <br />
head will give you a list<br/>
['col1','col2','col3']
fDF = f.filter(lambda row: row!=header).map(lambda x: str(x).split(",")).toDF(head)<br/>
fdF.show() <br/>
will give you header as well as the data in dataframe as required.
来源:https://stackoverflow.com/questions/36608559/how-to-assign-and-use-column-headers-in-spark