问题
Here is a snippet of a CSV file that I have:
"Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)"
1, 2222, 3, 3.5, 32312, 1981, 250000
2, 1628, 3, 2, 32308, 2009, 185000
3, 3824, 5, 4, 32312, 1954, 399000
4, 1137, 3, 2, 32309, 1993, 150000
5, 3560, 6, 4, 32309, 1973, 315000
Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote
and escape
options
, but to no avail.
Does anyone know why this is happening and how to strip the extra whitespaces on load? Thank you in advance!
>>> csv_file = '/tmp/file.csv'
>>> spark_reader.format('csv')
>>> spark_reader.option("inferSchema", "true")
>>> spark_reader.option("header", "true")
>>> spark_reader.option("quote", '"')
>>> df = spark_reader.load(csv_file)
>>> df.columns
['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"', ' "Zip"', ' "Year"', ' "List Price ($)"']
回答1:
From the docs for pyspark.sql.DataFrameReader, you can use the ignoreLeadingWhiteSpace
parameter.
ignoreLeadingWhiteSpace – A flag indicating whether or not leading whitespaces from values being read should be skipped. If None is set, it uses the default value, false.
In your case, you just need to add:
spark_reader.option("ignoreLeadingWhiteSpace", "true")
来源:https://stackoverflow.com/questions/53656237/pyspark-v2-4-dataframereader-adds-leading-whitespace-to-column-names