问题
I am trying to read a csv file into a PySpark DataFrame. However, for some reason the PySpark CSV load methods are loading significantly more rows than expected.
I have tried using both the spark.read
method and the spark.sql
method for reaching the CSV.
df = pd.read_csv("preprocessed_data.csv")
len(df)
# out: 318477
spark_df = spark.read.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("preprocessed_data.csv")
spark_df.count()
# out: 6422020
df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()
# out: 6422020
I cannot figure out why it is incorrectly reading the csv, the columns appear the same when I show them, however there is way too many rows. I am therefore looking for a way to solve this problem.
回答1:
You can try the following. I am assuming your csv have an header row.
fileName = "my.csv"
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv(fileName, header=True, inferSchema=True)
来源:https://stackoverflow.com/questions/56257225/pyspark-read-csv-reading-incorrectly