PySpark Read CSV reading incorrectly

问题

I am trying to read a csv file into a PySpark DataFrame. However, for some reason the PySpark CSV load methods are loading significantly more rows than expected.

I have tried using both the spark.read method and the spark.sql method for reaching the CSV.

df = pd.read_csv("preprocessed_data.csv")
len(df)

# out: 318477

spark_df = spark.read.format("csv")
                     .option("header", "true")
                     .option("mode", "DROPMALFORMED")
                     .load("preprocessed_data.csv")
spark_df.count()

# out: 6422020

df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()

# out: 6422020

I cannot figure out why it is incorrectly reading the csv, the columns appear the same when I show them, however there is way too many rows. I am therefore looking for a way to solve this problem.

回答1:

You can try the following. I am assuming your csv have an header row.

fileName = "my.csv"
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv(fileName, header=True, inferSchema=True)

来源：https://stackoverflow.com/questions/56257225/pyspark-read-csv-reading-incorrectly

标签

python

csv

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!