PySpark Read CSV reading incorrectly

老子叫甜甜 提交于 2020-01-16 20:18:34

问题


I am trying to read a csv file into a PySpark DataFrame. However, for some reason the PySpark CSV load methods are loading significantly more rows than expected.

I have tried using both the spark.read method and the spark.sql method for reaching the CSV.

df = pd.read_csv("preprocessed_data.csv")
len(df)

# out: 318477
spark_df = spark.read.format("csv")
                     .option("header", "true")
                     .option("mode", "DROPMALFORMED")
                     .load("preprocessed_data.csv")
spark_df.count()

# out: 6422020
df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()

# out: 6422020

I cannot figure out why it is incorrectly reading the csv, the columns appear the same when I show them, however there is way too many rows. I am therefore looking for a way to solve this problem.


回答1:


You can try the following. I am assuming your csv have an header row.

fileName = "my.csv"
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv(fileName, header=True, inferSchema=True)


来源:https://stackoverflow.com/questions/56257225/pyspark-read-csv-reading-incorrectly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!