Unable to infer schema when loading Parquet file

守給你的承諾、 提交于 2019-12-03 04:39:10

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before write it.

In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.

See also:

This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.

Besides that with parquet, the same thing happens with ORC.

I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).

Glue was trying to apply data catalog table schema on a file which doesn't exist.

After copying file into s3 bucket file location, issue got resolved.

Hope this helps someone who encounters/encountered an error in AWS Glue.

In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...

I ran into a similar problem with reading a csv

spark.read.csv("s3a://bucket/spark/csv_dir/.")

gave an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;

I found if I removed the trailing . and then it works. ie:

spark.read.csv("s3a://bucket/spark/csv_dir/")

I tested this for parquet adding a trailing . and you get an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.

Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename

val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
         .load("/Users/myuser/_HEADER_0")

org.apache.spark.sql.AnalysisException: 
Unable to infer schema for CSV. It must be specified manually.;

Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!