Reading JSON with Apache Spark - `corrupt_record`

匿名 (未验证) 提交于 2019-12-03 02:06:01

问题:

I have a json file, nodes that looks like this:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} ,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} ,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} ,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}] 

I am able to read and manipulate this record with Python.

I am trying to read this file in scala through the spark-shell.

From this tutorial, I can see that it is possible to read json via sqlContext.read.json

val vfile = sqlContext.read.json("path/to/file/nodes.json") 

However, this results in a corrupt_record error:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 

Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

回答1:

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}  {"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}  {"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}  {"toid":"osgb4000000031043208","point":[508513,196023],"index":4} 

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities, so it could distribute their processing (per entity, roughly saying). So that's why it expects to parse an entity on top-level but gets an array, which is impossible to map to a record as there is no name for such column. Basically (but not precisely) Spark is seeing your array as one row with one column and fails to find a name for that column.

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is often called JSONL. Basically it's an alternative to CSV.



回答2:

To read the multi-line JSON as a DataFrame:

val spark = SparkSession.builder().getOrCreate()  val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values) 

Reading large files in this manner is not recommended, from the wholeTextFiles docs

Small files are preferred, large file is also allowable, but may cause bad performance.



回答3:

I run into the same problem. I used sparkContext and sparkSql on the same configuration:

val conf = new SparkConf()   .setMaster("local[1]")   .setAppName("Simple Application")   val sc = new SparkContext(conf)  val spark = SparkSession   .builder()   .config(conf)   .getOrCreate() 

Then, using the spark context I read the whole json (JSON - path to file) file:

 val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2) 

You can create a schema for future selects, filters...

val schema = StructType( List(   StructField("toid", StringType, nullable = true),   StructField("point", ArrayType(DoubleType), nullable = true),   StructField("index", DoubleType, nullable = true) )) 

Create a DataFrame using spark sql:

var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF() 

For testing use show and printSchema:

df.show() df.printSchema() 

sbt build file:

name := "spark-single"  version := "1.0"  scalaVersion := "2.11.7"  libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2" libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2" 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!