Json file to pyspark dataframe

问题

I'm trying to work with JSON file on spark (pyspark) environment.

Problem: Unable to convert JSON to expected format in Pyspark Dataframe

1st Input data set:

https://health.data.ny.gov/api/views/cnih-y5dw/rows.json

In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data".

FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark environment.

df=sqlContext.read.json("/user/train/ny.json",multiLine=True)
df.count()
out[5]: 1
df.show()

df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- attribution: string (nullable = true)
 |    |    |-- attributionLink: string (nullable = true)
 |    |    |-- averageRating: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- columns: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cachedContents: struct (nullable = true)
 |    |    |    |    |    |-- average: string (nullable = true)
 |    |    |    |    |    |-- largest: string (nullable = true)
 |    |    |    |    |    |-- non_null: long (nullable = true)
 |    |    |    |    |    |-- null: long (nullable = true)
 |    |    |    |    |    |-- smallest: string (nullable = true)
 |    |    |    |    |    |-- sum: string (nullable = true)
 |    |    |    |    |    |-- top: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- count: long (nullable = true)
 |    |    |    |    |    |    |    |-- item: string (nullable = true)
 |    |    |    |    |-- dataTypeName: string (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- fieldName: string (nullable = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- format: struct (nullable = true)
 |    |    |    |    |    |-- align: string (nullable = true)
 |    |    |    |    |    |-- mask: string (nullable = true)
 |    |    |    |    |    |-- noCommas: string (nullable = true)
 |    |    |    |    |    |-- precisionStyle: string (nullable = true)
 |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- position: long (nullable = true)
 |    |    |    |    |-- renderTypeName: string (nullable = true)
 |    |    |    |    |-- tableColumnId: long (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |-- createdAt: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- displayType: string (nullable = true)
 |    |    |-- downloadCount: long (nullable = true)
 |    |    |-- flags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- grants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- inherited: boolean (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- hideFromCatalog: boolean (nullable = true)
 |    |    |-- hideFromDataJson: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- indexUpdatedAt: long (nullable = true)
 |    |    |-- metadata: struct (nullable = true)
 |    |    |    |-- attachments: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assetId: string (nullable = true)
 |    |    |    |    |    |-- blobId: string (nullable = true)
 |    |    |    |    |    |-- filename: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- availableDisplayTypes: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- custom_fields: struct (nullable = true)
 |    |    |    |    |-- Additional Resources: struct (nullable = true)
 |    |    |    |    |    |-- See Also: string (nullable = true)
 |    |    |    |    |-- Dataset Information: struct (nullable = true)
 |    |    |    |    |    |-- Agency: string (nullable = true)
 |    |    |    |    |-- Dataset Summary: struct (nullable = true)
 |    |    |    |    |    |-- Contact Information: string (nullable = true)
 |    |    |    |    |    |-- Coverage: string (nullable = true)
 |    |    |    |    |    |-- Data Frequency: string (nullable = true)
 |    |    |    |    |    |-- Dataset Owner: string (nullable = true)
 |    |    |    |    |    |-- Granularity: string (nullable = true)
 |    |    |    |    |    |-- Organization: string (nullable = true)
 |    |    |    |    |    |-- Posting Frequency: string (nullable = true)
 |    |    |    |    |    |-- Time Period: string (nullable = true)
 |    |    |    |    |    |-- Units: string (nullable = true)
 |    |    |    |    |-- Disclaimers: struct (nullable = true)
 |    |    |    |    |    |-- Limitations: string (nullable = true)
 |    |    |    |    |-- Local Data: struct (nullable = true)
 |    |    |    |    |    |-- County Filter: string (nullable = true)
 |    |    |    |    |    |-- County_Column: string (nullable = true)
 |    |    |    |-- filterCondition: struct (nullable = true)
 |    |    |    |    |-- children: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |    |    |-- includeAuto: long (nullable = true)
 |    |    |    |    |    |    |    |-- multiSelect: boolean (nullable = true)
 |    |    |    |    |    |    |    |-- operator: string (nullable = true)
 |    |    |    |    |    |    |    |-- tableColumnId: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- 583607: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |-- advanced: boolean (nullable = true)
 |    |    |    |    |    |-- unifiedVersion: long (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- jsonQuery: struct (nullable = true)
 |    |    |    |    |-- order: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |    |-- columnFieldName: string (nullable = true)
 |    |    |    |-- rdfSubject: string (nullable = true)
 |    |    |    |-- renderTypeConfig: struct (nullable = true)
 |    |    |    |    |-- visible: struct (nullable = true)
 |    |    |    |    |    |-- table: boolean (nullable = true)
 |    |    |    |-- rowLabel: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- newBackend: boolean (nullable = true)
 |    |    |-- numberOfComments: long (nullable = true)
 |    |    |-- oid: long (nullable = true)
 |    |    |-- owner: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- provenance: string (nullable = true)
 |    |    |-- publicationAppendEnabled: boolean (nullable = true)
 |    |    |-- publicationDate: long (nullable = true)
 |    |    |-- publicationGroup: long (nullable = true)
 |    |    |-- publicationStage: string (nullable = true)
 |    |    |-- query: struct (nullable = true)
 |    |    |    |-- orderBys: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |-- expression: struct (nullable = true)
 |    |    |    |    |    |    |-- columnId: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- rights: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- rowsUpdatedAt: long (nullable = true)
 |    |    |-- rowsUpdatedBy: string (nullable = true)
 |    |    |-- tableAuthor: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- tableId: long (nullable = true)
 |    |    |-- tags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- totalTimesRated: long (nullable = true)
 |    |    |-- viewCount: long (nullable = true)
 |    |    |-- viewLastModified: long (nullable = true)
 |    |    |-- viewType: string (nullable = true)

Problem: All records getting wrapped up in single row and two column, i.e meta and data. Also with spark native JSON utility - Spark infers the schema (meatadata) automatically - and my expectation is it shouldn't explicitly as separate column on dataframe.

Expected Output JSON data set has following list of columns. It should show them in tabular format in dataframe where I can query them"

FACILITY, ADDRESS, LAST INSPECTED, VIOLATIONS,TOTAL  CRITICAL VIOLATIONS, TOTAL CRIT.  NOT CORRECTED, TOTAL  NONCRITICAL VIOLATIONS, DESCRIPTION, LOCAL HEALTH DEPARTMENT, COUNTY, FACILITY ADDRESS, CITY, ZIP CODE, NYSDOH GAZETTEER (1980), MUNICIPALITY, OPERATION NAME, PERMIT EXPIRATION DATE, PERMITTED  (D/B/A), PERMITTED  CORP. NAME,PERM. OPERATOR LAST NAME, PERM. OPERATOR LAST NAME, PERM. OPERATOR FIRST NAME, NYS HEALTH OPERATION ID, INSPECTION TYPE, INSPECTION COMMENTS, FOOD SERVICE FACILITY STATE, Location1

2nd Input DataSet: On site, it's first data set about funded projects by world bank

http://jsonstudio.com/resources/

(On site, it's first data set about funded projects by world bank)

It works all fine.

df=sqlContext.read.json("/user/train/wb.json")
df.count()
500

2nd Input Data Sets works all but 1st Input dataset is not. My Observation is the way metadata defined is different for both Json files. In 1st. Meta data is defined first and then data however in 2nd file - meatadate is available with data on every line.

Can you please guide me on 1st input JSON file format and how to handle situation while converting it into pyspark dataframe?

Updated Outcome: After initial analysis we found that format seems wrong but community member helped an alternative way to read format. Marking answer as right and closing this thread.

Let me know if you need any further details, thanks in advance.

回答1:

Check out my notebook on databricks

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/753971180331031/2636709537891264/8469274372247319/latest.html

The first dataset is corrupt, i.e. it's not valid json and so spark can't read it.

But this was for spark 2.2.1

This is especially confusing because of the way this json file is organized The data is stored as a list of lists

df=spark.read.json("rows.json",multiLine=True)
data = df.select("data").collect()[0]['data']

And the column names are stored separately

column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

来源：https://stackoverflow.com/questions/49396558/json-file-to-pyspark-dataframe

标签

python

json

apache-spark

pyspark

spark-dataframe