问题
I have referred to all the links mentioned here:
1) Link-1 2) Link-2 3) Link-3 4) Link-4
Following R code has been written by using Sparklyr Package. It reads huge JSON file and creates database schema.
sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE,
memory = FALSE, overwrite = TRUE) # reads JSON file
sample_tbl <- sdf_schema_viewer(sample_tbl) # to create db schema
df <- tbl(sc,"example") # to create lookup table
It has created following database schema
Now,
If I rename first level column, then it works.
For example,
df %>% rename(ent = entities)
But when I run 2nd deep level nested column then it doesn't rename.
df %>% rename(e_hashtags = entities.hashtags)
It shows error:
Error in .f(.x[[i]], ...) : object 'entities.hashtags' not found
Question
My question is, how to rename 3rd to 4th deep level nested column also?
Please refer database schema mentioned above.
回答1:
Spark as such doesn't support renaming individual nested fields. You have to either cast or rebuild a whole structure. For simplicity let's assume that data looks as follows:
cat('{"contributors": "foo", "coordinates": "bar", "entities": {"hashtags": ["foo", "bar"], "media": "missing"}}', file = "/tmp/example.json")
df <- spark_read_json(sc, "df", "/tmp/example.json", overwrite=TRUE)
df %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = true)
| |-- hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)
with simple string representation:
df %>%
spark_dataframe() %>%
invoke("schema") %>%
invoke("simpleString") %>%
cat(sep = "\n")
struct<contributors:string,coordinates:string,entities:struct<hashtags:array<string>,media:string>>
With cast you have to define expression using matching type description:
expr_cast <- invoke_static(
sc, "org.apache.spark.sql.functions", "expr",
"CAST(entities AS struct<e_hashtags:array<string>,media:string>)"
)
df_cast <- df %>%
spark_dataframe() %>%
invoke("withColumn", "entities", expr_cast) %>%
sdf_register()
df_cast %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = true)
| |-- e_hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)
To rebuild structure you have to match all components:
expr_struct <- invoke_static(
sc, "org.apache.spark.sql.functions", "expr",
"struct(entities.hashtags AS e_hashtags, entities.media)"
)
df_struct <- df %>%
spark_dataframe() %>%
invoke("withColumn", "entities", expr_struct) %>%
sdf_register()
df_struct %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = false)
| |-- e_hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)
来源:https://stackoverflow.com/questions/52263836/changing-nested-column-names-using-sparklyr-in-r