apache-spark-sql

Parse JSON root in a column using Spark-Scala

泄露秘密 提交于 2020-05-26 09:23:49
问题 I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records. I've a data frame generated with a JSON similar the following: val exampleJson = spark.createDataset( """ {"ITEM1512": {"name":"Yin", "address":{"city":"Columbus", "state":"Ohio"} }, "ITEM1518": {"name":"Yang", "address":{"city":"Working", "state":"Marc"} } }""" :: Nil) When I read it whit the following instruction val itemsExample = spark.read.json(exampleJson) The Schema

Parse JSON root in a column using Spark-Scala

被刻印的时光 ゝ 提交于 2020-05-26 09:23:48
问题 I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records. I've a data frame generated with a JSON similar the following: val exampleJson = spark.createDataset( """ {"ITEM1512": {"name":"Yin", "address":{"city":"Columbus", "state":"Ohio"} }, "ITEM1518": {"name":"Yang", "address":{"city":"Working", "state":"Marc"} } }""" :: Nil) When I read it whit the following instruction val itemsExample = spark.read.json(exampleJson) The Schema

message:Hive Schema version 1.2.0 does not match metastore's schema version 2.1.0 Metastore is not upgraded or corrupt

故事扮演 提交于 2020-05-26 04:30:46
问题 enviroment: spark2.11 hive2.2 hadoop2.8.2 hive shell run successfully! and hava no error or warning. but when run application.sh, start failed /usr/local/spark/bin/spark-submit \ --class cn.spark.sql.Demo \ --num-executors 3 \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 3 \ --files /usr/local/hive/conf/hive-site.xml \ --driver-class-path /usr/local/hive/lib/mysql-connector-java.jar \ /usr/local/java/sql/sparkstudyjava.jar \ and the error tips: Exception in thread "main"

message:Hive Schema version 1.2.0 does not match metastore's schema version 2.1.0 Metastore is not upgraded or corrupt

拈花ヽ惹草 提交于 2020-05-26 04:30:07
问题 enviroment: spark2.11 hive2.2 hadoop2.8.2 hive shell run successfully! and hava no error or warning. but when run application.sh, start failed /usr/local/spark/bin/spark-submit \ --class cn.spark.sql.Demo \ --num-executors 3 \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 3 \ --files /usr/local/hive/conf/hive-site.xml \ --driver-class-path /usr/local/hive/lib/mysql-connector-java.jar \ /usr/local/java/sql/sparkstudyjava.jar \ and the error tips: Exception in thread "main"

message:Hive Schema version 1.2.0 does not match metastore's schema version 2.1.0 Metastore is not upgraded or corrupt

筅森魡賤 提交于 2020-05-26 04:29:06
问题 enviroment: spark2.11 hive2.2 hadoop2.8.2 hive shell run successfully! and hava no error or warning. but when run application.sh, start failed /usr/local/spark/bin/spark-submit \ --class cn.spark.sql.Demo \ --num-executors 3 \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 3 \ --files /usr/local/hive/conf/hive-site.xml \ --driver-class-path /usr/local/hive/lib/mysql-connector-java.jar \ /usr/local/java/sql/sparkstudyjava.jar \ and the error tips: Exception in thread "main"

How to drop multiple column names given in a list from Spark DataFrame?

耗尽温柔 提交于 2020-05-25 12:15:50
问题 I have a dynamic list which is created based on value of n. n = 3 drop_lst = ['a' + str(i) for i in range(n)] df.drop(drop_lst) But the above is not working. Note : My use case requires a dynamic list. If I just do the below without list it works df.drop('a0','a1','a2') How do I make drop function work with list? Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select() ? 回答1: You can use the * operator to pass the contents of your list as arguments

Reading CSV into a Spark Dataframe with timestamp and date types

南笙酒味 提交于 2020-05-25 09:05:10
问题 It's CDH with Spark 1.6 . I am trying to import this Hypothetical CSV into a apache Spark DataFrame: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a I use databricks-csv jar. val textData = sqlContext.read .format("com.databricks.spark.csv") .option("header", "false") .option("delimiter", ",") .option("dateFormat", "yyyy-MM-dd HH:mm:ss") .option("inferSchema", "true") .option("nullValue", "null") .load("test.csv") I use

Spark colocated join between two partitioned dataframes

*爱你&永不变心* 提交于 2020-05-25 06:52:47
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark colocated join between two partitioned dataframes

﹥>﹥吖頭↗ 提交于 2020-05-25 06:52:40
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark colocated join between two partitioned dataframes

依然范特西╮ 提交于 2020-05-25 06:52:25
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in