databricks | 易学教程

Passing external yml file in my spark-job/code not working throwing “Can't construct a java object for tag:yaml.org,2002”

阅读更多关于 Passing external yml file in my spark-job/code not working throwing “Can't construct a java object for tag:yaml.org,2002”

问题 I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit. As I am using below TypeSafe to load my property file. <groupId>com.typesafe</groupId> <artifactId>config</artifactId> <version>1.3.1</version> In my spark driver class MyDriver.java I am loading the YML file as below String ymlFilename = args[1].toString(); Optional<QueryEntities> entities = InputYamlProcessor.process(ymlFilename); I have all code here

Passing external yml file in my spark-job/code not working throwing “Can't construct a java object for tag:yaml.org,2002”

阅读更多关于 Passing external yml file in my spark-job/code not working throwing “Can't construct a java object for tag:yaml.org,2002”

How to know the file formats supported by Databricks?

阅读更多关于 How to know the file formats supported by Databricks?

问题 I have a requirement to load various files (different type) into spark data frame. Are all these file formats supported by Databricks? If yes, where can I get the list of options supported for each file format? delimited csv parquet avro excel json Thanks 回答1: I don't know exactly what Databricks offers out of the box (pre-installed), but you can do some reverse-engineering using org.apache.spark.sql.execution.datasources.DataSource object that is (quoting the scaladoc): The main class

How can I access a method which return Option object?

阅读更多关于 How can I access a method which return Option object?

问题 I have to call method which returns Option[List[Obj]] . After I call I need to iterate the List and print the Obj attributes. object Tester{ def main(args:Array[String]) { val ymlFilename ="some.yml"; val entities: Option[QueryEntities] = InputYamlProcessor.process(ymlFilename) for( e: QueryEntities <- entities ){ /// this is not working //How to access the columnFamily, fromData and toDate ? } } Complete sample https://gist.github.com/shatestest/fdeaba767d78e171bb6c08b359fbd1bf 回答1: The most

How do I use a from_json() dataframe in Spark?

阅读更多关于 How do I use a from_json() dataframe in Spark?

问题 I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe. val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema)) This returns a dataframe where the root object is jsontostructs(CAST(body AS STRING)):struct followed by the fields in the schema (looks correct). When I try another select on

Spark 2.0.0 truncate from Redshift table using jdbc

阅读更多关于 Spark 2.0.0 truncate from Redshift table using jdbc

问题 Hello I am using Spark SQL(2.0.0) with Redshift where I want to truncate my tables. I am using this spark-redshift package & I want to know how I can truncate my table.Can anyone please share example of this ?? 回答1: I was unable to accomplish this using Spark and the code in the spark-redshift repo that you have listed above. I was, however, able to use AWS Lambda with psycopg2 to truncate a redshift table. Then I use boto3 to kick off my spark job via AWS Glue. The important code below is

How to perform accumulated avg for multiple companies using spark based on the results stored in Cassandra?

阅读更多关于 How to perform accumulated avg for multiple companies using spark based on the results stored in Cassandra?

问题 I need to get avg and count for given dataframe and need to get previously stored avg and count from Cassandra table values for each company. Then need to calculate avg and count and persist back into the Cassandra table. How can I do it for each company ? I have two dataframe schemas as below ingested_df |-- company_id: string (nullable = true) |-- max_dd: date (nullable = true) |-- min_dd: date (nullable = true) |-- mean: double (nullable = true) |-- count: long (nullable = false) cassandra

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

阅读更多关于 Writing to Cosmos DB Graph API from Databricks (Apache Spark)

问题 I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos. When I write to Cosmos I can't see any properties on the vertices, just a generated id. Get data: data = spark.sql("select * from graph.testgraph") Configuration: writeConfig = { "Endpoint" : "******", "Masterkey" : "******", "Database" : "graph", "Collection" : "TestGraph", "Upsert" : "true", "query_pagesize" : "100000", "bulkimport": "true",

Is there a good way to join a stream in spark with a changing table?

阅读更多关于 Is there a good way to join a stream in spark with a changing table?

问题 Our Spark environment: DataBricks 4.2 (includes Apache Spark 2.3.1, Scala 2.11) What we try to achieve: We want to enrich streaming data with some reference data, which is updated regularly. The enrichment is done by joining the stream with the reference data. What we implemented: We implemented two spark jobs (jars): The first one is updating a Spark table TEST_TABLE every hour (let’s call it ‘reference data’) by using .write.mode(SaveMode.Overwrite).saveAsTable("TEST_TABLE") And afterwards

“expression is neither present in the group by, nor is it an aggregate function” what is wrong here?

阅读更多关于 “expression is neither present in the group by, nor is it an aggregate function” what is wrong here?

问题 I am trying to apply a pivot on my dataframe as below val pivot_company_model_vals_df = company_model_vals_df.groupBy("company_id","model_id","data_date") .pivot("data_code") .agg( when( col("data_item_value_numeric").isNotNull, first("data_value_numeric")).otherwise(first("data_value_string")) ) Error org.apache.spark.sql.AnalysisException: expression '`data_item_value_numeric`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first