databricks

Why “databricks-connect test” does not work after configurate Databricks Connect?

て烟熏妆下的殇ゞ 提交于 2019-12-11 03:11:47
问题 I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says. That is my cluster configuration 回答1: You problem looks like it is one of the following: a) You specificed the wrong port (it has to be 8787 on Azure) b) You didnt open up the port in you

SparkSQL job fails when calling stddev over 1,000 columns

落花浮王杯 提交于 2019-12-11 00:56:25
问题 I am on DataBricks with Spark 2.2.1 and Scala 2.11. I am attempting to run a SQL query that looks like the following. select stddev(col1), stddev(col2), ..., stddev(col1300) from mydb.mytable I then execute the code as follows. myRdd = sqlContext.sql(sql) However, I see the following exception thrown. Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino

equality of two data frames

梦想与她 提交于 2019-12-10 11:55:56
问题 I have below scenario: I have 2 dataframes containing only 1 column Lets say DF1=(1,2,3,4,5) DF2=(3,6,7,8,9,10) Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is: val df1count= DF1.count val df2count=DF2.count val diffDF=DF2.except(DF1) val diffCount=diffDF.count if(diffCount==(df2count-df1count)) true else false The problem with this approach is I

Consume events from EventHub In Azure Databricks using pySpark

半世苍凉 提交于 2019-12-10 10:36:51
问题 I could see spark connectors & guidelines for consuming events from Event Hub using Scala in Azure Databricks. But, How can we consume events in event Hub from azure databricks using pySpark? any suggestions/documentation details would help. thanks 回答1: Below is the snippet for reading events from event hub from pyspark on azure data-bricks. // With an entity path val with = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME" # Source with default

Saving empty DataFrame with known schema (Spark 2.2.1)

不羁岁月 提交于 2019-12-10 07:24:37
问题 Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records? def example(spark: SparkSession, path: String, schema: StructType) = { val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet") dataframeWriter.save(path) spark.read.load(path) // ERROR!! No files to read, so schema unknown } 回答1: This is the answer I

Exporting spark dataframe to .csv with header and specific filename

寵の児 提交于 2019-12-08 19:35:47
问题 I am trying to export data from a spark dataframe to .csv file: df.coalesce(1)\ .write\ .format("com.databricks.spark.csv")\ .option("header", "true")\ .save(output_path) It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv" I want the filename to be "part-r-00000.csv" or "part-00000.csv" As the file is being created on AWS S3, I am limited in how I can use os.system commands. How can I set the file name while keeping the header in the file? Thanks! 回答1: Well,

Databricks cluster does not initialize Azure library with error: module 'lib' has no attribute 'SSL_ST_INIT'

佐手、 提交于 2019-12-08 05:06:50
问题 I am using Azure DataBricks notebook with Azure library to get list of files in Blob Storage. This task is scheduled and cluster is terminated after finishing the job and started again with new run. I am using Azure 4.0.0 library (https://pypi.org/project/azure/) Sometimes I am getting error message: AttributeError: module 'lib' has no attribute 'SSL_ST_INIT' and very rarely also: AttributeError: cffi library '_openssl' has no function, constant or global variable named 'CRYPTOGRAPHY_PACKAGE

Scala: How can I split up a dataframe by row number?

别说谁变了你拦得住时间么 提交于 2019-12-08 03:26:26
问题 I want to split up a dataframe of 2,7 million rows into small dataframes of 100000 rows, so end up with like 27 dataframes, which I want to store as csv files too. I took a look at this partitionBy and groupBy already, but I don't need to worry about any conditions, except that they have to be ordered by date. I am trying to write my own code to make this work, but if you know about some Scala (Spark) functions I could use, that would be great! Thank you all for the suggestions! 回答1: You

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

*爱你&永不变心* 提交于 2019-12-07 15:41:28
I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout in mysql, but it doesn't seem to make any difference I am also loading a smaller (different) table, and that loads just fine. df_dataset = get_jdbc('raw_data_load', predicates=predicates).select('field1','field2', 'field3','date') df_dataset = df_dataset.repartition('date') df_dataset

Create External table in Azure databricks

痞子三分冷 提交于 2019-12-07 11:55:33
问题 I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location. From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created. Note : One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount