databricks | 易学教程

Why “databricks-connect test” does not work after configurate Databricks Connect?

阅读更多关于 Why “databricks-connect test” does not work after configurate Databricks Connect?

问题 I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says. That is my cluster configuration 回答1: You problem looks like it is one of the following: a) You specificed the wrong port (it has to be 8787 on Azure) b) You didnt open up the port in you

SparkSQL job fails when calling stddev over 1,000 columns

阅读更多关于 SparkSQL job fails when calling stddev over 1,000 columns

问题 I am on DataBricks with Spark 2.2.1 and Scala 2.11. I am attempting to run a SQL query that looks like the following. select stddev(col1), stddev(col2), ..., stddev(col1300) from mydb.mytable I then execute the code as follows. myRdd = sqlContext.sql(sql) However, I see the following exception thrown. Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino

equality of two data frames

阅读更多关于 equality of two data frames

问题 I have below scenario: I have 2 dataframes containing only 1 column Lets say DF1=(1,2,3,4,5) DF2=(3,6,7,8,9,10) Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is: val df1count= DF1.count val df2count=DF2.count val diffDF=DF2.except(DF1) val diffCount=diffDF.count if(diffCount==(df2count-df1count)) true else false The problem with this approach is I

Consume events from EventHub In Azure Databricks using pySpark

阅读更多关于 Consume events from EventHub In Azure Databricks using pySpark

问题 I could see spark connectors & guidelines for consuming events from Event Hub using Scala in Azure Databricks. But, How can we consume events in event Hub from azure databricks using pySpark? any suggestions/documentation details would help. thanks 回答1: Below is the snippet for reading events from event hub from pyspark on azure data-bricks. // With an entity path val with = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME" # Source with default

Saving empty DataFrame with known schema (Spark 2.2.1)

阅读更多关于 Saving empty DataFrame with known schema (Spark 2.2.1)

问题 Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records? def example(spark: SparkSession, path: String, schema: StructType) = { val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet") dataframeWriter.save(path) spark.read.load(path) // ERROR!! No files to read, so schema unknown } 回答1: This is the answer I

Exporting spark dataframe to .csv with header and specific filename

阅读更多关于 Exporting spark dataframe to .csv with header and specific filename

问题 I am trying to export data from a spark dataframe to .csv file: df.coalesce(1)\ .write\ .format("com.databricks.spark.csv")\ .option("header", "true")\ .save(output_path) It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv" I want the filename to be "part-r-00000.csv" or "part-00000.csv" As the file is being created on AWS S3, I am limited in how I can use os.system commands. How can I set the file name while keeping the header in the file? Thanks! 回答1: Well,

Databricks cluster does not initialize Azure library with error: module 'lib' has no attribute 'SSL_ST_INIT'

阅读更多关于 Databricks cluster does not initialize Azure library with error: module 'lib' has no attribute 'SSL_ST_INIT'

问题 I am using Azure DataBricks notebook with Azure library to get list of files in Blob Storage. This task is scheduled and cluster is terminated after finishing the job and started again with new run. I am using Azure 4.0.0 library (https://pypi.org/project/azure/) Sometimes I am getting error message: AttributeError: module 'lib' has no attribute 'SSL_ST_INIT' and very rarely also: AttributeError: cffi library '_openssl' has no function, constant or global variable named 'CRYPTOGRAPHY_PACKAGE

Scala: How can I split up a dataframe by row number?

阅读更多关于 Scala: How can I split up a dataframe by row number?

问题 I want to split up a dataframe of 2,7 million rows into small dataframes of 100000 rows, so end up with like 27 dataframes, which I want to store as csv files too. I took a look at this partitionBy and groupBy already, but I don't need to worry about any conditions, except that they have to be ordered by date. I am trying to write my own code to make this work, but if you know about some Scala (Spark) functions I could use, that would be great! Thank you all for the suggestions! 回答1: You

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

阅读更多关于 Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout in mysql, but it doesn't seem to make any difference I am also loading a smaller (different) table, and that loads just fine. df_dataset = get_jdbc('raw_data_load', predicates=predicates).select('field1','field2', 'field3','date') df_dataset = df_dataset.repartition('date') df_dataset

Create External table in Azure databricks

阅读更多关于 Create External table in Azure databricks

问题 I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location. From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created. Note : One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount