databricks

getting error while Databricks connection to Azure SQL DB with ActiveDirectoryPassword

我只是一个虾纸丫 提交于 2020-01-02 10:01:07
问题 I am trying to connect Azure sql db from Databricks with AAD - Password auth. I imported azure sql db& adal4j libs. but still getting below error java.lang.NoClassDefFoundError: com/nimbusds/oauth2/sdk/AuthorizationGrant stack trace: at com.microsoft.sqlserver.jdbc.SQLServerADAL4JUtils.getSqlFedAuthToken(SQLServerADAL4JUtils.java:24) at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:3609) at com.microsoft.sqlserver.jdbc.SQLServerConnection

Cannot create Dataframe in PySpark

夙愿已清 提交于 2020-01-02 09:40:12
问题 I want to create a Dataframe in PySpark with the following code from pyspark.sql import * from pyspark.sql.types import * temp = Row("DESC", "ID") temp1 = temp('Description1323', 123) print temp1 schema = StructType([StructField("DESC", StringType(), False), StructField("ID", IntegerType(), False)]) df = spark.createDataFrame(temp1, schema) But i am receiving the following error: TypeError: StructType can not accept object 'Description1323' in type type 'str' Whats wrong with my code? 回答1:

spark dropDuplicates based on json array field

白昼怎懂夜的黑 提交于 2019-12-31 05:36:33
问题 I have json files of the following structure: {"names":[{"name":"John","lastName":"Doe"}, {"name":"John","lastName":"Marcus"}, {"name":"David","lastName":"Luis"} ]} I want to read several such json files and distinct them based on the "name" column inside names. I tried df.dropDuplicates(Array("names.name")) but it didn't do the magic. 回答1: This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a

Spark Redshift with Python

我的梦境 提交于 2019-12-30 07:30:49
问题 I'm trying to connect Spark with amazon Redshift but i'm getting this error : My code is as follow : from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>) df = sql_context.read \ .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1

Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

我怕爱的太早我们不能终老 提交于 2019-12-29 02:12:40
问题 I'm using the 14 day Premium free trial. I'm trying to create and run a cluster in databricks (I'm following the quick start guide). How ever I'm getting the following error "Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4." I cant bump up the limit because I am in the free trial. I'm trying to run only 1 worker on the weakest worker type. I've already tried deleting all my subscriptions and made sure that there are no other

Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

限于喜欢 提交于 2019-12-29 02:11:05
问题 I'm using the 14 day Premium free trial. I'm trying to create and run a cluster in databricks (I'm following the quick start guide). How ever I'm getting the following error "Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4." I cant bump up the limit because I am in the free trial. I'm trying to run only 1 worker on the weakest worker type. I've already tried deleting all my subscriptions and made sure that there are no other

Get the highest price with smaller ID when two ID have the same highest price in Scala

三世轮回 提交于 2019-12-25 19:00:02
问题 I have a dataframe call productPrice have column ID and Price, I want to get the ID that had the highest price, if two ID have the same highest price, I only get the one the have the smaller ID number. I use val highestprice = productPrice.orderBy(asc("ID")).orderBy(desc("price")).limit(1) But the result I got is not the one that have the smaller ID, instead the one I got is the one the have a larger ID. I don't know what's wrong with my logic, any idea? 回答1: Try this. scala> val df = Seq((4,

if else in spark passing an condition to find the value from csv file

跟風遠走 提交于 2019-12-25 17:19:23
问题 I want to read csv file into dfTRUEcsv How to get the value (03,05) and 11 as string in the below eg I want to pass those string as a parameter to get files from that folder i will pass (03,05) and 11 as parameters if TRUE , for each Loop start Folder\03 ; Folder\05 ; Folder\11 +-------------+--------------+--------------------+-----------------+--------+ |Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload| +-------------+--------------+--------------------+---------

Spark csv data validation failed for date and timestamp data types of Hive

狂风中的少年 提交于 2019-12-25 07:47:01
问题 Hive Table Schema: c_date date c_timestamp timestamp It's text table Hive Table data: hive> select * from all_datetime_types; OK 0001-01-01 0001-01-01 00:00:00.000000001 9999-12-31 9999-12-31 23:59:59.999999999 csv obtained after spark job: c_date,c_timestamp 0001-01-01 00:00:00.0,0001-01-01 00:00:00.0 9999-12-31 00:00:00.0,9999-12-31 23:59:59.999 Issues: 00:00:00.0 is added in date type timestamp is truncated to milliseconds precision Useful code: SparkConf conf = new SparkConf(true)

Batch write from to Kafka does not observe checkpoints and writes duplicates

纵饮孤独 提交于 2019-12-25 03:12:47
问题 Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink. So should checkpoints work in batch-writing mode at all? Or I am missing something? Config: EH_SASL = 'kafkashaded.org.apache.kafka