databricks | 易学教程

getting error while Databricks connection to Azure SQL DB with ActiveDirectoryPassword

阅读更多关于 getting error while Databricks connection to Azure SQL DB with ActiveDirectoryPassword

问题 I am trying to connect Azure sql db from Databricks with AAD - Password auth. I imported azure sql db& adal4j libs. but still getting below error java.lang.NoClassDefFoundError: com/nimbusds/oauth2/sdk/AuthorizationGrant stack trace: at com.microsoft.sqlserver.jdbc.SQLServerADAL4JUtils.getSqlFedAuthToken(SQLServerADAL4JUtils.java:24) at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:3609) at com.microsoft.sqlserver.jdbc.SQLServerConnection

Cannot create Dataframe in PySpark

阅读更多关于 Cannot create Dataframe in PySpark

问题 I want to create a Dataframe in PySpark with the following code from pyspark.sql import * from pyspark.sql.types import * temp = Row("DESC", "ID") temp1 = temp('Description1323', 123) print temp1 schema = StructType([StructField("DESC", StringType(), False), StructField("ID", IntegerType(), False)]) df = spark.createDataFrame(temp1, schema) But i am receiving the following error: TypeError: StructType can not accept object 'Description1323' in type type 'str' Whats wrong with my code? 回答1:

spark dropDuplicates based on json array field

阅读更多关于 spark dropDuplicates based on json array field

问题 I have json files of the following structure: {"names":[{"name":"John","lastName":"Doe"}, {"name":"John","lastName":"Marcus"}, {"name":"David","lastName":"Luis"} ]} I want to read several such json files and distinct them based on the "name" column inside names. I tried df.dropDuplicates(Array("names.name")) but it didn't do the magic. 回答1: This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a

Spark Redshift with Python

阅读更多关于 Spark Redshift with Python

问题 I'm trying to connect Spark with amazon Redshift but i'm getting this error : My code is as follow : from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>) df = sql_context.read \ .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1

Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

阅读更多关于 Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

问题 I'm using the 14 day Premium free trial. I'm trying to create and run a cluster in databricks (I'm following the quick start guide). How ever I'm getting the following error "Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4." I cant bump up the limit because I am in the free trial. I'm trying to run only 1 worker on the weakest worker type. I've already tried deleting all my subscriptions and made sure that there are no other

Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

阅读更多关于 Operation results in exceeding quota limits of Core. Maximum allowed: 4, Current in use: 4, Additional requested: 4. While in 14 day free trial

Get the highest price with smaller ID when two ID have the same highest price in Scala

阅读更多关于 Get the highest price with smaller ID when two ID have the same highest price in Scala

问题 I have a dataframe call productPrice have column ID and Price, I want to get the ID that had the highest price, if two ID have the same highest price, I only get the one the have the smaller ID number. I use val highestprice = productPrice.orderBy(asc("ID")).orderBy(desc("price")).limit(1) But the result I got is not the one that have the smaller ID, instead the one I got is the one the have a larger ID. I don't know what's wrong with my logic, any idea? 回答1: Try this. scala> val df = Seq((4,

if else in spark passing an condition to find the value from csv file

阅读更多关于 if else in spark passing an condition to find the value from csv file

问题 I want to read csv file into dfTRUEcsv How to get the value (03,05) and 11 as string in the below eg I want to pass those string as a parameter to get files from that folder i will pass (03,05) and 11 as parameters if TRUE , for each Loop start Folder\03 ; Folder\05 ; Folder\11 +-------------+--------------+--------------------+-----------------+--------+ |Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload| +-------------+--------------+--------------------+---------

Spark csv data validation failed for date and timestamp data types of Hive

阅读更多关于 Spark csv data validation failed for date and timestamp data types of Hive

问题 Hive Table Schema: c_date date c_timestamp timestamp It's text table Hive Table data: hive> select * from all_datetime_types; OK 0001-01-01 0001-01-01 00:00:00.000000001 9999-12-31 9999-12-31 23:59:59.999999999 csv obtained after spark job: c_date,c_timestamp 0001-01-01 00:00:00.0,0001-01-01 00:00:00.0 9999-12-31 00:00:00.0,9999-12-31 23:59:59.999 Issues: 00:00:00.0 is added in date type timestamp is truncated to milliseconds precision Useful code: SparkConf conf = new SparkConf(true)

Batch write from to Kafka does not observe checkpoints and writes duplicates

阅读更多关于 Batch write from to Kafka does not observe checkpoints and writes duplicates

问题 Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink. So should checkpoints work in batch-writing mode at all? Or I am missing something? Config: EH_SASL = 'kafkashaded.org.apache.kafka