databricks

How to use dbutils command in pyspark job other than NoteBook

∥☆過路亽.° 提交于 2020-01-24 00:26:41
问题 I want to use dbutils command for accessing secrets in my pyspark job submitting through Spark-Submit inside Jobs on Databricks. While using dbutils command it is giving error dbutils not defined. Is there is the way to use dbutils in a pyspark job other than a notebook? Tried the following solutions: 1) import DBUtils, according to this solution. But this is not Databricks dbutils. 2) import pyspark.dbutils import DBUtils , according to this solution. But this also didn't work. pyspark job

How to use dbutils command in pyspark job other than NoteBook

拈花ヽ惹草 提交于 2020-01-24 00:26:06
问题 I want to use dbutils command for accessing secrets in my pyspark job submitting through Spark-Submit inside Jobs on Databricks. While using dbutils command it is giving error dbutils not defined. Is there is the way to use dbutils in a pyspark job other than a notebook? Tried the following solutions: 1) import DBUtils, according to this solution. But this is not Databricks dbutils. 2) import pyspark.dbutils import DBUtils , according to this solution. But this also didn't work. pyspark job

pyspark replace multiple values with null in dataframe

隐身守侯 提交于 2020-01-15 06:43:09
问题 I have a dataframe (df) and within the dataframe I have a column user_id df = sc.parallelize([(1, "not_set"), (2, "user_001"), (3, "user_002"), (4, "n/a"), (5, "N/A"), (6, "userid_not_set"), (7, "user_003"), (8, "user_004")]).toDF(["key", "user_id"]) df: +---+--------------+ |key| user_id| +---+--------------+ | 1| not_set| | 2| user_003| | 3| user_004| | 4| n/a| | 5| N/A| | 6|userid_not_set| | 7| user_003| | 8| user_004| +---+--------------+ I would like to replace the following values: not

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

我是研究僧i 提交于 2020-01-13 15:01:07
问题 I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded" . Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of

Reading csv data into SparkR after writing it out from a DataFrame

独自空忆成欢 提交于 2020-01-07 06:38:14
问题 I followed the example in this post to write out a DataFrame as a csv to an AWS S3 bucket. The result was not a single file but rather a folder with many .csv files. I'm now having trouble reading in this folder as a DataFrame in SparkR. Below is what I've tried but they do not result in the same DataFrame that I wrote out. write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket df_in1 <- read.df("s3a://bucket/df", source="csv") df_in2 <- read.df("s3a://bucket/df

Avro file error while loading decimal field into Redshift table using Databricks

自作多情 提交于 2020-01-06 07:02:10
问题 I have a dataframe in Databricks, which has bunch of columns including a decimal(15,2) field. If I exclude the decimal field then I am able to insert this data into the Redshift table, but when decimal field is included then I get following error: "Cannot init avro reader from s3 file Cannot parse file header: Cannot save fixed schema" Any thoughts? 回答1: Try to use just decimal without range. Or cast existing column to decimal . Also try to use different tempformat . From my experience CSV

Cast multiples columns in a DataFrame

孤者浪人 提交于 2020-01-06 06:11:44
问题 I'm on Databricks and I'm working on a classification problem. I have a DataFrame with 2000+ columns. I want to cast all the columns that will become features to double. val array45 = data.columns drop(1) for (element <- array45) { data.withColumn(element, data(element).cast("double")) } data.printSchema() The cast to double is working but I'm not saving it in the DataFrame called Data. If I create a new DataFrame in the loop ; outside of the for loops my DataFrame won't exist. I do not want

Cast multiples columns in a DataFrame

◇◆丶佛笑我妖孽 提交于 2020-01-06 06:10:12
问题 I'm on Databricks and I'm working on a classification problem. I have a DataFrame with 2000+ columns. I want to cast all the columns that will become features to double. val array45 = data.columns drop(1) for (element <- array45) { data.withColumn(element, data(element).cast("double")) } data.printSchema() The cast to double is working but I'm not saving it in the DataFrame called Data. If I create a new DataFrame in the loop ; outside of the for loops my DataFrame won't exist. I do not want

AnalysisException is thrown when the DataFrame is empty (No such struct field)

[亡魂溺海] 提交于 2020-01-04 09:48:08
问题 I have a dataframe on which I apply a filter and then a series of transformations. At the end, I select several columns. // Filters the event related to a user_principal. var filteredCount = events.filter("Properties.EventTypeName == 'user_principal_created' or Properties.EventTypeName == 'user_principal_updated'"); // Selects the columns based on the event type. .withColumn("Username", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.Username")) .otherwise

getting error while Databricks connection to Azure SQL DB with ActiveDirectoryPassword

你。 提交于 2020-01-02 10:02:34
问题 I am trying to connect Azure sql db from Databricks with AAD - Password auth. I imported azure sql db& adal4j libs. but still getting below error java.lang.NoClassDefFoundError: com/nimbusds/oauth2/sdk/AuthorizationGrant stack trace: at com.microsoft.sqlserver.jdbc.SQLServerADAL4JUtils.getSqlFedAuthToken(SQLServerADAL4JUtils.java:24) at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:3609) at com.microsoft.sqlserver.jdbc.SQLServerConnection