databricks

sql sparklyr sparkr dataframe conversions on databricks

痞子三分冷 提交于 2019-12-23 20:55:20
问题 I have the sql table on the databricks created using the following code %sql CREATE TABLE data USING CSV OPTIONS (header "true", inferSchema "true") LOCATION "url/data.csv" The following code converts that table to sparkr and r dataframe, respectively: %r library(SparkR) data_spark <- sql("SELECT * FROM data") data_r_df <- as.data.frame(data_spark) But I don't know how should I convert any or all of these dataframes into sparklyr dataframe to leverage parallelization of sparklyr? 回答1: Just sc

Spark doing exchange of partitions already correctly distributed

扶醉桌前 提交于 2019-12-23 06:49:33
问题 I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app? 回答1: 1) Go to Spark UI

How to create DataFrame Schema from Json schem file

百般思念 提交于 2019-12-23 05:19:06
问题 My use case is to read an existing json-schema file, parse this json-schema file and build a Spark DataFrame schema out of it. To start off I followed the steps mentioned here. Steps followed 1.Imported the library from Maven 2.Restarted the cluster 3.Created a sample JSON schema file 4.Used this code to read the sample schema file val schema = SchemaConverter.convert("/FileStore/tables/schemaFile.json") When I run above command I get error: not found: value SchemaConverter To ensure that the

Delete azure sql database rows from azure databricks

梦想与她 提交于 2019-12-23 04:54:09
问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

跟風遠走 提交于 2019-12-23 04:24:38
问题 I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout in mysql, but it doesn't seem to make any difference I am also loading a smaller (different) table, and that loads just fine. df_dataset = get_jdbc('raw_data_load', predicates=predicates).select(

reading data from URL using spark databricks platform

自作多情 提交于 2019-12-22 18:45:20
问题 trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv" from pyspark import SparkFiles spark.sparkContext.addFile(url) # sc.addFile(url) # sqlContext = SQLContext(sc) # df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) df = spark.read.csv(SparkFiles.get(

Unsupported literal type class scala.runtime.BoxedUnit

允我心安 提交于 2019-12-20 04:38:43
问题 I am trying to filter a column of a dataframe read from oracle as below import org.apache.spark.sql.functions.{col, lit, when} val df0 = df_org.filter(col("fiscal_year").isNotNull()) When I do it I am getting below error: java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit () at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163) at org.apache

Save custom transformers in pyspark

随声附和 提交于 2019-12-19 10:11:56
问题 When I implement this part of this python code in Azure Databricks: class clustomTransformations(Transformer): <code> custom_transformer = customTransformations() .... pipeline = Pipeline(stages=[custom_transformer, assembler, scaler, rf]) pipeline_model = pipeline.fit(sample_data) pipeline_model.save(<your path>) When I attempt to save the pipeline, I get this: AttributeError: 'customTransformations' object has no attribute '_to_java' Any work arounds? 回答1: It seems like there is no easy

How to slice a pyspark dataframe in two row-wise

人走茶凉 提交于 2019-12-18 03:34:47
问题 I am working in Databricks. I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows. +--------------------+----------+ | userid| eventdate| +--------------------+----------+ |00518b128fc9459d9...|2017-10-09| |00976c0b7f2c4c2ca...|2017-12-16| |00a60fb81aa74f35a...|2017-12-04| |00f9f7234e2c4bf78...|2017-05-09| |0146fe6ad7a243c3b...|2017-11-21| |016567f169c145ddb...|2017-10-16| |01ccd278777946cb8...

How to TRUNCATE and / or use wildcards with Databrick

空扰寡人 提交于 2019-12-17 21:14:54
问题 I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file. For example, the following file looks as follows: LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31 I have created the following code in Databricks: import datetime now1 = datetime.datetime.now() now = now1.strftime("%Y-%m-%d") Using the above code I tried to select the file using following: LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'