databricks

DataFrame to RDD[(String, String)] conversion

家住魔仙堡 提交于 2019-12-02 07:33:57
I want to convert an org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[(String, String)] in Databricks. Can anyone help? Background (and a better solution is also welcome): I have a Kafka stream which (after some steps) becomes a 2 column data frame. I would like to put this into a Redis cache, first column as a key and second column as a value. More specifically the type of the input is this: lastContacts: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: bigint] . I try to put into Redis as follows: sc.toRedisKV(lastContacts)(redisConfig) The error message looks

Call a function with each element a stream in Databricks

夙愿已清 提交于 2019-12-02 07:06:57
I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis. To be more specific, this is how it would look like in non-stream case: val someDataFrame = Seq( ("key1", "value1"), ("key2", "value2"), ("key3", "value3"), ("key4", "value4") ).toDF() def someFunction(keyValuePair: (String, String)) = { println(keyValuePair) } someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1)

How to use file from Databricks FileStore

一世执手 提交于 2019-12-02 02:05:56
Trying to use a .dat file for ip lookup. File is on Databricks file store from Scala code: def getCountryCode(ip: String) { val filePath = "FileStore/maxmind/GeoIPCountry.dat" val ipLookups = new IpLookups(geoFile = Option(new File(filePath)), ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = 0) val location = ipLookups.performLookups(ip)._1.head println(location.countryCode) } I am getting an exception: java.io.FileNotFoundException: FileStore/maxmind/GeoIPCountry.dat (No such file or directory) Method works on local environment with relative/absolute paths Use

How to use file from Databricks FileStore

故事扮演 提交于 2019-12-02 01:55:28
问题 Trying to use a .dat file for ip lookup. File is on Databricks file store from Scala code: def getCountryCode(ip: String) { val filePath = "FileStore/maxmind/GeoIPCountry.dat" val ipLookups = new IpLookups(geoFile = Option(new File(filePath)), ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = 0) val location = ipLookups.performLookups(ip)._1.head println(location.countryCode) } I am getting an exception: java.io.FileNotFoundException: FileStore/maxmind

Databricks - Creating permanent User Defined Functions (UDFs)

穿精又带淫゛_ 提交于 2019-12-01 13:47:15
I am able to create a UDF function and register to spark using spark.UDF method. However, this is per session only. How to register python UDF functions automatically when the Cluster starts?. These functions should be available to all users. Example use case is to convert time from UTC to local time zone. This is not possible; this is not like UDFs in Hive. Code the UDF as part of the package / program you submit or in the jar included in the Spark App, if using spark-submit. However, spark.udf.register.udf("... is required to be done as well. This applies to Databrick notebooks, etc. The

Databricks - Creating permanent User Defined Functions (UDFs)

心不动则不痛 提交于 2019-12-01 13:16:19
问题 I am able to create a UDF function and register to spark using spark.UDF method. However, this is per session only. How to register python UDF functions automatically when the Cluster starts?. These functions should be available to all users. Example use case is to convert time from UTC to local time zone. 回答1: This is not possible; this is not like UDFs in Hive. Code the UDF as part of the package / program you submit or in the jar included in the Spark App, if using spark-submit. However,

Why only one core is taking all the load , how to make other 29 cores to take load?

不问归期 提交于 2019-12-01 12:02:31
I am trying to push my Spark processed data to 3 node cluster of C*. I am pushing 200 million records to Cassandra it is failing with below err. Below it the my spark cluster configuration Nodes : 12 vCores Total : 112 Total memory : 1.5 TB. Below are my spark-sumbit parameters: $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --name app --class Driver --executor-cores 3 --executor-memory 8g --num-executors 10 --driver-cores 2 --driver-memory 10g --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=false --conf spark.task.maxFailures=8 --conf spark

Save custom transformers in pyspark

血红的双手。 提交于 2019-12-01 11:55:36
When I implement this part of this python code in Azure Databricks: class clustomTransformations(Transformer): <code> custom_transformer = customTransformations() .... pipeline = Pipeline(stages=[custom_transformer, assembler, scaler, rf]) pipeline_model = pipeline.fit(sample_data) pipeline_model.save(<your path>) When I attempt to save the pipeline, I get this: AttributeError: 'customTransformations' object has no attribute '_to_java' Any work arounds? dportman It seems like there is no easy workaround but to try and implement the _to_java method, as is suggested here for StopWordsRemover:

How to read a XML file with spark that contains multiple namespaces?

ぐ巨炮叔叔 提交于 2019-12-01 09:53:34
问题 I'm using the spark-xml library in Azure-Databricks. But I can't get the options right to read this kind of file that contains multiple namespaces. So I'm looking for some help to get this coded in the options, or any other approach. Here is a stripped sample. <msg:TrainTrackingMessage xmlns:msg="be:brail:nmbs-it:esb:msg:traintraffic" xmlns:trtf="be:brail:nmbs-it:esb:traintraffic" xmlns:gene="be:brail:nmbs-it:esb:generalelements"> <gene:Event> <gene:EventType>tracking</gene:EventType> <gene

How to set jdbc/partitionColumn type to Date in spark 2.4.1

ぐ巨炮叔叔 提交于 2019-12-01 09:38:22
问题 I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "31-MAR-02"); .option("upperBound", "01-MAY-19"); .option("partitionColumn", "data_date"); .option("numPartitions", 240); But gives error : java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] at java.sql.Timestamp.valueOf(Timestamp.java:204) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$