apache-spark

Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

南笙酒味 提交于 2021-02-07 10:59:27
问题 I am trying to follow this but I am encountering an error. In particular, when I run: spark-submit.cmd --master k8s://https://192.168.1.40:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=spark:spark --conf spark.kubernetes.driver.pod.name=spark-pi-driver local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar I get: 2018-03-17 02:09:00 INFO LoggingPodStatusWatcherImpl:54 -

Is there an API function to display “Fraction Cached” for an RDD?

落爺英雄遲暮 提交于 2021-02-07 10:59:26
问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

南笙酒味 提交于 2021-02-07 10:53:15
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

流过昼夜 提交于 2021-02-07 10:52:14
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

Find all permutations of values in Spark RDD; python

末鹿安然 提交于 2021-02-07 10:51:39
问题 I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following: ['x', 'y', 'z'] What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output: ['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx'] I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs. 回答1: Doing this all in

How to convert an Iterable to an RDD

戏子无情 提交于 2021-02-07 10:45:26
问题 To be more specific, how can i convert a scala.Iterable to a org.apache.spark.rdd.RDD ? I have an RDD of (String, Iterable[(String, Integer)]) and i want this to be converted into an RDD of (String, RDD[String, Integer]) , so that i can apply a reduceByKey function to the internal RDD . e.g i have an RDD where key is 2-lettered prefix of a person's name and the value is List of pairs of Person name and hours that they spent in an event my RDD is : ("To", List(("Tom",50),("Tod","30"),("Tom",70

Enabling SSL between Apache spark and Kafka broker

你说的曾经没有我的故事 提交于 2021-02-07 10:43:15
问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Enabling SSL between Apache spark and Kafka broker

你。 提交于 2021-02-07 10:43:11
问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Spark SQL to Hive table - Datetime Field Hours Bug

孤街浪徒 提交于 2021-02-07 10:42:14
问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala

Spark SQL to Hive table - Datetime Field Hours Bug

怎甘沉沦 提交于 2021-02-07 10:41:39
问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala