apache-spark | 易学教程

Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

阅读更多关于 Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

问题 I am trying to follow this but I am encountering an error. In particular, when I run: spark-submit.cmd --master k8s://https://192.168.1.40:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=spark:spark --conf spark.kubernetes.driver.pod.name=spark-pi-driver local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar I get: 2018-03-17 02:09:00 INFO LoggingPodStatusWatcherImpl:54 -

Is there an API function to display “Fraction Cached” for an RDD?

阅读更多关于 Is there an API function to display “Fraction Cached” for an RDD?

问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

Find all permutations of values in Spark RDD; python

阅读更多关于 Find all permutations of values in Spark RDD; python

问题 I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following: ['x', 'y', 'z'] What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output: ['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx'] I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs. 回答1: Doing this all in

How to convert an Iterable to an RDD

阅读更多关于 How to convert an Iterable to an RDD

问题 To be more specific, how can i convert a scala.Iterable to a org.apache.spark.rdd.RDD ? I have an RDD of (String, Iterable[(String, Integer)]) and i want this to be converted into an RDD of (String, RDD[String, Integer]) , so that i can apply a reduceByKey function to the internal RDD . e.g i have an RDD where key is 2-lettered prefix of a person's name and the value is List of pairs of Person name and hours that they spent in an event my RDD is : ("To", List(("Tom",50),("Tod","30"),("Tom",70

Enabling SSL between Apache spark and Kafka broker

阅读更多关于 Enabling SSL between Apache spark and Kafka broker

问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Enabling SSL between Apache spark and Kafka broker

阅读更多关于 Enabling SSL between Apache spark and Kafka broker

Spark SQL to Hive table - Datetime Field Hours Bug

阅读更多关于 Spark SQL to Hive table - Datetime Field Hours Bug

问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala

Spark SQL to Hive table - Datetime Field Hours Bug

阅读更多关于 Spark SQL to Hive table - Datetime Field Hours Bug