apache-spark | 易学教程

Java Spark : Stack Overflow Error on GroupBy

阅读更多关于 Java Spark : Stack Overflow Error on GroupBy

问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters

Cosine similarity of word2vec more than 1

阅读更多关于 Cosine similarity of word2vec more than 1

Is there an effective partitioning method when using reduceByKey in Spark?

阅读更多关于 Is there an effective partitioning method when using reduceByKey in Spark?

问题 When I use reduceByKey or aggregateByKey , I'm confronted with partition problems. ex) reduceBykey(_+_).map(code) Especially, if input data is skewed, the partitioning problem becomes even worse when using the above methods. So, as a solution to this, I use repartition method. For example, http://dev.sortable.com/spark-repartition/ is similar. This is good for partition distribution, but the repartition is also expensive. Is there a way to solve the partition problem wisely? 回答1: You are

Reading pretty print json files in Apache Spark

阅读更多关于 Reading pretty print json files in Apache Spark

问题 I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way. My JSON schema looks like this - { "dataset": [ { "key1": [ { "range": "range1", "value": 0.0 }, { "range": "range2", "value": 0.23 } ]

pyspark: grouby and then get max value of each group

阅读更多关于 pyspark: grouby and then get max value of each group

问题 I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value. # some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt') # Create the triplet so I index stuff data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2]))) # Group by the user i.e. r[0] grouped = data_file.groupBy(lambda r: r[0]) # Here is where

Why does Spark (on Google Dataproc) not use all vcores?

阅读更多关于 Why does Spark (on Google Dataproc) not use all vcores?

问题 I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master

Kafka Structured Streaming KafkaSourceProvider could not be instantiated

阅读更多关于 Kafka Structured Streaming KafkaSourceProvider could not be instantiated

问题 I am working on a streaming project where I have a kafka stream of ping statistics like so : 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=1 ttl=62 time=0.913 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=2 ttl=62 time=0.936 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=3 ttl=62 time=0.980 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=4 ttl=62 time=0.889 ms I am trying to read this as a structured stream in

Spark + Hive : Number of partitions scanned exceeds limit (=4000)

阅读更多关于 Spark + Hive : Number of partitions scanned exceeds limit (=4000)

问题 We upgraded our Hadoop Platform (Spark; 2.3.0, Hive: 3.1), and I'm facing this exception when reading some Hive tables in Spark : "Number of partitions scanned on table 'my_table' exceeds limit (=4000)". Tables we are working on : table1 : external table with a total of ~12300 partitions, partitioned by(col1: String, date1: String) , (ORC compressed ZLIB) table2 : external table with a total of 4585 partitions, partitioned by(col21: String, date2: Date, col22: String) (ORC uncompressed) [A]

Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

阅读更多关于 Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

问题 I am trying to follow this but I am encountering an error. In particular, when I run: spark-submit.cmd --master k8s://https://192.168.1.40:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=spark:spark --conf spark.kubernetes.driver.pod.name=spark-pi-driver local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar I get: 2018-03-17 02:09:00 INFO LoggingPodStatusWatcherImpl:54 -

Difference of elements in list in PySpark

阅读更多关于 Difference of elements in list in PySpark

问题 I have a PySpark dataframe ( df ) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders. +--------+----------+-------+ | version| timestamp| list | +--------+-----+----|-------+ | v1 |2012-01-10| [5,2] | | v1 |2012-01-11| [2,5] | | v1 |2012-01-12| [3,2] | | v2 |2012-01-12| [2,3] | | v2 |2012-01-11| [1,2] | | v2 |2012-01-13| [2,1] | +--------+----------+-------+ I want to take difference betweeen the first and the