apache-spark

Java Spark : Stack Overflow Error on GroupBy

社会主义新天地 提交于 2021-02-07 16:07:08
问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters

Cosine similarity of word2vec more than 1

大憨熊 提交于 2021-02-07 14:49:26
问题 I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: w2vmodel.findSynonyms('science',4).show(5) +------------+------------------+ | word| similarity| +------------+------------------+ | physics| 1.714908638833209| | fiction|1.5189824643358183| |neuroscience|1.4968051528391833| | psychology| 1.458865636374223| +------------+------------------+ I do not

Is there an effective partitioning method when using reduceByKey in Spark?

雨燕双飞 提交于 2021-02-07 14:21:45
问题 When I use reduceByKey or aggregateByKey , I'm confronted with partition problems. ex) reduceBykey(_+_).map(code) Especially, if input data is skewed, the partitioning problem becomes even worse when using the above methods. So, as a solution to this, I use repartition method. For example, http://dev.sortable.com/spark-repartition/ is similar. This is good for partition distribution, but the repartition is also expensive. Is there a way to solve the partition problem wisely? 回答1: You are

Reading pretty print json files in Apache Spark

前提是你 提交于 2021-02-07 13:50:22
问题 I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way. My JSON schema looks like this - { "dataset": [ { "key1": [ { "range": "range1", "value": 0.0 }, { "range": "range2", "value": 0.23 } ]

pyspark: grouby and then get max value of each group

末鹿安然 提交于 2021-02-07 13:12:55
问题 I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value. # some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt') # Create the triplet so I index stuff data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2]))) # Group by the user i.e. r[0] grouped = data_file.groupBy(lambda r: r[0]) # Here is where

Why does Spark (on Google Dataproc) not use all vcores?

杀马特。学长 韩版系。学妹 提交于 2021-02-07 12:31:55
问题 I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master

Kafka Structured Streaming KafkaSourceProvider could not be instantiated

橙三吉。 提交于 2021-02-07 11:38:13
问题 I am working on a streaming project where I have a kafka stream of ping statistics like so : 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=1 ttl=62 time=0.913 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=2 ttl=62 time=0.936 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=3 ttl=62 time=0.980 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=4 ttl=62 time=0.889 ms I am trying to read this as a structured stream in

Spark + Hive : Number of partitions scanned exceeds limit (=4000)

有些话、适合烂在心里 提交于 2021-02-07 11:03:50
问题 We upgraded our Hadoop Platform (Spark; 2.3.0, Hive: 3.1), and I'm facing this exception when reading some Hive tables in Spark : "Number of partitions scanned on table 'my_table' exceeds limit (=4000)". Tables we are working on : table1 : external table with a total of ~12300 partitions, partitioned by(col1: String, date1: String) , (ORC compressed ZLIB) table2 : external table with a total of 4585 partitions, partitioned by(col21: String, date2: Date, col22: String) (ORC uncompressed) [A]

Spark 2.3 - Minikube - Kubernetes - Windows - Demo - SparkPi not found

牧云@^-^@ 提交于 2021-02-07 10:59:32
问题 I am trying to follow this but I am encountering an error. In particular, when I run: spark-submit.cmd --master k8s://https://192.168.1.40:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=spark:spark --conf spark.kubernetes.driver.pod.name=spark-pi-driver local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar I get: 2018-03-17 02:09:00 INFO LoggingPodStatusWatcherImpl:54 -

Difference of elements in list in PySpark

白昼怎懂夜的黑 提交于 2021-02-07 10:59:28
问题 I have a PySpark dataframe ( df ) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders. +--------+----------+-------+ | version| timestamp| list | +--------+-----+----|-------+ | v1 |2012-01-10| [5,2] | | v1 |2012-01-11| [2,5] | | v1 |2012-01-12| [3,2] | | v2 |2012-01-12| [2,3] | | v2 |2012-01-11| [1,2] | | v2 |2012-01-13| [2,1] | +--------+----------+-------+ I want to take difference betweeen the first and the