apache-spark

Detecting repeating consecutive values in large datasets with Spark

感情迁移 提交于 2021-02-10 23:43:15
问题 Cheerz, Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is

Detecting repeating consecutive values in large datasets with Spark

点点圈 提交于 2021-02-10 23:40:50
问题 Cheerz, Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is

Detecting repeating consecutive values in large datasets with Spark

放肆的年华 提交于 2021-02-10 23:38:00
问题 Cheerz, Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is

Count distinct in window functions

天涯浪子 提交于 2021-02-10 20:26:49
问题 I was trying to count of unique column b for each c, with out doing group by. I know this could be done with join. how to do count(distinct b) over (partition by c) with out resorting to join. Why are count distinct not supported in window functions. Thank you in advance. Given this data frame: val df= Seq(("a1","b1","c1"), ("a2","b2","c1"), ("a3","b3","c1"), ("a31",null,"c1"), ("a32",null,"c1"), ("a4","b4","c11"), ("a5","b5","c11"), ("a6","b6","c11"), ("a7","b1","c2"), ("a8","b1","c3"), ("a9

Not able to set number of shuffle partition in pyspark

℡╲_俬逩灬. 提交于 2021-02-10 19:57:54
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

Not able to set number of shuffle partition in pyspark

大兔子大兔子 提交于 2021-02-10 19:57:47
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

时光怂恿深爱的人放手 提交于 2021-02-10 19:52:26
问题 I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to serializing/deserializing but due to situation, we have to use. The dataset is 60GB and well partitioned, cluster has 240GB memory. The job runs fine reading it in and performing spark functions but always fails when it starts calling the python udfs with the below error. At first I thought it was memory issue so I increased memory to nodes and executors but still the problem persists. What does this

Cosine Similarity for two pyspark dataframes

穿精又带淫゛_ 提交于 2021-02-10 19:31:21
问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

不羁岁月 提交于 2021-02-10 18:24:01
问题 This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query? I am running a python script to generate summary statistics of micro-batches of a streaming query. Python code (I am currently running) import sys import json import psycopg2 from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql

Spark writing to Cassandra with varying TTL

我们两清 提交于 2021-02-10 18:12:20
问题 In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to. I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp) , where CONST_TTL is a constant TTL that I configured. Currently I am writing to Cassandra with spark using a