apache-spark | 易学教程

Detecting repeating consecutive values in large datasets with Spark

阅读更多关于 Detecting repeating consecutive values in large datasets with Spark

问题 Cheerz, Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is

Detecting repeating consecutive values in large datasets with Spark

阅读更多关于 Detecting repeating consecutive values in large datasets with Spark

Detecting repeating consecutive values in large datasets with Spark

阅读更多关于 Detecting repeating consecutive values in large datasets with Spark

Count distinct in window functions

阅读更多关于 Count distinct in window functions

问题 I was trying to count of unique column b for each c, with out doing group by. I know this could be done with join. how to do count(distinct b) over (partition by c) with out resorting to join. Why are count distinct not supported in window functions. Thank you in advance. Given this data frame: val df= Seq(("a1","b1","c1"), ("a2","b2","c1"), ("a3","b3","c1"), ("a31",null,"c1"), ("a32",null,"c1"), ("a4","b4","c11"), ("a5","b5","c11"), ("a6","b6","c11"), ("a7","b1","c2"), ("a8","b1","c3"), ("a9

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

阅读更多关于 ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

问题 I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to serializing/deserializing but due to situation, we have to use. The dataset is 60GB and well partitioned, cluster has 240GB memory. The job runs fine reading it in and performing spark functions but always fails when it starts calling the python udfs with the below error. At first I thought it was memory issue so I increased memory to nodes and executors but still the problem persists. What does this

Cosine Similarity for two pyspark dataframes

阅读更多关于 Cosine Similarity for two pyspark dataframes

问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

阅读更多关于 Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

问题 This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query? I am running a python script to generate summary statistics of micro-batches of a streaming query. Python code (I am currently running) import sys import json import psycopg2 from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql

Spark writing to Cassandra with varying TTL

阅读更多关于 Spark writing to Cassandra with varying TTL

问题 In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to. I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp) , where CONST_TTL is a constant TTL that I configured. Currently I am writing to Cassandra with spark using a