pyspark

Not able to set number of shuffle partition in pyspark

大兔子大兔子 提交于 2021-02-10 19:57:47
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

时光怂恿深爱的人放手 提交于 2021-02-10 19:52:26
问题 I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to serializing/deserializing but due to situation, we have to use. The dataset is 60GB and well partitioned, cluster has 240GB memory. The job runs fine reading it in and performing spark functions but always fails when it starts calling the python udfs with the below error. At first I thought it was memory issue so I increased memory to nodes and executors but still the problem persists. What does this

Cosine Similarity for two pyspark dataframes

穿精又带淫゛_ 提交于 2021-02-10 19:31:21
问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

不羁岁月 提交于 2021-02-10 18:24:01
问题 This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query? I am running a python script to generate summary statistics of micro-batches of a streaming query. Python code (I am currently running) import sys import json import psycopg2 from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql

spark-class java no such file or derectory

眉间皱痕 提交于 2021-02-10 14:20:46
问题 I am a newbie to spark / scala ... I have set up on a fully distributed cluster spark / scala and sbt. when I test and issue the command pyspark I get the following error : /home/hadoop/spark/bin/spark-class line 75 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java - no such file or directory my bashrc contains: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 hadoop-env.sh contains export JAVA_HOME=/usr/lib/jvm/java7-openjdk-amd64/jre/ conf/spark-env.sh contains JAVA_HOME=usr/lib/jvm/java7

spark-nlp 'JavaPackage' object is not callable

会有一股神秘感。 提交于 2021-02-10 12:56:19
问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\

spark-nlp 'JavaPackage' object is not callable

浪子不回头ぞ 提交于 2021-02-10 12:53:09
问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\

Pyspark groupBy DataFrame without aggregation or count

狂风中的少年 提交于 2021-02-10 12:18:09
问题 Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: for i, d in df2: mycode .... ^^ if using pandas ^^ Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count? 回答1: At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas. ex: from pyspark.sql import functions as f df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df[

PySpark: An error occurred while calling o51.showString. No module named XXX

你。 提交于 2021-02-10 11:49:34
问题 My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure: |root |-- cast_to_float.py |-- tests |-- test.py In cast_to_float.py , my code: from pyspark.sql.types import FloatType from pyspark.sql.functions import udf def cast_to_float(y, column_name): return y.withColumn(column_name, y[column_name].cast(FloatType())) def cast_to_float_1(y, column_name): to_float = udf(cast2float1, FloatType()) return y.withColumn(column_name, to_float

pyspark. zip arrays in a dataframe

笑着哭i 提交于 2021-02-10 09:32:27
问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as