pyspark | 易学教程

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

阅读更多关于 ERROR PythonUDFRunner: Python worker exited unexpectedly (crashed)

问题 I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to serializing/deserializing but due to situation, we have to use. The dataset is 60GB and well partitioned, cluster has 240GB memory. The job runs fine reading it in and performing spark functions but always fails when it starts calling the python udfs with the below error. At first I thought it was memory issue so I increased memory to nodes and executors but still the problem persists. What does this

Cosine Similarity for two pyspark dataframes

阅读更多关于 Cosine Similarity for two pyspark dataframes

问题 I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue CustomerValue 15 .17 .14 16 .40 .43 18 .86 .09 I want to take the cosine similarity of the two dataframes. And have something like that CustomerID CustomerID CosineCustVal CosineCustVal 15 12 1 .90 16 12 .45 .67 18 12 .8 .04 回答1: You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the

Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

阅读更多关于 Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

问题 This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query? I am running a python script to generate summary statistics of micro-batches of a streaming query. Python code (I am currently running) import sys import json import psycopg2 from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql

spark-class java no such file or derectory

阅读更多关于 spark-class java no such file or derectory

问题 I am a newbie to spark / scala ... I have set up on a fully distributed cluster spark / scala and sbt. when I test and issue the command pyspark I get the following error : /home/hadoop/spark/bin/spark-class line 75 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java - no such file or directory my bashrc contains: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 hadoop-env.sh contains export JAVA_HOME=/usr/lib/jvm/java7-openjdk-amd64/jre/ conf/spark-env.sh contains JAVA_HOME=usr/lib/jvm/java7

spark-nlp 'JavaPackage' object is not callable

阅读更多关于 spark-nlp 'JavaPackage' object is not callable

问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\

spark-nlp 'JavaPackage' object is not callable

阅读更多关于 spark-nlp 'JavaPackage' object is not callable

Pyspark groupBy DataFrame without aggregation or count

阅读更多关于 Pyspark groupBy DataFrame without aggregation or count

问题 Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: for i, d in df2: mycode .... ^^ if using pandas ^^ Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count? 回答1: At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas. ex: from pyspark.sql import functions as f df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df[

PySpark: An error occurred while calling o51.showString. No module named XXX

阅读更多关于 PySpark: An error occurred while calling o51.showString. No module named XXX

问题 My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure: |root |-- cast_to_float.py |-- tests |-- test.py In cast_to_float.py , my code: from pyspark.sql.types import FloatType from pyspark.sql.functions import udf def cast_to_float(y, column_name): return y.withColumn(column_name, y[column_name].cast(FloatType())) def cast_to_float_1(y, column_name): to_float = udf(cast2float1, FloatType()) return y.withColumn(column_name, to_float

pyspark. zip arrays in a dataframe

阅读更多关于 pyspark. zip arrays in a dataframe

问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as