pyspark | 易学教程

pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

阅读更多关于 pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

问题 I have installed spark 2.2 with winutils in windows 10.when i am going to run pyspark i am facing bellow exception pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' I have already tried permission 777 commands in tmp/hive folder as well.but it is not working for now winutils.exe chmod -R 777 C:\tmp\hive after applying this the problem remains same. I am using pyspark 2.2 in my windows 10. Her is spark-shell env Here is

pyspark join rdds by a specific key

阅读更多关于 pyspark join rdds by a specific key

问题 I have two rdds that I need to join them together. They look like the followings: RDD1 [(u'2', u'100', 2), (u'1', u'300', 1), (u'1', u'200', 1)] RDD2 [(u'1', u'2'), (u'1', u'3')] My desired output is: [(u'1', u'2', u'100', 2)] So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys. Thanks 回答1:

AttributeError: module 'numpy' has no attribute 'core'

阅读更多关于 AttributeError: module 'numpy' has no attribute 'core'

问题 I was wondering if anyone had this issue when running spark and trying to import numpy. Numpy imports properly in a standard notebook, but when I try importing it via a notebook running spark, I get this error. I have the most recent version of numpy and am running the most recent anaconda python 3.6. Thanks! --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 import numpy /Users/michaelthomas/anaconda/lib

Should the DataFrame function groupBy be avoided?

阅读更多关于 Should the DataFrame function groupBy be avoided?

问题 This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different? I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this

pyspark program throwing name 'spark' is not defined

阅读更多关于 pyspark program throwing name 'spark' is not defined

问题 Below program throwing error name 'spark' is not defined Traceback (most recent call last): File "pgm_latest.py", line 232, in <module> sconf =SparkConf().set(spark.dynamicAllocation.enabled,true) .set(spark.dynamicAllocation.maxExecutors,300) .set(spark.shuffle.service.enabled,true) .set(spark.shuffle.spill.compress,true) NameError: name 'spark' is not defined spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py Code #!/usr/bin/python

pyspark replace all values in dataframe with another values

阅读更多关于 pyspark replace all values in dataframe with another values

问题 I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0 For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers c1| c2 | c3 |c4|c5..... |c500 yes| yes|passed |45.... No | Yes|failed |452.... Yes|No |None |32............ when I do df.replace(yes,1) I get

ModuleNotFoundError because PySpark serializer is not able to locate library folder

阅读更多关于 ModuleNotFoundError because PySpark serializer is not able to locate library folder

问题 I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'libfolder' I have zipped the folder into xyz.zip and run the

Python Round Function Issues with pyspark

阅读更多关于 Python Round Function Issues with pyspark

问题 I am relatively new to spark and I've run into an issue when I try to use python's builtin round() function after importing pyspark functions. It seems to have to do with how I import the pyspark functions but I am not sure what the difference is or why one way would cause issues and the other wouldn't. Expected behavior: import pyspark.sql.functions print(round(3.14159265359,2)) >>> 3.14 Unexpected behavior: from pyspark.sql.functions import * print(round(3.14159265359,2)) >>> ERROR

Add jar to pyspark when using notebook

阅读更多关于 Add jar to pyspark when using notebook

问题 I'm trying the mongodb hadoop integration with spark but can't figure out how to make the jars accessible to an IPython notebook. Here what I'm trying to do: # set up parameters for reading from MongoDB via Hadoop input format config = {"mongo.input.uri": "mongodb://localhost:27017/db.collection"} inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" # these values worked but others might as well keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io

java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange

阅读更多关于 java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange

问题 i use spark-streaming 2.2.0 with python. and read data from kafka(2.11-0.10.0.0) cluster. and i submit a python script with spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar hodor.py the spark report a error message 17/08/04 10:52:00 ERROR Utils: Uncaught exception in thread stdout writer for python java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange([BII)V at org.apache.kafka.common.message.KafkaLZ4BlockInputStream.read(KafkaLZ4BlockInputStream.java:176) at