pyspark

pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

不羁的心 提交于 2019-12-23 18:43:45
问题 I have installed spark 2.2 with winutils in windows 10.when i am going to run pyspark i am facing bellow exception pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' I have already tried permission 777 commands in tmp/hive folder as well.but it is not working for now winutils.exe chmod -R 777 C:\tmp\hive after applying this the problem remains same. I am using pyspark 2.2 in my windows 10. Her is spark-shell env Here is

pyspark join rdds by a specific key

末鹿安然 提交于 2019-12-23 16:49:01
问题 I have two rdds that I need to join them together. They look like the followings: RDD1 [(u'2', u'100', 2), (u'1', u'300', 1), (u'1', u'200', 1)] RDD2 [(u'1', u'2'), (u'1', u'3')] My desired output is: [(u'1', u'2', u'100', 2)] So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys. Thanks 回答1:

AttributeError: module 'numpy' has no attribute 'core'

断了今生、忘了曾经 提交于 2019-12-23 16:24:16
问题 I was wondering if anyone had this issue when running spark and trying to import numpy. Numpy imports properly in a standard notebook, but when I try importing it via a notebook running spark, I get this error. I have the most recent version of numpy and am running the most recent anaconda python 3.6. Thanks! --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 import numpy /Users/michaelthomas/anaconda/lib

Should the DataFrame function groupBy be avoided?

风流意气都作罢 提交于 2019-12-23 16:17:05
问题 This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different? I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this

pyspark program throwing name 'spark' is not defined

半城伤御伤魂 提交于 2019-12-23 15:59:13
问题 Below program throwing error name 'spark' is not defined Traceback (most recent call last): File "pgm_latest.py", line 232, in <module> sconf =SparkConf().set(spark.dynamicAllocation.enabled,true) .set(spark.dynamicAllocation.maxExecutors,300) .set(spark.shuffle.service.enabled,true) .set(spark.shuffle.spill.compress,true) NameError: name 'spark' is not defined spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py Code #!/usr/bin/python

pyspark replace all values in dataframe with another values

醉酒当歌 提交于 2019-12-23 14:59:12
问题 I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0 For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers c1| c2 | c3 |c4|c5..... |c500 yes| yes|passed |45.... No | Yes|failed |452.... Yes|No |None |32............ when I do df.replace(yes,1) I get

ModuleNotFoundError because PySpark serializer is not able to locate library folder

房东的猫 提交于 2019-12-23 10:26:17
问题 I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'libfolder' I have zipped the folder into xyz.zip and run the

Python Round Function Issues with pyspark

大憨熊 提交于 2019-12-23 09:59:16
问题 I am relatively new to spark and I've run into an issue when I try to use python's builtin round() function after importing pyspark functions. It seems to have to do with how I import the pyspark functions but I am not sure what the difference is or why one way would cause issues and the other wouldn't. Expected behavior: import pyspark.sql.functions print(round(3.14159265359,2)) >>> 3.14 Unexpected behavior: from pyspark.sql.functions import * print(round(3.14159265359,2)) >>> ERROR

Add jar to pyspark when using notebook

笑着哭i 提交于 2019-12-23 09:49:05
问题 I'm trying the mongodb hadoop integration with spark but can't figure out how to make the jars accessible to an IPython notebook. Here what I'm trying to do: # set up parameters for reading from MongoDB via Hadoop input format config = {"mongo.input.uri": "mongodb://localhost:27017/db.collection"} inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" # these values worked but others might as well keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io

java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange

ⅰ亾dé卋堺 提交于 2019-12-23 09:42:10
问题 i use spark-streaming 2.2.0 with python. and read data from kafka(2.11-0.10.0.0) cluster. and i submit a python script with spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar hodor.py the spark report a error message 17/08/04 10:52:00 ERROR Utils: Uncaught exception in thread stdout writer for python java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange([BII)V at org.apache.kafka.common.message.KafkaLZ4BlockInputStream.read(KafkaLZ4BlockInputStream.java:176) at