pyspark

Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

我是研究僧i 提交于 2020-03-21 22:00:32
问题 I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place them in the /opt/spark/jars directory of the spark instances. I get a java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException Why is spark not seeing the jars? Do I have to have to jars in all the slaves and specify a spark-defaults.conf for the master and slaves? Is there something that needs to be configured

How to partition data dynamically in this use-case

一世执手 提交于 2020-03-21 11:05:53
问题 I am using spark-sql-2.4.1version. I have a code something like below. I have scenario like below. val superDataset = // load the whole data set of student marks records ... assume have 10 years data val selectedYrsDataset = superDataset.repartition("--GivenYears--") //i.e. GivenYears are 2010,2011 One the selectedYrsDataset I need to calculate year wise toppers on over all country-wise, state-wise, colleage-wise. How to do this kind of use-case ? Is there any possibility of doing it dynamic

spark-submit on kubernetes cluster

假如想象 提交于 2020-03-21 07:01:57
问题 I have created simple word count program jar file which is tested and works fine. However, when I am trying to run the same jar file on my Kubernetes cluster it's throwing an error. Below is my spark-submit code along with the error thrown. spark-submit --master k8s://https://192.168.99.101:8443 --deploy-mode cluster --name WordCount --class com.sample.WordCount --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=debuggerrr/spark-new:spark-new local:///C:/Users/siddh

spark-submit on kubernetes cluster

五迷三道 提交于 2020-03-21 07:01:11
问题 I have created simple word count program jar file which is tested and works fine. However, when I am trying to run the same jar file on my Kubernetes cluster it's throwing an error. Below is my spark-submit code along with the error thrown. spark-submit --master k8s://https://192.168.99.101:8443 --deploy-mode cluster --name WordCount --class com.sample.WordCount --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=debuggerrr/spark-new:spark-new local:///C:/Users/siddh

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

痴心易碎 提交于 2020-03-18 11:00:30
问题 I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some environment variable, but I can't find any further detailed instructions. I can run the python interpreter on my machine, so I'm confident that it is installed correctly and running "java -version" gives me the expected response, so I don't think the problem

How to subtract a column of days from a column of dates in Pyspark?

时光总嘲笑我的痴心妄想 提交于 2020-03-18 10:54:09
问题 Given the following PySpark DataFrame df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col')) How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'] . I looked into pyspark.sql.functions.date_sub() , but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10) . Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']) . I also tried

apply OneHotEncoder for several categorical columns in SparkMlib

时光毁灭记忆、已成空白 提交于 2020-03-17 09:03:41
问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

Find mean of pyspark array<double>

自古美人都是妖i 提交于 2020-03-16 03:01:05
问题 In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type. Is there a way to find the average of an array without exploding the array out? I have several different arrays and I'd like to be able to do something like the following: df.select(col("Segment.Points.trajectory_points.longitude")) DataFrame[longitude: array] df.select(avg(col("Segment.Points.trajectory_points.longitude"))).show() org

While submit job with pyspark, how to access static files upload with --files argument?

跟風遠走 提交于 2020-03-11 02:47:45
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

inferSchema in spark csv package

那年仲夏 提交于 2020-03-06 09:25:10
问题 i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message >>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True) >>> fv_df.columns Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns return [f.name for f in self.schema