pyspark | 易学教程

Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

阅读更多关于 Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

问题 I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place them in the /opt/spark/jars directory of the spark instances. I get a java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException Why is spark not seeing the jars? Do I have to have to jars in all the slaves and specify a spark-defaults.conf for the master and slaves? Is there something that needs to be configured

How to partition data dynamically in this use-case

阅读更多关于 How to partition data dynamically in this use-case

问题 I am using spark-sql-2.4.1version. I have a code something like below. I have scenario like below. val superDataset = // load the whole data set of student marks records ... assume have 10 years data val selectedYrsDataset = superDataset.repartition("--GivenYears--") //i.e. GivenYears are 2010,2011 One the selectedYrsDataset I need to calculate year wise toppers on over all country-wise, state-wise, colleage-wise. How to do this kind of use-case ? Is there any possibility of doing it dynamic

spark-submit on kubernetes cluster

阅读更多关于 spark-submit on kubernetes cluster

问题 I have created simple word count program jar file which is tested and works fine. However, when I am trying to run the same jar file on my Kubernetes cluster it's throwing an error. Below is my spark-submit code along with the error thrown. spark-submit --master k8s://https://192.168.99.101:8443 --deploy-mode cluster --name WordCount --class com.sample.WordCount --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=debuggerrr/spark-new:spark-new local:///C:/Users/siddh

spark-submit on kubernetes cluster

阅读更多关于 spark-submit on kubernetes cluster

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

阅读更多关于 How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

问题 I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some environment variable, but I can't find any further detailed instructions. I can run the python interpreter on my machine, so I'm confident that it is installed correctly and running "java -version" gives me the expected response, so I don't think the problem

How to subtract a column of days from a column of dates in Pyspark?

阅读更多关于 How to subtract a column of days from a column of dates in Pyspark?

问题 Given the following PySpark DataFrame df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col')) How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'] . I looked into pyspark.sql.functions.date_sub() , but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10) . Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']) . I also tried

apply OneHotEncoder for several categorical columns in SparkMlib

阅读更多关于 apply OneHotEncoder for several categorical columns in SparkMlib

问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

Find mean of pyspark array<double>

阅读更多关于 Find mean of pyspark array

问题 In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type. Is there a way to find the average of an array without exploding the array out? I have several different arrays and I'd like to be able to do something like the following: df.select(col("Segment.Points.trajectory_points.longitude")) DataFrame[longitude: array] df.select(avg(col("Segment.Points.trajectory_points.longitude"))).show() org

While submit job with pyspark, how to access static files upload with --files argument?

阅读更多关于 While submit job with pyspark, how to access static files upload with --files argument?

问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

inferSchema in spark csv package

阅读更多关于 inferSchema in spark csv package

问题 i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message >>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True) >>> fv_df.columns Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns return [f.name for f in self.schema