pyspark | 易学教程

How to execute .sql file in spark using python

阅读更多关于 How to execute .sql file in spark using python

问题 from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext conf = SparkConf().setAppName("Test").set("spark.driver.memory", "1g") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) results = sqlContext.sql("/home/ubuntu/workload/queryXX.sql") When I execute this command using: python test.py it gives me an error . y4j.protocol.Py4JJavaError: An error occurred while calling o20.sql. : java.lang.RuntimeException: [1.1] failure: ``with'' expected but `/' found /home

Convert an RDD to iterable: PySpark?

阅读更多关于 Convert an RDD to iterable: PySpark?

问题 I have an RDD which I am creating by loading a text file and preprocessing it. I dont want to collect it and save it to the disk or memory(entire data) but rather want to pass it to some other function in python which consumes data one after the other is form of iterable. How is this possible? data = sc.textFile('file.txt').map(lambda x: some_func(x)) an_iterable = data. ## what should I do here to make it give me one element at a time? def model1(an_iterable): for i in an_iterable: do_that(i

Accessing S3 using S3a protocol from Spark Using Hadoop version 2.7.2

阅读更多关于 Accessing S3 using S3a protocol from Spark Using Hadoop version 2.7.2

问题 I’m trying to access s3 (s3a protocol) from pyspark (version 2.2.0) and I’m having some difficulty. I’m using the Hadoop and AWS sdk packages. pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 Here is what my code looks like: sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS

pyspark lag function (based on column)

阅读更多关于 pyspark lag function (based on column)

问题 I want to achieve the below lag(column1,datediff(column2,column3)).over(window) The offset is dynamic. I have tried using UDF as well, but it didn't work. Anythoughts of how to achieve the above? 回答1: The argument count of the lag function takes an integer not a column object : psf.lag(col, count=1, default=None) Therefore it cannot be a "dynamic" value. Instead you can build your lag in a column and then join the table with itself. First let's create our dataframe: df = spark.createDataFrame

spark-submit fails to detect the installed modulus in pip

阅读更多关于 spark-submit fails to detect the installed modulus in pip

问题 I have a python code which have the following 3rd party dependencies: import boto3 from warcio.archiveiterator import ArchiveIterator from warcio.recordloader import ArchiveLoadFailed import requests import botocore from requests_file import FileAdapter .... I have installed the dependencies using pip , and made sure that it was correctly installed by having the command pip list . Then, when I tried to submit the job to spark, I received the following errors: ImportError: No module named

pyspark generate row hash of specific columns and add it as a new column

阅读更多关于 pyspark generate row hash of specific columns and add it as a new column

问题 I am working with spark 2.2.0 and pyspark2. I have created a DataFrame df and now trying to add a new column "rowhash" that is the sha2 hash of specific columns in the DataFrame. For example, say that df has the columns: (column1, column2, ..., column10) I require sha2((column2||column3||column4||...... column8), 256) in a new column "rowhash" . For now, I tried using below methods: 1) Used hash() function but since it gives an integer output it is of not much use 2) Tried using sha2()

Zeppelin - Cannot query with %sql a table I registered with pyspark

阅读更多关于 Zeppelin - Cannot query with %sql a table I registered with pyspark

问题 I am new to spark/zeppelin and I wanted to complete a simple exercise, where I will transform a csv file from pandas to Spark data frame and then register the table to query it with sql and visualise it using Zeppelin. But I seem to be failing in the last step. I am using Spark 1.6.1 Here is my code: %pyspark spark_clean_df.registerTempTable("table1") print spark_clean_df.dtypes print sqlContext.sql("select count(*) from table1").collect() Here is the output: [('id', 'bigint'), ('name',

multiple criteria for aggregation on pySpark Dataframe

阅读更多关于 multiple criteria for aggregation on pySpark Dataframe

问题 I have a pySpark dataframe that looks like this: +-------------+----------+ | sku| date| +-------------+----------+ |MLA-603526656|02/09/2016| |MLA-603526656|01/09/2016| |MLA-604172009|02/10/2016| |MLA-605470584|02/09/2016| |MLA-605502281|02/10/2016| |MLA-605502281|02/09/2016| +-------------+----------+ I want to group by sku, and then calculate the min and max dates. If I do this: df_testing.groupBy('sku') \ .agg({'date': 'min', 'date':'max'}) \ .limit(10) \ .show() the behavior is the same

How to write Pyspark UDAF on multiple columns?

阅读更多关于 How to write Pyspark UDAF on multiple columns?

问题 I have the following data in a pyspark dataframe called end_stats_df : values start end cat1 cat2 10 1 2 A B 11 1 2 C B 12 1 2 D B 510 1 2 D C 550 1 2 C B 500 1 2 A B 80 1 3 A B And I want to aggregate it in the following way: I want to use the "start" and "end" columns as the aggregate keys For each group of rows, I need to do the following: Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start =1 and end =2, this number would be 4 because

Pyspark socket timeout exception after application running for a while

阅读更多关于 Pyspark socket timeout exception after application running for a while

问题 I am using pyspark to estimate parameters for a logistic regression model. I use spark to calculate the likelihood and gradients and then use scipy's minimize function for optimization (L-BFGS-B). I use yarn-client mode to run my application. My application could start to run without any problem. However, after a while it reports the following error: Traceback (most recent call last): File "/home/panc/research/MixedLogistic/software/mixedlogistic/mixedlogistic_spark/simulation/20160716-1626