pyspark

How to execute .sql file in spark using python

戏子无情 提交于 2019-12-30 18:24:57
问题 from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext conf = SparkConf().setAppName("Test").set("spark.driver.memory", "1g") sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) results = sqlContext.sql("/home/ubuntu/workload/queryXX.sql") When I execute this command using: python test.py it gives me an error . y4j.protocol.Py4JJavaError: An error occurred while calling o20.sql. : java.lang.RuntimeException: [1.1] failure: ``with'' expected but `/' found /home

Convert an RDD to iterable: PySpark?

百般思念 提交于 2019-12-30 17:26:12
问题 I have an RDD which I am creating by loading a text file and preprocessing it. I dont want to collect it and save it to the disk or memory(entire data) but rather want to pass it to some other function in python which consumes data one after the other is form of iterable. How is this possible? data = sc.textFile('file.txt').map(lambda x: some_func(x)) an_iterable = data. ## what should I do here to make it give me one element at a time? def model1(an_iterable): for i in an_iterable: do_that(i

Accessing S3 using S3a protocol from Spark Using Hadoop version 2.7.2

旧街凉风 提交于 2019-12-30 12:12:45
问题 I’m trying to access s3 (s3a protocol) from pyspark (version 2.2.0) and I’m having some difficulty. I’m using the Hadoop and AWS sdk packages. pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 Here is what my code looks like: sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS

pyspark lag function (based on column)

橙三吉。 提交于 2019-12-30 11:15:37
问题 I want to achieve the below lag(column1,datediff(column2,column3)).over(window) The offset is dynamic. I have tried using UDF as well, but it didn't work. Anythoughts of how to achieve the above? 回答1: The argument count of the lag function takes an integer not a column object : psf.lag(col, count=1, default=None) Therefore it cannot be a "dynamic" value. Instead you can build your lag in a column and then join the table with itself. First let's create our dataframe: df = spark.createDataFrame

spark-submit fails to detect the installed modulus in pip

丶灬走出姿态 提交于 2019-12-30 10:57:07
问题 I have a python code which have the following 3rd party dependencies: import boto3 from warcio.archiveiterator import ArchiveIterator from warcio.recordloader import ArchiveLoadFailed import requests import botocore from requests_file import FileAdapter .... I have installed the dependencies using pip , and made sure that it was correctly installed by having the command pip list . Then, when I tried to submit the job to spark, I received the following errors: ImportError: No module named

pyspark generate row hash of specific columns and add it as a new column

六眼飞鱼酱① 提交于 2019-12-30 09:53:33
问题 I am working with spark 2.2.0 and pyspark2. I have created a DataFrame df and now trying to add a new column "rowhash" that is the sha2 hash of specific columns in the DataFrame. For example, say that df has the columns: (column1, column2, ..., column10) I require sha2((column2||column3||column4||...... column8), 256) in a new column "rowhash" . For now, I tried using below methods: 1) Used hash() function but since it gives an integer output it is of not much use 2) Tried using sha2()

Zeppelin - Cannot query with %sql a table I registered with pyspark

大憨熊 提交于 2019-12-30 08:36:07
问题 I am new to spark/zeppelin and I wanted to complete a simple exercise, where I will transform a csv file from pandas to Spark data frame and then register the table to query it with sql and visualise it using Zeppelin. But I seem to be failing in the last step. I am using Spark 1.6.1 Here is my code: %pyspark spark_clean_df.registerTempTable("table1") print spark_clean_df.dtypes print sqlContext.sql("select count(*) from table1").collect() Here is the output: [('id', 'bigint'), ('name',

multiple criteria for aggregation on pySpark Dataframe

笑着哭i 提交于 2019-12-30 08:10:11
问题 I have a pySpark dataframe that looks like this: +-------------+----------+ | sku| date| +-------------+----------+ |MLA-603526656|02/09/2016| |MLA-603526656|01/09/2016| |MLA-604172009|02/10/2016| |MLA-605470584|02/09/2016| |MLA-605502281|02/10/2016| |MLA-605502281|02/09/2016| +-------------+----------+ I want to group by sku, and then calculate the min and max dates. If I do this: df_testing.groupBy('sku') \ .agg({'date': 'min', 'date':'max'}) \ .limit(10) \ .show() the behavior is the same

How to write Pyspark UDAF on multiple columns?

对着背影说爱祢 提交于 2019-12-30 06:59:10
问题 I have the following data in a pyspark dataframe called end_stats_df : values start end cat1 cat2 10 1 2 A B 11 1 2 C B 12 1 2 D B 510 1 2 D C 550 1 2 C B 500 1 2 A B 80 1 3 A B And I want to aggregate it in the following way: I want to use the "start" and "end" columns as the aggregate keys For each group of rows, I need to do the following: Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start =1 and end =2, this number would be 4 because

Pyspark socket timeout exception after application running for a while

匆匆过客 提交于 2019-12-30 06:34:55
问题 I am using pyspark to estimate parameters for a logistic regression model. I use spark to calculate the likelihood and gradients and then use scipy's minimize function for optimization (L-BFGS-B). I use yarn-client mode to run my application. My application could start to run without any problem. However, after a while it reports the following error: Traceback (most recent call last): File "/home/panc/research/MixedLogistic/software/mixedlogistic/mixedlogistic_spark/simulation/20160716-1626