pyspark

Join two data frames, select all columns from one and some columns from the other

。_饼干妹妹 提交于 2019-12-28 04:48:05
问题 Let's say I have a spark data frame df1, with several columns (among which the column 'id') and data frame df2 with two columns, 'id' and 'other'. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id") by using only pyspark functions such as join(), select() and the like? I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter. Thanks! 回答1: Not sure if the

get value out of dataframe

人走茶凉 提交于 2019-12-28 04:19:07
问题 In Scala I can do get(#) or getAs[Type](#) to get values out of a dataframe. How should I do it in pyspark ? I have a two columns DataFrame: item(string) and salesNum(integers) . I do a groupby and mean to get a mean of those numbers like this: saleDF.groupBy("salesNum").mean()).collect() and it works. Now I have the mean in a dataframe with one value. How can I get that value out of the dataframe to get the mean as a float number? 回答1: collect() returns your results as a python list. To get

reading json file in pyspark

一世执手 提交于 2019-12-28 03:11:05
问题 I'm new to PySpark, Below is my JSON file format from kafka. { "header": { "platform":"atm", "version":"2.0" } "details":[ { "abc":"3", "def":"4" }, { "abc":"5", "def":"6" }, { "abc":"7", "def":"8" } ] } how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)] . The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code. parsed = messages.map(lambda (k,v): json.loads(v))

Spark load data and add filename as dataframe column

蹲街弑〆低调 提交于 2019-12-27 17:41:29
问题 I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode", "DROPMALFORMED")\ .load(filename) # add the filename base as hostname ( hostname, _ ) = os.path.splitext( os.path.basename(filename) ) ( hostname, _ ) = os.path.splitext( hostname ) df = df.withColumn('hostname', lit(hostname)) return df specifically, I am using a glob to load a

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

耗尽温柔 提交于 2019-12-27 11:45:26
问题 I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM). I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact. I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC

Unable to open pyspark in mac os

我与影子孤独终老i 提交于 2019-12-26 07:18:49
问题 I have installed pyspark through pip but unable to open it. It shows following error . Users/sonveer.narwaria/anaconda/bin/pyspark: line 24: /Users/sonveer.narwaria/anaconda/lib/python3.6/site-packages/pyspark/bin/load-spark-env.sh: No such file or directory /Users/sonveer.narwaria/anaconda/bin/pyspark: line 77: /Users/sonveer.narwaria//Users/sonveer.narwaria/anaconda/lib/python3.6/site-packages/pyspark/bin/spark-submit: No such file or directory /Users/sonveer.narwaria/anaconda/bin/pyspark:

Running Spark on m4 instead of m3 on AWS

让人想犯罪 __ 提交于 2019-12-25 19:11:11
问题 I have a small script where I submit a job over AWS. I have changed the instance type from m3xlarge to m4.xlarge and I suddenly get an error message and the cluster is terminated without completing all steps. The script is: aws emr create-cluster --name “XXXXXX” --ami-version 3.7 --applications Name=Hive --use-default-roles --ec2-attributes KeyName=gattami,SubnetId=subnet-xxxxxxx \ --instance-type=m4.xlarge --instance-count 3 \ --log-uri s3://pythonpicode/ --bootstrap-actions Path=s3://eu

Can't instantiate Spark Context in iPython

怎甘沉沦 提交于 2019-12-25 19:01:40
问题 I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark. 2. I've set up the following environment variables, #Scala export SCALA_HOME=$HOME/scala/scala-2.12.4 export PATH=$PATH:$SCALA_HOME/bin #Spark export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin #Jupyter Python export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON

Can't instantiate Spark Context in iPython

≯℡__Kan透↙ 提交于 2019-12-25 19:01:03
问题 I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark. 2. I've set up the following environment variables, #Scala export SCALA_HOME=$HOME/scala/scala-2.12.4 export PATH=$PATH:$SCALA_HOME/bin #Spark export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin #Jupyter Python export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON

How do I decrease iteration time when making data transformations?

帅比萌擦擦* 提交于 2019-12-25 18:06:13
问题 I have a couple of data transformations that it seems operate quite slowly while iterating. What general strategies can I use to increase performance? Input Data: +-----------+-------+ | key | val | +-----------+-------+ | a | 1 | | a | 2 | | b | 1 | | b | 2 | | b | 3 | +-----------+-------+ My code I'm iterating on is the following: from pyspark.sql import functions as F # Output = /my/function/output # input_df = /my/function/input def my_compute_function(input_df): """Compute difference