pyspark | 易学教程

Join two data frames, select all columns from one and some columns from the other

阅读更多关于 Join two data frames, select all columns from one and some columns from the other

问题 Let's say I have a spark data frame df1, with several columns (among which the column 'id') and data frame df2 with two columns, 'id' and 'other'. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id") by using only pyspark functions such as join(), select() and the like? I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter. Thanks! 回答1: Not sure if the

get value out of dataframe

阅读更多关于 get value out of dataframe

问题 In Scala I can do get(#) or getAs[Type](#) to get values out of a dataframe. How should I do it in pyspark ? I have a two columns DataFrame: item(string) and salesNum(integers) . I do a groupby and mean to get a mean of those numbers like this: saleDF.groupBy("salesNum").mean()).collect() and it works. Now I have the mean in a dataframe with one value. How can I get that value out of the dataframe to get the mean as a float number? 回答1: collect() returns your results as a python list. To get

reading json file in pyspark

阅读更多关于 reading json file in pyspark

问题 I'm new to PySpark, Below is my JSON file format from kafka. { "header": { "platform":"atm", "version":"2.0" } "details":[ { "abc":"3", "def":"4" }, { "abc":"5", "def":"6" }, { "abc":"7", "def":"8" } ] } how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)] . The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code. parsed = messages.map(lambda (k,v): json.loads(v))

Spark load data and add filename as dataframe column

阅读更多关于 Spark load data and add filename as dataframe column

问题 I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode", "DROPMALFORMED")\ .load(filename) # add the filename base as hostname ( hostname, _ ) = os.path.splitext( os.path.basename(filename) ) ( hostname, _ ) = os.path.splitext( hostname ) df = df.withColumn('hostname', lit(hostname)) return df specifically, I am using a glob to load a

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

阅读更多关于 How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

问题 I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM). I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact. I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC

Unable to open pyspark in mac os

阅读更多关于 Unable to open pyspark in mac os

问题 I have installed pyspark through pip but unable to open it. It shows following error . Users/sonveer.narwaria/anaconda/bin/pyspark: line 24: /Users/sonveer.narwaria/anaconda/lib/python3.6/site-packages/pyspark/bin/load-spark-env.sh: No such file or directory /Users/sonveer.narwaria/anaconda/bin/pyspark: line 77: /Users/sonveer.narwaria//Users/sonveer.narwaria/anaconda/lib/python3.6/site-packages/pyspark/bin/spark-submit: No such file or directory /Users/sonveer.narwaria/anaconda/bin/pyspark:

Running Spark on m4 instead of m3 on AWS

阅读更多关于 Running Spark on m4 instead of m3 on AWS

问题 I have a small script where I submit a job over AWS. I have changed the instance type from m3xlarge to m4.xlarge and I suddenly get an error message and the cluster is terminated without completing all steps. The script is: aws emr create-cluster --name “XXXXXX” --ami-version 3.7 --applications Name=Hive --use-default-roles --ec2-attributes KeyName=gattami,SubnetId=subnet-xxxxxxx \ --instance-type=m4.xlarge --instance-count 3 \ --log-uri s3://pythonpicode/ --bootstrap-actions Path=s3://eu

Can't instantiate Spark Context in iPython

阅读更多关于 Can't instantiate Spark Context in iPython

问题 I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark. 2. I've set up the following environment variables, #Scala export SCALA_HOME=$HOME/scala/scala-2.12.4 export PATH=$PATH:$SCALA_HOME/bin #Spark export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin #Jupyter Python export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON

Can't instantiate Spark Context in iPython

阅读更多关于 Can't instantiate Spark Context in iPython

How do I decrease iteration time when making data transformations?

阅读更多关于 How do I decrease iteration time when making data transformations?

问题 I have a couple of data transformations that it seems operate quite slowly while iterating. What general strategies can I use to increase performance? Input Data: +-----------+-------+ | key | val | +-----------+-------+ | a | 1 | | a | 2 | | b | 1 | | b | 2 | | b | 3 | +-----------+-------+ My code I'm iterating on is the following: from pyspark.sql import functions as F # Output = /my/function/output # input_df = /my/function/input def my_compute_function(input_df): """Compute difference