pyspark | 易学教程

SparkSQL: conditional sum using two columns

阅读更多关于 SparkSQL: conditional sum using two columns

问题 I hope you can help me with this. I have a DF as follows: val df = sc.parallelize(Seq( (1, "a", "2014-12-01", "2015-01-01", 100), (2, "a", "2014-12-01", "2015-01-02", 150), (3, "a", "2014-12-01", "2015-01-03", 120), (4, "b", "2015-12-15", "2015-01-01", 100) )).toDF("id", "prodId", "dateIns", "dateTrans", "value") .withColumn("dateIns", to_date($"dateIns") .withColumn("dateTrans", to_date($"dateTrans")) I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates

PySpark explode stringified array of dictionaries into rows

阅读更多关于 PySpark explode stringified array of dictionaries into rows

问题 I have a pyspark dataframe with StringType column ( edges ), which contains a list of dictionaries (see example below). The dictionaries contain a mix of value types, including another dictionary ( nodeIDs ). I need to explode the top-level dictionaries in the edges field into rows; ideally, I should then be able to convert their component values into separate fields. Input: import findspark findspark.init() SPARK = SparkSession.builder.enableHiveSupport() \ .getOrCreate() data = [ Row(trace

Spark Python Performance Tuning

阅读更多关于 Spark Python Performance Tuning

问题 I brought up a iPython notebook for Spark development using the command below: ipython notebook --profile=pyspark And I created a sc SparkContext using the Python code like this: import sys import os os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf" sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python") sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip") from pyspark import SparkContext, SparkConf from pyspark.sql import * sconf = SparkConf() conf =

How to calculate mean and standard deviation given a PySpark DataFrame?

阅读更多关于 How to calculate mean and standard deviation given a PySpark DataFrame?

问题 I have PySpark DataFrame ( not pandas ) called df that is quite large to use collect() . Therefore the below-given code is not efficient. It was working with a smaller amount of data, however now it fails. import numpy as np myList = df.collect() total = [] for product,nb in myList: for p2,score in nb: total.append(score) mean = np.mean(total) std = np.std(total) Is there any way to get mean and std as two variables by using pyspark.sql.functions or similar? from pyspark.sql.functions import

Read a bytes column in spark

阅读更多关于 Read a bytes column in spark

问题 I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining). When loading the file using spark.read.csv , it seems that spark is converting the column to utf-8 . However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER. ( EF BF

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

阅读更多关于 Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

问题 I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful

How can set the default spark logging level?

阅读更多关于 How can set the default spark logging level?

问题 I launch pyspark applications from pycharm on my own workstation, to a 8 node cluster. This cluster also has settings encoded in spark-defaults.conf and spark-env.sh This is how I obtain my spark context variable. spark = SparkSession \ .builder \ .master("spark://stcpgrnlp06p.options-it.com:7087") \ .appName(__SPARK_APP_NAME__) \ .config("spark.executor.memory", "50g") \ .config("spark.eventlog.enabled", "true") \ .config("spark.eventlog.dir", r"/net/share/grid/bin/spark/UAT/SparkLogs/") \

Pandas-style transform of grouped data on PySpark DataFrame

阅读更多关于 Pandas-style transform of grouped data on PySpark DataFrame

问题 If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following: df["DemeanedValues"] = df.groupby("Category")["Values"].transform(lambda g: g - numpy.mean(g)) As far as I understand, Spark dataframes do not directly offer this group-by/transform operation (I am using PySpark on Spark 1.5.0). So, what is the best way to implement this computation? I have tried using a group-by/join as follows: df2 = df

Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

阅读更多关于 Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

问题 I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code c:\spark>spark-submit2 --master yarn --deploy-mode client --executor-cores 4 --jars C:\DependencyJars\spark-streaming-eventhubs_2.11-2.0.3.jar,C:\DependencyJars\scalaj-http_2.11-2.3.0.jar,C:\DependencyJars\config-1.3.1.jar,C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.driver.userClasspathFirst=true --conf

How does Spark interoperate with CPython

阅读更多关于 How does Spark interoperate with CPython

问题 I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy , so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form. 回答1: PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is