pyspark | 易学教程

Access to WrappedArray elements

阅读更多关于 Access to WrappedArray elements

Pyspark command not recognised

阅读更多关于 Pyspark command not recognised

问题 I have anaconda installed and also I have downloaded Spark 1.6.2. I am using the following instructions from this answer to configure spark for Jupyter enter link description here I have downloaded and unzipped the spark directory as ~/spark Now when I cd into this directory and into bin I see the following SFOM00618927A:spark $ cd bin SFOM00618927A:bin $ ls beeline pyspark run-example.cmd spark-class2.cmd spark-sql sparkR beeline.cmd pyspark.cmd run-example2.cmd spark-shell spark-submit

Pyspark command not recognised

阅读更多关于 Pyspark command not recognised

How to insert a custom function within For loop in pyspark?

阅读更多关于 How to insert a custom function within For loop in pyspark?

问题 I am facing a challenge in spark within Azure databricks. I have a dataset as +------------------+----------+-------------------+---------------+ | OpptyHeaderID| OpptyID| Date|BaseAmountMonth| +------------------+----------+-------------------+---------------+ |0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| |0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| +------------------+----------+-------------------+---------------+ Now I need to use a loop function to

PySpark UDF optimization challenge using a dictionary with regex's (Scala?)

阅读更多关于 PySpark UDF optimization challenge using a dictionary with regex's (Scala?)

问题 I am trying to optimize the code below (PySpark UDF). It gives me the desired result (based on my data set) but it's too slow on very large datasets (approx. 180M). The results (accuracy) are better than available Python modules (e.g. geotext, hdx-python-country). So I'm not looking for another module. DataFrame: df = spark.createDataFrame([ ["3030 Whispering Pines Circle, Prosper Texas, US","John"], ["Kalverstraat Amsterdam","Mary"], ["Kalverstraat Amsterdam, Netherlands","Lex"] ]).toDF(

Pyspark RDD collect first 163 Rows

阅读更多关于 Pyspark RDD collect first 163 Rows

问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Pyspark RDD collect first 163 Rows

阅读更多关于 Pyspark RDD collect first 163 Rows

How to avoid multiple window functions in a expression in pyspark

阅读更多关于 How to avoid multiple window functions in a expression in pyspark

问题 I want spark to avoid creating two separate window stage, for same window object used twice in my code. How can I use it once in my code in the following example, and tell spark to do sum and division under single window. df = df.withColumn("colum_c", f.sum(f.col("colum_a")).over(window) / f.sum(f.col("colum_b")).over(window)) Example: days = lambda i: (i - 1) * 86400 window = ( Window() .partitionBy(f.col("account_id")) .orderBy(f.col("event_date").cast("timestamp").cast("long"))

Calculate a grouped median in pyspark

阅读更多关于 Calculate a grouped median in pyspark

问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Should we parallelize a DataFrame like we parallelize a Seq before training

阅读更多关于 Should we parallelize a DataFrame like we parallelize a Seq before training

问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we