pyspark

Access to WrappedArray elements

百般思念 提交于 2021-02-19 02:37:39
问题 I have a spark dataframe and here is the schema: |-- eid: long (nullable = true) |-- age: long (nullable = true) |-- sex: long (nullable = true) |-- father: array (nullable = true) | |-- element: array (containsNull = true) | | |-- element: long (containsNull = true) and a sample of rows:. df.select(df['father']).show() +--------------------+ | father| +--------------------+ |[WrappedArray(-17...| |[WrappedArray(-11...| |[WrappedArray(13,...| +--------------------+ and the type is DataFrame

Pyspark command not recognised

断了今生、忘了曾经 提交于 2021-02-19 01:17:47
问题 I have anaconda installed and also I have downloaded Spark 1.6.2. I am using the following instructions from this answer to configure spark for Jupyter enter link description here I have downloaded and unzipped the spark directory as ~/spark Now when I cd into this directory and into bin I see the following SFOM00618927A:spark $ cd bin SFOM00618927A:bin $ ls beeline pyspark run-example.cmd spark-class2.cmd spark-sql sparkR beeline.cmd pyspark.cmd run-example2.cmd spark-shell spark-submit

Pyspark command not recognised

孤人 提交于 2021-02-19 01:17:39
问题 I have anaconda installed and also I have downloaded Spark 1.6.2. I am using the following instructions from this answer to configure spark for Jupyter enter link description here I have downloaded and unzipped the spark directory as ~/spark Now when I cd into this directory and into bin I see the following SFOM00618927A:spark $ cd bin SFOM00618927A:bin $ ls beeline pyspark run-example.cmd spark-class2.cmd spark-sql sparkR beeline.cmd pyspark.cmd run-example2.cmd spark-shell spark-submit

How to insert a custom function within For loop in pyspark?

﹥>﹥吖頭↗ 提交于 2021-02-18 19:41:53
问题 I am facing a challenge in spark within Azure databricks. I have a dataset as +------------------+----------+-------------------+---------------+ | OpptyHeaderID| OpptyID| Date|BaseAmountMonth| +------------------+----------+-------------------+---------------+ |0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| |0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| +------------------+----------+-------------------+---------------+ Now I need to use a loop function to

PySpark UDF optimization challenge using a dictionary with regex's (Scala?)

ぐ巨炮叔叔 提交于 2021-02-18 17:09:50
问题 I am trying to optimize the code below (PySpark UDF). It gives me the desired result (based on my data set) but it's too slow on very large datasets (approx. 180M). The results (accuracy) are better than available Python modules (e.g. geotext, hdx-python-country). So I'm not looking for another module. DataFrame: df = spark.createDataFrame([ ["3030 Whispering Pines Circle, Prosper Texas, US","John"], ["Kalverstraat Amsterdam","Mary"], ["Kalverstraat Amsterdam, Netherlands","Lex"] ]).toDF(

Pyspark RDD collect first 163 Rows

不打扰是莪最后的温柔 提交于 2021-02-18 13:51:54
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Pyspark RDD collect first 163 Rows

妖精的绣舞 提交于 2021-02-18 13:51:11
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

How to avoid multiple window functions in a expression in pyspark

拥有回忆 提交于 2021-02-18 07:55:48
问题 I want spark to avoid creating two separate window stage, for same window object used twice in my code. How can I use it once in my code in the following example, and tell spark to do sum and division under single window. df = df.withColumn("colum_c", f.sum(f.col("colum_a")).over(window) / f.sum(f.col("colum_b")).over(window)) Example: days = lambda i: (i - 1) * 86400 window = ( Window() .partitionBy(f.col("account_id")) .orderBy(f.col("event_date").cast("timestamp").cast("long"))

Calculate a grouped median in pyspark

倖福魔咒の 提交于 2021-02-18 07:55:36
问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Should we parallelize a DataFrame like we parallelize a Seq before training

不羁的心 提交于 2021-02-17 15:36:40
问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we