spark-dataframe | 易学教程

Trouble With Pyspark Round Function

阅读更多关于 Trouble With Pyspark Round Function

问题 Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it: output = output.select(col("ad").alias("ad_id"), col("part").alias("part_id"), func.round(col("new_bid"), 2).alias("bid")) the new_bid column here is of type float - the resulting

Pyspark: cast array with nested struct to string

阅读更多关于 Pyspark: cast array with nested struct to string

问题 I have pyspark dataframe with a column named Filters : "array>" I want to save my dataframe in csv file, for that i need to cast the array to string type. I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()) , but both solutions generate error message for each row in the columns Filters: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19 The code is as follows from pyspark.sql.types import StringType DF.printSchema() |-- ClientNum: string (nullable =

How to slice a pyspark dataframe in two row-wise

阅读更多关于 How to slice a pyspark dataframe in two row-wise

问题 I am working in Databricks. I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows. +--------------------+----------+ | userid| eventdate| +--------------------+----------+ |00518b128fc9459d9...|2017-10-09| |00976c0b7f2c4c2ca...|2017-12-16| |00a60fb81aa74f35a...|2017-12-04| |00f9f7234e2c4bf78...|2017-05-09| |0146fe6ad7a243c3b...|2017-11-21| |016567f169c145ddb...|2017-10-16| |01ccd278777946cb8...

Using UDF ignores condition in when

阅读更多关于 Using UDF ignores condition in when

问题 Suppose you had the following pyspark DataFrame: data= [('foo',), ('123',), (None,), ('bar',)] df = sqlCtx.createDataFrame(data, ["col"]) df.show() #+----+ #| col| #+----+ #| foo| #| 123| #|null| #| bar| #+----+ The next two code blocks should do the same thing- that is, return the uppercase of the column if it is not null . However, the second method (using a udf ) produces an error. Method 1 : Using pyspark.sql.functions.upper() import pyspark.sql.functions as f df.withColumn( 'upper', f

How to turn off scientific notation in pyspark?

阅读更多关于 How to turn off scientific notation in pyspark?

问题 As the result of some aggregation i come up with following sparkdataframe: ------------+-----------------+-----------------+ |sale_user_id|gross_profit |total_sale_volume| +------------+-----------------+-----------------+ | 20569| -3322960.0| 2.12569482E8| | 24269| -1876253.0| 8.6424626E7| | 9583| 0.0| 1.282272E7| | 11722| 18229.0| 5653149.0| | 37982| 6077.0| 1181243.0| | 20428| 1665.0| 7011588.0| | 41157| 73227.0| 1.18631E7| | 9993| 0.0| 1481437.0| | 9030| 8865.0| 4.4133791E7| | 829| 0.0|

How does createOrReplaceTempView work in Spark?

阅读更多关于 How does createOrReplaceTempView work in Spark?

问题 I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory? 回答1: createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. scala> val s = Seq(1,2,3).toDF("num") s: org.apache.spark.sql.DataFrame = [num: int]

Convert spark DataFrame column to python list

阅读更多关于 Convert spark DataFrame column to python list

问题 I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and count value. Something like mvv = [1,2,3,4] count = [5,9,3,1] So, I tried the following code: The first line should return a python list of row. I wanted to see the first value: mvv_list = mvv_count_df.select('mvv').collect() firstvalue = mvv_list[0].getInt(0) But I get an error message with the second line:

How to get a value from the Row object in Spark Dataframe?

阅读更多关于 How to get a value from the Row object in Spark Dataframe?

问题 for averageCount = (wordCountsDF .groupBy().mean()).head() I get Row(avg(count)=1.6666666666666667) but when I try: averageCount = (wordCountsDF .groupBy().mean()).head().getFloat(0) I get the following error: AttributeError: getFloat --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () 1 # TODO: Replace with appropriate code ----> 2 averageCount = (wordCountsDF 3 .groupBy().mean()).head().getFloat(0) 4 5 print

What is version library spark supported SparkSession

阅读更多关于 What is version library spark supported SparkSession

问题 Code Spark with SparkSession. import org.apache.spark.SparkConf import org.apache.spark.SparkContext val conf = SparkSession.builder .master("local") .appName("testing") .enableHiveSupport() // <- enable Hive support. .getOrCreate() Code pom.xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId

Applying a Window function to calculate differences in pySpark

阅读更多关于 Applying a Window function to calculate differences in pySpark

问题 I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price