apache-spark-dataset

Pyspark transform method that's equivalent to the Scala Dataset#transform method

随声附和 提交于 2020-06-27 17:49:05
问题 The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so: val weirdDf = df .transform(myFirstCustomTransformation) .transform(anotherCustomTransformation) I don't see an equivalent transform method for pyspark in the documentation. Is there a PySpark way to chain custom transformations? If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method? Update The transform method was added to PySpark as

Sorting numeric String in Spark Dataset

旧时模样 提交于 2020-05-29 06:14:17
问题 Let's assume that I have the following Dataset : +-----------+----------+ |productCode| amount| +-----------+----------+ | XX-13| 300| | XX-1| 250| | XX-2| 410| | XX-9| 50| | XX-10| 35| | XX-100| 870| +-----------+----------+ Where productCode is of String type and the amount is an Int . If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison): def orderProducts(product: Dataset[Product]): Dataset[Product] = { product

Sorting numeric String in Spark Dataset

喜欢而已 提交于 2020-05-29 06:13:05
问题 Let's assume that I have the following Dataset : +-----------+----------+ |productCode| amount| +-----------+----------+ | XX-13| 300| | XX-1| 250| | XX-2| 410| | XX-9| 50| | XX-10| 35| | XX-100| 870| +-----------+----------+ Where productCode is of String type and the amount is an Int . If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison): def orderProducts(product: Dataset[Product]): Dataset[Product] = { product

Sorting numeric String in Spark Dataset

喜欢而已 提交于 2020-05-29 06:12:43
问题 Let's assume that I have the following Dataset : +-----------+----------+ |productCode| amount| +-----------+----------+ | XX-13| 300| | XX-1| 250| | XX-2| 410| | XX-9| 50| | XX-10| 35| | XX-100| 870| +-----------+----------+ Where productCode is of String type and the amount is an Int . If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison): def orderProducts(product: Dataset[Product]): Dataset[Product] = { product

Ho to read “.gz” compressed file using spark DF or DS?

人走茶凉 提交于 2020-05-29 05:11:16
问题 I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited. 回答1: Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter): val df = spark.read.option("sep", "\t").csv("file.csv.gz") PySpark: df = spark.read.csv("file.csv.gz", sep='\t') The only extra consideration to take into

Create Spark Dataset from a CSV file

ぐ巨炮叔叔 提交于 2020-05-26 10:59:13
问题 I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file: name,state,number_of_people,coolness_index trenton,nj,"10","4.5" bedford,ny,"20","3.3" patterson,nj,"30","2.2" camden,nj,"40","8.8" Here is the code to make the Dataset: var location = "s3a://path_to_csv" case class City(name: String, state: String, number_of_people: Long) val cities = spark.read .option("header", "true") .option("charset", "UTF8") .option("delimiter",",") .csv(location) .as

How to use the spark stats?

ⅰ亾dé卋堺 提交于 2020-05-17 06:54:31
问题 I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results? Here an example, having data as show below: +----+---------+-------------+----------+-----------+ | id| date|total_revenue|con_dist_1| con_dist_2| +----+---------+-------------+----------+-----------+ |3310|1/15/2018| 0.010680705| 6|0.019875458| |3310|1/15/2018| 0