pyspark

outlier detection in pyspark

只愿长相守 提交于 2020-02-27 12:00:11
问题 I have a pyspark data frame as shown below. +---+-------+--------+ |age|balance|duration| +---+-------+--------+ | 2| 2143| 261| | 44| 29| 151| | 33| 2| 76| | 50| 1506| 92| | 33| 1| 198| | 35| 231| 139| | 28| 447| 217| | 2| 2| 380| | 58| 121| 50| | 43| 693| 55| | 41| 270| 222| | 50| 390| 137| | 53| 6| 517| | 58| 71| 71| | 57| 162| 174| | 40| 229| 353| | 45| 13| 98| | 57| 52| 38| | 3| 0| 219| | 4| 0| 54| +---+-------+--------+ and my expected output should be look like, +---+-------+--------+-

outlier detection in pyspark

泄露秘密 提交于 2020-02-27 12:00:11
问题 I have a pyspark data frame as shown below. +---+-------+--------+ |age|balance|duration| +---+-------+--------+ | 2| 2143| 261| | 44| 29| 151| | 33| 2| 76| | 50| 1506| 92| | 33| 1| 198| | 35| 231| 139| | 28| 447| 217| | 2| 2| 380| | 58| 121| 50| | 43| 693| 55| | 41| 270| 222| | 50| 390| 137| | 53| 6| 517| | 58| 71| 71| | 57| 162| 174| | 40| 229| 353| | 45| 13| 98| | 57| 52| 38| | 3| 0| 219| | 4| 0| 54| +---+-------+--------+ and my expected output should be look like, +---+-------+--------+-

pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

匆匆过客 提交于 2020-02-27 05:59:19
问题 In pyspark is there a way to convert a dataframe column of timestamp datatype to a string of format 'YYYY-MM-DD' format? 回答1: If you have a column with schema as root |-- date: timestamp (nullable = true) Then you can use from_unixtime function to convert the timestamp to string after converting the timestamp to bigInt using unix_timestamp function as from pyspark.sql import functions as f df.withColumn("date", f.from_unixtime(f.unix_timestamp(df.date), "yyyy-MM-dd")) and you should have root

pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

岁酱吖の 提交于 2020-02-27 05:58:50
问题 In pyspark is there a way to convert a dataframe column of timestamp datatype to a string of format 'YYYY-MM-DD' format? 回答1: If you have a column with schema as root |-- date: timestamp (nullable = true) Then you can use from_unixtime function to convert the timestamp to string after converting the timestamp to bigInt using unix_timestamp function as from pyspark.sql import functions as f df.withColumn("date", f.from_unixtime(f.unix_timestamp(df.date), "yyyy-MM-dd")) and you should have root

How to extract a single (column/row) value from a dataframe using PySpark?

不羁的心 提交于 2020-02-25 22:43:31
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

How to extract a single (column/row) value from a dataframe using PySpark?

好久不见. 提交于 2020-02-25 22:43:13
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

Spark aggregations where output columns are functions and rows are columns

…衆ロ難τιáo~ 提交于 2020-02-25 05:06:45
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

扶醉桌前 提交于 2020-02-25 05:06:27
问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

How to generate hourly timestamps between two dates in PySpark?

一世执手 提交于 2020-02-25 04:14:09
问题 Consider this sample dataframe data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))] df = spark.createDataFrame(data, ["minDate", "maxDate"]) df.show() +-------------------+-------------------+ | minDate| maxDate| +-------------------+-------------------+ |2000-01-01 15:20:37|2000-01-01 19:12:22| +-------------------+-------------------+ I would like to explode those two dates into an hourly time-series like +-------------------+-------------------+ | minDate| maxDate| +--

unable to install pyspark

最后都变了- 提交于 2020-02-24 11:11:49
问题 I am trying to install pyspark as this: python setup.py install I get this error: Could not import pypandoc - required to package PySpark pypandoc is installed already Any ideas how can I install pyspark? 回答1: I faced the same issue and solved it as below install pypandoc before installing pyspark pip install pypandoc pip install pyspark 回答2: You need to use findspark or spark-submit to use pyspark. After installing scala and java download apache spark and put it some folder. then try this 2