pyspark | 易学教程

outlier detection in pyspark

阅读更多关于 outlier detection in pyspark

问题 I have a pyspark data frame as shown below. +---+-------+--------+ |age|balance|duration| +---+-------+--------+ | 2| 2143| 261| | 44| 29| 151| | 33| 2| 76| | 50| 1506| 92| | 33| 1| 198| | 35| 231| 139| | 28| 447| 217| | 2| 2| 380| | 58| 121| 50| | 43| 693| 55| | 41| 270| 222| | 50| 390| 137| | 53| 6| 517| | 58| 71| 71| | 57| 162| 174| | 40| 229| 353| | 45| 13| 98| | 57| 52| 38| | 3| 0| 219| | 4| 0| 54| +---+-------+--------+ and my expected output should be look like, +---+-------+--------+-

outlier detection in pyspark

阅读更多关于 outlier detection in pyspark

pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

阅读更多关于 pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

问题 In pyspark is there a way to convert a dataframe column of timestamp datatype to a string of format 'YYYY-MM-DD' format? 回答1: If you have a column with schema as root |-- date: timestamp (nullable = true) Then you can use from_unixtime function to convert the timestamp to string after converting the timestamp to bigInt using unix_timestamp function as from pyspark.sql import functions as f df.withColumn("date", f.from_unixtime(f.unix_timestamp(df.date), "yyyy-MM-dd")) and you should have root

pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

阅读更多关于 pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format

How to extract a single (column/row) value from a dataframe using PySpark?

阅读更多关于 How to extract a single (column/row) value from a dataframe using PySpark?

问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

How to extract a single (column/row) value from a dataframe using PySpark?

阅读更多关于 How to extract a single (column/row) value from a dataframe using PySpark?

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

问题 I want to compute a bunch of different agg functions on different columns in a dataframe. I know I can do something like this, but the output is all one row. df.agg(max("cola"), min("cola"), max("colb"), min("colb")) Let's say I will be performing 100 different aggregations on 10 different columns. I want the output dataframe to be like this - |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc.. cola | 1 | 10| ... colb | 2 | NULL| ... colc | 5 | 20| ... cold | NULL | 42| ... ... Where my

Spark aggregations where output columns are functions and rows are columns

阅读更多关于 Spark aggregations where output columns are functions and rows are columns

How to generate hourly timestamps between two dates in PySpark?

阅读更多关于 How to generate hourly timestamps between two dates in PySpark?

问题 Consider this sample dataframe data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))] df = spark.createDataFrame(data, ["minDate", "maxDate"]) df.show() +-------------------+-------------------+ | minDate| maxDate| +-------------------+-------------------+ |2000-01-01 15:20:37|2000-01-01 19:12:22| +-------------------+-------------------+ I would like to explode those two dates into an hourly time-series like +-------------------+-------------------+ | minDate| maxDate| +--

unable to install pyspark

阅读更多关于 unable to install pyspark

问题 I am trying to install pyspark as this: python setup.py install I get this error: Could not import pypandoc - required to package PySpark pypandoc is installed already Any ideas how can I install pyspark? 回答1: I faced the same issue and solved it as below install pypandoc before installing pyspark pip install pypandoc pip install pyspark 回答2: You need to use findspark or spark-submit to use pyspark. After installing scala and java download apache spark and put it some folder. then try this 2