pyspark | 易学教程

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

withColumn not allowing me to use max() function to generate a new column

阅读更多关于 withColumn not allowing me to use max() function to generate a new column

问题 I have a dataset like this: a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"]) I want to have a dataset that adds a new column that is equal to the largest value in the other three columns. The output would look like this: +----+----+-----+-------+ |one |two |three|max_col| +----+----+-----+-------+ | 1| 2| 3| 3| | 0| 2| 1| 2| | 9| 8| 7| 9| +----+----+-----+-------+ I thought I would use withColumn , like so: b = a.withColumn("max_col", max(a["one"], a["two"], a[

Spark (pySpark) groupBy misordering first element on collect_list

阅读更多关于 Spark (pySpark) groupBy misordering first element on collect_list

问题 I have the following dataframe (df_parquet): DataFrame[id: bigint, date: timestamp, consumption: decimal(38,18)] I intend to get sorted lists of dates and consumptions using collect_list, just as stated in this post: collect_list by preserving order based on another variable I am following the last approach (https://stackoverflow.com/a/49246162/11841618), which is the one i think its more efficient. So instead of just calling repartition with the default number of partitions (200) i call it

Filter if String contain sub-string pyspark

阅读更多关于 Filter if String contain sub-string pyspark

问题 I have 2 datasets. In each one I have several columns. But I want to use only 2 columns from each dataset, without doing any join, merge or combination between the both of the datasets. Example dataset 1: column_dataset_1 <String> | column_dataset_1_normalized <String> ----------------------------------------------------------------------- 11882621-V021BRP161305-1 | 11882621V021BRP1613051 ----------------------------------------------------------------------- W-B.7120RP1605794 |

How to generate summary statistics (using Summarizer.metrics) in streaming query?

阅读更多关于 How to generate summary statistics (using Summarizer.metrics) in streaming query?

问题 Currently, I am using spark structured streaming to create data frames of random data in the form of (id, timestamp_value, device_id, temperature_value, comment). Spark Dataframe per Batch: Based on the screenshot of the data frame above, I would like to have some descriptive statistics for the column "temperature_value". For example, min, max, mean, count, variance. My approach to achieve this in python is the following: import sys import json import psycopg2 from pyspark import SparkContext

find the date, when value of column changed

阅读更多关于 find the date, when value of column changed

问题 I had one DataFrame as A, like: +---+---+---+---+----------+ |key| c1| c2| c3| date| +---+---+---+---+----------+ | k1| -1| 0| -1|2015-04-28| | k1| 1| -1| 1|2015-07-28| | k1| 1| 1| 1|2015-10-28| | k1| 1| 1| -1|2015-12-28| | k2| -1| 0| 1|2015-04-28| | k2| -1| 1| -1|2015-07-28| | k2| 1| -1| 0|2015-10-28| | k2| 1| -1| 1|2015-11-28| +---+---+---+---+----------+ Code to create A : data = [('k1', '-1', '0', '-1','2015-04-28'), ('k1', '1', '-1', '1', '2015-07-28'), ('k1', '1', '1', '1', '2015-10-28'

Spark (pyspark) having difficulty calling statistics methods on worker node

阅读更多关于 Spark (pyspark) having difficulty calling statistics methods on worker node

问题 I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs. On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems keys_to_bucketed = vectors.collectAsMap() keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()} but if I do the same directly

Pyspark - reusing JDBC connection

阅读更多关于 Pyspark - reusing JDBC connection

问题 I have the following task: load data from one table from multiple schemas use PySpark use one user which have access to all schemas in DB I am using the following code (more or less): def connect_to_oracle_db(spark_session, db_query): return spark_session.read \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:@//<host>:<port>/<srvice_name") \ .option("user", "<user>") \ .option("password", "<pass>") \ .option("dbtable", db_query) \ .option("driver", "oracle.jdbc.driver.OracleDriver") def