pyspark

Get Last Monday in Spark

社会主义新天地 提交于 2020-01-03 18:05:53
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

孤者浪人 提交于 2020-01-03 18:05:27
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

陌路散爱 提交于 2020-01-03 18:05:09
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

withColumn not allowing me to use max() function to generate a new column

非 Y 不嫁゛ 提交于 2020-01-03 16:47:38
问题 I have a dataset like this: a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"]) I want to have a dataset that adds a new column that is equal to the largest value in the other three columns. The output would look like this: +----+----+-----+-------+ |one |two |three|max_col| +----+----+-----+-------+ | 1| 2| 3| 3| | 0| 2| 1| 2| | 9| 8| 7| 9| +----+----+-----+-------+ I thought I would use withColumn , like so: b = a.withColumn("max_col", max(a["one"], a["two"], a[

Spark (pySpark) groupBy misordering first element on collect_list

旧巷老猫 提交于 2020-01-03 05:40:30
问题 I have the following dataframe (df_parquet): DataFrame[id: bigint, date: timestamp, consumption: decimal(38,18)] I intend to get sorted lists of dates and consumptions using collect_list, just as stated in this post: collect_list by preserving order based on another variable I am following the last approach (https://stackoverflow.com/a/49246162/11841618), which is the one i think its more efficient. So instead of just calling repartition with the default number of partitions (200) i call it

Filter if String contain sub-string pyspark

梦想与她 提交于 2020-01-03 05:18:21
问题 I have 2 datasets. In each one I have several columns. But I want to use only 2 columns from each dataset, without doing any join, merge or combination between the both of the datasets. Example dataset 1: column_dataset_1 <String> | column_dataset_1_normalized <String> ----------------------------------------------------------------------- 11882621-V021BRP161305-1 | 11882621V021BRP1613051 ----------------------------------------------------------------------- W-B.7120RP1605794 |

How to generate summary statistics (using Summarizer.metrics) in streaming query?

家住魔仙堡 提交于 2020-01-03 05:08:47
问题 Currently, I am using spark structured streaming to create data frames of random data in the form of (id, timestamp_value, device_id, temperature_value, comment). Spark Dataframe per Batch: Based on the screenshot of the data frame above, I would like to have some descriptive statistics for the column "temperature_value". For example, min, max, mean, count, variance. My approach to achieve this in python is the following: import sys import json import psycopg2 from pyspark import SparkContext

find the date, when value of column changed

别等时光非礼了梦想. 提交于 2020-01-03 04:56:07
问题 I had one DataFrame as A, like: +---+---+---+---+----------+ |key| c1| c2| c3| date| +---+---+---+---+----------+ | k1| -1| 0| -1|2015-04-28| | k1| 1| -1| 1|2015-07-28| | k1| 1| 1| 1|2015-10-28| | k1| 1| 1| -1|2015-12-28| | k2| -1| 0| 1|2015-04-28| | k2| -1| 1| -1|2015-07-28| | k2| 1| -1| 0|2015-10-28| | k2| 1| -1| 1|2015-11-28| +---+---+---+---+----------+ Code to create A : data = [('k1', '-1', '0', '-1','2015-04-28'), ('k1', '1', '-1', '1', '2015-07-28'), ('k1', '1', '1', '1', '2015-10-28'

Spark (pyspark) having difficulty calling statistics methods on worker node

↘锁芯ラ 提交于 2020-01-03 04:45:11
问题 I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs. On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems keys_to_bucketed = vectors.collectAsMap() keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()} but if I do the same directly

Pyspark - reusing JDBC connection

☆樱花仙子☆ 提交于 2020-01-03 04:18:05
问题 I have the following task: load data from one table from multiple schemas use PySpark use one user which have access to all schemas in DB I am using the following code (more or less): def connect_to_oracle_db(spark_session, db_query): return spark_session.read \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:@//<host>:<port>/<srvice_name") \ .option("user", "<user>") \ .option("password", "<pass>") \ .option("dbtable", db_query) \ .option("driver", "oracle.jdbc.driver.OracleDriver") def