apache-spark-sql

Spark : need confirmation on approach in capturing first and last date : on dataset

半腔热情 提交于 2020-12-23 13:43:12
问题 I have a data frame : A, B, C, D, 201701, 2020001 A, B, C, D, 201801, 2020002 A, B, C, D, 201901, 2020003 expected output : col_A, col_B, col_C ,col_D, min_week ,max_week, min_month, max_month A, B, C, D, 201701, 201901, 2020001, 2020003 What I tried in pyspark- from pyspark.sql import Window import pyspark.sql.functions as psf w1 = Window.partitionBy('A','B', 'C', 'D')\ .orderBy('WEEK','MONTH') df_new = df_source\ .withColumn("min_week", psf.first("WEEK").over(w1))\ .withColumn("max_week",

Extract values from spark dataframe column into new derived column

倖福魔咒の 提交于 2020-12-15 07:31:52
问题 I have the following dataframe schema below root |-- SOURCE: string (nullable = true) |-- SYSTEM_NAME: string (nullable = true) |-- BUCKET_NAME: string (nullable = true) |-- LOCATION: string (nullable = true) |-- FILE_NAME: string (nullable = true) |-- LAST_MOD_DATE: string (nullable = true) |-- FILE_SIZE: string (nullable = true) I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following: example 1: prod/docs

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

不打扰是莪最后的温柔 提交于 2020-12-15 06:39:45
问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

送分小仙女□ 提交于 2020-12-15 06:39:30
问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

pandas dataframe : order -supply matching

前提是你 提交于 2020-12-15 06:30:53
问题 I am very new to python or pandas coding. so I am kind of struck here and any input is appreciated. I have two df, individually ordered based on a criteria. df1 : list of orders with quantity df2 : list of inventories with quantity and date available. quantity necessarily not to equal to order quantity. I need to pop first order in df1 and keep popping inventory in df2 until order quantity satisfied and also maintain, how many inventory I took to fulfill the order Any help would be greatly

pandas dataframe : order -supply matching

我怕爱的太早我们不能终老 提交于 2020-12-15 06:30:41
问题 I am very new to python or pandas coding. so I am kind of struck here and any input is appreciated. I have two df, individually ordered based on a criteria. df1 : list of orders with quantity df2 : list of inventories with quantity and date available. quantity necessarily not to equal to order quantity. I need to pop first order in df1 and keep popping inventory in df2 until order quantity satisfied and also maintain, how many inventory I took to fulfill the order Any help would be greatly

how to efficiently parse dataframe object into a map of key-value pairs

风格不统一 提交于 2020-12-15 05:52:33
问题 i'm working with a dataframe with the columns basketID and itemID . is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket? my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks! screen shot of sample data the goal is to obtain basket = Map("b1" -> Set("i1", "i2"

Custom SQL using Spark Big Query Connector

本秂侑毒 提交于 2020-12-15 05:34:05
问题 I have some custom sql to read the data from BigQuery. How can I execute that? I tried using option as query but it is not working. It is ignoring the query option and reading the full table. Dataset<Row> testDS = session.read().format("bigquery") //.option("table", <TABLE>) .option("query",<QUERY>) .option("project", <PROJECT_ID>) .option("parentProject", <PROJECT_ID>) .load(); 回答1: That's because the query option is not available in the connector. See https://github.com/GoogleCloudDataproc

Custom SQL using Spark Big Query Connector

那年仲夏 提交于 2020-12-15 05:32:46
问题 I have some custom sql to read the data from BigQuery. How can I execute that? I tried using option as query but it is not working. It is ignoring the query option and reading the full table. Dataset<Row> testDS = session.read().format("bigquery") //.option("table", <TABLE>) .option("query",<QUERY>) .option("project", <PROJECT_ID>) .option("parentProject", <PROJECT_ID>) .load(); 回答1: That's because the query option is not available in the connector. See https://github.com/GoogleCloudDataproc

Load dataframe from pyspark

倾然丶 夕夏残阳落幕 提交于 2020-12-15 05:23:56
问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df