apache-spark-sql | 易学教程

Spark : need confirmation on approach in capturing first and last date : on dataset

阅读更多关于 Spark : need confirmation on approach in capturing first and last date : on dataset

问题 I have a data frame : A, B, C, D, 201701, 2020001 A, B, C, D, 201801, 2020002 A, B, C, D, 201901, 2020003 expected output : col_A, col_B, col_C ,col_D, min_week ,max_week, min_month, max_month A, B, C, D, 201701, 201901, 2020001, 2020003 What I tried in pyspark- from pyspark.sql import Window import pyspark.sql.functions as psf w1 = Window.partitionBy('A','B', 'C', 'D')\ .orderBy('WEEK','MONTH') df_new = df_source\ .withColumn("min_week", psf.first("WEEK").over(w1))\ .withColumn("max_week",

Extract values from spark dataframe column into new derived column

阅读更多关于 Extract values from spark dataframe column into new derived column

问题 I have the following dataframe schema below root |-- SOURCE: string (nullable = true) |-- SYSTEM_NAME: string (nullable = true) |-- BUCKET_NAME: string (nullable = true) |-- LOCATION: string (nullable = true) |-- FILE_NAME: string (nullable = true) |-- LAST_MOD_DATE: string (nullable = true) |-- FILE_SIZE: string (nullable = true) I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following: example 1: prod/docs

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

阅读更多关于 Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

阅读更多关于 Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

pandas dataframe : order -supply matching

阅读更多关于 pandas dataframe : order -supply matching

问题 I am very new to python or pandas coding. so I am kind of struck here and any input is appreciated. I have two df, individually ordered based on a criteria. df1 : list of orders with quantity df2 : list of inventories with quantity and date available. quantity necessarily not to equal to order quantity. I need to pop first order in df1 and keep popping inventory in df2 until order quantity satisfied and also maintain, how many inventory I took to fulfill the order Any help would be greatly

pandas dataframe : order -supply matching

阅读更多关于 pandas dataframe : order -supply matching

how to efficiently parse dataframe object into a map of key-value pairs

阅读更多关于 how to efficiently parse dataframe object into a map of key-value pairs

问题 i'm working with a dataframe with the columns basketID and itemID . is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket? my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks! screen shot of sample data the goal is to obtain basket = Map("b1" -> Set("i1", "i2"

Custom SQL using Spark Big Query Connector

阅读更多关于 Custom SQL using Spark Big Query Connector

问题 I have some custom sql to read the data from BigQuery. How can I execute that? I tried using option as query but it is not working. It is ignoring the query option and reading the full table. Dataset<Row> testDS = session.read().format("bigquery") //.option("table", <TABLE>) .option("query",<QUERY>) .option("project", <PROJECT_ID>) .option("parentProject", <PROJECT_ID>) .load(); 回答1: That's because the query option is not available in the connector. See https://github.com/GoogleCloudDataproc

Custom SQL using Spark Big Query Connector

阅读更多关于 Custom SQL using Spark Big Query Connector

Load dataframe from pyspark

阅读更多关于 Load dataframe from pyspark

问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df