pyspark

Spark - missing 1 required position argument (lambda function)

人走茶凉 提交于 2020-01-06 06:42:28
问题 I'm trying to distribute some text extraction from PDFs between multiple servers using Spark. This is using a custom Python module I made and is an implementation of this question. The 'extractTextFromPdf' function takes 2 arguments: a string representing the path to the file, and a configuration file used to determine various extraction constraints. In this case the config file is just a simple YAML file sitting in the same folder as the Python script running the extraction and the files are

Pyspark: Concat function generated columns into new dataframe

一曲冷凌霜 提交于 2020-01-06 05:50:12
问题 I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the corresponding, original df column. And the column headers in the new df should be == corresponding column header in old dataframe + "_diff". With the following code I can generate the new columns of percentage changes for each column in the original df but am not able to stick them in a new df with suitable column headers:

PySPARK UDF on withColumn to replace column

徘徊边缘 提交于 2020-01-06 05:43:08
问题 This UDF is written to replace a column's value with a variable. Python 2.7; Spark 2.2.0 import pyspark.sql.functions as func def updateCol(col, st): return func.expr(col).replace(func.expr(col), func.expr(st)) updateColUDF = func.udf(updateCol, StringType()) Variable L_1 to L_3 have updated columns for each row . This is how I am calling it: updatedDF = orig_df.withColumn("L1", updateColUDF("L1", func.format_string(L_1))). \ withColumn("L2", updateColUDF("L2", func.format_string(L_2))). \

Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

*爱你&永不变心* 提交于 2020-01-06 05:23:07
问题 I have a list of S3 buckets partitioned by date. The first bucket titled 2019-12-1, the second 2019-12-2, etc. Each of these buckets stores parquet files that I am reading into a pyspark dataframe. The pyspark dataframe generated from each of these buckets has the exact same schema. What I would like to do is iterate over these buckets, and store all of these parquet files into a single pyspark dataframe that has a date column denoting what bucket each entry in the dataframe actually came

pyspark: Difference performance for spark.read.format(“csv”) vs spark.read.csv

一世执手 提交于 2020-01-06 04:52:09
问题 Anyone knows what is the difference between spark.read.format("csv") vs spark.read.csv? Some say "spark.read.csv" is an alias of "spark.read.format("csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching. DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB. DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv"

Register UDF from external Java jar class in pyspark [duplicate]

会有一股神秘感。 提交于 2020-01-06 03:52:12
问题 This question already has answers here : Spark: How to map Python with Scala or Java User Defined Functions? (1 answer) Calling Java/Scala function from a task (1 answer) Closed last year . I have Java Jar which contains functions, as example package com.test.oneid; public class my_class { public static void main(String args[]) { } public static int add(int x) throws IOException { try { return (x+2); } catch(Exception e) { throw new IOException("Caught exception processing input row ", e); }

how to convert multiple row tag xml files to dataframe

泪湿孤枕 提交于 2020-01-06 03:44:25
问题 I have xml file having multiple rowstags. i need to convert this xml to proper dataframe. i have used spark-xml which is only handling single row tag. xml data is below <?xml version='1.0' encoding='UTF-8' ?> <generic xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000"> <HEADER compName="ABGROUP" dateCreated="2018-03-09T09:38:51"/> <COVERSHEET> <ESTIMATE_INFO estimateName="2016-09-28-133907" priceList="YHTRDF" laborEff="Restoration/Service/Remodel

Drop function not working after left outer join in pyspark

泪湿孤枕 提交于 2020-01-06 03:26:33
问题 My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority . I am creating my dataframes like this: a = "select 123 as id, 1 as priority" a_df = spark.sql(a) b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority" b_df = spark.sql(b) c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority) c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int] The drop function is not removing

TypeError converting a Pandas Dataframe to Spark Dataframe in Pyspark

大兔子大兔子 提交于 2020-01-06 03:08:09
问题 Did my research, but didn't find anything on this. I want to convert a simple pandas.DataFrame to a spark dataframe, like this: df = pd.DataFrame({'col1': ['a', 'b', 'c'], 'col2': [1, 2, 3]}) sc_sql.createDataFrame(df, schema=df.columns.tolist()) The error I get is: TypeError: Can not infer schema for type: <class 'str'> I tried something even simpler: df = pd.DataFrame([1, 2, 3]) sc_sql.createDataFrame(df) And I get: TypeError: Can not infer schema for type: <class 'numpy.int64'> Any help?

Import PySpark packages with a regular Jupyter Notebook

烂漫一生 提交于 2020-01-06 02:51:13
问题 What is pyspark actually doing except importing packages properly? Is it possible to use a regular jupyter notebook and then import what is needed? 回答1: Yes, it is possible but can be painful. While Python alone is not an issue and all you need is to set $SPARK_HOME , add $SPARK_HOME/python (and if not accessible otherwise $SPARK_HOME/python/lib/py4j-[VERSION]-src.zip ) PySpark script handles JVM setup as well ( --packages , --jars --conf , etc.). This can be handled using PYSPARK_SUBMIT_ARGS