pyspark

Do Spark/Parquet partitions maintain ordering?

流过昼夜 提交于 2021-01-28 03:09:44
问题 If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code: # read a csv df = sql_context.read.csv(input_filename) # add a hash column hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType()) df = df.withColumn('hash', hash_udf(df['customer_id'])) # write out to parquet df.write.parquet(output_path, partitionBy=['hash']) # read back the file df2 = sql_context.read.parquet(output_path) I am partitioning on a

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

我的梦境 提交于 2021-01-28 01:42:06
问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

Spark on Windows 10. 'Files\Spark\bin\..\jars“”\' is not recognized as an internal or external command

泪湿孤枕 提交于 2021-01-28 00:51:00
问题 I am very frustrated by Spark. An evening wasted thinking that I was doing something wrong but I have uninstalled and reinstalled several times, following multiple guides that all indicate a very similar path. On cmd prompt, I am trying to run: pyspark or spark-shell The steps I followed include downloading a pre-built package from: https://spark.apache.org/downloads.html including spark 2.0.2 with hadoop 2.3 and spark 2.1.0 with hadoop 2.7. Neither work and I get this error: 'Files\Spark\bin

Find latest file pyspark

我们两清 提交于 2021-01-27 22:04:16
问题 So I've figured out how to find the latest file using python. Now I'm wondering if I can find the latest file using pyspark. Currently I specify a path but I'd like pyspark to get the latest modified file. Current code looks like this: df = sc.read.csv("Path://to/file", header=True, inderSchema=True) Thanks in advance for your help. 回答1: I copied the code to get the HDFS API to work with PySpark from this answer: Pyspark: get list of files/directories on HDFS path URI = sc._gateway.jvm.java

How to write spark dataframe in a single file in local system without using coalesce

↘锁芯ラ 提交于 2021-01-27 21:21:22
问题 I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this.

Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)

寵の児 提交于 2021-01-27 21:02:50
问题 I have a json-like structure in spark which looks as follows: >>> df = spark.read.parquet(good_partition_path) id: string some-array: array element: struct array-field-1: string array-field-2: string depending on the partition, some-array might be an empty array for all id 's. When this happend spark infers the following schema: >>> df = spark.read.parquet(bad_partition_path) id: string some-array: array element: string Of course that's a problem if I want to read multiple partitions because

PySpark, Win10 - The system cannot find the path specified

淺唱寂寞╮ 提交于 2021-01-27 20:54:19
问题 I previously had PySpark installed as a Python package I installed through pip, I uninstalled it recently with a clean version of Python and downloaded the standalone version. In my User variables I made a path with name: SPARK_HOME with a value of: C:\spark-2.3.2-bin-hadoop2.7\bin In System variables under Path I made an entry: C:\spark-2.3.2-bin-hadoop2.7\bin When I run pyspark I can not run spark-shell either. Any ideas? 回答1: SPARK_HOME should be without bin folder. Hence, Set SPARK_HOME

PySpark first and last function over a partition in one go

北战南征 提交于 2021-01-27 19:54:45
问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

Implicit schema for pandas_udf in PySpark?

南楼画角 提交于 2021-01-27 18:01:32
问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and

What does checkpointing do on Apache Spark?

心不动则不痛 提交于 2021-01-27 17:50:17
问题 What does checkpointing do for Apache Spark, and does it take any hits on RAM or CPU? 回答1: From Apache Streaming Documentation - Hope it helps: A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are