pyspark | 易学教程

Do Spark/Parquet partitions maintain ordering?

阅读更多关于 Do Spark/Parquet partitions maintain ordering?

问题 If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code: # read a csv df = sql_context.read.csv(input_filename) # add a hash column hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType()) df = df.withColumn('hash', hash_udf(df['customer_id'])) # write out to parquet df.write.parquet(output_path, partitionBy=['hash']) # read back the file df2 = sql_context.read.parquet(output_path) I am partitioning on a

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

阅读更多关于 Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

Spark on Windows 10. 'Files\Spark\bin\..\jars“”\' is not recognized as an internal or external command

阅读更多关于 Spark on Windows 10. 'Files\Spark\bin\..\jars“”\' is not recognized as an internal or external command

问题 I am very frustrated by Spark. An evening wasted thinking that I was doing something wrong but I have uninstalled and reinstalled several times, following multiple guides that all indicate a very similar path. On cmd prompt, I am trying to run: pyspark or spark-shell The steps I followed include downloading a pre-built package from: https://spark.apache.org/downloads.html including spark 2.0.2 with hadoop 2.3 and spark 2.1.0 with hadoop 2.7. Neither work and I get this error: 'Files\Spark\bin

Find latest file pyspark

阅读更多关于 Find latest file pyspark

问题 So I've figured out how to find the latest file using python. Now I'm wondering if I can find the latest file using pyspark. Currently I specify a path but I'd like pyspark to get the latest modified file. Current code looks like this: df = sc.read.csv("Path://to/file", header=True, inderSchema=True) Thanks in advance for your help. 回答1: I copied the code to get the HDFS API to work with PySpark from this answer: Pyspark: get list of files/directories on HDFS path URI = sc._gateway.jvm.java

How to write spark dataframe in a single file in local system without using coalesce

阅读更多关于 How to write spark dataframe in a single file in local system without using coalesce

问题 I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this.

Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)

阅读更多关于 Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)

问题 I have a json-like structure in spark which looks as follows: >>> df = spark.read.parquet(good_partition_path) id: string some-array: array element: struct array-field-1: string array-field-2: string depending on the partition, some-array might be an empty array for all id 's. When this happend spark infers the following schema: >>> df = spark.read.parquet(bad_partition_path) id: string some-array: array element: string Of course that's a problem if I want to read multiple partitions because

PySpark, Win10 - The system cannot find the path specified

阅读更多关于 PySpark, Win10 - The system cannot find the path specified

问题 I previously had PySpark installed as a Python package I installed through pip, I uninstalled it recently with a clean version of Python and downloaded the standalone version. In my User variables I made a path with name: SPARK_HOME with a value of: C:\spark-2.3.2-bin-hadoop2.7\bin In System variables under Path I made an entry: C:\spark-2.3.2-bin-hadoop2.7\bin When I run pyspark I can not run spark-shell either. Any ideas? 回答1: SPARK_HOME should be without bin folder. Hence, Set SPARK_HOME

PySpark first and last function over a partition in one go

阅读更多关于 PySpark first and last function over a partition in one go

问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

Implicit schema for pandas_udf in PySpark?

阅读更多关于 Implicit schema for pandas_udf in PySpark?

问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and

What does checkpointing do on Apache Spark?

阅读更多关于 What does checkpointing do on Apache Spark?

问题 What does checkpointing do for Apache Spark, and does it take any hits on RAM or CPU? 回答1: From Apache Streaming Documentation - Hope it helps: A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are