apache-spark | 易学教程

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

阅读更多关于 Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

阅读更多关于 Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

Calling scala code in pyspark for XSLT transformations

阅读更多关于 Calling scala code in pyspark for XSLT transformations

问题 This might be a long shot, but figured it couldn't hurt to ask. I'm attempting to use Elsevier's open-sourced spark-xml-utils package in pyspark to transform some XML records with XSLT. I've had a bit of success with some exploratory code getting a transformation to work: # open XSLT processor from spark's jvm context with open('/tmp/foo.xsl', 'r') as f: proc = sc._jvm.com.elsevier.spark_xml_utils.xslt.XSLTProcessor.getInstance(f.read()) # transform XML record with 'proc' with open('/tmp/bar

Does Spark write intermediate shuffle outputs to disk

阅读更多关于 Does Spark write intermediate shuffle outputs to disk

问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

Does Spark write intermediate shuffle outputs to disk

阅读更多关于 Does Spark write intermediate shuffle outputs to disk

Iterate though Columns of a Spark Dataframe and update specified values

阅读更多关于 Iterate though Columns of a Spark Dataframe and update specified values

问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Iterate though Columns of a Spark Dataframe and update specified values

阅读更多关于 Iterate though Columns of a Spark Dataframe and update specified values

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

阅读更多关于 Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

问题 I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured streaming, read the message containing the data path. Collect the message record in driver. (Messages are small in sizes) Create the dataframe from the data location. kafkaDf.select($"value".cast(StringType)) .writeStream.foreachBatch((batchDf:DataFrame,

Get HDFS file path in PySpark for files in sequence file format

阅读更多关于 Get HDFS file path in PySpark for files in sequence file format

问题 My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format. Appreciate any

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

阅读更多关于 PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

问题 I was trying to connect to MongoDB Atlas from PySpark and I have the following problem: from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql.functions import * sc = SparkContext spark = SparkSession.builder \ .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \ .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db