apache-spark

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

痴心易碎 提交于 2021-01-27 04:08:01
问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

那年仲夏 提交于 2021-01-27 04:07:48
问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Calling scala code in pyspark for XSLT transformations

ぃ、小莉子 提交于 2021-01-27 02:48:17
问题 This might be a long shot, but figured it couldn't hurt to ask. I'm attempting to use Elsevier's open-sourced spark-xml-utils package in pyspark to transform some XML records with XSLT. I've had a bit of success with some exploratory code getting a transformation to work: # open XSLT processor from spark's jvm context with open('/tmp/foo.xsl', 'r') as f: proc = sc._jvm.com.elsevier.spark_xml_utils.xslt.XSLTProcessor.getInstance(f.read()) # transform XML record with 'proc' with open('/tmp/bar

Does Spark write intermediate shuffle outputs to disk

ぃ、小莉子 提交于 2021-01-26 16:46:28
问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

Does Spark write intermediate shuffle outputs to disk

纵饮孤独 提交于 2021-01-26 16:36:49
问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

Iterate though Columns of a Spark Dataframe and update specified values

我怕爱的太早我们不能终老 提交于 2021-01-24 21:23:37
问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Iterate though Columns of a Spark Dataframe and update specified values

淺唱寂寞╮ 提交于 2021-01-24 21:13:13
问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

删除回忆录丶 提交于 2021-01-24 18:56:54
问题 I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured streaming, read the message containing the data path. Collect the message record in driver. (Messages are small in sizes) Create the dataframe from the data location. kafkaDf.select($"value".cast(StringType)) .writeStream.foreachBatch((batchDf:DataFrame,

Get HDFS file path in PySpark for files in sequence file format

我的未来我决定 提交于 2021-01-24 07:09:23
问题 My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format. Appreciate any

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

旧街凉风 提交于 2021-01-23 06:01:19
问题 I was trying to connect to MongoDB Atlas from PySpark and I have the following problem: from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql.functions import * sc = SparkContext spark = SparkSession.builder \ .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \ .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db