spark-dataframe

How to handle data skew in the spark data frame for outer join

亡梦爱人 提交于 2019-12-06 04:56:04
问题 I have two data frames and I am performing outer join on 5 columns . Below is example of my data set . uniqueFundamentalSet|^|PeriodId|^|SourceId|^|StatementTypeCode|^|StatementCurrencyId|^|FinancialStatementLineItem.lineItemId|^|FinancialAsReportedLineItemName|^|FinancialAsReportedLineItemName.languageId|^|FinancialStatementLineItemValue|^|AdjustedForCorporateActionValue|^|ReportedCurrencyId|^|IsAsReportedCurrencySetManually|^|Unit|^|IsTotal|^|StatementSectionCode|^|DimentionalLineItemId|^

Add leading zeros to Columns in a Spark Data Frame [duplicate]

做~自己de王妃 提交于 2019-12-06 04:40:36
This question already has an answer here : Prepend zeros to a value in PySpark (1 answer) Closed last year . In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in. val df = spark.read .format("com.databricks.spark.xml") .option("rowTag", "output") .option("excludeAttribute", true) .option("allowNumericLeadingZeros", true)

Create a dataframe from a list in pyspark.sql

本小妞迷上赌 提交于 2019-12-06 04:19:37
问题 I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last)

SPARK read.json throwing java.io.IOException: Too many bytes before newline

↘锁芯ラ 提交于 2019-12-06 03:25:44
I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up. Keep in mind that Spark is expecting each line to be a

PySpark - Convert to JSON row by row

一个人想着一个人 提交于 2019-12-06 02:04:55
I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code. for message in df.toJSON().collect(): kafkaClient.send(message) However the dataframe is very large so it fails when trying to collect() . I was thinking of using a UDF since it processes it row by row. from pyspark.sql.functions import udf, struct def get_row(row): json = row.toJSON() kafkaClient.send(message) return "Sent" send_row_udf = F.udf(get_row, StringType()) df_json = df.withColumn("Sent", get

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------— (on Linux)

元气小坏坏 提交于 2019-12-06 01:43:15
问题 The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-------- Hi, The following Spark code i was executing in Eclipse of CDH 5.8 & getting above RuntimeExeption public static void main(String[] args) { final SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("HiveConnector"); final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); SQLContext sqlContext = new HiveContext(sparkContext); DataFrame df = sqlContext.sql("SELECT *

DataFrame filtering based on second Dataframe

僤鯓⒐⒋嵵緔 提交于 2019-12-05 23:59:34
Using Spark SQL, I have two dataframes, they are created from one, such as: df = sqlContext.createDataFrame(...); df1 = df.filter("value = 'abc'"); //[path, value] df2 = df.filter("value = 'qwe'"); //[path, value] I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2) where udf is user defined function that shorten original path from df1. Naive solution is to use JOIN and then filter result, but it is

How to select a same-size stratified sample from a dataframe in Apache Spark?

吃可爱长大的小学妹 提交于 2019-12-05 23:16:27
问题 I have a dataframe in Spark 2 as shown below where users have between 50 to thousands of posts. I would like to create a new dataframe that will have all the users in the original dataframe but with only 5 randomly sampled posts for each user. +--------+--------------+--------------------+ | user_id| post_id| text| +--------+--------------+--------------------+ |67778705|44783131591473|some text...........| |67778705|44783134580755|some text...........| |67778705|44783136367108|some text.....

pyspark - create DataFrame Grouping columns in map type structure

我与影子孤独终老i 提交于 2019-12-05 22:49:37
问题 My DataFrame has the following structure: ------------------------- | Brand | type | amount| ------------------------- | B | a | 10 | | B | b | 20 | | C | c | 30 | ------------------------- I want to reduce the amount of rows by grouping type and amount into one single column of type : Map So Brand will be unique and MAP_type_AMOUNT will have key,value for each type amount combination. I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the

Using Python's reduce() to join multiple PySpark DataFrames

我是研究僧i 提交于 2019-12-05 20:02:53
Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes, ) whereas this one doesn't: joined_df = list_of_dataframes[0] joined_df.cache() for right_df in