spark-dataframe | 易学教程

How to handle data skew in the spark data frame for outer join

阅读更多关于 How to handle data skew in the spark data frame for outer join

问题 I have two data frames and I am performing outer join on 5 columns . Below is example of my data set . uniqueFundamentalSet|^|PeriodId|^|SourceId|^|StatementTypeCode|^|StatementCurrencyId|^|FinancialStatementLineItem.lineItemId|^|FinancialAsReportedLineItemName|^|FinancialAsReportedLineItemName.languageId|^|FinancialStatementLineItemValue|^|AdjustedForCorporateActionValue|^|ReportedCurrencyId|^|IsAsReportedCurrencySetManually|^|Unit|^|IsTotal|^|StatementSectionCode|^|DimentionalLineItemId|^

Add leading zeros to Columns in a Spark Data Frame [duplicate]

阅读更多关于 Add leading zeros to Columns in a Spark Data Frame [duplicate]

This question already has an answer here : Prepend zeros to a value in PySpark (1 answer) Closed last year . In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in. val df = spark.read .format("com.databricks.spark.xml") .option("rowTag", "output") .option("excludeAttribute", true) .option("allowNumericLeadingZeros", true)

Create a dataframe from a list in pyspark.sql

阅读更多关于 Create a dataframe from a list in pyspark.sql

问题 I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last)

SPARK read.json throwing java.io.IOException: Too many bytes before newline

阅读更多关于 SPARK read.json throwing java.io.IOException: Too many bytes before newline

I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up. Keep in mind that Spark is expecting each line to be a

PySpark - Convert to JSON row by row

阅读更多关于 PySpark - Convert to JSON row by row

I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code. for message in df.toJSON().collect(): kafkaClient.send(message) However the dataframe is very large so it fails when trying to collect() . I was thinking of using a UDF since it processes it row by row. from pyspark.sql.functions import udf, struct def get_row(row): json = row.toJSON() kafkaClient.send(message) return "Sent" send_row_udf = F.udf(get_row, StringType()) df_json = df.withColumn("Sent", get

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------— (on Linux)

阅读更多关于 The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------— (on Linux)

问题 The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-------- Hi, The following Spark code i was executing in Eclipse of CDH 5.8 & getting above RuntimeExeption public static void main(String[] args) { final SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("HiveConnector"); final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); SQLContext sqlContext = new HiveContext(sparkContext); DataFrame df = sqlContext.sql("SELECT *

DataFrame filtering based on second Dataframe

阅读更多关于 DataFrame filtering based on second Dataframe

Using Spark SQL, I have two dataframes, they are created from one, such as: df = sqlContext.createDataFrame(...); df1 = df.filter("value = 'abc'"); //[path, value] df2 = df.filter("value = 'qwe'"); //[path, value] I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2) where udf is user defined function that shorten original path from df1. Naive solution is to use JOIN and then filter result, but it is

How to select a same-size stratified sample from a dataframe in Apache Spark?

阅读更多关于 How to select a same-size stratified sample from a dataframe in Apache Spark?

问题 I have a dataframe in Spark 2 as shown below where users have between 50 to thousands of posts. I would like to create a new dataframe that will have all the users in the original dataframe but with only 5 randomly sampled posts for each user. +--------+--------------+--------------------+ | user_id| post_id| text| +--------+--------------+--------------------+ |67778705|44783131591473|some text...........| |67778705|44783134580755|some text...........| |67778705|44783136367108|some text.....

pyspark - create DataFrame Grouping columns in map type structure

阅读更多关于 pyspark - create DataFrame Grouping columns in map type structure

问题 My DataFrame has the following structure: ------------------------- | Brand | type | amount| ------------------------- | B | a | 10 | | B | b | 20 | | C | c | 30 | ------------------------- I want to reduce the amount of rows by grouping type and amount into one single column of type : Map So Brand will be unique and MAP_type_AMOUNT will have key,value for each type amount combination. I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the

Using Python's reduce() to join multiple PySpark DataFrames

阅读更多关于 Using Python's reduce() to join multiple PySpark DataFrames

Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes, ) whereas this one doesn't: joined_df = list_of_dataframes[0] joined_df.cache() for right_df in