spark-dataframe | 易学教程

How to “reduce” multiple json tables stored in a column of an RDD to a single RDD table as efficiently as possible

阅读更多关于 How to “reduce” multiple json tables stored in a column of an RDD to a single RDD table as efficiently as possible

问题 Does concurrent access to append rows using union in a dataframe using following code will work correctly? Currently showing type error from pyspark.sql.types import * schema = StructType([ StructField("owreg", StringType(), True),StructField("we", StringType(), True) ,StructField("aa", StringType(), True) ,StructField("cc", StringType(), True) ,StructField("ss", StringType(), True) ,StructField("ss", StringType(), True) ,StructField("sss", StringType(), True) ]) f = sqlContext

Spark: Find Each Partition Size for RDD

阅读更多关于 Spark: Find Each Partition Size for RDD

What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this: l = builder.rdd.glom().map(len).collect() # get length of each partition print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l)) It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom() is causing this to happen. But anyway, just wanted to know if there is any better way to do it? Use: builder.rdd.mapPartitions(lambda it: [sum(1 for _ in it)])

Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu

阅读更多关于 Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu

问题 I have a data frame where i am replacing default delimiter , with |^| . it is working fine and i am getting the expected result also except where , is found in the records . For example i have one such records like below 4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!| So there is , in the 4th field . Now i am doing like this

Spark Window Functions requires HiveContext?

阅读更多关于 Spark Window Functions requires HiveContext?

问题 I trying one example of window function on spark from this blog http://xinhstechblog.blogspot.in/2016/04/spark-window-functions-for-dataframes.html. Getting following error while running the program.My questions ,do we need hivecontext to execute the window functions in spark? Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not resolve window function 'avg'. Note that, using window functions currently requires a HiveContext; at org.apache.spark.sql.catalyst.analysis

How to enable Cartesian join in Spark 2.0? [duplicate]

阅读更多关于 How to enable Cartesian join in Spark 2.0? [duplicate]

问题 This question already has answers here : spark.sql.crossJoin.enabled for Spark 2.x (3 answers) Closed 2 years ago . I have to cross join 2 dataframe in Spark 2.0 I am encountering below error: User class threw exception: org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; Please help me where to set this configuration, I am coding in eclipse. 回答1: As the

Transforming a column and update the DataFrame

阅读更多关于 Transforming a column and update the DataFrame

问题 So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames. df = df_data.drop('A').join( df_data[['ID', 'A']].rdd\ .map(lambda x: (x.ID, json.loads(x.A)) if x.A is not None else (x.ID, None))\ .toDF()\ .withColumnRenamed('_1', 'ID')\ .withColumnRenamed('_2', 'A'), ['ID'] ) The thing I dislike

How to Remove header and footer from Dataframe?

阅读更多关于 How to Remove header and footer from Dataframe?

问题 I am reading a text (not CSV) file that has header, content and footer using spark.read.format("text").option("delimiter","|")...load(file) I can access the header with df.first() . Is there something close to df.last() or df.reverse().first() ? 回答1: Sample data: col1|col2|col3 100|hello|asdf 300|hi|abc 200|bye|xyz 800|ciao|qwerty This is the footer line Processing logic: #load text file txt = sc.textFile("path_to_above_sample_data_text_file.txt") #remove header header = txt.first() txt = txt

How do I compare each column in a table using DataFrame by Scala

阅读更多关于 How do I compare each column in a table using DataFrame by Scala

There are two tables; one is ID Table 1 and the other is Attribute Table 2. Table 1 Table 2 If the IDs the same row in Table 1 has same attribrte, then we get number 1, else we get 0. Finally, we get the result Table 3. Table 3 For example, id1 and id2 have different color and size, so the id1 and id2 row(2nd row in Table 3) has "id1 id2 0 0"; id1 and id3 have same color and different size, so the id1 and id3 row(3nd row in Table 3) has "id1 id3 1 0"; Same attribute---1 Different attribute---0 How can I get the result Table 3 using Scala dataframe? This should do the trick import spark

Spark dataframes convert nested JSON to seperate columns

阅读更多关于 Spark dataframes convert nested JSON to seperate columns

I've a stream of JSONs with following structure that gets converted to dataframe { "a": 3936, "b": 123, "c": "34", "attributes": { "d": "146", "e": "12", "f": "23" } } The dataframe show functions results in following output sqlContext.read.json(jsonRDD).show +----+-----------+---+---+ | a| attributes| b| c| +----+-----------+---+---+ |3936|[146,12,23]|123| 34| +----+-----------+---+---+ How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e

Datasets in Apache Spark

阅读更多关于 Datasets in Apache Spark

Dataset<Tweet> ds = sc.read().json("path").as(Encoders.bean(Tweet.class)); ds.show(); JavaRDD<Tweet> dstry = ds.toJavaRDD(); System.out.println(dstry.first().getClass()); Caused by: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 16: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 16: No applicable constructor/method found for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates are: "public void sparkSQL.Tweet.setId(long)" at org.spark