spark-dataframe | 易学教程

Joining a large and a ginormous spark dataframe

阅读更多关于 Joining a large and a ginormous spark dataframe

I have two dataframes, df1 has 6 million rows, df2 has 1 billion. I have tried the standard df1.join(df2,df1("id")<=>df2("id2")) , but run out of memory. df1 is too large to be put into a broadcast join. I have even tried a bloom filter, but it was also too large to fit in a broadcast and still be useful. The only thing I have tried that doesn't error out is to break df1 into 300,000 row chunks and join with df2 in a foreach loop. But this takes an order of magnitude longer than it probably should (likely because it is too large to fit as a persist causing it to redo the split upto that point)

How to modify a Spark Dataframe with a complex nested structure?

阅读更多关于 How to modify a Spark Dataframe with a complex nested structure?

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example: I have schema defined as: StructType( StructField(name,StringType,true), StructField(data,ArrayType( StructType( StructField(name,StringType,true), StructField(values, MapType(StringType,StringType,true), true) ), true ), true) ) I'd like to produce a new DF that has the field data.value of MapType set to null, but

Cannot resolve column (numeric column name) in Spark Dataframe

阅读更多关于 Cannot resolve column (numeric column name) in Spark Dataframe

This is my data: scala> data.printSchema root |-- 1.0: string (nullable = true) |-- 2.0: string (nullable = true) |-- 3.0: string (nullable = true) This doesn't work :( scala> data.select("2.0").show Exception: org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];; 'Project ['2.0] +- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617] +- LocalRelation [_1#5608, _2#5609, _3#5610] ... Try this at home (I'm running on the shell v_2.1.0.5)! val data = spark.createDataFrame(Seq( ("Hello", ", ", "World!") )).toDF("1.0", "2.0", "3.0")

Randomly shuffle column in Spark RDD or dataframe

阅读更多关于 Randomly shuffle column in Spark RDD or dataframe

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task. If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions method. rdd.mapPartitions(Random.shuffle(_)); For a PairRDD (RDDs of type RDD[(K, V)] ), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value): pairRDD.mapPartitions(iterator => { val (keySequence, valueSequence) = iterator.toSeq.unzip val

pyspark: isin vs join

阅读更多关于 pyspark: isin vs join

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically: Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs broadcast ? This question is the spark analogue of the following question in Pig: Pig: efficient filtering by loaded list Additional context: Pyspark isin function Considering import pyspark.sql.functions as psf There are two types of broadcasting: sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin psf.broadcast inside a join

“resolved attribute(s) missing” when performing join on pySpark

阅读更多关于 “resolved attribute(s) missing” when performing join on pySpark

I have the following two pySpark dataframe: > df_lag_pre.columns ['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity'] > df_unmatched.columns ['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price'] Now I want to join them on common columns, so I try the following: > df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price']) And I get the following error message: AnalysisException: u'resolved attribute(s) price#3424 missing from country#3443,month#801,price#808,category#803,subcategory#804,page#805,date#280,link#809,name

Why does Apache Spark read unnecessary Parquet columns within nested structures?

阅读更多关于 Why does Apache Spark read unnecessary Parquet columns within nested structures?

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a schema with nested complex structures val schema = StructType(Seq( StructField("F1", IntegerType),

Spark: Find Each Partition Size for RDD

阅读更多关于 Spark: Find Each Partition Size for RDD

问题 What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this: l = builder.rdd.glom().map(len).collect() # get length of each partition print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l)) It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom() is causing this to happen. But anyway, just wanted to

Take n rows from a spark dataframe and pass to toPandas()

阅读更多关于 Take n rows from a spark dataframe and pass to toPandas()

I have this code: l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas() Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas() . So to put it another way, how can I take the top n rows from a dataframe and call toPandas() on the resulting dataframe? Can't think this is difficult but I can't figure it out. I

Inferring Spark DataType from string literals

阅读更多关于 Inferring Spark DataType from string literals

I am trying to write a Scala function that can infer Spark DataTypes based on a provided input string: /** * Example: * ======== * toSparkType("string") => StringType * toSparkType("boolean") => BooleanType * toSparkType("date") => DateType * etc. */ def toSparkType(inputType : String) : DataType = { var dt : DataType = null if(matchesStringRegex(inputType)) { dt = StringType } else if(matchesBooleanRegex(inputType)) { dt = BooleanType } else if(matchesDateRegex(inputType)) { dt = DateType } else if(...) { ... } dt } My goal is to support a large subset, if not all, of the available DataTypes