spark-dataframe | 易学教程

Define spark udf by reflection on a String

阅读更多关于 Define spark udf by reflection on a String

I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me an error : Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection

Create a dataframe from a list in pyspark.sql

阅读更多关于 Create a dataframe from a list in pyspark.sql

I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last) <ipython-input-90-4a49f7f67700> in <module>() 56 l = example_data.map(lambda x: get_labeled_prediction(w,x)

How to handle data skew in the spark data frame for outer join

阅读更多关于 How to handle data skew in the spark data frame for outer join

I have two data frames and I am performing outer join on 5 columns . Below is example of my data set . uniqueFundamentalSet|^|PeriodId|^|SourceId|^|StatementTypeCode|^|StatementCurrencyId|^|FinancialStatementLineItem.lineItemId|^|FinancialAsReportedLineItemName|^|FinancialAsReportedLineItemName.languageId|^|FinancialStatementLineItemValue|^|AdjustedForCorporateActionValue|^|ReportedCurrencyId|^|IsAsReportedCurrencySetManually|^|Unit|^|IsTotal|^|StatementSectionCode|^|DimentionalLineItemId|^|IsDerived|^|EstimateMethodCode|^|EstimateMethodNote|^|EstimateMethodNote.languageId|^

Fetching distinct values on a column using Spark DataFrame

阅读更多关于 Fetching distinct values on a column using Spark DataFrame

问题 Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach? import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId)

Spark 1.6 SQL or Dataframe or Windows

阅读更多关于 Spark 1.6 SQL or Dataframe or Windows

问题 I have a data dump of Work orders as below. I need to identify the orders who are all having the status of both 'In Progress' and 'Finished'. Also, need to display display only in case of 'In progress' status with 'Finished/Not Valid' status. The output I have mentioned below. What is the best approach I can follow for the same in Spark. The input and output are attached here. Input Work_ Req_Id,Assigned to,Date,Status R1,John,3/4/15,In Progress R1,George,3/5/15,In Progress R2,Peter,3/6/15,In

How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

阅读更多关于 How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

问题 Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs. 回答1: The most straightforward approach IMHO is to use repartition/coalesce (prefer coalesce unless data is skewed and you want to create same-sized outputs) before writing parquet files so that you will not create small files to begin with. df .map(<some transformation>) .filter(<some filter>) ///... .coalesce(<number of partitions>) .write

Spark Dataframe: Generate an Array of Tuple from a Map type

阅读更多关于 Spark Dataframe: Generate an Array of Tuple from a Map type

问题 My downstream source does not support a Map type and my source does and as such sends this. I need to convert this map into an array of struct (tuple). Scala support Map.toArray which creates an array of tuple for you which seems like the function I need on the Map to transform: { "a" : { "b": { "key1" : "value1", "key2" : "value2" }, "b_" : { "array": [ { "key": "key1", "value" : "value1" }, { "key": "key2", "value" : "value2" } ] } } } What is the most efficient way in Spark to do this

How do I compare each column in a table using DataFrame by Scala

阅读更多关于 How do I compare each column in a table using DataFrame by Scala

问题 There are two tables; one is ID Table 1 and the other is Attribute Table 2. Table 1 Table 2 If the IDs the same row in Table 1 has same attribrte, then we get number 1, else we get 0. Finally, we get the result Table 3. Table 3 For example, id1 and id2 have different color and size, so the id1 and id2 row(2nd row in Table 3) has "id1 id2 0 0"; id1 and id3 have same color and different size, so the id1 and id3 row(3nd row in Table 3) has "id1 id3 1 0"; Same attribute---1 Different attribute--

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------— (on Linux)

阅读更多关于 The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------— (on Linux)

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-------- Hi, The following Spark code i was executing in Eclipse of CDH 5.8 & getting above RuntimeExeption public static void main(String[] args) { final SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("HiveConnector"); final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); SQLContext sqlContext = new HiveContext(sparkContext); DataFrame df = sqlContext.sql("SELECT * FROM test_hive_table1"); //df.show(); df.count(); } According to Exception /tmp/hive on HDFS should be

Bulk data migration through Spark SQL

阅读更多关于 Bulk data migration through Spark SQL

I'm currently trying to bulk migrate the contents of a very large MySQL table into a parquet file via Spark SQL. But when doing so I quickly run out of memory, even when setting the driver's memory limit higher (I'm using spark in local mode). Example code: Dataset<Row> ds = spark.read() .format("jdbc") .option("url", url) .option("driver", "com.mysql.jdbc.Driver") .option("dbtable", "bigdatatable") .option("user", "root") .option("password", "foobar") .load(); ds.write().mode(SaveMode.Append).parquet("data/bigdatatable"); It seems like Spark tries to read the entire table contents into memory