spark-dataframe | 易学教程

Why is my Spark App running in only 1 executor?

阅读更多关于 Why is my Spark App running in only 1 executor?

问题 I'm still fairly new to Spark but I have been able to create the Spark App I need to be able to reprocess data from our SQL Server using JDBC drivers ( we are removing expensive SPs ), the app loads a few tables from Sql Server via JDBC into dataframes, then I do a few joins, a group, and a filter finally reinserting some data back via JDBC the results to a different table. All this executes just fine at Spark EMR in Amazon Web Services in a m3.xlarge with 2 cores in around a minute. My

Effect of fetchsize and batchsize on Spark

阅读更多关于 Effect of fetchsize and batchsize on Spark

问题 I want to control the reading and writing speed to an RDB by Spark directly, yet the related parameters as the title already revealed seemingly were not working. Can I conclude that fetchsize and batchsize didn't work with my testing method? Or they do affect on the facet of reading and writing since the measure result is reasonable based on scale. Stats of betchsize , fetchsize and data set /*Dataset*/ +--------------+-----------+ | Observations | Dataframe | +--------------+-----------+ |

Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame

阅读更多关于 Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame

问题 I have a dataframe with two columns "date" and "value", how do I add 2 new columns "value_mean" and "value_sd" to the dataframe where "value_mean" is the average of "value" over the last 10 days (including the current day as specified in "date") and "value_sd" is the standard deviation of the "value" over the last 10 days? 回答1: Spark sql provide the various dataframe function like avg,mean,sum etc. you just have to apply on dataframe column using spark sql column import org.apache.spark.sql

Convert List into dataframe spark scala

阅读更多关于 Convert List into dataframe spark scala

问题 I have a list with more than 30 strings. how to convert list into dataframe . what i tried: eg Val list=List("a","b","v","b").toDS().toDF() Output : +-------+ | value| +-------+ |a | |b | |v | |b | +-------+ Expected Output is +---+---+---+---+ | _1| _2| _3| _4| +---+---+---+---+ | a| b| v| a| +---+---+---+---+ any help on this . 回答1: List("a","b","c","d") represents a record with one field and so the resultset displays one element in each row. To get the expected output, the row should have

how to add a Incremental column ID for a table in spark SQL

阅读更多关于 how to add a Incremental column ID for a table in spark SQL

问题 I'm working on a spark mllib algorithm. The dataset I have is in this form Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these) Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this? 回答1: Scala val

Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

阅读更多关于 Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

问题 I am relatively new to Spark and Scala. I have a dataframe which has the following format: | Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 | | 1234 | AAAA | 1111 | afsdf | ewqre | 1970-01-01 00:00:00.0 | false | | 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true | | 1234 | AAAA | 1111 | dafsd | afwew | 2015-01-17 07:09:32.748 | false | | 5678 | BBBB | 2222 | afsdf | qwerq | 1970-01-01 00:00:00.0 | true | | 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 |

Spark union fails with nested JSON dataframe

阅读更多关于 Spark union fails with nested JSON dataframe

问题 I have the following two JSON files: { "name" : "Agent1", "age" : "32", "details" : [{ "d1" : 1, "d2" : 2 } ] } { "name" : "Agent2", "age" : "42", "details" : [] } I read them with spark: val jsonDf1 = spark.read.json(pathToJson1) val jsonDf2 = spark.read.json(pathToJson2) two dataframes are created with the following schemas: root |-- age: string (nullable = true) |-- details: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- d1: long (nullable = true) | | |-- d2:

How to filter rows for a specific aggregate with spark sql?

阅读更多关于 How to filter rows for a specific aggregate with spark sql?

问题 Normally all rows in a group are passed to an aggregate function. I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function. Such operation is possible with PostgreSQL. I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0). The code could probably look like this: val df = ... // some data frame df.groupBy("A").agg( max("B").where("B").less(10), // there is no such method as `where` :( max("C").where("C").less(5)

How do I enable partition pruning in spark

阅读更多关于 How do I enable partition pruning in spark

问题 I am reading parquet data and I see that it is listing all the directories on driver side Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver I have specified month=2014-12 in my where clause. I have tried using spark sql and data frame API, and looks like both aren't pruning partitions. Using Dataframe API df.filter("month='2014-12'").show() Using Spark SQL sqlContext.sql("select name,

spark off heap memory config and tungsten

阅读更多关于 spark off heap memory config and tungsten

问题 I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? 回答1: Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient