spark-dataframe | 易学教程

Flatten a DataFrame in Scala with different DataTypes inside

阅读更多关于 Flatten a DataFrame in Scala with different DataTypes inside

问题 As you may know, a DataFrame can contain fields which are complex types, like structures (StructType) or arrays (ArrayType). You may need, as in my case, to map all the DataFrame data to a Hive table, with simple type fields (String, Integer...). I've been struggling with this issue for a long time, and I've finally found a solution I want to share. Also, I'm sure it could be improved, so feel free to reply with your own suggestions. It's based on this thread, but also works for ArrayType

Apache Spark Window function with nested column

阅读更多关于 Apache Spark Window function with nested column

问题 I'm not sure this is a bug (or just incorrect syntax). I searched around and didn't see this mentioned elsewhere so I'm asking here before filing a bug report. I'm trying to use a Window function partitioned on a nested column. I've created a small example below demonstrating the problem. import sqlContext.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")

How to filter rows for a specific aggregate with spark sql?

阅读更多关于 How to filter rows for a specific aggregate with spark sql?

Normally all rows in a group are passed to an aggregate function. I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function. Such operation is possible with PostgreSQL . I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0). The code could probably look like this: val df = ... // some data frame df.groupBy("A").agg( max("B").where("B").less(10), // there is no such method as `where` :( max("C").where("C").less(5) ) So for a data frame like this: | A | B | C | | 1| 14| 4| | 1| 9| 3| | 2| 5| 6| The result would be:

how can you calculate the size of an apache spark data frame using pyspark?

阅读更多关于 how can you calculate the size of an apache spark data frame using pyspark?

Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark? why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes df.cache() 来源： https://stackoverflow.com/questions/38180140/how-can-you-calculate-the-size-of-an-apache-spark-data-frame-using-pyspark

Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

阅读更多关于 Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error. >>> cnxn = pyodbc.connect(con_string) >>> aj = cnxn.cursor() >>> >>> aj.execute("select * from tjob") <pyodbc.Cursor object at 0x257b2d0> >>> row = aj.fetchone() >>> row (1127, u'', u'8196660', u'', u'', 0, u'', u'', None, 35, None, 0, None, None, None, None, None, None, None, None, None, None,

Spark UI showing 0 cores even when setting cores in App

阅读更多关于 Spark UI showing 0 cores even when setting cores in App

问题 I am having a strange issue with running an application off of the spark master url where the UI is reporting a "STATE" of "WAITING" indefinitely as 0 cores are showing up under the RUNNING APPLICATIONs table no matter what I configure the core count to be. Ive configured my app with the following settings where spark.max.cores = 2 & spark.default.cores = 2 & memory set to 3GB. The machine is an enterprise class server with over 24 cores. SparkConf conf = new SparkConf() .setAppName

How do I enable partition pruning in spark

阅读更多关于 How do I enable partition pruning in spark

I am reading parquet data and I see that it is listing all the directories on driver side Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver I have specified month=2014-12 in my where clause. I have tried using spark sql and data frame API, and looks like both aren't pruning partitions. Using Dataframe API df.filter("month='2014-12'").show() Using Spark SQL sqlContext.sql("select name, price from products_parquet_151 where month = '2014-12'") I have tried the above on versions 1.5.1, 1

multiple criteria for aggregation on pySpark Dataframe

阅读更多关于 multiple criteria for aggregation on pySpark Dataframe

I have a pySpark dataframe that looks like this: +-------------+----------+ | sku| date| +-------------+----------+ |MLA-603526656|02/09/2016| |MLA-603526656|01/09/2016| |MLA-604172009|02/10/2016| |MLA-605470584|02/09/2016| |MLA-605502281|02/10/2016| |MLA-605502281|02/09/2016| +-------------+----------+ I want to group by sku, and then calculate the min and max dates. If I do this: df_testing.groupBy('sku') \ .agg({'date': 'min', 'date':'max'}) \ .limit(10) \ .show() the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following

Trying to read and write parquet files from s3 with local spark

阅读更多关于 Trying to read and write parquet files from s3 with local spark

问题 I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it. Currently my spark session reads local parquet mocks and is defined as such: val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate() 回答1: I'm going to have to correct the post by himanshuIIITian

spark off heap memory config and tungsten

阅读更多关于 spark off heap memory config and tungsten

I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient and friendly to GC memory utilization. Thus, even operating in the default on-heap mode Tungsten alleviates