spark-dataframe

Flatten a DataFrame in Scala with different DataTypes inside

随声附和 提交于 2019-12-05 08:26:19
问题 As you may know, a DataFrame can contain fields which are complex types, like structures (StructType) or arrays (ArrayType). You may need, as in my case, to map all the DataFrame data to a Hive table, with simple type fields (String, Integer...). I've been struggling with this issue for a long time, and I've finally found a solution I want to share. Also, I'm sure it could be improved, so feel free to reply with your own suggestions. It's based on this thread, but also works for ArrayType

Apache Spark Window function with nested column

坚强是说给别人听的谎言 提交于 2019-12-05 07:57:50
问题 I'm not sure this is a bug (or just incorrect syntax). I searched around and didn't see this mentioned elsewhere so I'm asking here before filing a bug report. I'm trying to use a Window function partitioned on a nested column. I've created a small example below demonstrating the problem. import sqlContext.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")

How to filter rows for a specific aggregate with spark sql?

佐手、 提交于 2019-12-05 04:42:48
Normally all rows in a group are passed to an aggregate function. I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function. Such operation is possible with PostgreSQL . I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0). The code could probably look like this: val df = ... // some data frame df.groupBy("A").agg( max("B").where("B").less(10), // there is no such method as `where` :( max("C").where("C").less(5) ) So for a data frame like this: | A | B | C | | 1| 14| 4| | 1| 9| 3| | 2| 5| 6| The result would be:

how can you calculate the size of an apache spark data frame using pyspark?

泄露秘密 提交于 2019-12-05 04:20:11
Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark? why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes df.cache() 来源: https://stackoverflow.com/questions/38180140/how-can-you-calculate-the-size-of-an-apache-spark-data-frame-using-pyspark

Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

最后都变了- 提交于 2019-12-05 03:05:12
I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error. >>> cnxn = pyodbc.connect(con_string) >>> aj = cnxn.cursor() >>> >>> aj.execute("select * from tjob") <pyodbc.Cursor object at 0x257b2d0> >>> row = aj.fetchone() >>> row (1127, u'', u'8196660', u'', u'', 0, u'', u'', None, 35, None, 0, None, None, None, None, None, None, None, None, None, None,

Spark UI showing 0 cores even when setting cores in App

萝らか妹 提交于 2019-12-05 02:20:47
问题 I am having a strange issue with running an application off of the spark master url where the UI is reporting a "STATE" of "WAITING" indefinitely as 0 cores are showing up under the RUNNING APPLICATIONs table no matter what I configure the core count to be. Ive configured my app with the following settings where spark.max.cores = 2 & spark.default.cores = 2 & memory set to 3GB. The machine is an enterprise class server with over 24 cores. SparkConf conf = new SparkConf() .setAppName

How do I enable partition pruning in spark

假如想象 提交于 2019-12-05 02:19:41
I am reading parquet data and I see that it is listing all the directories on driver side Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver I have specified month=2014-12 in my where clause. I have tried using spark sql and data frame API, and looks like both aren't pruning partitions. Using Dataframe API df.filter("month='2014-12'").show() Using Spark SQL sqlContext.sql("select name, price from products_parquet_151 where month = '2014-12'") I have tried the above on versions 1.5.1, 1

multiple criteria for aggregation on pySpark Dataframe

戏子无情 提交于 2019-12-05 01:21:10
I have a pySpark dataframe that looks like this: +-------------+----------+ | sku| date| +-------------+----------+ |MLA-603526656|02/09/2016| |MLA-603526656|01/09/2016| |MLA-604172009|02/10/2016| |MLA-605470584|02/09/2016| |MLA-605502281|02/10/2016| |MLA-605502281|02/09/2016| +-------------+----------+ I want to group by sku, and then calculate the min and max dates. If I do this: df_testing.groupBy('sku') \ .agg({'date': 'min', 'date':'max'}) \ .limit(10) \ .show() the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following

Trying to read and write parquet files from s3 with local spark

为君一笑 提交于 2019-12-05 00:51:42
问题 I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it. Currently my spark session reads local parquet mocks and is defined as such: val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate() 回答1: I'm going to have to correct the post by himanshuIIITian

spark off heap memory config and tungsten

心已入冬 提交于 2019-12-05 00:37:48
I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient and friendly to GC memory utilization. Thus, even operating in the default on-heap mode Tungsten alleviates