spark-dataframe

Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

∥☆過路亽.° 提交于 2019-12-04 22:52:34
I want to load cassandra table to a datafram in spark, I have followed the sample programes below (found in this answer ), but I am getting an execption mentioned below, I have tried to load the table to RDD first then convert it to Datafrme, loading the RDD is successful, but when I try to convert it to a dataframe I am getting the same execption faced in the first methdology, any suggestions ? I am using Spark 2.0.0, Cassandra 3.7, and Java 8. public class SparkCassandraDatasetApplication { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName(

Converting row values into a column array in spark dataframe

久未见 提交于 2019-12-04 19:52:16
I am working on spark dataframes and I need to do a group by of a column and convert the column values of grouped rows into an array of elements as new column. Example : Input: employee | Address ------------------ Micheal | NY Micheal | NJ Output: employee | Address ------------------ Micheal | (NY,NJ) Any help is highly appreciated.! Here is an alternate solution Where I have converted the dataframe to an rdd for the transformations and converted it back a dataFrame using sqlContext.createDataFrame() Sample.json {"employee":"Michale","Address":"NY"} {"employee":"Michale","Address":"NJ"} {

Why does df.limit keep changing in Pyspark?

你说的曾经没有我的故事 提交于 2019-12-04 19:27:46
问题 I'm creating a data sample from some dataframe df with rdd = df.limit(10000).rdd This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now. However, when I now work on rdd , it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save? What is the reason behind it? Update: Here is a reproduction on Spark 1.5.2 from operator import add from pyspark.sql

Apply a function to a single column of a csv in Spark

99封情书 提交于 2019-12-04 17:54:10
问题 Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this? My code SparkContext().addPyFile("myfile.py") spark = SparkSession\ .builder\ .appName("myApp")\ .getOrCreate() from myfile import myFunction df = spark.read.csv(sys.argv[1], header=True, mode="DROPMALFORMED",) a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF() I

How to refer broadcast variable in dataframes

二次信任 提交于 2019-12-04 17:12:59
I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames? I have two dataframes employee & department. Employee Dataframe ------------------- Emp Id | Emp Name | Emp_Age ------------------ 1 | john | 25 2 | David | 35 Department Dataframe -------------------- Dept Id | Dept Name | Emp Id ----------------------------- 1 | Admin | 1 2 | HR | 2 import scala.collection.Map val df_emp = hiveContext.sql("select * from emp") val df_dept = hiveContext.sql("select * from dept") val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))

Spark: executor memory exceeds physical limit

女生的网名这么多〃 提交于 2019-12-04 16:55:00
My input dataset is about 150G. I am setting --conf spark.cores.max=100 --conf spark.executor.instances=20 --conf spark.executor.memory=8G --conf spark.executor.cores=5 --conf spark.driver.memory=4G but since data is not evenly distributed across executors, I kept getting Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used here are my questions: 1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer 2. I think about repartition the input dataFrame, so how can I

Exploding nested Struct in Spark dataframe

两盒软妹~` 提交于 2019-12-04 15:46:29
问题 I'm working through the Databricks example. The schema for the dataframe looks like: > parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true) In the example,

Spark Java: How to move data from HTTP source to Couchbase sink?

可紊 提交于 2019-12-04 15:40:10
I've a .gz file available on a Web server that I want to consume in a streaming manner and insert the data into Couchbase. The .gz file has only one file in it, which in turn contains one JSON object per line. Since Spark doesn't have a HTTP receiver, I wrote one myself (shown below). I'm using Couchbase Spark connector to do the insertion. However, when running, the job is not actually inserting anything. I've a suspicion that it is due to my inexperience with Spark and not knowing how to start and await termination. As you can see below, there are 2 places such calls can be made. Receiver :

spark: What is the difference between Aggregator and UDAF?

北战南征 提交于 2019-12-04 15:01:05
In Spark's documentation, Aggregator: abstract class Aggregator[-IN, BUF, OUT] extends Serializable A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value. UserDefinedAggregateFunction is: abstract class UserDefinedAggregateFunction extends Serializable The base class for implementing user-defined aggregate functions (UDAF). According to Dataset Aggregator - Databricks , “an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row .” It seems

GroupBy Operation of DataFrame takes lot of time in spark 2.0

随声附和 提交于 2019-12-04 14:37:01
In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy leads lot of shuffling. It also suggests that for avoiding lot of shuffling ReduceByKey should be used