spark-dataframe | 易学教程

Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

阅读更多关于 Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

I want to load cassandra table to a datafram in spark, I have followed the sample programes below (found in this answer ), but I am getting an execption mentioned below, I have tried to load the table to RDD first then convert it to Datafrme, loading the RDD is successful, but when I try to convert it to a dataframe I am getting the same execption faced in the first methdology, any suggestions ? I am using Spark 2.0.0, Cassandra 3.7, and Java 8. public class SparkCassandraDatasetApplication { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName(

Converting row values into a column array in spark dataframe

阅读更多关于 Converting row values into a column array in spark dataframe

I am working on spark dataframes and I need to do a group by of a column and convert the column values of grouped rows into an array of elements as new column. Example : Input: employee | Address ------------------ Micheal | NY Micheal | NJ Output: employee | Address ------------------ Micheal | (NY,NJ) Any help is highly appreciated.! Here is an alternate solution Where I have converted the dataframe to an rdd for the transformations and converted it back a dataFrame using sqlContext.createDataFrame() Sample.json {"employee":"Michale","Address":"NY"} {"employee":"Michale","Address":"NJ"} {

Why does df.limit keep changing in Pyspark?

阅读更多关于 Why does df.limit keep changing in Pyspark?

问题 I'm creating a data sample from some dataframe df with rdd = df.limit(10000).rdd This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now. However, when I now work on rdd , it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save? What is the reason behind it? Update: Here is a reproduction on Spark 1.5.2 from operator import add from pyspark.sql

Apply a function to a single column of a csv in Spark

阅读更多关于 Apply a function to a single column of a csv in Spark

问题 Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this? My code SparkContext().addPyFile("myfile.py") spark = SparkSession\ .builder\ .appName("myApp")\ .getOrCreate() from myfile import myFunction df = spark.read.csv(sys.argv[1], header=True, mode="DROPMALFORMED",) a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF() I

How to refer broadcast variable in dataframes

阅读更多关于 How to refer broadcast variable in dataframes

I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames? I have two dataframes employee & department. Employee Dataframe ------------------- Emp Id | Emp Name | Emp_Age ------------------ 1 | john | 25 2 | David | 35 Department Dataframe -------------------- Dept Id | Dept Name | Emp Id ----------------------------- 1 | Admin | 1 2 | HR | 2 import scala.collection.Map val df_emp = hiveContext.sql("select * from emp") val df_dept = hiveContext.sql("select * from dept") val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))

Spark: executor memory exceeds physical limit

阅读更多关于 Spark: executor memory exceeds physical limit

My input dataset is about 150G. I am setting --conf spark.cores.max=100 --conf spark.executor.instances=20 --conf spark.executor.memory=8G --conf spark.executor.cores=5 --conf spark.driver.memory=4G but since data is not evenly distributed across executors, I kept getting Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used here are my questions: 1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer 2. I think about repartition the input dataFrame, so how can I

Exploding nested Struct in Spark dataframe

阅读更多关于 Exploding nested Struct in Spark dataframe

问题 I'm working through the Databricks example. The schema for the dataframe looks like: > parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true) In the example,

Spark Java: How to move data from HTTP source to Couchbase sink?

阅读更多关于 Spark Java: How to move data from HTTP source to Couchbase sink?

I've a .gz file available on a Web server that I want to consume in a streaming manner and insert the data into Couchbase. The .gz file has only one file in it, which in turn contains one JSON object per line. Since Spark doesn't have a HTTP receiver, I wrote one myself (shown below). I'm using Couchbase Spark connector to do the insertion. However, when running, the job is not actually inserting anything. I've a suspicion that it is due to my inexperience with Spark and not knowing how to start and await termination. As you can see below, there are 2 places such calls can be made. Receiver :

spark: What is the difference between Aggregator and UDAF？

阅读更多关于 spark: What is the difference between Aggregator and UDAF？

In Spark's documentation, Aggregator: abstract class Aggregator[-IN, BUF, OUT] extends Serializable A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value. UserDefinedAggregateFunction is: abstract class UserDefinedAggregateFunction extends Serializable The base class for implementing user-defined aggregate functions (UDAF). According to Dataset Aggregator - Databricks , “an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row .” It seems

GroupBy Operation of DataFrame takes lot of time in spark 2.0

阅读更多关于 GroupBy Operation of DataFrame takes lot of time in spark 2.0

In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy leads lot of shuffling. It also suggests that for avoiding lot of shuffling ReduceByKey should be used