spark-dataframe | 易学教程

Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

阅读更多关于 Spark and Cassandra Java Application Exception Provider org.apache.hadoop.fs.s3.S3FileSystem not found

问题 I want to load cassandra table to a datafram in spark, I have followed the sample programes below (found in this answer), but I am getting an execption mentioned below, I have tried to load the table to RDD first then convert it to Datafrme, loading the RDD is successful, but when I try to convert it to a dataframe I am getting the same execption faced in the first methdology, any suggestions ? I am using Spark 2.0.0, Cassandra 3.7, and Java 8. public class SparkCassandraDatasetApplication {

Change the Datatype of columns in PySpark dataframe

阅读更多关于 Change the Datatype of columns in PySpark dataframe

I have an input dataframe( ip_df ), data in this dataframe looks like as below: id col_value 1 10 2 11 3 12 Data type of id and col_value is String I need to get another dataframe( output_df ), having datatype of id as string and col_value column as decimal**(15,4)**. THere is no data transformation, just data type conversion. Can i use it using PySpark. Any help will be appreciated Try using the cast method: from pyspark.sql.types import DecimalType <your code> output_df = ip_df.withColumn("col_value",ip_df["col_value"].cast(DecimalType())) neeraj bhadani try below statement. output_df = ip

Spark Structured streaming- Using different Windows for different GroupBy Keys

阅读更多关于 Spark Structured streaming- Using different Windows for different GroupBy Keys

Currently i have following table after reading from a Kafka topic via spark structured streaming key,timestamp,value ----------------------------------- key1,2017-11-14 07:50:00+0000,10 key1,2017-11-14 07:50:10+0000,10 key1,2017-11-14 07:51:00+0000,10 key1,2017-11-14 07:51:10+0000,10 key1,2017-11-14 07:52:00+0000,10 key1,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:50:00+0000,10 key2,2017-11-14 07:51:00+0000,10 key2,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:53:00+0000,10 I would like to use different windows for each of the keys and perform aggregation for example key1 would be

Performance of UDAF versus Aggregator in Spark

阅读更多关于 Performance of UDAF versus Aggregator in Spark

I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源： https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to do custom partition in spark dataframe with saveAsTextFile

阅读更多关于 How to do custom partition in spark dataframe with saveAsTextFile

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files. I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case. I can not use below option for custom partition because it does not support multi-char delimiter: dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode") .format("csv") .option("delimiter", "^") .option("nullValue", "") .option("codec", "gzip") .save("s3://trfsdisu/SPARK

spark: What is the difference between Aggregator and UDAF？

阅读更多关于 spark: What is the difference between Aggregator and UDAF？

问题 In Spark's documentation, Aggregator: abstract class Aggregator[-IN, BUF, OUT] extends Serializable A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value. UserDefinedAggregateFunction is: abstract class UserDefinedAggregateFunction extends Serializable The base class for implementing user-defined aggregate functions (UDAF). According to Dataset Aggregator - Databricks, “an Aggregator is

Spark Java: How to move data from HTTP source to Couchbase sink?

阅读更多关于 Spark Java: How to move data from HTTP source to Couchbase sink?

问题 I've a .gz file available on a Web server that I want to consume in a streaming manner and insert the data into Couchbase. The .gz file has only one file in it, which in turn contains one JSON object per line. Since Spark doesn't have a HTTP receiver, I wrote one myself (shown below). I'm using Couchbase Spark connector to do the insertion. However, when running, the job is not actually inserting anything. I've a suspicion that it is due to my inexperience with Spark and not knowing how to

Spark: executor memory exceeds physical limit

阅读更多关于 Spark: executor memory exceeds physical limit

问题 My input dataset is about 150G. I am setting --conf spark.cores.max=100 --conf spark.executor.instances=20 --conf spark.executor.memory=8G --conf spark.executor.cores=5 --conf spark.driver.memory=4G but since data is not evenly distributed across executors, I kept getting Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used here are my questions: 1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect

'RDD' object has no attribute '_jdf' pyspark RDD

阅读更多关于 'RDD' object has no attribute '_jdf' pyspark RDD

I'm new in pyspark. I would like to perform some machine Learning on a text file. from pyspark import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf sc = SparkContext spark = SparkSession.builder.appName("ML").getOrCreate() train_data = spark.read.text("20ng-train-all-terms.txt") td= train_data.rdd #transformer df to rdd tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:])) from pyspark.ml.feature import CountVectorizer vectorizer = CountVectorizer(inputCol ="words", outputCol=

How to generate a DataFrame with random content and N rows?

阅读更多关于 How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3") Here you go, Seq.fill is your friend: def randomInt1to100 = scala.util.Random.nextInt(100)+1 val df = sc.parallelize( Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)} ).toDF("col1", "col2", "col3") You can simply use scala.util.Random to generate the random numbers within range and loop for